IBM’s AutoAI Has The Smarts To Make Data Scientists A Lot More Productive – But What’s Scary Is That It’s Getting A Whole Lot Smarter

By Patrick Moorhead - May 23, 2022
IBM

I recently had the opportunity to discuss current IBM artificial intelligence developments with Dr. Lisa Amini, an IBM Distinguished Engineer and the Director of IBM Research Cambridge, home to the MIT-IBM Watson AI Lab. Dr. Amini was previously Director of Knowledge & Reasoning Research in the Cognitive Computing group at IBM’s TJ Watson Research Center in New York. Dr. Amini earned her Ph.D. degree in Computer Science from Columbia University. Dr. Amini and her team are part of IBM Research tasked with creating the next generation of Automated AI and data science.

I was interested in automation's impact on the lifecycles of artificial intelligence and machine learning and centered our discussion around next-generation capabilities for AutoAI. 

AutoAI automates the highly complex process of finding and optimizing the best ML model, features, and model hyperparameters for your data. AutoAI does what otherwise would need a team of specialized data scientists and other professional resources, and it does it much faster. 

AI model building can be challenging

“How Much Automation Does a Data Scientist Want?” IBM

Building AI and machine learning models is a multifaceted process that involves gathering requirements and formulating the problem. Before model training begins, data must be acquired, assessed, and preprocessed to identify and correct data quality issues. 

Because the process is so complex, data scientists and ML engineers typically create ML pipelines to link those steps together for reuse each time data and models are refined. Pipelines handle data cleansing and manipulation operations for model training, testing and deployment, and inference. Constructing and tuning a pipeline is not only complex but also labor-intensive. It requires a team of trained resources who understand data science, plus subject-matter experts knowledgeable about the model’s purpose and outputs.

It is a lengthy process because there are many design choices to be made, plus a myriad of tuning adjustments for various data processing and modeling stages. 

The pipeline's high degree of complexity makes it a prime candidate for automation.

IBM AutoAI automates model building across the entire AI lifecycle

IBM

According to Dr. Amini, AutoAI does in minutes what would typically take hours to days for a whole team of data scientists. Automated functions include data preparation, model development, feature engineering, and hyperparameter optimization.

IBM

End-to-end automation of an entire model building process can result in significant resource savings. Here is a partial list of AutoAI features:

  • Automatic analysis of data and automatic generation of model pipelines customized for predictive modeling problems.
  • Model pipelines are created iteratively as AutoAI analyzes datasets and discovers data transformations, algorithms, and parameter settings that work best for problem settings.
  • Results are displayed on a leaderboard, showing the automatically generated model pipelines ranked according to problem optimization objective.
  • Visualizations are available at each stage of the process, ranging from data preparation to algorithm selection to model creation.
  • The user can easily deploy the model or generate the python notebook for any pipeline with a single mouse click. 
  • Automated tasks for continuous model improvement make it possible to integrate AI model APIs into applications where needed. 

AutoAI provides a significant productivity boost. Even a person with basic data science skills can automatically select, train, and tune a high-performing ML model with customized data in just a few mouse clicks. 

However, expert data scientists can rapidly iterate on potential models and pipelines, and experiment with the latest models, feature engineering techniques, and fairness algorithms. This can all be done without having to code pipelines from scratch. 

Future AI automation projects

IBM Research is working on several next-generation AI automation projects, such as next-generation algorithms to handle new data types, bring new automated quality and fairness, and dramatically boost scale and performance. 

Dr. Amini provided a deep dive into two especially interesting next-generation capabilities for scaling enterprise AI: AutoAI for Decisions and Semantic Data Science. 

AutoAI for improved decision making

Time series forecasting is one of the most popular but one of the most difficult predictive analytics. It uses historical data to predict the timing of future results. Time series forecasting is commonly used for financial planning, inventory, and capacity planning. The time dimensions within a dataset make analysis difficult and require more advanced data handling.

IBM

IBM’s AutoAI product already supports Time Series forecasting. It automates the following steps of building predictive models:

  • Prepares data sets for training
  • Determines which model, such as classification or regression, is needed based on the type of data
  • Incorporates appropriate imputation transformers into pipelines to handle missing data
  • Handles feature selection by determining which data columns best support the problem
  • Tests various hyperparameter tuning options for best results
  • Generates and ranks pipelines based on such things as accuracy and precision. 

Dr. Amini explained that after a time series forecast is created in many settings, the next step is to leverage that forecast for improved decision-making. 

For example, a data scientist might build a time series forecasting model for product demand, but the model can also be used as input for inventory restocking decisions with the goal to maximize profit by reducing costly over-stocking of too much inventory or avoiding lost sales due to stock outages. 

Simple heuristics are sometimes used for inventory restocking decisions, such as determining when inventory should be restocked and by how much. In other cases, a more systematic approach, called decision optimization, is leveraged to build a prescriptive model to complement the predictive time series forecasting model. 

Prescriptive analytics (as opposed to predictive analytics) use sophisticated mathematical modeling techniques and data structures for decision optimization and leverage expertise in short supply. However, products for automated decision optimization pipeline generation created directly from data, like AutoAI for predictive models, do not exist today.

Multi-model pipelines

IBM

Dr. Amini explained that the best results are obtained by using both machine learning and decision optimization. To support that capability, IBM researchers are working on multi-model pipelines that could accommodate the needs of predictive and prescriptive models. Multi-models will allow business analysts and data scientists to use a common model to discuss aspects of the problem from each other's perspectives. Such a product would also promote and improve collaboration between diverse but equally essential resources. 

Automation for Deep Reinforcement Learning

The new capability to automate pipeline generation for decision models is now available through the Early Access program from IBM Research. It leverages deep reinforcement learning to learn an end-to-end model from data to decision policy. The technology, called AutoDO (Automated Decision Optimization), leverages reinforcement learning (RL) models and gives data scientists the capability to train machine learning models to perform sequential decision-making under uncertainty. Automation for reinforcement learning (RL) is critical because RL algorithms are highly sensitive to internal hyperparameters. Therefore, they require significant expertise and manual effort to tune them to specific problems and data sets.

Dr. Amini explained that the technology automatically selects the best reinforcement learning model to use according to the data and the problem. Using advanced search strategies, it also selects the best configuration of hyperparameters for the model. 

The system can automatically search historical data sets or any gym-compatible environment to automatically generate, tune, and rank the best RL pipeline. The system supports various flavors of reinforcement learning, including online and offline learning and model-free and model-based algorithms.

Scaling AI with automation 

Automation for reinforcement learning tackles two pressing problems for scaling AI in the enterprise. 

First, it provides automation for sequential decision-making problems where uncertainty may weaken heuristic and even formal optimization models that don't utilize historical data. 

Secondly, it brings an automated, systematic approach to the challenging reinforcement learning model building domain.

Semantic Data Science

State-of-the-art automated ML products like AutoAI can efficiently analyze historical data to create and rank custom machine learning pipelines. It includes automated feature engineering, which expands and augments the feature space of data to optimize model performance. Automated methods currently rely on statistical techniques to explore the feature space. 

However, if a data scientist understands the semantics of the data, it is possible to leverage domain knowledge to expand the feature space to increase model accuracy. This expansion can be done using complementary data from internal or external data sources. Feature space is the group of features used to characterize data. For example, if the data is about cars, the feature space could be (Ford, Tesla, BMW). 

Complementary feature transformations may be found in existing python scripts or relationships described in the literature. Despite this, knowing which features and transformations are relevant, a user must have sufficient technical skills to decipher and translate from code and documents.

IBM

New semantic power for data scientists

Dr. Amini described another powerful new capability created by IBM Research called Semantic Data Science that automatically detects semantic concepts for a given dataset. Semantic concepts characterize concepts to help understand the words and sentences to provide a way for meanings to be represented. Once AutoAI has detected the proper semantic concepts, the program uses those concepts in a broad search for relevant features and feature engineering operations that may be present in existing code, data, and literature. 

AutoAI can use these new, semantically-rich features to improve the accuracy of generated models and provide human-readable explanations with these generated features.

IBM

Even without having domain expertise to assess these semantic concepts or new features, a data scientist can still run AutoAI experiments. However, data scientists who want to understand and interact with the discovered semantic concepts can use the Semantic Feature Discovery visual explorer to explore discovered relationships. 

Users can go directly from the visual explorer into the python code or document where the new feature originated simply by clicking the Sources hyperlink, as shown in the graphics below.

IBM
IBM

The Semantic Data Science capability is also available as an IBM Research Early Access offering. Some of the capabilities are even available for experimentation on IBM’s API Hub.

Dr. Amini concluded our conversation and summed up the vast research effort IBM is pouring into AutoAI with one single yet efficient sentence:

“We want AutoAI and Semantic Data Science to do what an expert data scientist would want to do but may not always have the time or domain knowledge to do by themselves.” 

Wrap-up key points

  • AutoAI allows people without deep data science expertise to generate various model types, and even those with deep data science expertise to more rapidly prototype and iterate. Models can be rapidly generated at scale with AutoAI.
  • AutoAI will reduce the effort of model building and increase productivity and accuracy. It should also increase the number of enterprise models deployed and become operationalized.
  • AutoAI for Decisions will expand the types of problems that can be solved using automatically generated pipelines to those requiring decision optimization under uncertainty and reinforcement learning.
  • Semantic Data Science will add considerable power to the model-building process. It will increase the quality of models being built by acting as an expert resource to gather and incorporate widespread, difficult-to-find information of varied types and sources.
  • AutoAI is part of IBM Watson Studio. More information can be found here

Analyst Notes:

  1. In this article, I mentioned IBM’slittle-known program called the Early Access Program. Clients have tested a high percentage of IBM's pre-commercial AI technology under this program. Each offering in the program consists of a package of pre-commercial assets, called capabilities, available to clients for a one-year license that can be renewed or refreshed at the end of the term. More information on Early Access and its candidate programs can be found here.
  2. IBM also has an open-source framework product called CodeFlare, developed to scale AI’s pipelines. CodeFlare simplifies the integration, scaling, and acceleration of complex multi-step analytics and machine learning pipelines on the cloud. My earlier Forbes.com article on CodeFlare is here.

For more information and comments about quantum computing and artificial intelligence, follow Paul Smith-Goodson on Twitter @Moor_Quantum

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.