IBM Built A Giant AI Supercomputer In The Cloud To Train Its Massive AI Models


Much attention has been given to artificial intelligence ever since OpenAI unveiled its AI language model called ChatGPT. Public interest surged after getting a free hands-on trial taste of AI. Response was so great, Microsoft and Google were forced to quickly integrate AI into their search engine infrastructures. The release of ChatGPT will likely go down as a turning point in the evolution of AI.

IBM Research has a longstanding leadership position in AI

OpenAI and Google are not the only companies doing extensive AI research. IBM has one of the largest and most well-funded research AI programs in the world. Decades of work with artificial intelligence has kept the company at the forefront of advanced AI research.

Recently IBM has been focused on creating AI models that streamline the operations of internal IBM business units. Not only does this work make IBM operate more efficiently, it also allows its researchers to gain valuable experience to further refine the technology. IBM has also been conducting innovative AI research in life-affecting areas such as chemistry, biology, medicine, and healthcare.

A significant amount of IBM’s recent research has been dedicated to foundation models and generative AI. These models are trained on large amounts of unlabeled data and can be used for multiple tasks with only little modifications. Foundation models are huge, usually with billions of parameters. Models of this scale are so large, they can only be trained with supercomputers.

Unfortunately, classical supercomputers were not designed for the computational complexity necessary for optimal training of AI models. IBM realized that building an AI supercomputer with an architecture designed to build and train massive AI models would be beneficial to its research efforts, and eventually, its customers.

The decision to build an AI supercomputer was easy, however after a fair amount of internal debate, IBM decided it should be built in the cloud.

According to Dr. Talia Gershon, Director,Cloud Infrastructure Research, IBM has been dedicated to developing AI-focused and high-performance infrastructure in the cloud for many years.

“At IBM Research, we are heavily leaning into foundation models,” said Dr. Gershon. “Our research in this area has been remarkable and groundbreaking. Due to the performance that these models deliver and the ability to adapt quickly with minimal time to achieve value, IBM sees foundation models as a massive and disruptive opportunity that we are determined to seize.”

IBM has developed a number of generative AI models for various life-impacting and business-related domains, such as antimicrobials, chemistry, materials, and code. You can read more information about this particular class of models in my earlier article here.

Building the model


Dr. Gershon explained why developing massive foundational models is a challenging and time-consuming task, often requiring dozens or even hundreds of GPUs to run for weeks or months during the training phase.

She further explained to ensure an efficient model-building workflow, special attention must be paid to every step of the process, from the initial data collection and preparation to validation and eventually, to operationalization. Data must be cleaned and prepared, and the model's performance must then be validated on various downstream tasks. And finally, the model must be served, but because of its size, that is a complex and expertise-intensive task.

Goals and considerations for IBM’s cloud-based AI supercomputer

A comparison of the HPC AI system stack (left) and Cloud-native AI system stack IBM

The IBM Research team responsible for building Vela decided that building an AI supercomputer in the cloud provided the most efficient and effective way to achieve its goals:

  • Cloud makes it easy for researchers and customers to collaborate with each other,
  • It provided access to various public cloud services for enhanced security and privacy,
  • Software could be configured on each node to meet the needs of research teams.
  • With cloud, AI researchers had greater flexibility and more independence to access to the latest software tools and libraries needed for models.
  • Cloud’s high redundancy would ensure the system would continue to operate in the event of a component failure.

To function properly, AI infrastructure needs nodes comprised of many GPUs. Nodes can be configured in one of two ways, either as physical machines (commonly called "bare metal") that maximize AI performance or as virtual machines (VM) that provide support teams with the flexiblity to adjust the infrastructure and allocate resources between workloads.

The AI supercomputer design team used clever engineering to combine the advantages of node capabilities (such as GPUs, CPUs, networking, and storage) with the flexibility of virtual machines (VMs). This was accomplished by configuring the host for virtualization but ensuring all devices and connections were accurately represented inside the VM. This provided Vela twith the capability to operate at the same level of performance as a physical machine while also offering VM flexibility.

Vela: architected for near bare metal performance

Representative graphic illustrating the architecture of Vela IBM

Vela’s platform is built on OpenShift, making it easily transferable to any cloud or hybrid environment. It is a huge multi-node, multi-GPU system that uses NVlink for high-speed communication between GPUs. NVSwitch is used to connect multiple NVLinks for all-to-all GPU high speed communications within a single node. NVSwitch also extends communications across nodes to create a seamless, high-bandwidth, multi-node GPU cluster—effectively forming a data center-sized GPU.

The design team decided Vela needed native cloud technologies. IBM chose not to build an InfiniBand system so Ethernet was chosen for increased flexibility, scalability, ease of operation, and management. Equally important, Ethernet made the system compatible with IBM’s cloud infrastructure.

Although IBM declined to specify Vela’s exact number of GPUs, to effectively manage resources in a thousand-plus GPU system, IBM developed a cloud-native batch scheduling technology that runs on top of Kubernetes. This technology is currently in use on the production OpenShift cluster to efficiently queue, prioritize, and manage jobs.

Wrapping up

IBM Research has dedicated a great deal of resources to building and training large foundation and generative AI models across many domains and modalities. IBM Research has partnered with many of its internal business units to focus on training world-class models faster and more efficiently in order to operationalize models and turn them into business value. Because training AI models requires significantly different compute requirements than classical supercomputers, IBM designed and built its own AI supercomputer and deployed it in the cloud in May 2022.

Artificial intelligence models require an inordinate amount of compute resources and time to build and train. IBM’s AI supercomputer, Vela, was ultimately designed to build and train huge models with efficiency and speed.

IBM did a monumental amount of design and engineering work to limit Vela’s performance overhead to within a few percentage points of bare metal. Its entire stack is built on top of OpenShift, which makes it portable to any cloud, any on-prem or public cloud environment, and giving it the capability to run in a hybrid cloud environment.

Although IBM hasn’t signaled its intention to provide cloud-based AI supercomputing as a service, Vela is a massive AI supercomputer with every feature and function necessary for such an offering. The architecture is such that it would be simple for IBM to take a slice of Vela’s infrastructure and offer it as a service.

Everything considered, I wouldn’t be surprised to see an AI supercomputer as-a-service offering sometime in early 2024. It would be beneficial for IBM as well as the entire AI ecosystem.

Paul Smith-Goodson

Paul Smith-Goodson is the Moor Insights & Strategy Vice President and Principal Analyst for quantum computing and artificial intelligence.  His early interest in quantum began while working on a joint AT&T and Bell Labs project and, during 360 overviews of Murray Hill advanced projects, Peter Shor provided an overview of his ground-breaking research in quantum error correction. 

Patrick Moorhead

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.