IBM’s newly announced Artificial Intelligence Unit (AIU) is its first system-on-chip design. The AIU is an application-specific integrated circuit (ASIC) designed to train and run deep learning models that require massively parallel computing. The AIU is much faster than existing CPUs designed for traditional software applications several years before the development of deep learning. IBM provided no release date for the AIU
The IBM Research AI Hardware Center developed the new AIU chip over five years. The center focuses on developing next-gen chips and AI systems to improve AI hardware efficiency by 2.5x annually and be able to train and run AI models a thousand times faster in 2029 than in 2019.
Unpacking the AIU
According to the IBM blog, “Our complete system-on-chip features 32 processing cores and contains 23 billion transistors — roughly the same number in our z16 chip. The IBM AIU is also designed to be as easy to use as a graphics card. It can be plugged into any computer or server with a PCIe slot.”
Deep learning models have traditionally relied on a combination of CPU and GPU co-processors to train and run models. GPUs were initially developed to render graphical images, but later the technology found advantages for use in artificial intelligence.
The IBM AIU is not a graphics processor. It was specifically designed and optimized to accelerate matrix and vector computations used by deep learning models. The AIU can solve computationally complex problems and perform data analysis at speeds far beyond the capability of a CPU.
Growth of AI and deep learning
Deep learning growth is putting resource pressure on available compute power. AI and deep learning models are growing exponentially across all industries for a large range of applications.
In addition to growth, another problem is model size. Deep learning models are huge, with billions, and sometimes trillions, of parameters. Unfortunately, according to IBM, hardware efficiency has lagged behind the exponential growth of deep learning.
Historically, computation has relied on high precision 64- and 32-bit floating point arithmetic. IBM believes that level of precision is not always needed. It has a term for lowering traditional computational precision – “approximate computation.” On its blog, IBM explains its rationale for using approximate computing:
“Do we need this level of accuracy for common deep learning tasks? Does our brain require a high-resolution image to recognize a family member, or a cat? When we enter a text thread for search, do we require precision in the relative ranking of the 50,002ndmost useful reply vs the 50,003rd? The answer is that many tasks including these examples can be accomplished with approximate computing.”
Approximate computation played an essential role in the design of the new AIU chip.IBM researchers designed the AIU chip using less precision than what would be needed by a CPU. Lower precision was vital to achieving high compute densities in the new AIU hardware accelerator. Instead of 32-bit floating point or 16-bit floating point arithmetic typically used for AI training, IBM used hybrid 8-bit floating-point (HFP8) calculations. Lower precision computation allowed the chip to operate 2x faster than FP16 calculations while providing similar training results.
There seemed to be conflicting design goals, but the conflict presented no problem for IBM. While low precision computation was necessary to obtain higher density and faster computation, the accuracy for deep learning (DL) models had to be at a level consistent with high-precision computation.
IBM designed the chip for streamlined AI workflows. According to IBM, “Because most AI calculations involve matrix and vector multiplication, our chip architecture features a simpler layout than a multi-purpose CPU. IBM designed the AIU to send data directly from one compute engine to the next, creating enormous energy savings.”
IBM’s announcement contained very few technical information about the chip. However, we can gain some insights into its performance by looking back at the demonstration of its initial prototype when IBM presented the performance results of its early 7nm chip design at the International Solid-State Circuits Conference (ISSCC) in 2021.
Rather than 32 cores, IBM’s prototype for the conference demonstration was an experimental 4-core 7nm AI chip that supported fp16 and hybrid-fp8 formats for training and inference of DL models. It also supported int4 and int2 formats for scaling inference. A summary of the prototype chip’s performance was contained in a 2021 Lindley Group newsletter that reported IBM’s demonstration that year:
- At peak speed, using HFP 8, the 7nm design achieved 1.9 teraflops per second per watt (TF/W).
- TOPS measures how many math problems an accelerator can solve in one second. It provides a method to compare how different accelerators perform on a given inference task. Using INT4 for inference, the experimental chip achieved 16.5 TOPS/W, bettering Qualcomm’s low-power Cloud AI module.
- Although few specs and no pricing were released, a broad price estimate would be in the $1500 to $2000 range. With the proper price performance, the AIU should be able to establish its place in the market rapidly.
- Because of a lack of information, it’s not possible to directly compare the AIU and GPUs solely based on AI processing cores.
- Low-precision AIU technologies used in the AIU were based on earlier IBM Research that pioneered the first 16-bit reduced-precision systems for deep learning training, the first 8-bit training techniques, and state-of-the-art 2-bit inference results.
- According to IBM Research, the AIU chip uses a scaled version of the AI accelerator in its Telum chip.
- The Telum uses 7nm transistors, but the AIU will use faster 5nm transistors.
- It will be interesting to see how the AIU stacks up against other technologies if it is released in time for next year’s MLPerf benchmarking tests.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.