Today at the IEEE's Hot Chips 33 conference, IBM presented a preview of IBM Telum, the next-generation processor for IBM z and LinuxONE systems, planned for the first half of 2022.
Immediately I was intrigued by the name change from the rather drab sequence of z14, z15 to “Telum.” The highlights announced were the expected increases in performance, a new cache design, and an integrated accelerator designed for real-time embedded artificial intelligence (AI). I think the latter feature is a game-changer that justifies the name change.
I have long maintained that the name of the game today in chip design is vertical integration. You cannot apply a homogenous chip to a specific task and expect maximum performance and power. Apple and AWS popularized and IBM has been vertically integrating for twenty-five years.
The business case for an AI accelerator with real-time low latency
I am a chip guy at heart, but before diving into the silicon, I will discuss why you should care about Telum. There are two broad scenarios where the inference tasks directly embedded into the transaction workload can deliver benefits. The first is where AI is used on business data to derive insights such as fraud detection on credit card transactions, customer behavior prediction, or supply chain optimization. Second is where AI is used to make infrastructure more intelligent such as workload placement, databases query plans based on AI models, or anomaly detection for security.
Let us examine credit card fraud detection further. AI embedded directly in the transaction in real-time with low latency prevents credit card fraud before the transaction completes, rather than just be detected after the fact. To try and do this off-platform, you will inevitably run into network delays that lead to higher latency and a lack of consistency. Off platform, you literally need to move the data from Z to another platform. Low latency is needed to score every transaction consistently. With spikes in latency, some transactions will go unchecked, and some customers only achieve 70% of the transactions leaving 30% of transactions unprotected. There is a business opportunity to do AI scoring on all transactions consistently. In financial systems there are even transaction speed requirements, another chllenge.
Additionally, going off-platform creates security risks in sending sensitive or personal data to a separate platform through a network with concerns about encryption, auditing, and an increased attack surface.
Design goals for the Telum AI accelerator
Being able to directly embed AI tasks into the transaction broker on IBM z allows customers to run the most accurate model for the task and run it at low latency without security concerns. Telum results from building an AI centralized on-chip accelerator with shared access by all the cores and very low and consistent inference latency.
Whenever a core switches into AI, it gets the compute capacity of the entire accelerator to perform the AI task. Low latency results from the full power of the accelerator being available to the core when it needs it. There is enough total compute capacity in the AI accelerator to enable every transaction to have embedded AI, as each accelerator has a six teraflop (TFLOP) compute capacity.
Several use cases use a variety of AI technologies, not just deep learning. The range includes traditional machine learning algorithms to various deep neural networks such as convolutional (CNN) and recurrent neural networks (RNN). The accelerator has operations that help in all of these different types of AI models.
Even though no data goes off-platform, security remains essential, and the on-chip accelerator has enterprise-grade memory virtualization and protection.
Another important consideration for the AI accelerator is extensibility with future firmware and hardware updates. AI is a relatively new field, evolving quickly. The design includes firmware that allows the delivery of new functionality on the same hardware platform over time.
Diving into the silicon
Let me take you one step further down into the details of how the accelerator works. IBM has designed a new memory-to-memory CISC instruction set called the neural network processing system assist. This new instruction set operates directly on the tensor data (the primary data structure used by neural networks) in the program's userspace, enabling matrix multiplication, convolution, pooling, and activation functions. These primitives make up typical AI algorithms.
Firmware running on both the core and the AI accelerator in combination implement the new instructions. The core performs address translation and access checking for the tensor data, translating a virtual program address down to a physical address, and performing all the access checking to pass to the accelerator.
The core also prefetches tensor data into the L2 cache to be readily available for the AI accelerator. The firmware coordinates the staging of the data into the l2 cache and the accelerator.
The accelerator can deliver six TFLOPs per chip from two independent compute arrays, one geared towards matrix operations and the second geared towards activation functions. A thirty-two chip system with four trays will provide over 200 TFLOPs of compute and have access to 8GB of total system cache.
The matrix array consists of 128 processor tiles with eight-way FP16 SIMDs connected in a mesh-like topology. The activation array consists of 32 tiles processor tiles with eight-way FP-16/FP-32 SIMDs optimized for RELU, Sigmoid, tanh, log, and complex activation functions like SigMoid and LSTM (use in natural language processing).
An intelligent data fabric controls the data flow to keep the six TFLOP compute array busy. The intelligent prefetcher works with the core to receive translated addresses, fetch the source, and store results. The AI accelerator has intelligent prefetch, write-back controllers, large scratch pads, and data buffers controlled by micro-cores to ensure efficient use of the compute capacity.
The data movers can shuffle the data to and from the chip ring with about 100GB/s bandwidth. Then internally, this data can be distributed from the scratchpad to the compute engines with more than 600GB/s bandwidth, ensuring high utilization of the compute arrays, which provides low latency and high bandwidth AI capabilities.
There is a software ecosystem that enables the exploitation of this accelerator. Customers can build and train AI models anywhere on any platform. Familiar tools used by data scientists such as Keras, PyTorch, SAS, MATLAB, Chainer, mxnet, and TensorFlow are supported. Trained models exported to the open neural network exchange format (ONNX) are fed to the IBM Deep Learning Compiler to compile and optimize them for execution directly on the AI accelerator on the Telum chip.
It may seem off starting with AI, but that’s where the heat is here. But there is more. IBM says it is delivering a “40% per socket performance improvement. Each chip delivers 8 cores/ 16 threads at over 5Ghz. The maximum performance is delivered with 32 cores / 64 threads with a four drawer system configuration. Each chip has 32MB L2 connected via 320GB/s ring for L3 and L4. Each chip is 530m2, 22.B transistors, and fabbed on Samsung 7"nm." I am using “nm” as I do not believe the node is actually 7nm. Based on performance, power and density characteristics, I think its closer to Intel’s 10nm or Intel’s 7 process.
Telum is the work of an extensive team across IBM covering chip design, operating systems, and software, with research to define the silicon technology and the AI accelerator.
Others talk about real-time inference and deep learning, often in the context of image recognition. What IBM is tackling here goes beyond recognizing cats and dogs to optimizing fraud detection by bringing real-time AI and deep learning inferences to transactional workloads that are very latency-sensitive.
It is fair to say that IBM is the first to bring real-time deep learning to response-time-sensitive transactional workloads. A differentiating factor when performing inference is that the entire 6 TFLOPS per chip of the inference accelerator is available to one core for AI work. In contrast, competitive chips allocate dedicated silicon across many cores. The AI accelerator provides the total inference capacity to make every single transaction perform at low latency.
IBM does shy away from publishing traditional third-party benchmarks claiming irrelevance to what customers are doing on a massive scale. IBM has sizing tools for capacity planning, and large map systems performance ratio (LSPR) numbers used to compare processor generations for planning.
IBM has worked with several customers to validate the design goal of
real-time low latency inference. This effort involved building proxy models for real-world applications of AI, such as a recurrent neural network model for credit card fraud detection co-developed with a global bank. Running that model on a single Telum chip achieved more than one hundred thousand inference tasks per second with a latency of only 1.1 milliseconds. Scaling up to twenty-two chips, achieved three and a half million inferences per second with the latency still low at 1.2 milliseconds. While this test only ran the inference tasks and not the credit card transaction workload, it confirms that the AI Accelerator has sufficient bandwidth to deliver inference at scale with low latency to be embedded directly into transactional workloads like credit card processing.
As the Telum chip gets into the market in the first half of 2022, it will be exciting to see how customers realize value by embedding AI capabilities directly into enterprise workloads.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.