Graphcore, a UK-based startup, launched its first Intelligence Processing Unit (IPU) for AI acceleration in 2018. Today it introduced its second-generation product for AI, a massively parallel chip with 59.4 billion transistors that delivers some 250 Trillion Operations per Second (TOPS). However, the company’s potential to challenge NVIDIA, the leader in data center AI, lies not just in the new chip’s performance, but in the new IPU-Machine, Poplar software and interconnect fabric. Together, these elements promise to provide the scalability necessary to handle massive AI models. While we must await real application performance comparisons to assess the platform’s true capabilities, Graphcore’s new offering potentially represents the first candidate capable of challenging NVIDIA’s new Ampere-based A100, at least for very large-scale AI models. Time will tell if that is good enough to overcome NVIDIA’s massive lead in software and AI ecosystem, but investors certainly believe in the company—they have bid up Graphcore’s value to nearly $2B.
The Colossus MK2 IPU (GC200)
The new Colossus MK2, manufactured by TSMC, is a 59.4 B transistor parallel processor that delivers some 250 Trillion Operations per Second (TOPS) across 1472 cores and 900MB of In-Processor-Memory (a 3X increase). All of this is interconnected across a 2.8Tb/s low latency fabric. Most of the architectural design of the MK1 generation carries over to the MK2 platform. Like its predecessor, the processing tiles contain cores and on-die SRAM, interconnected over the same fabric, which can extend off-die to communicate with other IPU domains.
The MK2 has a few new features beyond just more cores and memory. New sparsity optimizations help avoid wasted cycles for a range of data patterns including block, scalar, static and dynamic sparsity. According to Graphcore, these optimizations stand to increase performance by a factor of two or more. The MK2 IPU also gets a performance boost from its set of novel floating-point implementation techniques developed by Graphcore, called AI-Float. These are used to tune energy and performance for AI computation, including stochastic rounding to enable FP16 to match FP32 performance on master weights, and FP16.16 to match FP16.32 precision for forward and backward propagation. The chip also supports 62.5 TFLOPS in single-precision FP32.
Graphcore productized its MK1 silicon in a two-IPU PCIe board to ease adoption and speed time to market. With the MK2 version, Graphcore took this productization a step further. The company announced a 1 Rack-Unit appliance, called the M2000 IPU-Machine, that contains four MK2 chips. At a system level, the larger on-die IPU memory is now supplemented by up to 448GB of “Streaming Memory” DRAM on the IPU-Machine. Since deep learning model size is doubling every 3.5 months (according to OpenAI.org), this memory architecture could be a game changer. It provides 100 times the bandwidth and 10 times the capacity found in High Bandwidth Memory (HBM2), at a significantly lower cost.
The 1U boxes are interconnected over 100Gb Ethernet with ROCE (RDMA over Converged Ethernet) for low-latency access. Using Ethernet avoids the bottlenecks and costs of PCIe connectors and enables a flexible CPU to accelerator ratio. The IPU Fabric connects to other IPUs by tunneling its protocol and data over Ethernet, maintaining the same programming model and Bulk Synchronous Processing (BSP) inter-tile synchronization, regardless of the size of the deployment. The IPU-Machine carries a list price of $32,450, which may sound expensive to the uninitiated. However, this is a good value when one compares the platform’s performance to the competition.
The second generation IPU fabric
Built-in fabrics are becoming a necessity for AI training accelerators, since model sizes are increasing dramatically. The Graphcore fabric enables a flexible disaggregation model, allowing the user to configure an array of accelerators on the fly without being constrained by a fixed ratio of CPUs to accelerators (as they would be on most other systems). By leveraging 100Gb Ethernet, elastic configurations are easy to deploy and simple to use, enabling scaling up to 64,000 IPUs. This disaggregated scaling model is one of the most significant features of the second generation Graphcore IPU platform, enabling a wide range of deployment options.
Graphcore says the new 4-chip IPU-Machine delivers 7-9 times the performance of the 2-chip predecessor PCIe card in training neural networks, and over 8 times the performance in inference processing. So, a chip-to-chip comparison would likely put the MK2 at roughly a 3-4 times improvement over the MK1.
Graphcore already offers its first-generation accelerators in the Microsoft Azure cloud, as well as in Dell-EMC servers. Delivering the new Colossus MK2 IPU in a plug-and-play hardware and software platform could prove to be a good strategy for wider adoption of Graphcore technology, as customers and partners can use these building blocks to deliver any scale of infrastructure desired. I believe that Graphcore is one of the few startups so far that can challenge the status quo in AI acceleration and advance parallel processing in a meaningful way. It remains to be seen if Graphcore can also extend this platform to inference processing, where cost-effectiveness and software ecosystems represent potential hurdles for any startup. It also remains to be seen which large customers may become clients, and how well the new platform can perform in real-world applications. But from what we know today, it looks like the new MK2 and IPU-Machine could become a contender. For more, see the Moor Insights & Strategy white paper on this topic here.