On Monday, seven technology companies announced that they had reached an agreement to develop and implement an interconnect that would enable different vendors’ CPUs and accelerators to talk to one another while sharing main memory. Specifically, Advanced Micro Devices, ARM Holdings, Huawei Technologies, IBM, Mellanox, Qualcomm and Xilinx jointly announced that they would collaborate to build a cache-coherent fabric to interconnect their CPUs, accelerators and networks.
This will be no small task. It is hard enough to build a cache coherent interface between two or four homogeneous chips like CPUs. Building one that allows devices to share data across disparate implementations of CPUs, FPGAs, GPUs and network chips will be a monumental challenge. However the potential benefits could be tremendous if they can pull this off, providing plug-and-play compute and network acceleration for whatever processor you choose, while providing much better performance than is available today using the PCIe interconnect upon which today’s system depend.
The CCIX fabric connects processors to accelerators and other processors. (Source: CCIX Consortium)
What problems does this solve?
As Moore’s Law has slowed down the natural and traditional doubling of performance every 18 months, engineers are increasingly turning to accelerators such as GPUs, FPGAs, DSPs and network offload engines to continue to keep up with customers’ demands for better, cheaper and faster computing. Today’s accelerators typically connect to the processor, be it X86, ARM or POWER, via PCIe (PCI Express): the tried and true I/O connectivity protocol that has been around in one form or another since 2003.
But PCIe was designed as an I/O interface and is not well suited for the high bandwidth, low latency connectivity required between processors, which typically demand an order of magnitude faster links than PCIe can provide. Accelerator “Nirvana” would be to have the CPU and a reasonable number of accelerators all behave as first class citizens on a shared memory bus or fabric, sharing access to high speed memory that is “cache-coherent” to ensure everyone retrieves the same value from a memory location, even though another party may be writing to that location in its local cached copy. The alternative is to have the CPU intervene in accelerator memory access, which creates an unacceptable bottleneck. Cache coherency sounds easy, but it is definitely not, especially when you try to scale the number of processors, each of which must “snoop” everyone else’s cache to ensure they are all “coherent” without incurring unacceptable overhead.
Interestingly, IBM and NVIDIA have developed their own technologies to address this challenge. With POWER8, IBM announced the Cache-Coherent Accelerator Processor Interconnect, or CAPI, which is now being used by Xilinx to improve performance. But CAPI is only supported by IBM. And NVIDIA has invested in their own solution, called NVLink, which provides faster connectivity between GPUs and with IBM POWER. It is reasonable to expect CAPI and CCIX to converge over time, however NVIDIA is notable by their absence in this.
A common threat makes for strange bedfellows
This announcement shows that the companies involved recognize the need to work closely together to present a seamless system architecture to the their customers, as that is exactly what Intel is widely expected to do with their acquisition of the FPGA vendor Altera, reaching Accelerator Nirvana with an all-Intel solution that likely includes Intel’s coherent bus called QPI. IBM OpenPOWER, Advanced Micro Devices x86, ARM partners AMD, Qualcomm and Huawei could each go it alone, or they can join forces. To not collaborate would cede a substantial advantage to Intel and risk fragmentation that the industry would not accept.
Where do we go from here?
It’s very early days for the CCIX. There’s a one-page website with no details even on board structure or bylaws. This is a natural first step. The CCIX consortium has promised to provide more information about their specification over the coming months. No timetable was given for technology access and product availability, but it is reasonable to expect that this could take years to develop, since it depends on the next generation of processors after those that are already in-flight–i.e., after POWER9, after AMD Zen and perhaps after ARM’s next generation. This implies that we probably won’t see fruit of this effort until 2019 or perhaps even 2020. They will need to wrestle with complex technical, business and cultural issues, any of which could derail this effort.
There is no doubt that the computing industry will laud their efforts as the only sensible path to provide an alternative to an all-Intel computing world.