Mellanox & HDR InfiniBand

As SC16 has ended, I find myself recapping the many things I saw that are noteworthy. For those of you who are not heavily involved in High Performance Computing (HPC), SC or Super Computing is the premier event where the HPC segment of the industry “struts its stuff”. Most things were not a surprise, but still it's nice to see their availability begin to take shape. One of the most notable (at least to me) is the 200Gb/s HDR InfiniBand that Mellanox plans to introduce early next year. It looks interesting, and in fact, I almost entitled this article A Work-Delivering Network, because that is its focus. The real measure of a system is enabling as much work possible with the best economics. Everyone is asking the question, “Do I go with InfiniBand or move to Omnipath or even Ethernet?” If you are looking for that answer here, you won’t find it. What you will find is some food for thought about what Mellanox brings to the table to perform real work. These are things to consider in all solutions. connectx6-1200x384 (Source: Mellanox Technologies) Before jumping into the Mellanox offering, a brief mention of Ethernet and OmniPath is probably in order. Most folks are familiar with Ethernet because of its extensive use on consumer devices like notebooks and PCs. In fact, the connection on the back of your broadband modem at home is most likely Ethernet. Ethernet is widely used in data centers and next year, the 400Gb/s specification is expected to be ratified. Some HPC systems where a low latency network is not required utilize Ethernet and rely on the Transmission Control Protocol / Internet Protocol (TCP/IP), a “heavy” software stack. OmniPath currently has a bandwidth of 100Gb/s and is Intel’s network alternative. Intel has disclosed plans to integrate the OmniPath host channel adapter on to the CPU, and it will be interesting to see its performance. While Mellanox has announced the 200 Gb/s speed, the real ability to do work is much more than just the network speed and moving data from point to point. Mellanox refers to this as Data-Centric offload. This offload is built on the idea that there are many places where work can be performed. As data moves through the network, you should take advantage of them. Here are the major parts of their whole solution:
  1. Connect X6: 200Gb/s @ .6µs latency & 200M message / second
  2. Quantum Switch: 40 HDR ports (80 HDR100 breakout) @ 16 Tb/s w/ <90 nsec latency
  3. LinkX transceiver: active optical & copper cable up to 200Gb/s
  4. HPC-X Software package: For MPI, SHEME / PGAS and UPS
The interesting part is the way all these building blocks play together. Mellanox Connect-X adapters have always been able to run InfiniBand or Ethernet. Mellanox can now boast that their RDMA solution even at 200Gb/s, only needs a 1% CPU utilization (not a lot of overhead for this data rate). This behavior includes data access and movement for compute and storage. All traffic is managed and operated by the adapter and its adaptive routing and congestion management software. This data rate is not for the faint of heart, and it should be noted that HDR requires either a PCIe Gen 3 x32 or a PCIe Gen 4 x16 interface on the platform. Connect X6 includes off-loads for NVMe over fabric, T10-DIF and erasure coding, encryptions / decryptions and others. The HPC-X software package is a toolkit which includes MPI, PGAS / SHMEM and UPC Communication libraries. Their MPI is especially noteworthy, as it uses Tag-Matching and Rendezvous Protocol Offloads. This capability means that part of the MPI library actually runs on the switch. When results are coalesced, not only is the main server processors freed but the operation scales with the network. As additional switches are added, the ability to coalesce or collect results into a single result increase. It is going to be interesting to see how this will impact overall cluster performance and how competitive offerings will stack up.