IBM Spectrum Storage for AI Brings Scalable Storage To Deep Learning

By Steve McDowell - December 12, 2018

AI and deep learning are invading the enterprise. NVIDIA Corporation is in the midst of an unprecedented run, delivering targeted technology and products that enable companies to learn from their data. These learnings can lead to competitive insights, recognizing new trends, fueling control systems for intelligent infrastructure, or simply providing predictive capabilities to better manage the business. The challenge in deploying these systems is one of balance. Storage in the datacenter has evolved to service the needs of mainstream business applications, not highly-parallel deep learning systems.

The fidelity and efficiency of machine learning is highly dependent on delivering data to the learning pipelines with high-throughput and low-latency. Beyond simply feeding the compute engines, overall system throughput is impacted by the efficiency of ingesting data from its source, as well as archiving data once it has served its training purposes. This becomes especially true at scale, where large amounts of data feed large numbers of processing engines. High volumes of data need to be ingested and cleaned. Actual machine learning is all about high-throughput, low-latency, small block random I/O. Models are archived to cloud or tape, driving the need for high degrees of scalability on the back-end. Scale is where IBM Corporation shines the brightest. It was no surprise this week when IBM announced its highly scalable IBM Spectrum Storage for AI with NVIDIA DGX solution to solve the problems of storing and serving data for AI-driven analytics and machine learning. IBM Spectrum Storage for AI with NVIDIA DGX IBM Spectrum Storage for AI with NVIDIA DGX seeks to solve the problems inherent in storing and serving data in the AI-driven machine learning environment. It aims to solve those problems at scale. It also promises to seamlessly integrate into an enterprise workflow, leveraging its pervasive IBM Spectrum Storage suite of offerings. The new offering is a reference architecture that combines NVIDIA DGX servers with a new AI-tuned storage stack from IBM. IBM’s AI storage stack combines IBM Spectrum Scale v5 software-defined storage with a new NVMe all-flash storage solution, all while allowing full integration into the world of IBM’s Spectrum Storage suite of products. Tying it all together is an InfiniBand network powered by Mellanox Technology. At the heart of IBM’s value-add is the updated Spectrum Scale v5 software-defined storage solution. The Spectrum Scale v5 is architected specifically for AI workloads. Training models for machine learning require highly efficient metadata access, small file access, and performant small-block random I/O performance. This is what IBM Spectrum Scale v5 is built to deliver. IBM’s software-defined composable storage approach makes a lot of sense for enterprises looking to expand an AI capability over time. Organizations can deploy the solution with a single system and seamlessly scale-out as growth is needed. The flexibility also allows IBM’s partners to deliver solutions that are the right size for the end customer’s needs. Keeping with IBM’s enterprise and total solution focus, IBM Spectrum Storage for AI seamlessly integrates with the rest of the IBM storage portfolio. This brings the powerful ability to integrate AI and machine learning into existing storage workflows, up to and including archiving to cloud or tape. Supplementing the updated Spectrum Scale software is a new NVMe all-flash storage solution. The new storage box promises to deliver over 300TB in every 2U building block, with 120GB/s of data throughput in a standard rack configuration. Scalability is at the heart of IBM’s value proposition. It is able to support up to 9 NVIDIA DGX-1 servers and 3 2U NVMe flash storage arrays in a single rack. The performance numbers that IBM shared at the announcement were impressive, showing linear scalability and top-tier performance for systems with high numbers of GPUs. Performance IBM Spectrum Storage for AI with NVIDIA DGX is built for scalable performance. The new NVMe all-flash storage in IBM Spectrum Storage for AI can deliver throughput of 120GB/s in a rack, with over 30GB/s sustained throughput from each 2U storage array. More impressive than raw array performance is the scaling curve that is achievable with IBM’s Spectrum Scale software. IBM demonstrates near linear performance scalability while feeding up to 72 GPUs in a single rack. This scalability is directly attributed to the power of the AI-focused tuning that IBM put into its Spectrum Scale v5. IBM showed performance numbers that match IBM’s Spectrum Storage for AI with NVIDIA DGX against similarly architected converged solutions also built around NVIDIA DGX. I can’t publish the actual numbers that IBM shared with the analyst community for its ResNet training benchmarks. I will say that no published benchmark numbers beat what IBM claims above 8 GPUs for Resnet-151 and ResNet-50, and IBM sits nearly alone above 32 GPUs on both benchmarks. Comparing the results published to date for converged NVIDIA DGX solutions gives IBM a serious performance advantage in head to head tested data throughput results. While these numbers need to be validated by real-world testing on released hardware, if it’s anywhere close to accurate then IBM will continue its impressive run of delivering top-tier storage at scale. I like that IBM is unafraid of providing a performance comparison. IBM also takes scalability one step further than its competition with its unique software-defined storage approach. IBM’s solution is extremely competitive at the entry and mid-level tiers, and looks as if it will dominate performance numbers The competition IBM Spectrum Storage for AI with NVIDIA DGX marks NVIDIA’s public fourth collaboration with a storage vendor to deliver an AI and machine learning reference architecture. Three of those collaborations are targeted at the enterprise and commercial AI and machine learning market. Pure Storage was the first to collaborate with NVIDIA. The Pure Storage AIRI set the stage for what is possible when it comes to combining high-performance storage with NVIDIA DGX servers. Marrying Pure’s FlashBlade technology with NVIDIA servers, Pure delivered new technology targeted directly at solving the issues intrinsic in AI and learning workloads. I really like what Pure did with both AIRI and FlashBlade. Less attractive is NVIDIA’s collaboration with NetApp . While NVIDIA brings stellar compute capabilities to the relationship, it seems that NetApp is simply bundling its existing A800 all-flash array into a package with NVIDIA’s DGX-1 and a 100GB ethernet switch from Cisco, and branding the package ONTAP AI. The solution might be fine to service the needs of NetApp’s existing installed base, but it’s not a differentiated solution. IBM’s Spectrum Storage for AI is differentiated from both the NetApp and Pure Storage offerings. IBM Spectrum Storage for AI provides a level of scalability that is nearly unmatched by anyone in the industry. It’s both incredibly fast at scale, and it scales linearly. Beyond raw performance, I like that IBM is bringing its enterprise legacy to enable AI in the datacenter. The ability for IBM Spectrum Storage for AI to seamlessly integrate with the rest of the Spectrum Storage suite should make IBM’s solution an easy decision for enterprise buyers. I love competition. It drives our industry forward and forces innovation. IBM has a great offering with Spectrum Storage for AI—one that raises the bar. Concluding thoughts Beyond providing just another indicator of NVIDIA’s near total domination of the AI and machine learning space, IBM’s announcement brings both additional credibility to AI in the enterprise, and a solution that is built to integrate with the tools enterprises are already relying upon. Enterprise IT is demanding targeted solutions for AI and machine learning. There is no vendor more in touch with the needs of enterprise computing than IBM. IBM delivering this bundle to market provides further validation that machine learning, and machine learning at scale, is a very real trend for enterprise buyers. It’s a trend we should all watch. The world of AI and machine learning in the enterprise is in its early days. The industry is still figuring out the right way to exploit the amazing capabilities that are promised by the technology. Balancing storage and compute is a huge part of how this world will evolve. IBM is delivering something to learn from with IBM Spectrum Storage for AI with NVIDIA DGX. I can’t wait to get my hands dirty and play with this beast.
+ posts