Ampere Computing is a merchant datacenter SoC company that delivers high-performance and efficient computing to the cloud data center market. Over the past few years, I have written a number of articles about the company and followed its progress in creating a new cloud-native processor category.
I believe the shift towards the cloud-native software approach demands new compute engines that sustainably provide performance and aren’t burdened with transistors focused on legacy software. Ampere is a company at the forefront of meeting this demand by developing more efficient processors delivering performance across all segments of the cloud data center.
Ampere recently introduced a new family of Arm-compatible server processors called “AmpereOne”. AmpereOne features a custom CPU core developed in-house; it can scale up to 192 cores—the largest core count currently available in the industry. In this article, I examine Ampere’s new milestone, which is unique and undoubtedly disruptive to all the datacenter CPU and SoC players.
Ampere brings new appeal to its second-generation design
Ampere has established itself with the major cloud service providers like Azure, Google Cloud, Oracle Cloud, Alibaba and Baidu, and is making inroads into enterprise on-premises data centers thanks to the inclusion of Ampere processors in servers from Hewlett Packard Enterprise, SuperMicro and other manufacturers.
That said, nobody in the enterprise space relishes buying version one of anything, and the more risk-averse will usually wait until the second version of a product (if not later) before they adopt it. Now, the second-gen design embodied in AmpereOne should make enterprises feel more comfortable about using it, especially because of its highly appealing value proposition: maximum VMs and containers per rack with the highest efficiency at low power.
Forbes Daily: Get our best stories, exclusive reporting and essential analysis of the day’s news in your inbox every weekday.
AmpereOne picks up where Ampere Altra left off
Ampere’s first two processors, introduced in 2020, were built using cores licensed from Arm Ltd.—the 80-core Altra and the 128-core Altra Max, which use the 7-nanometer manufacturing process. The Altra and Altra Max processor family ranges from 32 to 128 cores. The new AmpereOne family extends the portfolio with 136 to 192 single-threaded cores and more IO, memory, performance and cloud features, with no overlap between the product families.
I expect the use cases for AmpereOne will include AI inferencing, web servers, databases, caching services, media encoding and video streaming. A vital advantage of this CPU is its ability to scale for application workloads almost linearly.
The AmpereOne family of chips is a departure from the previous Altra family. Ampere has custom-designed the cores to tailor its products to better meet hyperscalers’ needs. The CPU is built using the latest 5nm manufacturing process. The custom-designed CPU uses the Arm Instruction Set Architecture(ISA), ensuring compatibility with applications developed on the Altra processors.
Under the hood of the new AmpereOne family
As mentioned, AmpereOne is available with from 136 to 192 single-threaded cores; it also has eight channels of DDR5 memory, 128 lanes of PCIe Gen5 IO and 2MB per core of private cache (double that of the Altra).
AmpereOne also has all the features present in the Altra family: besides single-threaded cores, which are ideal for the cloud, it features 2×228-bit vector units per core, which provides efficiency when manipulating large amounts of data, and FP16 and Int16 support for memory efficiency.
In custom-designing the cores, Ampere has been able to implement several other impressive features. The CPU has bfloat16 support, which is helpful in AI and deep-learning applications where large-scale neural networks are trained and deployed. It has gained traction thanks to its ability to maintain reasonable numerical precision for gradient computations and weight updates, which are critical in training deep-learning models while still providing memory efficiency.
It’s one thing to have a lot of cores, it’s another thing to have the right architecture to feed those cores. Ampere has implemented mesh congestion management to handle the 192 cores. Mesh congestion management is a type of intelligent traffic management that avoids bottlenecks with techniques such as adaptive routing, virtual channels and load balancing. This optimizes the performance and efficiency of the mesh interconnect network between the cores, minimizing the impact of congestion on overall system performance.
Memory and service level classification (SLC) quality of service enforcement are mechanisms implemented to ensure that no single user gets an unfair share of memory bandwidth or SLC capacity, thus providing performance consistency for everyone. Meanwhile, nested virtualization is a way for cloud providers to offer additional services to users by extending virtualization capabilities to allow a VM to act as a virtual host, enabling the creation and execution of nested VMs within it. Nested virtualization allows for scenarios such as running a hypervisor within a VM, which can host additional VMs.
Fine-grained power management provides Ampere users with OS-based granular control and visibility into what’s happening with the processor from a power perspective. Advanced droop detection is a mechanism used to monitor and detect variations in power supply voltage levels, specifically droops or voltage sags, to ensure the stable and reliable operation of the processor. Voltage can “droop” when the power supply voltage to a CPU temporarily decreases due to high current demand, sudden load changes or power delivery limitations.
In the longer term, all processors age, which can affect their performance. To address this, process aging allows Ampere customers to monitor how a processor is aging under conditions of high utilization and low idle time. This ensures that reliability goals are met, and that end users don’t experience the effects of processor aging.
New security features
Ampere has introduced several security measures in its new generation of chips. Secure virtualization provides security across a multi-tenant environment and single-key memory encryption is supported when deploying machines in untrusted locations where physical access is risky.
Memory tagging is a feature that has long been requested by customers that has no analog in the x86 space. This security and data integrity feature protects against buffer overflow attacks and increasing data integrity for applications like large databases where memory can be disrupted over time. Memory tagging ensures trust when accessing memory by associating tags or labels with memory regions or individual memory addresses. These tags track and enforce memory access permissions, detect unauthorized access and mitigate the impact of specific security vulnerabilities.
Performance metrics that people can relate to
Ampere considered two performance benchmarks to gauge the relative performance of AmpereOne. First is the number of virtual machines per rack.
The 192 cores of the AmpereOne delivered 7296 VM’s per rack which was 2.9 times more than the 96-core Genoa AMD EPYC 9654 at 2496 VM’s per rack. The AmpereOne also delivers 4.3 times more VM’s per rack than the 60-core Intel Xeon 8480+ Sapphire Rapids at 1680 VM’s per rack.
For the second performance benchmark, Ampere used two different types of AI inferencing workloads to put the AmpereOne through its paces. AmpereOne’s performance per rack for a generative AI Stable Diffusion workload was compared against the Genoa AMD EPYC 9654. AmpereOne delivered 2.3 times more frames or images per second compared to Genoa.
The second AI workload was AI recommenders, specifically the deep learning recommendation model (DLRM), a machine learning model designed to provide personalized recommendations for users. The DLRM is a deep neural network architecture specifically tailored for recommendation tasks, such as suggesting products, movies or content based on user preferences and historical behavior. The use case involves significant amounts of data and is very latency-sensitive. At the rack level, AmpereOne delivered almost twice the number of recommendations per second compared to Genoa. If you wonder why the Intel Sapphire Rapids was not in the comparison, the performance gap was even more significant.
Cloud providers and enterprises will likely always need to buy x86 processors from Intel or AMD for x86 applications that cannot be ported to Arm or RISC-V architectures.
After more than a decade, there is plenty of cloud-optimized code that is Arm-optimized today, and AWS has assured this with Graviton. And that in turn means that Arm is beginning to gain traction in the data center market, which x86-based processors from Intel and AMD currently dominate.
The Arm general-purpose datacenter merchant silicon charge is currently being led by Ampere which, with the addition of the AmpereOne family, addresses most all cloud-native computing needs, from the lowest-power and most constrained uses to the largest-scale deployments.
The simple message is more cores, IO, memory, performance and cloud features—and that’s a message that’s sure to resonate in many quarters.