Ampere Makes A Strong Case For On-Chip AI

Ampere Computing has made a lot of waves in being the Arm chip for the cloud datacenter. The company brought to market the first credible, merchant Arm-based silicon for general compute. Azure, Google Cloud (GCP), Oracle Cloud Infrastructure (OCI) and others have adopted Ampere’s CPUs to deliver Arm-based cloud instances. The one notable exception is AWS, which developed its own Arm chip, called Graviton.

In addition to being the merchant Arm general purpose CPU of choice for the cloud, Ampere’s CPU has also been designed into server designs from HPE and Lenovo. While these servers are designed for the hyperscaler and service spaces, they are designed to power enterprise IT with manageability, security, and reliability, running native cloud workloads.

As Ampere has become so popular in the cloud, it is easy to overlook the architecture of its CPUs and the value they bring to high-performance computing and AI. What few people are discussing is that more AI workloads are run on the CPU versus the GPU. In the following sections, I’ll dig deeper into Ampere’s Altra and AmpereOne CPUs and why I believe these chips can deliver a compelling value proposition for certain AI workloads.

Translating AI characteristics into CPU requirements

When discussing AI, one’s mind usually goes to high performance computing with the highest-performing CPUs and discrete GPUs. And when talking about AI training, this can be an accurate way to look at the equation. Many large-scale training projects can benefit from accelerators such as GPUs or ASICs.

However, in many AI training deployments, the acceleration capabilities built into the CPU are more than capable of delivering the required performance. For example, In the x86 market, Intel has designed several accelerators to boost training performance significantly. Advanced Matrix Engines (AMX) and Advanced Vector Engines 512 (AVX-512) work with other acceleration engines to remove the requirement for discrete accelerators.

Ampere has employed a similar strategy in the Arm ecosystem, designing its CPUs for AI workloads from the ground up. When considering the two elements that drive the need for acceleration – the amount of data and complexity of the model – Ampere has designed a CPU that pushes the need for a GPU further to the right, if you will, on the CPU – GPU continuum. For example, its native support for the FP16 data format can double the speed of some AI workloads, making it a strong candidate for many mainstream AI use cases.

As CPU architectures evolve and consider the requirements of AI workloads, silicon designers such as Ampere, AMD, and Intel will continue to design acceleration capabilities into the chip, further pushing that line for when a discrete accelerator is required. This has always the natural evolution, as we have seen functions like FPUs, memory controllers and I/O that were external devices get sucked into the SoC.

Beyond just training, AI Inference has an even different performance profile and a different set of requirements. Whereas training will typically consume whatever compute available, inferencing applies a predefined set of operations on data. So, the compute requirements are far less.

From a data perspective, inferencing deals with small amounts of data in real- or near real-time, whereas training is the constant training on very large datasets.

When discussing inferencing applications such as image recognition, natural language processing, recommendation systems, etc. – predictability, scalability, and parallelization become essential when considering the optimal deployment model.

Ampere’s architecture is about performance, efficiency, economics

There has been some confusion about Arm’s relevance in AI. I believe this is born out of the market just not understanding due to the continuous drumbeat of Nvidia’s well-deserved generative AI hype. However, the reality is this – organizations embarking on AI projects need to dig deeper into their needs. Chances are, there’s an overallocation of resources for both training and inference. Taking this a step further, chances are deploying inference on Ampere could deliver significant price-performance advantages.

What makes Ampere’s architecture compelling in AI? Let’s address both training and inference. On the training side, CPUs are primarily for data preparation, feature extraction, and managing data flow to and from GPUs. This translates into a CPU with a higher core count, higher clock speeds, bigger memory capacity, and lots of fast I/O. This is especially true for frameworks that can take advantage of parallelism, such as PyTorch and TensorFlow.

As you can see, its CPU scales up to 192 single-threaded cores, enabling predictable and scalable performance – optimal for parallelized workloads. And its top-end TDP of 350W is impressive. Of course, not every training workload is the same, and every organization will find the suitable performance-power-price model that works best for them.

In the case of inference, the story is similar and even more compelling. Because inference happens everywhere, organizations require a range of platforms. However, all these platforms have to perform the same in predictability, scalability, and parallelization.

Here’s what I believe makes Ampere’s cloud-native CPUs appealing for AI.

Single-threaded design. Because each Ampere core is single-threaded, performance is essentially guaranteed – no pipeline stalls, caching contention issues, or memory bandwidth challenges. While this may sound somewhat trivial to a business user, each has a measurable impact on a workload, such as inference that has real-time or near-real-time performance requirements.
High core counts. Up to 192 single-threaded cores means incredible scalability and the best in parallelism in theory. In inference, handling so many simultaneous requests means faster results. And to a business, this means happier customers.
Cost-effectiveness. Defining the best inference platform is different for each organization. However, the measurements tend to be performance, cost and power (contributing to cost). Performance per dollar or performance per instance cost (as inference is a popular cloud workload) is a relevant measure. And this is another area where Ampere has demonstrated leadership.

Proving out the story

Every vendor has a way to market numbers that substantiate a value proposition or claim. And, of course, Ampere is no different. The company makes bold claims about its performance relative to other CPUs and from cloud to cloud. Here’s what stood out to me.

Ampere claims a 3.6x advantage on inference relative to AMD’s 3^rd Generation CPU codename “Milan.”
Ampere claims a 6x advantage in performance per dollar relative to Intel’s 3^rd Generation CPU, codename “Ice Lake.”
Ampere Claims a 6.4x advantage in CPU inference in the cloud relative to AWS (Oracle Cloud Infrastructure [OCI] A1 v Graviton 2)
Ampere claims an 11.8x cost advantage in cloud AI costs relative to AWS (OCI A1 v Graviton 2).

A few notes of notes on these numbers.

First, the comparisons are against N -1 CPUs from both Ampere and its competition. These were the current cloud-based instances at the time of benchmarking.
These results are based on the industry standard Resnet50 benchmark, an inferencing test measuring the speed of image classification.
From what I see, the testing team attempted to be as impartial in configurations as possible.

As I mentioned, benchmarking can be tricky, and vendors often use tricks to create synthetic results. It could be a compiler optimized for one CPU and not another. Or servers using different memory configurations. There are many tricks. Ampere hasn’t done this in its testing, which is refreshing.

Closing

I’ll finish this article the way I started; it’s easy to overlook the capabilities of Arm, specifically Ampere, in terms of AI for two reasons. The first is the success of Ampere in the cloud market. The second is frankly due to the need for more awareness of the performance characteristics of AI workloads.

Not every AI workload requires the most power-hungry CPU coupled with the most power-hungry GPU. In fact, most AI workloads don’t require these expensive and overly provisioned platforms. As CPU architectures continue to evolve and account for the distinct needs of AI (training and inference), I expect to see more acceleration built into the CPU. Historically this has been the case for 30 years, and I don’t see this changing. This doesn’t mean there won’t be a need for higher-performance datacenter GPUs in the future, this will be, but those will be working on the next gen workloads that over time, will be sucked into the SoC.

Ampere makes a strong case for its AI play in the cloud. The company has demonstrated a significant cost advantage on the training side of the equation. On the inference side, the company shows its leadership in raw performance and price-performance.

I look forward to seeing updated AI benchmark numbers that show AmpereOne versus AMD’s 4^th Gen EPYC and Intel’s 4^th Gen Xeon.

NOTE: This blog contains many contributions from Moor Insights & Strategy VP and Principal analyst Matt Kimball.

Follow me on Twitter or LinkedIn. Check out my website or some of my other work here.

Moor Insights & Strategy provides or has provided paid services to technology companies like all research and tech industry analyst firms. These services include research, analysis, advising, consulting, benchmarking, acquisition matchmaking, and video and speaking sponsorships. The company has had or currently has paid business relationships with 8×8, Accenture, A10 Networks, Advanced Micro Devices, Amazon, Amazon Web Services, Ambient Scientific, Ampere Computing, Anuta Networks, Applied Brain Research, Applied Micro, Apstra, Arm, Aruba Networks (now HPE), Atom Computing, AT&T, Aura, Automation Anywhere, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, C3.AI, Calix, Cadence Systems, Campfire, Cisco Systems, Clear Software, Cloudera, Clumio, Cohesity, Cognitive Systems, CompuCom, Cradlepoint, CyberArk, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Dialogue Group, Digital Optics, Dreamium Labs, D-Wave, Echelon, Ericsson, Extreme Networks, Five9, Flex, Foundries.io, Foxconn, Frame (now VMware), Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Revolve (now Google), Google Cloud, Graphcore, Groq, Hiregenics, Hotwire Global, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, HYCU, IBM, Infinidat, Infoblox, Infosys, Inseego, IonQ, IonVR, Inseego, Infosys, Infiot, Intel, Interdigital, Jabil Circuit, Juniper Networks, Keysight, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, Lightbits Labs, LogicMonitor, LoRa Alliance, Luminar, MapBox, Marvell Technology, Mavenir, Marseille Inc, Mayfair Equity, Meraki (Cisco), Merck KGaA, Mesophere, Micron Technology, Microsoft, MiTEL, Mojo Networks, MongoDB, Multefire Alliance, National Instruments, Neat, NetApp, Nightwatch, NOKIA, Nortek, Novumind, NVIDIA, Nutanix, Nuvia (now Qualcomm), NXP, onsemi, ONUG, OpenStack Foundation, Oracle, Palo Alto Networks, Panasas, Peraso, Pexip, Pixelworks, Plume Design, PlusAI, Poly (formerly Plantronics), Portworx, Pure Storage, Qualcomm, Quantinuum, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Renesas, Residio, Samsung Electronics, Samsung Semi, SAP, SAS, Scale Computing, Schneider Electric, SiFive, Silver Peak (now Aruba-HPE), SkyWorks, SONY Optical Storage, Splunk, Springpath (now Cisco), Spirent, Splunk, Sprint (now T-Mobile), Stratus Technologies, Symantec, Synaptics, Syniverse, Synopsys, Tanium, Telesign,TE Connectivity, TensTorrent, Tobii Technology, Teradata,T-Mobile, Treasure Data, Twitter, Unity Technologies, UiPath, Verizon Communications, VAST Data, Ventana Micro Systems, Vidyo, VMware, Wave Computing, Wellsmith, Xilinx, Zayo, Zebra, Zededa, Zendesk, Zoho, Zoom, and Zscaler. Moor Insights & Strategy founder, CEO, and Chief Analyst Patrick Moorhead is an investor in dMY Technology Group Inc. VI, Fivestone Partners, Frore Systems, Groq, MemryX, Movandi, and Ventana Micro., MemryX, Movandi, and Ventana Micro.

Read the full article here