Arm Goes For Performance

Recently announced high-end processors are pushing performance capabilities to the forefront.

popularity

At the recent Linley Processor Conference, Arm presented two processors. This was regarded as so confidential that the original pre-conference version of the presentations didn’t contain the Arm one, even though that pdf was only put online about an hour before. But most of the outline of what they presented they already talked about in May, a few months ago.

I said recently that this seemed to be the age of Arm beginning. See my post from a month ago, The Start of the Arm Era. In that post, I mentioned that Apple would announce their first Arm-based Macs (and for readers of this blog, more importantly, their first Arm-based Mac chips). They just announced M1, their first Arm-based Mac chip. It is built in 5nm, presumably by the obvious suspect but Apple never talks about that. It has 16B transistors. A 16-core neural engine. The claim is that it delivers the performance of a “typical” laptop CPU at a fourth of the power.

But that wasn’t even the topic of this blog post. Arm have “gone for it,” too. Historically, Arm has had to deliver designs that were economical with area and power, and okay for performance. That was what mobile suppliers wanted. If they did a processor that was as fast as possible, area and power be damned (well, it has to be manufacturable and packageable), then who would license it? But Amazon/Annapurna has shown what is possible, and today Apple showed more of the same (with the obvious caveat that this is a marketing presentation that has not yet been confirmed with actual measurements from anyone independent like Anandtech.

Anyway, Apple is not the subject of this post. The details of the M1 will come out soon. As I said, at Linley, Arm dug into two processors.

Arm Cortex-X1

The first was the Arm Cortex-X1, billed as the first CPU from the Cortex-X Custom Program (CXC). The word “custom” makes it seem like this is something designed on an individual basis for customers, but as far I understand from the presentation, this is a standard product, at least at the level of any other high-end Arm CPU. Of course, it is parameterizable in various ways such as cache size. I’m sure if you pay Arm enough, they’ll do other stuff for you, it was certainly like that 25 years ago when I was in charge of Arm licenses at VLSI Technology (one of the original investors in Arm when it was spun out). The big difference about the X1 is that it is focused on performance without worrying too much about anything else. Annapurna showed with Graviton 2 (see my post Xcelium Is 50% Faster on AWS’s New Arm Server Chip) that you could push the architecture to data center standards, and Apple showed something similar today. I have no idea if Arm were a little embarrassed that several licensees were getting more out of their architecture than they were themselves, but clearly, the Cortex-X1 is them pushing the envelope themselves. It will be next year before any licensee actually gets working silicon, so we’ll have to see how it stacks up against these designs by companies that have an architectural license.

Some features:

  • 33% increase in dispatch bandwidth (up to 8 instructions per cycle)
  • 40% increase in out-of-order window size (224 instruction window)
  • 2X floating-point and MAC bandwidth (so better machine learning performance)
  • Much larger L0-BTB capacity
  • Increased instruction fetch bandwidth (5 from the iCache and 8 from the Mop cache). Mop is “macro-operation fusion” where sometimes operations are merged together. Plus there is a bigger Mop cache.
  • 2/3 larger L2 TLB capacity to 2K entries)
  • Up to 1MB L2 and 8MB L3

Arm Cortex-A78

The other processor was the Arm Cortex-A78. Having got the X1 for the very high end, the A78 is more focused on cutting stuff back a little to save even more power and area. In the same thermal envelope, it is 20% faster than its predecessor, the A77. It is compatible with the A55 so you can do big.LITTLE and do multicore designs with some A78 cores (big) and some A55 cores (little). Even so, there is a lot of architectural innovation:

  • Branch prediction expanded to two take branches per cycle
  • Improvements to conditional branch prediction
  • Additional integer multiply (two per cycle)
  • Lots of improvement in the OoO microarchitecture such as a bigger ROB, bigger out of order window size, and so on
  • Bigger caches supported

Result: Compared to the A77, it is 30% lower power at the same performance, 7% better performance at the same power, and 36% lower power at peak performance. Unlike the X1, this is Arm’s traditional sweet spot. Lots of performance, but not the absolute maximum, but economical with power to get there.

Arm Cortex-A78C

They also announced the A78C which is a derivative of the A78 for lower-end mobile applications. It was unclear what the difference was since it was all in headline bullet points like “improve security,” “unleashed multithreaded performance,” and “best-in-class efficiency.” Arm also talked about “pointer authentication” with this (but not the other processors, but they may have it, too), which is a way of partially addressing the Spectre and Meltdown attacks from a couple of years ago, that are not so much an attack on any individual processor but all processors that use branch-prediction and speculative execution…which is all high-end processors such as Arm Cortex-A but also all the other competing high-end processors.

Summary

The Cortex-X1 is clearly designed to have the best single-thread performance as opposed to Arm’s normal value proposition of lots of computation in a low power envelope. Annapurna, with Graviton, has shown what is possible, plus yesterday’s Apple M1 announcement. The Arm Cortex-X1 is their own entry into this high-end space, computing head-to-head with other datacenter-level processors.

Meanwhile, the A78 is their high-end solution for more power-constrained applications like moble, Arm’s biggest market (I assume). Lower power but higher performance.

I believe these two designs (both from Arm in Austin, TX) are variations on the same basic architecture but with different sizes of things like branch-table-buffer, caches, reorder buffers, and so on. It is also notable for being the first year that Arm produced two high-end (Cortex-A level) processors in a single year.



Leave a Reply


(Note: This name will be displayed publicly)