Innovation At The Core

Three companies add powerful variations to ARM’s symmetric multi-processing chips. Here’s how they did it.


By Barry Pangrle
A number of next-generation ARM-based multi-core systems are starting to show up in the press. Nvidia has released information on its upcoming Tegra 3 (also known as “Kal El”). At last week’s ARM Techcon in Santa Clara, ARM gave several presentations around its Cortex-A7 (Kingfisher) and Cortex-A15 (Eagle) architectures and collectively about its big.LITTLE strategy. Qualcomm is also starting to talk more about its new Snapdragon S4.

All three of these competing styles are targeting multi-core designs that offer the best possible energy efficiency, while at the same time providing the computing power needed to be competitive in the market for the next generation of smartphones and tablets. Each of the multi-core architectures has ARM-based cores, but that is about where the similarities end.

Nvidia is using what it calls a Variable Symmetric Multiprocessing (vSMP) architecture that consists of four fast Cortex-A9 cores implemented in a 40nm “G” process plus 1 slower Cortex-A9 core that is essentially implemented in a 40 nm “LP” process—all on one die. All five cores can be individually turned on and off, but the four fast cores share the same voltage supply and clock so that the fast cores that are turned on at any given time will all operate at the same voltage and frequency point.

The chart below in Figure 1 shows the power performance tradeoff curves for the two implementations of the A9 cores. At some point the voltage on the “LP” has to be pushed so high to get the equivalent performance of the “G” core that the “G” cores become more efficient. This cutover point is about 500MHz.


Fig. 1


ARM’s big.LITTLE strategy takes a more heterogeneous approach that uses Cortex-A15 cores for applications that need higher-performance and Cortex-A7 cores that are relatively lower performance but more energy efficient. The Cortex A-7 cores were designed foremost with energy efficiency in mind. The chart below in Figure 2 shows the power performance tradeoff curves for the A7 and A9 cores. Note that unlike the Tegra 3, the curves don’t intersect. This is because of the heterogeneous nature of the two cores, where one is designed toward energy efficiency (A7) and always falls below the curve of the higher performance (A15). At some point the A7 simply runs out of gas and can’t deliver the same level of performance as the A15.


Fig. 2

The differences in the architectures between the A7 and A15 again illustrate that architecture plays a significant role in any energy-efficient design and that performance comes at a significant cost to efficiency. Table 1 illustrates the performance energy tradeoffs between the two architectures. Note that in most cases doubling the energy doesn’t even double the performance (i.e. the performance/energy-efficiency ratio is less than 1.0) and that even for the IMDCT case where 3x the energy leads to 3x the performance, one has to realize that because it’s 3x the energy in 1/3 the time that the power has now jumped to 9x or nearly one order of magnitude. Having the flexibility to shift between the two architectures creates the possibility to generate well-balanced systems that will have good performance and power (energy) characteristics.


Qualcomm has taken yet another approach, which it calls Asynchronous Symmetric Multiprocessing (aSMP). It uses Qualcomm’s own ”Krait” custom design (based on the ARM7 instruction set architecture) for performance and efficiency. In terms of relative overall performance, Qualcomm claims that the Krait micro-architecture offers a 60% increase in performance in Dmips/MHz over the previous generation “Scorpion” core, which it claimed to be about 60% higher performance than the base ARM11 architecture. The Krait core is expected to be able to operate in the 1.5 GHz – 2.5 GHz clock frequency range.

There is simply one curve for the power and performance of the core since all (up to) 4 cores are identical. Obviously, the curves for deciding when to up the voltage and frequency or bring on another core to handle a new task would be much more interesting.

Snapdragon S4 has the widest range of simultaneous operating points between the cores of the three styles discussed here. Each S4 core has its own voltage and clock control so they can independently operate at the most efficient point based on their individual workloads, leading to an efficient overall design. Another advantage that the Snapdragon S4 has over other early entries into the next generation of parts is that it is built on a 28 nm LP process.

It should also be noted that TSMC’s 28LP process is SiON and not HKMG. This should help in terms of getting yields up quicker on 28nm, and in part explains Qualcomm’s ability to get these parts out sooner. This will also give the Snapdragon S4 an early process advantage over competing 40nm parts. Figure 3 illustrates the expected power savings to be had by being able to dynamically adjust voltage and frequency operating points on each individual core based on its workload, as opposed to having all cores run at the same voltage and frequency operating point.

Fig. 3

Fig. 3

It’s interesting to see how companies are focusing on architecture to improve the energy-efficiency of their designs. There will be a wide range of support tools both on the hardware and software sides needed to make the most out of the strategies touched upon in this article, (vSMP, big.LITTLE and aSMP), and this should provide much more material for future articles. Thanks for reading.

-–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.