From Multicore To Many-Core

Getting sufficient performance and power improvements will require serious engineering; business trumps technology.


By Ed Sperling
Future SoCs will move from multiple cores—typically two to four in a high-power processor—to dozens of cores. But answers are only beginning to emerge as to where and how those cores will be deployed and how they will be accessed.

Just as Moore’s Law forced a move to multicore architectures inside a single processor because of leakage at higher frequencies, it will begin to force a move to many-core architectures at 22nm and beyond. How to deal with those cores presents challenges, however. There are still big questions about what value it really brings to the end customers when design costs rise into the stratosphere. There also are issues about how to eke the most performance out of all these cores, and what it will mean to the SoC or FPGA design process.

These are hardly academic questions. Most of the leading processor and SoC vendors are working diligently on 22/20 nm and beginning to experiment with 15nm. In the past few weeks Intel and ARM have both introduced multicore processors with a promise of many-core versions in the future. MIPS is expected to follow suit in the very near future with its own rollout of multicore to many-core. And Nvidia has begun hooking multicore chips to many-core graphics processors because it believes heat will limit the capabilities of a single chip at future nodes.

One of the questions that has plagued processor developers for years is how to use all the cores. Even before those cores were on a chip a massive amount of research was done on symmetric multiprocessing on multiple servers. The conclusion is that most software applications will never be parallelized sufficiently to run on many simultaneously.

The alternative has been to use virtualization to allow multiple functions or operating systems to run on a single device simultaneously by assigning them to any available cores, with acceleration technology added for those applications that can benefit from extra processing power. For example, a smart phone has a number of functions ranging from a camera to a GPS and MP3 player. Using virtualization, all of them can run at the same time on different cores, which are asymmetrically connected over the interconnection fabric within the phone. The advantage of this approach is the virtualization layer can be used to load-balance the resources to prevent hot spots on the SoC, which at a low level can impact signal integrity and at a high level destroy the device.

“There are two extremes in virtualization,” said Shay Benchorin, director of marketing and strategic alliances inside of Mentor Graphics’ Embedded Systems Division. “One is where you run multiple operating systems on a single core, which is what you get with a low-priced Android phone. It’s cheaper to add more cores than a chipset for all the features. With four cores you do symmetric multiprocessing on two sets of two cores and do asymmetric processing between them. Basically you live with unsupervised SMP. With 8 and 16 cores things get more interesting. You can use variable resources from fractional to multiple cores and get load balancing.”

ARM’s new Cortex-A15 is headed in the latter direction. Industry sources outside of ARM say the new processor will feature many cores in future iterations, and the company is publicly talking about clusters of up to eight cores. But SoCs can contain multiple processors, so the limitation also may be defined as much by the application and the device manufacturer’s power/heat budget than ARM’s processor design.

“We worked with our review board on the A-15 to determine what is the right architecture and how to make it efficient,” said Nandan Nayampally, director of product marketing for ARM’s Processor Division. “There are some compromises in the chip because virtualization, coherency and high-end power implementations are connected to all points in the value chain.”

The real value in the short term is separating secure and non-secure functions, so a single mobile device may be used for home and work with a secure barrier between the two worlds to prevent data theft and leakage. While that doesn’t affect performance, it does make the device more attractive to consumers and creates a big market for four- or eight-core implementations.

From a business perspective, there is a real need inside of corporations for safeguarding corporate data on mobile devices. From a technology standpoint there are tradeoffs to any virtualization approach, however. Using homogeneous cores is like using a general-purpose operating system. It gets the job done, but not always with the highest efficiency or performance..

New approaches to rightsizing
Heterogeneous cores, in contrast, can be much more efficient—particularly when software is written for a particular function and matched with the hardware. It’s also much more expensive to design as a standard, re-usable platform, which is why the idea hasn’t caught on particularly well. This is the high point of custom design, and so far it’s also the highest cost.

A different approach, and one being adopted by both Intel and MIPS, is to regulate the power in the cores—essentially rightsizing by adjusting the power. Mark Throndson, product marketing director for technologies at MIPS, said virtualization allows multiple cores to scale up and down on power.

“If you pick a moment where the software load requires all cores, the power will be multiples of a single core,” Throndson said. “The benefit of multicore is a wider degree of top performance vs. low power.”

The approach is to tie each core to a separate power island, then shut off those cores when they are not in use to limit dynamic and static leakage.

Intel, meanwhile, has established six power modes for its latest processors and integrated Vpro security and power management with virtualization.

“Intel has had had virtualization technology in the chip since 2006-2007,” said Brian Tucker, director of marketing for Intel’s business client group. “The first generation was to put it into the CPU. The second generation, which we are rolling out now, brings in page tables and memory allocations. There also are new hypervisors being developed by our partners, but those are not part of Sandy Bridge.”

And finally, Nvidia has taken a completely different approach. Instead of trying to build processing power into a single chip it has set up arrays of graphics processing units with many cores working on a single computational problem that can be parallelized and connecting that back to the processor. For computationally intensive tasks such as modeling and simulation this approach has been extremely successful. But how this plays into the consumer electronics world, where power efficiency is directly proportional to battery life, remains to be seen.