Part 1: Design teams are rethinking the right number and kinds of cores, how big they need to be, and how they’re organized.
The optimal number of processor cores in chip designs is becoming less obvious, in part due to new design and architectural options that make it harder to draw clear comparisons, and in part because just throwing more cores at a problem does not guarantee better performance.
This is hardly a new problem, but it does have a sizable list of new permutations and variables—right-sized heterogeneous cores, better-written software, new memory types, and multiple packaging options. Moreover, while this may seem like just one more discussion about partitioning, it’s accompanied by a growing sense of urgency and concern. As transistor density continues to increase, it is becoming necessary to make some fundamental changes to designs to reduce leakage current and dynamic power density, as well as to reduce contention for memory and I/O.
One way to resolve these issues is to add more processing elements of all types, with much more attention to details about where they need to be placed and what they actually do. But which ones, what size, what clock frequency, where to put them, how to utilize them, and what impact they will have on an overall system design are increasingly difficult questions to answer.
The great debate
Parts of this discussion may sound familiar. Selecting the right number of processing elements has been a point of contention since Burroughs first introduced a dual-processor mainframe with asymmetric multiprocessing in 1961. That was followed by a variety of schemes to add more processing elements into computers, starting with symmetric multiprocessing in mainframes in the mid-1960s and leading up to computational and graphics accelerator chips in PCs in the 1990s. Then, beginning in 2006, Intel introduced dual-core processors as a way of minimizing heat in dense processor cores.
And there the benefits have largely stopped. Aside from a handful of embarrassingly parallel applications, such as image and video processing on PCs, and database- and search-related applications in data centers—not to mention compute-intensive applications such as hardware-accelerated verification in chip design—the number of cores in a CPU often has proved to be little more than an extension of the old MIPS/MHz/GHz marketing wars. More is considered better, even if they don’t actually affect performance. And in some cases, extra cores don’t do anything other than add to the cost of the design.
Chipmakers, software developers, research houses and various consortia have tried for decades to solve the parallelism problem. Enormous sums of money have been spent on developing new languages, parallel programming methodologies, and new curricula for computer science majors in universities. Had they been successful, the debate over the right number of cores would have evaporated. But most consumer applications still cannot take advantage of multiple computing cores.
This explains why large, complex SoCs have literally dozens of separate compute elements. That may seem counterintuitive, until you consider that some of these cores are accelerators for other cores, while others function as islands—they have little or no interaction with any other functions in a device. It often is simpler to separate out functions with their own highly-specific processors, particularly for IP subsystems, than to have those functions run on a central processor unit. And while that can lead to problems involving system-level coherence, it is a much easier way to design chips and far more power efficient.
“This all comes down to Amdahl’s Law,” said Arvind Raghuraman, staff engineer in the Embedded Systems Division at Mentor Graphics. “There is a practical limitation imposed on scaling of software. Various software programming paradigms impose restrictions, such as recomposing software into microkernels where you have a sequential problem. In most cases, this is a significant bottleneck.”
Even in the best of cases, adding more cores does not scale linearly in terms of power or performance. Running a process on two cores at slower clock speeds does not double the performance of running that process faster on one core. And for those applications that can be easily parallelized, there is still overhead at the parsing and recombinant stages.
“The reality, as everyone knows in SMP, is that adding four cores may only give you the real performance of two cores,” said Kurt Shuler, vice president of marketing at Arteris. “And a lot of stuff in the embedded world, particularly in phones, is single-threaded and event-driven, so it’s not really predictable.”
That has led to a number of different strategies to more effectively use processing elements. The biggest problem here is that not everything is a straight apples-to-apples comparison. For example, if software is co-developed with a heterogeneous multi-core chip it might require fewer cycles per compute function than commercially developed software that works across a comparable processor. Likewise, splitting processing between various elements may look more efficient on a spec sheet, but throughput to memory might be lower than a different configuration, a similar configuration in a different chip, or even one that uses a different manufacturing process.
“If you look at deep learning, the strategy has been to use lots of highly parallelizable cores,” said Chris Rowen, a Cadence fellow. “The current thinking is that you may want 1,000 cores, but it may be easier and faster to use 1 core and 1,000 multipliers. It’s like asking, ‘Which is smarter, 1 two-year-old or 1,000 cockroaches?’ Software is a key part of this. At the highest level of abstraction, you may be running Windows or Android or iOS, while at the lower levels you are using more real-time-oriented software that relies on an escalation principle. The lowest-level software can recognize its own limits and interrupt higher-level processes. So there are more processes, but not necessarily more processors.”
Pick a number
The general consensus among dozens of engineers and architects is that the maximum number of cores in a smart phone CPU is four to eight, and more often than not applications only use one core. The others are in reserve, and generally powered off.
“Smart phones have 20 to 30 cores,” said Emerson Hsiao, vice president of North American Operations at Andes Technology. “A lot of them are doing simple tasks like test control. We’re also seeing dedicated cores that are optimized for power or performance, and others that work together on a specific task. So for a camera, you might identify objects, not just an image. Those usually are customized processors, not general-purpose.”
But even here the math doesn’t always add up. In black-box IP subsystems, for example, no one other than the IP developer really knows how many processing elements are inside. Some of them don’t have connect with anything outside of that subsystem.
“It certainly makes it easier to be flexible with a subsystem if you can do it without much of an integration burden,” said Drew Wingard, CTO at Sonics. “From an encapsulation perspective, this is really important because you are not thinking about contention for a processing resource. You don’t have to deal with interrupts or worry if data processing is going on in the subsystem. That’s also good from an abstraction perspective because the compute resource is not available to the rest of the system.”
Other processing elements in this type of scheme communicate only sporadically, such as an accelerator that kicks into gear when a user calls up a particular process. Smartphones have been working with this approach for years, using a combination of DSPs, GPUs and MCUs optimized for specific operations such as listening to music, streaming videos, or playing games.
Those are, by definition, heterogeneous cores. But some are more heterogeneous than others, and not all of it depends on the core. Other important factors include where they are they are placed, how they are utilized, and in many cases who is using them. By matching cores very narrowly to certain tasks and by sizing them appropriately, particularly with software written for those tasks and cores, significant gains in efficiency and performance can be achieved.
“Twelve years ago we began developing our own cryptographic microprocessor using our own custom instruction set,” said Pat Rugg, vice president of sales and marketing at the Athena Group. “They do multiplication, addition and subtraction, but they also can do AES and random number generation. They’re also more than 10 times more efficient than using individual cores.”
This is the strategy behind heterogeneous multicore CPUs, as well. ARM‘s big.LITTLE architecture is one example. Intel likewise said it plans to offer its own twist on that with FPGAs connected to multi-core processors using high-speed bridges.
“The challenge here is that it’s not always easy to determine which core should be used,” said Mentor’s Raghuraman. “It’s a compromise on the design side. It’s not always easy to decide when to shift from the big cluster down to the little cluster, so it could fail on latency requirements.”
In business settings, such as cloud-based servers, the pendulum frequently swings in the opposite direction. There are many options for using multiple cores simultaneously for a parallelized process, which is where homogeneous multi-core is most useful, or where heterogeneous multi-core architectures come into play. Depending upon how they are used, and whether data needs to be consistent, they can either be cache-coherent or function independently.
Kalray, a processor company based outside of Paris, is developing chips with up to 288 64-bit cores for networking and storage inside of data centers, as well as self-driving cars. “Our focus is embarrassingly parallel applications where you don’t need to share data from one processor to another,” said Jean-Pierre Demange, the company’s vice president of sales and marketing. “There are many processes that have nothing to do with each other. The real key here is that you need throughput.”
Software
One of the key factors in deciding how many cores to use and how to use them is software, and this is one area where a discussion about cores really gets confusing. Commercially developed software may be able to run across multiple cores, but it usually isn’t as efficient as software developed in conjunction with a processor with fewer cores, regardless of whether those cores are heterogeneous or homogeneous.
“Software is what defines the upper limit for cores,” said Tom De Schutter, director of product marketing for virtual prototyping at Synopsys. “So with embedded vision, you can have massive parallelism. GPUs and FPGAs need a different level of parallelism. This is why we’re seeing the return of dedicated processors, which is a trend over the last few years. For a long time we saw the demise of IP vendors selling processors. Now we’re seeing a reverse trend with new applications reviving application-specific processors.”
De Schutter pointed to two trends in software for multiple cores. One involves more standardization of software for popular processors, such as what the Linaro consortium is doing with ARM cores. The second is much more specialization, including dedicated algorithms to minimize power consumption. “These are very specific software stacks, and more software developers are becoming highly specialized.”
Coming in part 2: Design considerations such as new architectures and memory strategies, and how to build and test these systems.
Related Stories
Heterogeneous Multi-Core Headaches
Experts At The Table: Multi-Core And Many-Core
Multicore Madness
”
I think this paragraph is key
“This explains why large, complex SoCs have literally dozens of separate compute elements. That may seem counterintuitive, until you consider that some of these cores are accelerators for other cores, while others function as islands—they have little or no interaction with any other functions in a device. It often is simpler to separate out functions with their own highly-specific processors, particularly for IP subsystems, than to have those functions run on a central processor unit. And while that can lead to problems involving system-level coherence, it is a much easier way to design chips and far more power efficient
No matter how many processor cores there are on a shared processor, there is only one shared memory and memory accesses may limit speed. (And the OS?)
Separating functions and having local memory for both instructions and data means that things can actually work in parallel.
But if every processor requires a memory and caches it can quickly become too expensive. That is why small/scalable programmable interconnected blocks should be used.(it works for parallel searches so should work for parallel functions)
There must be an interconnect for message and data passing so data and messages(request/response/data) can be passed.
Altera Qsys is one example that is available to start. I have not used Qsys but was in discussions while it was being developed and believe in the concept.
Karl, good point on the software and programmability. Stay tuned for part two.