Have Processor Counts Stalled?

Have chips reached a plateau for the number of processor cores they can effectively make use of? Possibly yes, until you change the programming model.

popularity

Survey data suggests that additional microprocessor cores are not being added into SoCs, but you have to dig into the numbers to find out what is really going on.

The reasons are complicated. They include everything from software programming models to market shifts and new use cases. So while the survey numbers appear to be flat, market and technology dynamics could have a big impact in reshaping these trends in the future.

The current comparisons are highlighted in a just-released functional verification survey, conducted by Wilson Research and Mentor, a Siemens Business. The survey, based on 1,492 respondents — about half of whom were involved in ASICS — focuses on such aspects as chip size and processor count. The survey data  shows a very slight increase in the “8 or more” category, but almost within the +/- 3% margin of error for survey. That puts it at roughly the same number as 2016. SoCs with two processors have decreased slightly compared to the previous two survey results.

“The number of processors has stalled in platforms like PCs or portable devices,” says Michael Frank, fellow and system architect at Arteris IP. “This has less to do with Moore’s Law leveling out than the fact that it’s getting very hard, from a software point of view, to find enough work for more processors. Unless you have highly parallel workloads, it is really really hard to keep all these cores busy.”

Workloads are shifting, too. “For mobile applications processors, which is the number one volume product, the general architecture seems to have settled out at a cluster of eight cores — at least for the time being,” says Larry Przywara, senior group director of marketing for Tensilica IP at Cadence. “What we are seeing is that the number of cores is going up for doing vision and AI types of workloads.”

But outside of the smart phone market, there is plenty of room for change. “We are seeing increased compute requirements across a range of markets, from automotive to infrastructure,” says Peter Greenhalgh, vice president of technology and fellow at Arm. “This means we are nowhere near the limit of useful processors. New applications have differing needs, such as power, safety and efficiency. From a low-power IoT perspective, we can expect increasing adoption of multi-core solutions. And we can expect that this trend will continue to increase to more applications and end-markets.”

Markus Willems, product marketing manager for ASIP tools at Synopsys, agrees. “What we are seeing is quite the opposite from what appears in the survey, namely that the number of programmable components on the chip is increasing. This is due to the very application-specific, or domain-specific kinds of accelerators that are becoming more software programmable. A multi-core cluster of 32-bit superscalar cores may have become limited because memory is an issue.”

There would appear to be a large change about to happen in the smaller embedded market. “If we look at the new chips coming out in the embedded market, they are typically dual-core, quad-core type of configurations,” says Jeffrey Hancock, senior product manager in Mentor’s Embedded Software Division. “There are still devices that are single-core, but multi-core is finally about to come in this area, as well. From the silicon providers standpoint, they are more than happy to add cores if people can use them, so long as they can stay within their cost structure.”

But these figures hide the bigger story. “While the number of embedded processors in an ASIC may have been flat, the number of domain-specific or specialized processors is increasing,” says Ravi Subramanian, vice president and general manager of Mentor’s IC Verification Solutions Division. “We are seeing that SoCs in edge devices are changing in nature in that the data, and the amount of compute, is growing, and a significant amount of that compute is moving to application- or domain-specific processors.”

New workloads have significantly different processing demands. “The traditional ASIC computer-architectures with multiple embedded processors cannot deliver the GOPS/mw/MHz required for many new applications,” says Subramanian. “It should be noted that algorithmic complexity is far outstripping Moore’s Law, and as a result, driving new architectures. The programming model for these new architectures is co-developed in order to jointly optimize programming model and hardware architecture.”

This also is driving large data center architectures. “We see supercomputers where the number of cores goes up because their workloads, such as finite elements and big iterative systems, are easily parallelizable,” says Arteris’ Frank.” Everything that has big linear algebra workloads is scaling nicely.”

Limitations of CPUs
While general-purpose CPUs are useful, their generality is also a limiter. “One of the problems is that CPUs are not really good at anything,” says Frank. “CPUs are good at processing a single thread that has a lot of decisions in it. That is why you have branch predictors, and they have been the subject of research for many years. But accelerators, especially custom accelerators, serve two areas. One is where you have a lot of data moving around, where the CPU is not good at processing it. Here we see vector extensions going wider. There are also a lot of operations that are very specific. If you look at neural networks, where you have non-linear thresholding and huge matrix multiplies, doing this with a CPU is inefficient. So people try to move the workload closer to memory, or into specialized function units.”

Closely tied with the processor architecture is the memory. “For some applications, memory bandwidth is limiting growth,” says Mentor’s Subramanian. “One of the key reasons for the growth of specialized processors, as well as in-memory (or near-memory) computer architectures, is to directly address the limitations of traditional von Neumann architectures. This is especially the case when so much energy is spent moving data between processors, and memory vs. energy spent on actual compute.”

Another thing hidden in the number is core complexity. “What we’re seeing is that while the clusters remain intact, the individual cores become more powerful,” says Synopsys’ Willems. “We are seeing the trend of moving to 64-bit for certain domains. In that sense, the same number of cores, but more capabilities.”

Limited by software
The PC switched from being a single processor to multi-core more than 20 years ago, and yet the number of applications that can successfully utilize multiple cores remains limited. “If you look at how much parallelism you can extract out of a regular program, you can find a level of parallelism of about two instructions per clock cycles,” says Frank. “If you change your programming model to become an explicit data flow model, where you have data dependencies that you manage, you can suddenly get 13X to 20X performance improvement.”

Can humans do better? “Try to find a programmer that can write a parallel program that can efficiently use two threads,” adds Frank. “You’ll find those. Try to find someone who can, when you do not have a highly parallel workload, use more than four threads. Very hard. And the more threads you want to add, the more complicated and harder the problem is.”

Data-oriented programming makes all the difference. “Parallel computing concepts have been part of computer science for decades, but have been relegated to highly specialized tasks, and parallel processing is difficult,” says Subramanian. “This has significantly changed over the last five to seven years to the point where parallel computing at scale is possible. This is especially important today in many new ‘data-based’ disciplines, where the programming model needs to efficiently enable software algorithms to run on very large parallel computing architectures. Now, they are being extended to heterogeneous parallel-computing architectures, and we are seeing new programming frameworks emerge for the new workloads — Tensorflow, Caffe, MLpac, CUDA, Spark, MOA, are just a few — to significantly reduce the barriers in programming these systems.”

Progress is happening outside of the AI space, too. “There are programming models, or programming languages, that enforce programming models, like Cilk, a task-based model,” says Frank. “There are runtime libraries that support things like OpenMP task models. When adopting these, you will suddenly see that you can fill 10 processors. So as soon as we get to that level, and these kinds of programming models — task-based, data dependency managed program models — you will find that there will be a next step in the increase of processors.”

There are other factors pushing people to look at multi-core programming. “Some people are looking at multi-core from either a functional safety or security realm,” says Mentor’s Hancock. “Designers need to isolate, or separate, multiple cores into individual subsystems in order to protect some things. However, people are reluctant to change. You need trailblazers to lead the charge, and there probably has to be some big event that happens. Until that happens, they tend to build off of what they already know and are comfortable with. Automotive needed somebody like a Tesla to shake everything up. Traditional automotive companies initially just saw a startup in the valley and dismissed them. All of a sudden, they take notice and realize they need to make changes.”

New drivers
While smart phones remain the volume and dollar leader, they are no longer the only market segment that is causing technology to advance. “Computing in automotive applications is going through an incredible transformation, driven by the demand for more immersive, rich in-vehicle experiences, increasing levels of automation for safety and convenience, and the move toward software-defined functionality,” says Arm’s Greenhalgh. “This has led to a rapid consolidation of individual electronic control units (ECUs) into fewer multi-function ECUs, and this requires not only more CPU compute power in the form of multi-core CPUs, but also deployment of heterogenous compute elements. We are now seeing SoCs with increasing multi-core CPUs also augmented by GPUs, ISPs and ML accelerators, allowing software to be deployed to the most efficient compute element for each workload.”

The amount of compute power going into automotive chips is staggering. “The trend is definitely moving toward domain-specific architectures, on-chip and off-chip accelerators, and embedded programmable logic,” says Sergio Marchese, technical marketing manager for OneSpin Solutions. “For many applications, throwing more standard processors at a problem is not the most effective solution. For example, if you look at the Tesla Full Self Driving chip, it has 12 Cortex-A72 processors that take a good portion of the chip area. However, the two neural network accelerators take almost twice as much area (see figure 1).”

Fig. 2: Die image of a Tesla chip. Source: OneSpin Solutions

Fig. 1: Die image of a Tesla chip. Source: OneSpin Solutions

Cadence’s Przywara agrees. “What we’re seeing in automotive for support of the more sophisticated Level 2, Level 3 in autonomous vehicles, is multi-core clusters of DSPs to support radar, and engines to support the increasingly sophisticated AI algorithms, as depicted in figure 2. We are seeing examples of maybe upwards of 8 AI engines that are being used to run the AI algorithms because of the number of TOPS required.”

Fig. 3: Blocks within a complex automotive chip. Source: Cadence

Fig. 2: Blocks within a complex automotive chip. Source: Cadence

Artificial intelligence is a driver on its own. “With artificial intelligence, it’s about how you handle the memory accesses and the address calculations,” says Willems. “We see a lot of activity and a lot of processors being developed that are not off the shelf type of processors. Some systems will have multiple AI processors, simply because they have an always-on function where they need to process something, and then they wake up another AI processor to do something more massive in terms of computation So, you may have multiple processors assigned to tasks.”

AI also is affecting smaller edge devices, as well. “MCUs are taking on the ability to do machine learning, and their capability is being further enhanced with the addition of DSPs” says Przywara. “TinyML is helping to drive that. This is something they haven’t traditionally been using, but as more and more applications for AI and machine learning are being investigated and developed, it is going to be a driver for MCUs with DSP accelerators, or all products, to become more capable. This will drive the trends to higher numbers of processing cores in these devices.”

5G also is having a considerable impact. “A radio modem used to have one big massive DSP,” says Willems. “It was powerful enough to run the necessary tasks in sequence or interleaved. Moving to 5G, it is clear that a single DSP cannot do the job anymore. Instead, they strip off certain things, turning them into multiple programmable accelerators — a DSP specialized for certain functionality, such as doing matrix inversion or equalization. But it’s not all the same DSP or multiples of it. It’s multiple cores, each being specialized by the instruction set and the memory interface.”

The base stations are seeing even more processor growth. “For 5G base station application, we’re seeing huge numbers of cores approaching almost 100 cores, to do the workload,” says Przywara. “5G, certainly on the infrastructure side, is going to be pushing things up.”

Future growth
So what should we expect in the future? “There will be more special processors,” says Frank. “In the standard SoC area, I expect the number of processors cores will increase, possibly up to 256 cores, but not for PCs, notebooks or phones. Where we will see it is in embedded systems targeted for cars, where increasing demand for high performance and coherency has become more important than on the machine learning side.”

To make use of it, programming models have to change. “The number of ASICs with new computer architectures driven by workload power and performance efficiency is rapidly growing,” says Subramanian. “The demand for solutions to address a large variety of workloads (e.g. speech recognition, X-ray analysis, object detection, facial recognition, evolution of biological cells) is already forcing the development of new areas in computational analysis, where new mathematical models to understand specific problems are developed. And it’s rapidly bringing those findings to the world of computer science and, specifically, the analysis of the performance of new computer architectures. This is leading to a world of computer architectures that are evolving to heterogeneous, multi-processing architectures, and a move from von Neumann to neuromorphic computing architectures. We have just started a renaissance in computer architecture today.”

Perhaps it is time for the survey question to be rewritten to better reflect the evolution of processing power in chips. Perhaps it should identify the number of independent instruction streams, or threads of control, but all such metrics would continue to be biased towards today’s control-driven paradigms.



2 comments

Tanj Bennett says:

All increases in processor count – and that includes adding specialized processors on an SOC – are ultimately about packaging. And Moore’s Law is actually an economic law about packaging. It was enabled technically by Dennard scaling, which dovetailed with it until about 15 years ago, but the reason Moore’s law remains alive beyond the end of that runway is that the economic engine created in the first 35 years has become pervasive and no longer requires any one method to get the goals it demands.

When we look at parallelism we can distinguish two separate trends. One is algorithmic parallel. This is in its purest form in vector or tensor processing units. The other is packaging parallel, and the use of multiple CPUs in servers or databases is generally of this kind. Packaging parallel uses pools of computation interconnected to pools of resources, but in general it works best when the various threads actually have nothing in common. The ground in between, that classic “try to make things go twice as fast with 2 cores on one algorithm” are fascinating academic exercises, but if you actually pay attention to what is really being built, *it is not all that common*. Even something like a GPU with hundreds of processors doing what looks like one job is actually succeeding because that job is cut into polygons that can be processed independently. And that phone SOC is generally using 2 or 3 cores because one supervises the network, one supports the screen and other parts of the UX (heavily offloaded to GPU and signal processors), while one runs the application logic.

Now, the cases where core counts are climbing are those where packaging dominates. A modern server CPU in a data center has in one socket as many cores as a rack had 15 years ago when the cloud got its start. Remarkably, that one socket may also have around the same number of customers as the rack. And those customers have nothing to do with each other. So the problem today in getting to higher core counts on servers is nothing like those old algorithm puzzles. It is much more about keeping resources – memory, bandwidth, caches, cores – cleanly isolated. It is actually, done right, pretty close to linear scaling with core count, and no end in sight.

Packaging will continue to drive this because of the economic savings that still come from the ecosystem which Moore’s Law built. It is cheaper to build one socket than a rack. It is also cheaper to build one socket than two sockets. You also get far better bandwidth in one socket than when you need to split across two. Overall power per core goes down even as the socket total power rises.

And the great thing is, the apps running in this do not need to be rewritten. The whole thing looks to them like their original 1, 2, or 4 core server which does not care if that is done with 4 sockets or with 0.0625 of a socket.

This may turn out to be fundamentally different than how scaling continues for vehicles, phones, or even laptops. They seem to be advancing more by investments into specialized accelerators. They have only one machine to run or one user to entertain, so perhaps there will be an upper limit where there is no incentive to add cores, or accelerators.

But in either case the classic multi-programming problem is not central to what is driving scaling today.

harrie geenen says:

harrie geenen
Other strategies are also possible for faster computing. We can radically change the architecture of today’s microprocessor by using the architecture of the human brain.
Current microprocessors work at more than 3 Ghz, people have to do with 30 Hz, so without the G.
You will understand that with an old-fashioned microprocessor architecture you can do nothing at 30 steps per second. Talk and put your mind in order, Forget it.
So with its 30 Hz (no transistors but ion transport in electrolytes) people must have a super architecture in order to still be able to achieve something. For those who are not yet familiar with this, everything that plays in consciousness is always always on a kind of main highway through the brain, a bus with an estimated width of 20 kbit. There are 100,000 memory blocks permanently watching along that highway. In case of agreement, a reply follows on an answer bus. So when you talk, that type of environment is already there at the pre-processors of consciousness.
Back to the real microprocessor. No parallel input but a serial, a 20k bit bus and a large number of comparators.
Example d dos attack
an attack with one or some forms of an internet package in large numbers. The new processor puts the entire incoming parcel on the internal bus at once, the criminal parcel is recognized in one step and thrown in the trash or the IP address of the sender is stored for subsequent trash.
For specific applications (eg sound-based imaging), the architecture needs to be adapted.
I estimate that such microprocessors can become more than a million times faster than current processors.

Leave a Reply


(Note: This name will be displayed publicly)