Second of two parts: More cores and faster cores are supposed to be better. That’s true sometimes, but not others.
Check out any smart phone these days and you’ll find some reference to the number of cores in the device. It’s not the number of cores that makes a difference, though—or even the clock speed at which they run. Performance depends on the underlying design for how they’re utilized, how often that happens, how much memory they share, how much interaction there is between the cores, and the software applications that utilize them.
Performance is woven tightly around a power budget, especially in mobile devices where more time between battery charges is essential, but increasingly also in those devices that are plugged into the wall. The rule of thumb in semiconductor and full-system design is that you can trade power for performance, but you can’t have both. What’s less obvious, but just as significant, are the speeds of the interconnects, how much processing is done on-chip versus off-chip, and how much cache coherency will play a role in all of this. Designs that get this balance wrong may have lots of cores and can boast fast clock speeds, but still not perform better than models they replace with fewer cores and slower clocks. So while the single processor can’t get any faster by turning up the clock frequency, there are ways of significantly improving the performance outside of the processor speed.
Software virtualization has emerged as a way of utilizing more cores within a design. It is less frequently associated with performance, because it doesn’t actually speed up individual operations. But it can speed up a multitude of operations, with or without some help from the hardware. It also can bog the system down in coherency issues if it isn’t architected correctly.
The technology—first developed by IBM in the late 1960s to boost performance on its mainframes by allowing more tasks to be scheduled simultaneously—gained widespread attention over the past decade as a way of improving the utilization across data centers. By allowing any application running on any operating system to be queued up on the same server using a thin layer of code called a virtual machine and a hypervisor to manage it, server utilization was increased from a low of about 5% to as much as 90%. That, in turn, allowed data centers to save a huge amount of money on both powering and cooling racks of servers, as well as to shrink the overall number of machines inside of those data centers. In some cases, the savings was measured in tens of millions of dollars.
That approach has been further refined with “Type 1” hypervisors, which use an even thinner layer of code that runs directly on the metal rather than on an operating system The result is even better energy efficiency and performance for the same operation. Virtualization also has become particularly attractive recently as a security measure on all sizes of processors because it allows transactions to be isolated from each other. Rather than physically separating functions with components and processes that never actually touch, they can be virtually separated, which is a much less costly approach.
But not all virtualization needs to be done in software. A less publicized approach is hardware virtualization, which utilizes unused processor cycles as if they were running as different processors. That makes it faster, less prone to error, and even more efficient, and it allows for better utilization of processing capability even on single-core processors.
“You maximize your processor use by utilizing it as if it’s a different processor,” said Andrew Caples, senior product manager for Nucleus in Mentor Graphics’ Embedded Software Division. “So it functions as virtual CPUs, but physically it’s only one CPU. This is one area where MIPS (now part of Imagination Technologies) has a performance gain. You also can use it to minimize power and heat while increasing performance.”
Improving work per clock cycle is one of the big challenges for makers of all types of processors. Part of that is dependent on the instruction set, as well, which can involve both the number of bits and the efficiency of the operations. Synopsys has voiced similar goal as MIPS, for example, with a different twist on the technology.
“This isn’t just about adding instructions,” said Mike Thompson, product marketing manager for Synopsys’ ARC processor. “You can add more core registers and specify source and destination there, and you can bring in registers so there is single clock access to read and write. And you can improve performance by using a hypervisor with performance density, so you’re running a second system on the same processor. That has implications in terms of maximizing efficiency because you’re taking advantage of unused cycles.”
It also has implications at 16/14nm because the logic runs faster, but at least initially the memories will run slower than the logic. “There are ways to mitigate that, such as more stages to access the memory,” noted Thompson. “But how much money are you willing to spend on memory for bigger drivers?”
Offloading some of the processing from a general-purpose processor is at the heart of a number of schemes for improving performance and reducing power. This varies by the capability of applications to take advantage of multiple cores.
“You can either have a hard affinity to specific cores, or you can have a soft affinity and load a different task when a processor becomes available,” said Mentor’s Caples. “That way you can allocate a couple processors to a number of tasks. So you get a performance gain from software that works on multiple cores, not just from adding more and more cores.”
How many cores are effective is dependent on how many of them can be used effectively. Most software can be threaded to use two cores. Some software can be threaded to use four cores effectively. Beyond, that, however, the software has to be doing highly repetitive tasks that can be parsed into small pieces, computed independently, and then reassembled at the end, which is the biggest challenge in the multiprocessing world.
“At some level, this is a problem we’ve been dealing with since the dawn of the SoC —how to write the code,” said Drew Wingard, chief technology officer at Sonics, Inc.. “This all comes down to the energy efficiency of the processor. We’ve gotten to the knee of the curve where making it run faster is prohibitively expensive in terms of power. So the question is no longer about parallel processors. It’s about how many separate processors should be used, and how we can help the software community use platforms with a mix of processors and different instruction sets. That also requires some changes in intermediate assembly, where you have intelligent schedulers to move processes around.”
There is still a big effort in application-specific processors and in compiler technology to speed the software for specific processors. This helps explain why Synopsys acquired Target Compiler Technologies earlier this month.
Power and throughput
One of the challenges of processor makers at advanced nodes is throughput. Wires between memory and logic are longer and thinner at each new process node, raising questions not just about performance. The new equation of choice is performance per watt, but sometimes even that isn’t sufficient.
“In some cases it’s not only performance/watt,” said Eran Briman, vice president of marketing at CEVA. “Sometimes it is sheer performance gaps that necessitate an application-specific over general-purpose processors. For example, you cannot even imagine running LTE PHY (physical layer) on even the fastest CPUs you’ll find in wireless base-stations. It just won’t fit in, whatever your power budget may be. Another example comes from the computational photography domain. One of the most sought after applications in this domain is lens arrays, where multiple — up to 4 x 4 — sensors capture an image simultaneously, and then manipulate and fuse them together to create a single high-quality image. Trying to map such computations onto the fastest CPUs you’ll find in application processors will fail. In such extreme cases, these general purpose processor simply lack the performance required, let alone power concerns.”
Briman noted that application-specific processors range from extremely low-power, always-on all the way up to high-performance vector floating-point processors. “The most significant hurdle for vendors to embrace such processors is its programming and opening them up to software developers. These are usually very familiar with CPUs and comfortable with their tools and ISA. How can you get them to use a different set of tools, with application-specific ISA and a clear focus on coding in a power-efficient manner? This is where various automatic offloading tools come in handy, helping to abstract the processor from software developers, to a stage where the developer does not really need to know what processor he’s running his code on. From his point-of-view, it’s all ARM-based and the actual offloading to application-specific processors happens automatically.”
These kinds of questions are being asked more frequently. Granularity in design has now reached the processor, the application, and the power budget.
“Do you need to crank up the clock speed to do what you need to do?” asked Aveek Sarkar, vice president of product engineering and support at ANSYS. “Is the clock speed the determining factor, or do you need more cores that are more application-specific and optimize it that way? With the finFET, transition leakage current is out of the picture. But if you crank the clock up, you still have to deal with power. If you drop the supply voltage to 0.7 volts or even 0.6 volts the performance hardly changes with a finFET, but if you’re at 28nm and you drop it you get a significant decrease in performance. At the same time, if you push up the clock speed, you negate the purpose of what you’re trying to accomplish. This is why we’re seeing a general trend to more cores for specific functions. In the past it was all about clock speed. Now it’s about functionality and application-specific design.”
Stacked die—both 2.5D and 3D —will give another boost in terms of improved throughput, less memory leakage and lower I/O power, providing they can be proved to be cost-effective. The distances between components are shorter and the wires connecting them are wider, which requires less energy to drive signals. But there also is a reverse trend beginning to take shape, as well. Rather than processors defining the software, there is discussion under way at systems companies to have the software define the hardware itself, as well as how it behaves. That could prove to be a much-less expensive approach to design because it may not require the latest technology for improved performance. A billion gates may not matter for hardware that is designed according to the needs of software.
In every part of design, though, power is becoming the limiting factor on just how fast a processor can run. So while it makes sense to add more cores for specific functions, processors themselves aren’t getting faster. There are just more of them, and they are all working more efficiently. That’s what good engineering is all about. But it may take a long time for the marketing world to figure out a good way to sell it.
To view part one of this report, click here.