Are More Processor Cores Better?

The effects of one architectural change on hardware, software and the design flow have far-reaching consequences. Adding a second processing core adds untold complexity.

popularity

Up until the early 2000s, each generation of processor was faster, used more exotic architectures, had deeper pipelines, used more transistors, ran at higher clock frequencies and consumed more power. In fact power was rising faster than performance and led to the extrapolation that within a few generations, processors would run as hot as nuclear reactors.

Something had to change, and that change was the migration to multi-core processors. Processors got simpler while maintaining or growing the total compute power by adding additional cores. This has had a profound impact on every aspect of hardware and software — a change that not only shifted complexity but magnified it. In fact, the ripple effects are much larger than appears on the surface. Some aspects of that impact are still being realized in as far reaching ways as the EDA tools and the solutions being used for chip design.

Semiconductor Engineering talked with many experts in the industry to find out how they are coping with the complexity burden thrown off by the processor migration.

The industry has seen a convergence of two major application areas. One of those was the desktop and general-purpose computing. This area demanded the fastest processors, and mainframes adopted homogeneous multi-processing a long time ago. “I have been working on multi-processor systems with interconnect and coherency for 25 years,” says Chris Rowen, a Cadence fellow. “The people who have been designing this category of systems have been having this nightmare for a long time. This is not a new nightmare. It’s decades old. But now the nightmare has been brought down to 1cm2 of silicon, so it is everyone’s nightmare.”

The second area was led by mobile phones and smart TVs. These started out heterogeneous with additional cores being added to perform dedicated functions. The early phones had a CPU and DSP, but all of the heavy lifting, the baseband communications, was performed on the DSP. The CPU was a simple controller whose job was to service keyboard interrupts, or display a few characters on a low resolution screen.

The addition of the applications processor has turned the phone and the smart TV into computers. On the other side, because of the power envelopes, general-purpose computers needed to look more like SoCs and started to integrate all of the functions that used to be on separate devices. Today they not only include the applications processor, but GPUs, audio sub-systems and many other functions.

“The first change, and the cost benefits of this level of integration are only realized if you share memory,” explains Drew Wingard, chief technology officer for Sonics. “Until then, each of them had their own local memory, which frequently involved external memory chips. The PCB would be populated by a number of discrete memory parts supporting the separate sub-systems. When you put them together it would require far too many pins for all the memory interfaces. You only get the cost benefit from pooling those resources and being able to share them. This gave us a lot more DRAM capacity but now we ran out of bits per second. Throughput became the new problem.”

The total number of lines of software is so large that there is a willingness to put extra features in the hardware to reduce the cost of developing software. “In the past, regions of memory were private to each sub-system and so there was no need to make addressing schemes consistent or no requirements for making these visible to the host operating system,” explains Wingard. “They also had no need for virtual memory. All of these walls are coming down.”

The primary way to speed up memory access was to build a cache. This was fairly simple for one processor, but as soon as multiple cores exist, the programming model being used requires memory coherency. “People have done a fair amount of shared memory programming,” says Rowen, “and have developed hardware features that made this relatively efficient, such as cache coherency. While cache is more efficient than going off-chip to get the common copy, it is a meaningful amount of overhead.”

This overhead extends into the verification task. “Cache coherence is a huge problem to verify,” says Michael Sanie, senior director of verification marketing at Synopsys.

Pranav Ashar, chief technology officer at Real Intent, adds “multi-core systems are being designed with lots of bandwidth to support parallelism, but they continue to have insidious latency-traps. Cache-coherency and memory access latency are two often encountered examples.”

At the most recent Decoding Formal event, Robert Kurshan, a recently retired formal expert from Cadence, provided an example of how complexity in the system can lead people to verify the wrong things, potentially with disastrous results. “Memory consistency lives at a higher, more abstract level than cache coherence, and as a result you can find more bugs faster if you seek to verify memory consistency.” He explained how Intel almost missed a cache problem on the P5 because of this wrongly focused verification approach. “It was actually a flaw in the way write-back was performed and it was not found because they had been verifying their cache protocol. It had not occurred to them that a race condition in the write-back could have created a problem.”

Another area of added complexity is power management. While power issues brought down the reign of the single core, the problem now has become a global chip issue. “Power management architectures have become a critical aspect of design,” says Aveek Sarkar vice president for Ansys-Apache. “Do you use Clock Gating or Power Gating to turn off unused cores, do you have voltage regulators to control them.” But it is not just about controlling power in a centralized manner. The issue spreads to other parts of the design flow. “If you power gate them, do you gate them in a distributed manner or sprinkle the power gates across the block, do you put them around the block? This controls several architectural decisions and as the number of cores increases your power gating architecture changes.”

This has led to additional tools and methodologies being added into the design flow. “System functionality could be mapped onto a low-end processor with hardware accelerators, or multiple high-performance processors,” says Jon McDonald, strategic program manager within Mentor Graphics. “Each design could potentially provide the same level of performance, but the cost and flexibility of the two approaches are dramatically different.” The worlds of hardware and software and the tradeoffs between them are now a significant development problem.

“Many of the performance related features are quite dependent on the software workload,” says Gene Matter, vice president at Docea Power. “EDA tools are evolving to provide dynamic use case simulation of complex SoCs that not only estimate performance/task consumption but also power and thermal behavior. Most system designs are thermally limited, so the thermal management algorithms can be modeled to take into account both the power and performance trade-offs.”

Power is one of the problems that bring the hardware and software teams together. “Power management software debug is becoming a fairly significant issue and this is one area where a Virtual Prototype has a very significant impact,” says Johannes Stahl, director of product marketing for virtual prototyping at Synopsys. “In mobile applications you will see designs that have 300 to 400 power islands. The verification of this is a huge challenge. While people have been dealing with this for a while now, there are a bunch of problems, such as how do you connect these? At the architectural level it is not that difficult to connect them, but at RTL, a lot of things can go wrong.”

“Consider two cores sitting next to each other,” says Sarkar. “One core is operating and the other is shut off. You need to wake the second core and the regulator has to push in a lot of current. That creates noise on the chip which can capacitively and inductively couple to the neighboring block and collapse their supply rail.”

Bob Smith, senior vice president of marketing and business development for Uniquify, Inc. adds another area in which complexity has increased. “With the proliferation of on-chip processors, the demand for ‘smarter’ interconnect is growing. An intelligent memory controller cannot be fully responsible to ensure accesses approach the theoretical maximum bandwidth of the memory. With multiple processors making multiple simultaneous requests to memory, the system interconnect needs to play a role to ensure that memory bandwidth is not crippled.”

Bill Neifert, chief technology officer for Carbon Design Systems points out that “while it may be possible to get a first-order understanding of the performance of an interconnect by using traffic generators and VIP blocks, the only true way to understand system behavior is to execute the system together with the software. Data can then be analyzed across multiple runs to visualize the impact of various system configuration and software changes.”

That’s no simple task. “You can build the best hardware and software in the world but if processes are fighting each other, everything slows to a crawl,” says Bernard Murphy, chief technology officer for Atrenta. “This has to be managed very collaboratively between hardware and software teams, understanding likely use-case models.” Murphy adds that both power management and security are coming into that collaboration. “The current focus in hardware is on proving security robustness as a trusted execution environment (TEE), which is a necessary but not sufficient condition.”

Hem Hingarh, vice president of engineering at Synapse Design also sees the network as the place where many of the design concerns come together. “Designing energy-efficient methodologies for various Network on Chip (NoC) domains, such as the routing algorithms, buffered and buffer-less router architectures, fault tolerance, switching techniques, voltage islands, and voltage-frequency scaling significantly affects the NoC performance. Therefore, the optimization of routing algorithms for the NoC is a key concern in enhancing the NoC performance and to minimize the energy consumption.”

“The major limit of a modern SoC growth is the verification of these systems,” claims David Kelf, vice president of marketing for OneSpin Solutions. “The current interconnect solutions allow a layered verification approach critical to system test, with the functional verification of IP blocks leading to only an integration test at the SoC level. This approach is absolutely dependent on the verification of the interconnect itself, and this, in turn, relies on the interconnect solution.”

Murphy does not see the interconnect making verification simpler. “In general, verification complexity is getting harder at the interconnect. You have to worry not just about traffic management, performance, QoS and so on, but also all of this working correctly in various partitioned modes, when different masters or slaves are running in different voltage/frequency domains or are powered off, or are in transition between these domains.”

The implications of adding multiple application cores has had far reaching effects on every aspect of the hardware and software design and verification flows. If one could measure the increase in total system complexity caused by this one architectural change, it would probably dwarf every other change in the development of the integrated circuit.

“As a rule, whenever you are designing hardware or software to allow more things to happen in the same amount of time, you are not simplifying the verification problem,” observes Ashar. That may be something of an understatement.