Power-Centric Chip Architectures

New approaches can lower power, but many are harder to design.


As traditional scaling runs out of steam, new chip architectures are emerging with power as the starting point.

While this trend has been unfolding for some time, it is getting an extra boost and sense of urgency as design teams weigh a growing number of design challenges and options across a variety of new markets. Among the options are multi-patterning and finFETs, leveraging different materials such as FD-SOI, and a number of advanced packaging approaches that include variants of fan-outs and 2.5D.

At the same time there is much more attention being paid to doing more at the same process nodes with architectures, microarchitectures, various caching schemes and new versions of those processes that are being optimized for specific markets. That also can include doing in hardware what used to be handled by software, or doing in software what traditionally had been done in hardware.

“We’re seeing new architectures where they are rethinking the fundamental paradigm of processor, cache and external memory,” said Krishna Balachandran, product management director at Cadence. “Companies are looking at the whole thing from a power-efficiency point of view and applying performance per watt to 21st century applications such as climate mapping and even bomb simulation. The bottom line is we cannot just go on building server farms like we have in the past.”

At some point the electronics industry will narrow its scope and build in the economies of scale that will make some of these options affordable and good enough for many applications. But for the time being, the number of options and possible combinations appears to be increasing rather than shrinking, with power reduction—and related physical effects such as thermal—becoming a central part of many designs.

Heterogeneous approaches
One strategy for reducing power involves heterogeneous multicore computing, which has been gaining traction across multiple vertical markets because it allows compute jobs to run on processors and memory that are sized for specific tasks. Done right, this approach has an added benefit of saving on silicon because the utilization rate of smaller cores can be increased. In many devices such as smart phones or wearable electronics, some cores are in the “off” state most of the time, and while this is more efficient than keeping them on, it may not be as efficient as using a mix of different-sized cores.

How that works with other power-saving approaches, such as virtualization, remains to be seen. Virtualization originally was developed as a way of allowing any software to run on the same hardware, but the market really began ramping when data centers began using it to increase server utilization instead of powering and cooling racks of mostly idle multicore processors. In the past these servers had been segregated by operating system, but hypervisors allow multiple operating systems to run on the same processor, which means they can be queued up for processing together or separately with intelligent scheduling.

But hypervisors become more complicated to manage in a heterogeneous multicore architecture because they only run on the large cores, which means some other compute schemes have to be included, as well.

“We’re seeing designs with four big and four little cores, where you can swap one core in and out at a time,” said Felix Baum, product manager for embedded virtualization at Mentor Graphics. “But having that kind of capability is difficult for customers to comprehend. Only very recently did Linux provide a scheduler. This brings a lot of challenges. If you’re working on a dual-core device, which one boots first?”

Cache coherency is another option because it allows all of the cores to work together, but this is significantly more difficult in a heterogeneous environment than in a homogeneous one. “We know how to build it in theory and there have been some examples of this working, but no successful ones in the market,” said Drew Wingard, CTO of Sonics. “It assumes a uniform view of memory and that the system has a coherent view of memory. You need memory management units, and different architectures assume different models for virtual memory.”

Wingard noted that heterogeneous computing has been common in the mobile space for some time, but it has not been coherent. “At the root of heterogeneous computing is the built-in efficiency to be gained by choosing at run time where to schedule different computing task.”

There is work underway to make cache coherency easier to implement across heterogeneous compute elements. One big challenge is making the data consistent between different cores, which is rather straightforward in a homogeneous multicore system, but much more difficult when cores, memories and other IP blocks are developed by different vendors. This is particularly troublesome when some of the IP is internally developed by chipmakers and they want to leverage it for as long as possible.

One of the big challenges with heterogeneous cache coherency is making data consistent between cores.

“The key is how everything talks to each other,” said David Kruckemyer, chief hardware architect at Arteris. “How you place it on a die can affect clocking and power management. By using the interconnect layer you can lash all kinds of IP together and translate from any native coherent protocols into a single layer and create a coherent subsystem.”

The challenge is understanding the sum of all the possible behaviors of the coherence protocols, Kruckemyer said. “But if you can adapt the native coherence models, you can have caches of different sizes. So you can scale the functionality of the interconnect by adding more components, and you can lower the power with multiple clock and voltage domains.”

This is easier said than done, and it requires a lot of work. But it also provides some interesting possibilities to lower power at existing nodes without changing manufacturing processes or materials—and to the extend the life of IP that is already paid for and market tested.

Heterogeneous cores add another interesting element, as well, in mission-critical markets. Secure firmware updates can be done sequentially, rather than rebooting an entire device.

Homogeneous approaches
Homogeneous computing isn’t out of the picture, either. While there is more attention being paid to heterogeneous solutions, there are simpler ways to achieve results.

“The bottom line is that if you want power efficiency you need to customize logic,” said Cadence’s Balachandran. “Hardware acceleration is one approach. There also are new forms of computing based upon how expensive it is to move data.”

That’s the idea behind Rex Computing, a startup developing a new processor architecture that it claims will deliver a 10X to 25X increase in energy efficiency compared with existing GPU- and CPU-based systems. The strategy is to sharply limit the movement of data, relying on hard scheduling between homogeneous processor cores scattered around a die with about 80% of the die’s surface taken up by SRAM. The company plans to tape out its first 16-core chip on TSMC’s 28nm HPC process next month, with a 16nm finFET-based production chip slated for the second half of 2017.

“If you look at modern processors today, moving data around takes 40 times more power than processing data,” said Thomas Sohmers, Rex’s founder and CEO. “Our goal is to do as much in software to reduce complexity and make the system more efficient. We got rid of MMUs, paging, virtual memory.”

The result, he said, is static latency that is predictable and quick to build. The company has gone from concept to tapeout in about a year.

And then there are some chipmakers that aren’t quite heterogeneous or homogeneous. Intel, for example, is combining Altera FPGAs with its server processors to add programmability into the chip itself, but it doing redundant arrays of small FPGAs.

ARM’s big.LITTLE architecture likewise uses different-sized cores, but many chipmakers are using multiple implementations of big.LITTLE chips. But the solution in some cases may extend beyond the design into how various compute elements are utilized in the first place.

“We’re seeing 32-bit and 64-bit microcontrollers making their way into MEMS pretty quickly,” said Zach Shelby, vice president of IoT marketing at ARM. “But the problem is how we get the right types of software applications that are high volume enough that they do the same things over and over. FPGA doesn’t quite work for low-power applications. If you do specialized mixed-signal vision detection algorithms on silicon, you have to use a microcontroller, but it’s the same application over and over again.”

Designing differently
Nor is this just about what gets used where. Another option is to design chips more efficiently using less silicon by reducing the margin. That may be the hardest challenge of all, because it requires shaking up the design flow within companies, as well as better tools to understand what can be changed.

“One company overdesigned their power grid because they couldn’t analyze the voltage drop well enough in the context of timing,” said John Lee, general manager and vice president of Ansys. “By fixing that they were able to decrease the chip size by 10% and it freed up routing resources.”

Lee noted that with the IoT, many designs are not at the most advanced nodes, but they still face the same challenges for reducing power. “Their challenge, like everyone else, is how to interpret all of the data with regard to timing, layout and extraction, and many other things.” As with functionality, he said all of this data analysis needs to shift left, and there need to be tools to help decipher it more effectively.

Even designs at established process nodes face the same challenges for reducing power.

There also is a growing recognition that software and hardware need to come together more effectively, with the emphasis on making the hardware more accessible to the software rather than trying to get software programmers to take better advantage of hardware features.

“There are two possible approaches here,” said Sonics’ Wingard. “One is to develop a magic compiler to figure out the hardware. The other is say, ‘There is a block over here with an API to do this kind of job, and there is a resource manager in the programming stack.’ You don’t have to turn programmers into embedded system programmers. You have to solve the problem where the software guys want to work. It also does not make sense to think of this as a steady-state machine. What you’re trying to do is design for integration.”

Power is emerging as the major constraint in designs these days, both in terms of the impact on battery life and utility bills, as well as the associated physical effects such as heat and electromigration, and the impact of those physical effects on reliability over time. In many ways power is a cost that is rising, which is why it has become the starting point around which an increasing number of chips are being architected.

There are many ways to tackle power. None of them is simple, and no single approach solves everything. Power is a system-level problem, which means it has to be considered at every level, from what kinds of components are used, how they are utilized and connected together, how they are laid out on a chip, what materials work best and using what manufacturing process, and even how they are ultimately tested.

But it also is a huge opportunity for the semiconductor and design industry. New tools and approaches are under development that can make designing for power much more predictive with significantly better accuracy and faster turnaround. Change is coming, but in this case it is likely to come across many markets and technologies at once. Power is, after all, a global problem, and it will an entire industry to solve it.

Related Stories
Heterogeneous Multi-Core Headaches
Way Too Much Data
Power Management Heats Up
Virtualization Revisited
Thermal Damage To Chips Widens