Making Chips Run Faster

Just turning up the clock frequency or using more cores will drain the battery or cook the chip, but there are lots of other options.

popularity

For all the talk about low power, the real focus of most chipmakers is still performance. The reality is that OEMs might be willing to sacrifice increasing performance for longer battery life, but they will rarely lower performance to reach that goal.

This is more obvious for some applications than others. A machine monitor probably isn’t the place where performance will make much of a difference. It’s quite another story waiting for a video to download and watching the action hang at a critical moment.

“Anything that’s video-enabled, which goes all the way down to cell phones, is going to be performance-sensitive,” said Bernard Murphy, chief technology officer at Atrenta. “Performance is still a factor.”

But turning up the clock frequency is no longer an option. In fact, Intel’s fastest Xeon processor runs at a maximum of 3.66GHz, down slightly from the 3.8GHz Pentium 4 that the company introduced in 2004 using a 90nm process technology. IBM’s POWER7 chip likewise runs at a maximum of 4.25GHz, down from 5.0GHz in the previous generation.

The solution to boosting performance for these high-performance chips has been multithreading and homogeneous multicore architectures, but for most applications—and even in cloud-based server farms—that approach won’t provide sustainable gains in performance in the future. In the mobile world, utilizing four homogeneous cores continuously can eat up a battery charge as fast or faster than one core at the same or similar frequencies, and in data centers cores that are turned on all the time will make CIOs wince at the cost of powering and cooling racks of servers.

The solution is much more engineering, and not all of it on the processor side. In fact, there are many things that affect performance other than the frequency of processor cores.

“Some of the obvious culprits are memory bandwidth, and that takes us back to Wide I/O,” said Murphy. “A lot of the problem with video is the compression and decompression and all the other tricks you’re playing on it. All of that is very memory-intensive. So you need to have a fast and wide link to whatever you are using as memory DRAM. With any kind of multiprocessing platform you’d like to have dedicated compute on that stream that is not going to be interrupted by other things you are doing. And that means you have to have a multiprocessor, which includes the ARM 5x series or an equivalent with Nvidia or Imagination. And then for communication, MIPI-type stuff could be valuable because you can have multiple channels to multiple transmitters to receivers, which would potentially speed up your communication.”

All of this requires better floor planning, architecting better access to memory, sometimes using different sizes of wires, and different IP, all of which can help boost performance. In addition, improving energy efficiency and rethinking how compute cycles are used can allow more intelligent use of the fastest resources and even provide headroom for boosting clock frequencies or running more cores when necessary.

Bring in the architects—again
In the chip architect world, this sounds like déjà vu. Architects are suddenly back in vogue, intimately involved in all aspects of the design process rather than handing off a chip’s blueprint and moving on.

Moreover, this is true for both the digital and the analog side. But analog has become particularly thorny for designs at advanced nodes. “When you’re looking at performance in analog, it’s more about getting the noise out,” said , group director of marketing at Cadence. “Because the analog portions are not scaling well, instead of 5,000 square microns maybe you have 3,000 square microns and a DSP to improve the quality of audio streaming.”

Part of this is an exercise in good floor planning for prioritized signal paths, which is much more difficult as more and more functionality is added onto an SoC and there is contention for shared memory, buses and I/O.

“One of the main things we see as we go to smaller feature sizes is that the RC delay from wires is increasing,” said Arvind Narayanan, product marketing manager for the Place and Route Division at Mentor Graphics. “That affects timing and it can lead to over-buffering because you’re not using your upper metal layers for critical nets. You need to prioritize and then you need to follow through on that.”

That follow-through is changing, though. At 28nm, architects could hand off a design and move on. At 20nm and beyond, Narayanan said the SoC architect often needs to provide feedback after the initial floor plan is completed. “Once that’s done designers can determine, ‘Is it feasible?’ If not, it may require changes to the spec. With finFETs, the dynamic power is larger than the leakage power and there are in increasing number of scenarios that may have been forgotten. What we’re finding is that the worst case for timing is not necessarily the worst case for leakage, and the number of corners is rising. There are 10 to 15 corners for implementation, but maybe as many as 40 or 50 corners for signoff.”

Nor is this confined just to silicon. The package is now an integral part of the performance equation. The generally accepted power budget for mobile devices is about 3 watts, after which they become uncomfortably warm with extended use.

Rather than turn up the clock frequency, a package-on-package approach using a fan-out and more wires around the edge can double the bandwidth, according to Raj Pendse, chief marketing officer at STATS ChipPAC. He noted that requires the addition of high-density organic substrates, but the result is increased performance or lower power.

In fact, packaging is being viewed as a very big knob to turn in improving performance, whether it involves Wide I/O, more pins, through-silicon vias, or even silicon photonics to communicate between chips.

“What we’re dealing with here is the evolution of the PC board,” said Cadence’s Carlson. “If you shorten the distance you get less capacitance, which allows you to increase the speed because there’s less distance to travel.”

Memory is another place to tweak performance. The Hybrid Memory Cube and Wide I/O-2 have been shown to significantly improve performance and memory throughput. There are opportunities to make DRAM significantly more efficient, as well, according to Craig Hampel, chief scientist for the memory and interface division at Rambus. “The key is to use resources more efficiently and undo the brute-force approach,” he said, utilizing such techniques as memory threading—basically the equivalent of multiple cores in processors—and reducing the voltage swing.

Power tradeoffs
Trading off power for performance is an important part of the equation, too. In many cases, it’s one or the other, and with fixed power budgets design teams have to reduce power in order to improve performance somewhere else on an SoC.

“What’s happened at the leading edge is an accelerated technology migration from 28nm to 16/14nm,” said Aveek Sarkar, vice president of product engineering and support at ANSYS-Apache. “That allowed people to add more and more cores, which were aided by the technology migration. Now the question is how many of them are in idle mode, and whether they’re still burning power when they’re idling. So you need a lot more power management to handle the increased performance, and you need to decrease the supply voltage for the majority of the chip. But memory may need to go at a slightly higher voltage, so now people are putting power gates in there and deciding whether to put in one big power switch around a block or scatter it across the block. That also affects performance.”

With those multiple cores, power gating also can create problems. Waking up cores fast causes a spike in in-rush current, which can affect long-term reliability as well as causing noise on another core. That noise can affect performance, regardless of whether it’s digital or analog, and the problem is magnified if more than one core wakes up at the same time.

“If you don’t design the switches right you can actually slow down the chip because you can’t transition between gates fast enough,” said Sarkar. “You also need to develop good packaging rules at smaller technology nodes because of the resource contention between power and ground. We’re starting to see more design-dependent packaging. One size does not fit all.”

Shrinking features vs. reworking established nodes
What’s also changed is that while some companies have accelerated their march to the next process node, others have sharply decelerated. Far more options are opening up at 28nm, for example, including various flavors from foundries and different materials such as fully depleted silicon on insulator, which has received strong support from STMicroelectronics, GlobalFoundries and Samsung.

Mark Milligan, vice president of marketing at Calypto, said the next process nodes still offer better performance for a price, but he said there also is a lot more microarchitecture work going on at established nodes to close that gap.

“This is a renewed area of design all of a sudden,” Milligan said. “What we’re seeing is companies trying out different microarchitectures for pipelines and doing RTL power analysis to see the power impact. Then they make incremental changes for different customers. So they may make one version for one customer based on low power and cost, and then turn on different blocks for other customers that have higher performance.”

This kind of granularity and more purposeful utilization of resources has been talked about for several years. But while 14/16nm is in full swing for designs, the challenges of getting there with multipatterning and finFETs have caused many companies to stop and take stock of their options. In some cases, relying on better architectural approaches is the less expensive alternative, while in others shrinking is the only way to eke out enough performance and area to remain competitive.

Richard Solomon, technical marketing manager at Synopsys, has seen this firsthand with PCIe and MPCIe, the Mini PCI Express standard that basically combines MIPI with second-generation PCIe.

“A lot of customers are now looking at power over time,” Solomon said. “So is it better to use a long period of time and a slow data rate, or do everything in a short period of time at a faster data rate? If it’s the latter, you can buffer the video, blast it and then go quiet again. But you have to look at the traffic profile and make decisions that sometimes are counterintuitive.”

While this kind of approach has been used in processors for some time, particularly with a burst mode that runs for a very short interval, it’s a relatively new concept for bus architectures. But what’s clear from all of these approaches is that just relying on shrinking features to achieve performance and power improvements is no longer the only approach. And as 2.5D and package-on-package stacking begin to take root in the industry, there will be far more options for mixing things up in ways no one has been economically motivated to investigate prior to a 16/14nm, multi-patterning, EUV delays, high-mobility materials and a host of other problems facing continued feature shrinks.



3 comments

Excellent discussion. Thanks!

Very interesting article. Our approach is to provide the ‘different IP’ as suggested in the article. On the advanced nodes, performance optimisation schemes require conditions to be accurately monitored on-chip and within the core. We have the belief that PVT conditions should be monitored and sensed by small analog sensors such as accurate temperature sensors, voltage monitors (core & IO) and process monitors. Quite simply, the more accurate you sense conditions the more watts can be saved for both idle leakage and active states of a device. For example, our embedded temperature sensors have been developed to monitor to a high accuracy for this reason. Once you have the ‘gauges’ in place you can then play with the ‘levers,’ by implementing Dynamic Voltage and Frequency Scaling (DVFS) schemes or Energy Optimisation Controllers (EOCs) with are able to vary system clock frequencies, supplies and block powering schemes.

Again, we believe that these peripheral monitors are nowadays less ‘nice to have’ and becoming a more critical requirement. With that, these monitors must be reliable and testable in situ as failing sensors could have a dramatic effect to the system.

Another point is that we’re seeing device architectures that cannot cope with each and every block being enabled. With increased gate densities on 28nm bulk and finfet, hence greater power densities, hence greater thermal dissipation, we’re seeing that devices cannot be fully power-up and at the same time, operate within reasonable power consumption limits.

All these problems of coping with PVT conditions on-chip and the increasing process variability on advanced nodes mean that the challenges, and opportunities of innovation, for implementing more accurate, better distributed embedded sensors and effective Energy Optimisation (EO) schemes are here to stay.

Interesting discussion. I agree with the comment that analog designs are not scaling well as we move to finFET technologies (increased mis-match, increased track and via impedance, etc). Part of the reason for requiring ‘good performance’ analog circuits is to monitor on-chip PVT conditions so that performance and energy optimization schemes, such as that discussed, can be implemented.

Leave a Reply


(Note: This name will be displayed publicly)