Darker Silicon

MRAMs offer less volatile cache to address the dark silicon dilemma. What happened to Dennard’s Law?


For the last several decades, integrated circuit manufacturers have focused their efforts on Moore’s Law, increasing transistor density at constant cost. For much of that time, Dennard’s Law also held: As the dimensions of a device go down, so does power consumption. Smaller transistors ran faster, used less power, and cost less.

As most readers already know, however, there was a limit. Smaller devices with thinner dielectrics and shorter channels are more prone to leakage. Indeed, leakage, negligible for much of the industry’s history and ignored in Dennard’s original paper, now approaches the same order of magnitude as the circuit’s dynamic power. Advances such as the introduction of high dielectric constant gate dielectric materials helped, but leakage-limited transistor structures are now a fact of life. Switching a transistor at a lower threshold voltage requires a thinner gate dielectric, but leakage constraints place a lower bound on dielectric thickness. As a result, while feature sizes have continued to shrink, threshold voltage has not.

Plenty of transistors, not enough power
This failure of Dennard scaling has introduced the era of what designers call “dark silicon.” If the number of transistors doubles, but the power budget for the circuit as a whole stays the same — or goes down, thanks to the proliferation of mobile devices — then the available power for each transistor is cut in half. If threshold voltage stays the same, then the number of transistors that can operate at one time is also cut in half. These non-operational transistors are dark silicon, measured as a fraction of the chip’s total area.

Calculating the power consumption of a generic chip is difficult. It depends on a wide range of factors, from dielectric thickness and process variation to the workload of different parts of the chip. Still, as Greg Yeric, senior principal engineer at ARM, explained in a short course at the 2014 IEEE Electron Device Meeting, projections estimate the dark silicon fraction will be about one-third of total area in the 20nm technology node (including 16/14nm finFETs), increasing to as much as 80% by the 5nm node. Real products are likely to achieve better results, but clearly power consumption imposes an increasingly severe design constraint.

At that point manufacturers may be tempted to ask why they are putting so much effort into making smaller transistors if designers aren’t planning to use them. Part of the answer is that “dark” silicon is not “useless” or “wasted” silicon. In any design, many circuit paths will be “dark” at any given moment. Some elements, such as specialized logic and cache memory, are particularly “dark-silicon friendly,” in that they contribute to overall IC performance while consuming power only in special situations.

Specialized cores help…somewhat
In fact, this insight led to the integrated circuit industry’s current focus on multicore designs. If a problem can be broken into parallel components, then several cores running at a relatively low speed can still deliver better overall performance than a single core running at high speed. Many problems, and in particular many computation-intensive problems — digital photography, video rendering, database searching, etc. — are readily parallelizable. Moreover, the availability of parallel processing allows designers and software engineers to address larger problems with larger data sets in the same amount of time.

The tradeoff between parallel and sequential processing is still hotly debated in the software and design worlds. (Readers interested in the argument might start here.) As frequency and voltage scaling have become more difficult, devices with multiple general-purpose cores have proliferated. As power constraints become more severe, though, the fundamental design assumption that silicon area is expensive and should be conserved has been turned on its head.

In an expensive silicon paradigm, it makes sense to design general-purpose logic that can be re-used by many different problems. In the dark silicon era, though, transistors are readily available, but power is very limited. Thus, as UC San Diego professor Michael Taylor explains it, designers can “spend” transistors in order to “buy” power efficiency. For example, a circuit might have many different special-purpose cores that perform one task very efficiently but are dark the rest of the time.

Along those lines, Taylor’s group has proposed “GreenDroid,” a power-optimized approach to the popular Android phone and tablet platform. They found that 43,000 static instructions accounted for 95% of the typical Android device’s workload, and estimate that only 7 mm² of silicon in a 45nm process is needed to accommodate those instructions. In place of general-purpose cores, the proposed GreenDroid design uses many different “conservation cores” optimized for specific key functions.

As a design paradigm, this approach is problematic. While the conservation cores can be automatically generated, based on statistical measures of the target workload, the design also needs to be able to dynamically switch between specialized and general-purpose blocks, depending on which tasks a particular piece of software requires. Over-reliance on specialized cores risks creating a “Tower of Babel” situation, in which a core cannot be used for even closely-related computations, and software developers are slow to adopt new hardware because of the difficulty of programming for it.

Meanwhile, all of these factors conspire to substantially increase hardware complexity and therefore the demands placed on human designers and programmers. And even after all that effort, specialized cores, like general-purpose cores, will only keep the dark silicon problem at bay for so long. Ultimately, the overhead involved in switching between cores will itself consume a substantial fraction of the available power.

Can non-volatile cache memory change the game?
This is where the manufacturing side of the house comes in. In broad terms, any kind of multicore design approach makes use of “power gating.” The parts of the chip that are not in use are powered off completely, eliminating both static leakage losses and dynamic power consumption. However, as Takahiro Hanyu and colleagues at Tohoku University explained at December’s IEEE Electron Device Meeting (paper #28.2), switching to the “off” state requires a “backup” step to store the logic state to memory, and a “bootup” step to restore it. These operations consume both power and time, while the cache memory used for storage also uses power. DRAMs are difficult to scale to very small dimensions, difficult to integrate with CMOS processes, and require a regular “refresh” operation. SRAMs, currently used for on-chip cache memory, are prone to leakage. Existing non-volatile memory technologies, like flash, are too slow and require too much write power.

However, as K. Ikegami and colleagues at Toshiba pointed out (IEDM2014, paper #28.1), non-volatile cache memories don’t need to store data “forever” in the way that bulk storage components like flash disks do. Rather, they only need to store data for some multiple of the cache refresh rate, which is long enough to ensure that it is no longer needed. According to the Toshiba group, a data retention time of a second or so should be more than adequate for most applications.

This observation creates a potential opening for “less volatile” memory elements, with retention times longer than conventional DRAMs or SRAMs, but lower write current requirements than conventional non-volatile memories. In spin-transfer torque MRAMs, Ikegami explained, the write current depends on the length of the write pulse: smaller cells are slower, but require less power. In processor simulations, write times of 3 to 4ns appear to be fast enough for mobile processor cache access, and offer write currents of less than 45 microamps. Meanwhile, retention time depends on the thermal stability factor, delta, a measure of the stability of the element’s magnetic behavior.

As the Toshiba group showed, thermal stability factors of 70 or more are needed for “permanent” storage, but a delta of only 60 is adequate for cache memory. This level of performance was demonstrated in an MTJ-Last process, where the magnetic tunnel junctions used by STT-MRAMs were integrated after CMOS metal fabrication. In benchmark studies, these devices reduced energy requirements by 60% while suffering only a 7% performance degradation relative to SRAM cache memory.

While integrating non-volatile cache memory presents many process challenges, it is conceptually easy to visualize because the circuit logic will behave in the same way and be approachable with the same design tools. But more radical solutions to the dark silicon challenge also have been proposed.

One, to be considered in a future article, is neuromorphic computing. In biological brains, the “retention time” of a neural pathway depends on the frequency with which it is activated. For instance, you may remember how to get to your childhood home more easily than a restaurant you visited last month. Neuromorphic computing sees the brain — a powerful low-voltage, low-frequency computational system — as, Taylor wrote, “an existence proof of highly parallel, reliable, and dark operation,” and a potential model for human-designed systems that avoid the constraints of conventional serial, Boolean logic.