Low Power-High Performance

Target: 50% Reduction In Memory Power

Is it possible to reduce the power consumed by memory by 50%? Yes, but it requires work in the memory and at the architecture level.

April 11th, 2019 - By: Brian Bailey

Memory consumes about 50% or more of the area and about 50% of the power of an SoC, and those percentages are likely to increase.

The problem is that static random access memory (SRAM) has not scaled in accordance with Moore’s Law, and that will not change. In addition, with many devices not chasing the latest node and with power becoming an increasing concern, the industry must find ways to do more with what is available.

So while it is possible to reduce the power consumption of memory by 50%, don’t expect your memory provider to come up with all of the saving. Along with improvements in memory, attitudes also must change.

Back to basics
Saving power starts with the internals of memory. “Memory has several building blocks—the bitcells, column and row decoders that allow you to access specific locations, and sense amplifiers,” explains Farzad Zarrinfar, managing director of the IP division of Mentor, a Siemens Business. “The bitcells could be of different types, such as static or dynamic memory, where dynamic memory also needs refresh circuitry. It is often the design of the peripheral elements that differentiate one memory from another and can have a large impact on power consumption.”

Fig 1. Simplified block diagram of a static memory. Source: Semiconductor Engineering

Where is power consumed? “There is the static leakage, which is required to make sure the cell remembers,” explains Tony Stansfield, CTO at sureCore. “Then there is the active power, which is basically the cost to perform a read or write. The leakage currents are tiny, but when you have millions of bitcells, a small number times a big number can give you a noticeable number.”

Hiroyuki Nagashima, general manager for Alchip U.S., concurs. “Designers care about both dynamic power and leakage power. In general, periphery circuitry contributes large portion of the dynamic power, while bit array is the major contribution to leakage power consumption.”

For the read and write operation, it is impossible to overcome the laws of physics, which state that power is related to CV². “There are lots of wires that charge and discharge,” adds Stansfield. “Memories are basically massively parallel circuits. You typically access a whole row at once, so you are driving a whole rows-worth of bit lines and that gets muxed down by a sense amplifier. This means large numbers of relatively long wires are involved. Then the data has to get somewhere, so there are a number of on-chip issues relative to where the data will be used.”

Reducing power has to target either the V or C term. The minimum voltage for an SRAM to work reliably has decreased as process technology has progressed, but the slope has been flattening. Further reductions in V_min push the memory cell transistors, which tend to be the smallest transistors in a design, into an area of operational instability. That requires additional circuitry to be added in order to ensure correct operation and maintain performance. These circuit enhancements must compensate for the manufacturing spread and the inherent variability present in these processes. “There is not a huge amount you can do about it,” admits Stansfield. “You just have to live with what the fab gives you.”

One such problem is random dopant fluctuation. A migration to finFET or FD-SOI technologies decreases this problem, because they both use an undoped channel, but the problems of line-edge roughness and metal gate granularity remain. These both become major problems below 20nm.

One way to reduce the impact is to use larger transistors that do not exhibit as much variability and thus operate more stably at low voltage. “This is the voltage level where most memories are in a deep sleep mode,” says Stansfield. “With the right design, it is still possible to access the memory at this voltage. This is very attractive to developers who want to also run their logic at a low voltage—designed to solve a system power issue—not simply a memory power issue. This does have a performance hit because of the extra circuitry required to make the low voltage work and also has a significant area hit—again because of the extra circuitry needed to make the low-voltage operation work.”

Another strategy is to decrease the capacitance. “A very large memory is typically not built as one big memory,” says Stansfield. “Instead, it will be two or more smaller memories, and then there are extra levels of muxing to pass the data on. That means you only activate what you need to.”

The memory essentially performs a read access to all cells on the row and throws away the results for all memory cells except those of the addressed word. During a write access, the remaining words on the same physical row as the accessed word behave as though a read cycle is taking place. Power can be saved by using shorter bitlines and other vertical-routed signals that typically dominate dynamic power consumption for moderate- to large-size memory blocks.

Various muxing techniques can be used reduce the line lengths. As the muxing depth increases, the number of rows in the memory decreases, while the number of columns increases. Although the increased memory width will impose additional delay on horizontally-routed signals such as the wordline, this penalty is more than compensated for by gains in the speed of vertically-routed signals.

A significant portion of the read access time of the memory is due to the length of time needed to develop a measurable signal driven from the memory cells onto the bitlines. The signal development time is directly proportional to the bitline capacitance. “The way that you design the hierarchy of sensing is important,” says Zarrinfar. “The goal is for it to be roughly proportional to size. Designers will also look at different aspect ratios.”

Much of the advancement in memory design is associated with the read and write circuitry. “Read assist and write assist technologies have been implemented to resolve the issues caused by low V_DD operation,” says Alchip’s Nagashima. “For example, negative bitLine write assist is a common technique to improve write margin for robust write operation.”

Other techniques exist to reduce capacitance. “Can we compartmentalize the design or keep wire tracks a bit shorter or maybe spaced a little further apart to control capacitance?” asks Stansfield. “Can we manipulate the circuit so that not as much is active? Such a design would have similar performance and significantly better power with a slight area penalty.”

Making architectural changes
While changes in the memory design can yield significant power reduction, it is not enough to get to 50% power savings. “Most of the potential power savings, or inefficiencies, are locked in by the architectural decisions,” asserts Dave Pursley, product management director at Cadence. “Our industry estimates that architectural changes have 4X more potential to improve, or worsen, power than lower-level changes.”

Start by looking at the big-ticket items. “Don’t move data if you don’t have to,” says Marc Greenberg, group director, product marketing in the IP Group at Cadence. “This might seem obvious, but it covers a host of hardware and software techniques that can lead to power reduction. Consider whether the precision of arithmetic operations is appropriate to the task at hand. Consider whether intermediate values need to be stored or whether they could be regenerated.”

The converse of this is when maximum performance is required. “If the CPU has branch prediction or speculative fetch of data, then by nature these techniques will move some data unnecessarily if the prediction/speculation is incorrect,” adds Greenberg. “That’s the price of the increased performance.”

If there are times when a memory is not going to be used, it often can be put into a sleep mode. “A deep sleep mode means that you can retain the data, but you can shut off the peripheral circuitry,” says Zarrinfar. “The only power consumption is now the leakage in the bitcells. While dependent on many factors, including the size of the memory, turning off the periphery drops leakage by about 60% or 70%.”

Power gating does add some design complexity. “You have to be careful that you do not turn off the power to the bitcells, and when you turn the power on or off, you have to ensure that no glitches are created,” warns Stansfield. “This has a habit of looking like a write cycle and thus could corrupt the data.”

Another technique is to manipulate the source bias voltage. “This dramatically reduces leakage,” says Zarrinfar. “It still looks like a fully functional SRAM, but the banks that are not being used remain in light sleep.”

Light sleep will not save as much power, but it can be turned on and off more quickly. Deep sleep saves more power, but the time needed to bring it back is longer.

Knowing when a memory can be turned on and off requires system-level knowledge. “You may have a lot of data but your accessing it infrequently, therefore it is leakage that matters the most,” says Stansfield. “Or you might have a small memory that is accessed frequently, in which case it is the active power that matters.”

The storage and communication architectures are decided early, and everything else is built around those decisions. “Consider an application where two pieces of data are needed each cycle,” says Pursley. “Even for just that simple problem, the architectural choices are numerous. Should there be two separate memories or just one? If one memory, is it more efficient to have a dual-port memory, or do the memory access patterns allow a single wider memory (where each word contains multiple values)? Or maybe the best tradeoff is to use a single-port memory running at twice the speed.”

Greenberg points to another large waste of power. “Don’t move data further than you have to. This is mostly about caching data appropriately and not flushing it if you don’t have to. Ideally, we would store all data very close to the CPU in local storage or L1 caches. But those caches are necessarily small and expensive, so we use a hierarchy of memory. Cache-to-cache transfers are much more efficient if they don’t go through the main memory (DRAM). As a rule of thumb, compared to local storage of data, I assume 10X the energy to store something in L3 cache and 100X the energy to store it in DDR DRAM. New memory interface technologies like HBM2 and LPDDR5 help to reduce the amount of energy required if the data should need to go off-chip into external DRAM by reducing I/O capacitance and reducing I/O voltage.”

When architects and memory designers work together, other optimization become possible. “With vector compute, you know that after you have read one piece of data you have a good idea about what is next,” says Stansfield. “When you have predictable access patterns, you can exploit that in the design of the memory system. If you are working with this chunk of data now, you may know that you will do a lot of access to this chunk and not some other chunks. Then you can use that kind of information. When the product architect gets together with the memory architect, we can see the tradeoffs that can be made on the two sides together.”

Stansfield confirms that they have successfully reduced memory power by 50% by designing the memory to fit the application. “It is not the kind of thing that we, as a memory company, can optimize in isolation. There is information about the access patterns that we do not know. We could make a best guess and produce a memory design that is optimized for a particular access pattern, but unfortunately, everyone would have a slightly different access pattern.”

In the end, it comes down to a tradeoff. Significant power can be saved, but the cost is probably a little bit of area and increased design time. Is it worth it? That’s an architectural decision, and there is no simple answer.

Related Articles
Memory Tradeoffs Intensify In AI, Automotive Applications
Why choosing memories and architecting them into systems is becoming much more difficult.
Defining Edge Memory Requirements
Edge compute covers a wide range of applications. Understanding bandwidth and capacity needs is critical.
Using Less Power At The Same Node
When going to a smaller node is no longer an option, how do you get better power performance? Several techniques are possible.

Brian Bailey

(all posts)
Brian Bailey is Technology Editor/EDA for Semiconductor Engineering.

3 comments

neil says:

April 11, 2019 at 5:30 pm

t static random access memory (SRAM) has not scaled in accordance with Moore’s Law,—————why??

C. Hodson says:

April 12, 2019 at 7:53 am

I’m not able to assess whether Spin Orbit Torque (SOT) MRAM is ready for prime time on current nodes (10/7nm), but it sounds pretty compelling…

Brian Bailey says:

April 14, 2019 at 9:36 am

Intel has documented what they have been able to achieve with scaling over the past few generations. For SRAM, they plotted the ideal scaling with actual scaling from 90nm down to 10nm. Ideal would have given them an 81X improvement in density but they only managed to obtain a 32X. Reasons for this are increased poly rules, impacts of capacitance, smaller noise margins, increased variability also leading to increased demands on the surrounding circuitry.

Target: 50% Reduction In Memory Power

Brian Bailey

3 comments

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Die-to-die Interconnect Standards In Flux

The Best DRAMs For Artificial Intelligence

Future-proofing AI Models

Sponsors

Recent Comments

About

Navigation

Connect With Us

Target: 50% Reduction In Memory Power

Brian Bailey

3 comments

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Die-to-die Interconnect Standards In Flux

The Best DRAMs For Artificial Intelligence

Future-proofing AI Models

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored