Part 2: What micro-architectural techniques can designers use to explore and reduce the power consumption in their designs?
By Abhishek Ranjan, Saurabh Shrimal and Sanjiv Narayan
In the first part of this series, we discussed the need to perform power optimizations and exploration at higher levels of abstractions, where the potential to reduce the power consumption was highest. While fine-grained local changes (like clock-gating, operand isolation, etc.) for power reduction are well understood and widely adopted, there is a need for making coarser changes at higher level of abstractions to exploit full power saving potential. In a design, power is mostly consumed by logic, memories and clock-network. Effort should be made to address power consumed across these categories.
There are multiple techniques in published literature that talk about reducing power of clocks, memories and logic. In this article, we will present some of the most potent micro-architectural transformations that can have a significant impact on a design’s power dissipation. Each of these addresses one category of power dissipation — clock, logic and memory respectively.
Shift register vs. circular buffer
One of the most common micro-architectural transformations to minimize power is replacing shift registers with circular buffers. In a shift register, values are constantly being transferred (or ‘shifted’) along a chain of registers as newer values are written in.
For example, consider the shift register in the figure below. Every time a new value is written to register A, data is constantly being moved from registers A to B to C and finally to D. Such data transfers cause multiple and entirely unnecessary toggles at these flops (and in any logic driven by these flops) resulting in higher power being dissipated in the design.
Unlike a shift register, a circular buffer does not require values to be moved around the buffer whenever a new value is read or consumed. As shown above, two pointers keep track of where the next value is to be written to or read from. From an initial empty state, the values are routed to registers A, B, C and D in sequence. During the write operation, values are read from the appropriate register directly.
By eliminating the unnecessary movement of data, a circular buffer implementation consumes far lower power than a shift register implementation. However care must be taken to ensure that the extra logic added for the read/write pointers does not itself consume more power than will be saved by using the circular buffer.
It is usually difficult to clock gate multiple fan-out flops because at least one of their sinks actually may be using the value downstream. In the example below, flop F provides an operand for three arithmetic operations. Despite the fact that the three operations are mutually exclusive, flop F cannot be gated because the value it holds is required for at least one of the three arithmetic operations. So even though an addition operation may be executed, the logic in the multiplier and comparator will toggle needlessly and dissipate power.
Cloning flop F into three flops (F1, F2 & F3) provides an opportunity to gate at least two of the arithmetic operations completely while the third is being computed. This will reduce the power consumption in the design, assuming the additional flops that are added due to cloning consume far less power than the arithmetic operations they control downstream.
Large memories waste power because only a few addresses are being accessed in any given interval of time. Memory banking implements the memory with smaller “banks” of memory. This allows the designer to shut off those memory banks that are not being accessed.
The illustration above shows two ways in which a 1024 word memory may be banked. When two banks of 512 words are used to configure the memory, only one of the banks is actively accessed by the design while the other bank can be gated off to save power. In the case where the memory is configured from four banks of 256 words each, three of these banks can be gated off at any point of time.
The example above illustrates how the total required address space can be constructed from different combinations of smaller sized memory blocks. Another axis of memory power exploration relates to configuring the required memory word size from a given set of memory blocks with fixed widths.
In the example below, given four available memory components, there are a number possible ways of implementing a 512 word x 28-bit memory configuration, two of which are shown below. One implementation has the exact 28-bit word size (16+8+4), but requires extra decoding logic to select between the two banks. The other implementation uses a single 32-bit memory component. While no extra decode logic is required, 4 bits of the memory are wasted.
Regardless of whether exploration is accomplished by decomposing the desired memory configuration along address space or word size, care must be taken to trade off the power consumed by the extra decode logic that will be required to implement a larger memory block from small constituent blocks.
We have presented a few of the options available to address design power in earlier phases of the design. There are numerous other ways in which design power can be saved. So why is not every designer deploying such micro-architectural techniques to save power? The simple answer is that when it comes to power exploration, designers are mostly flying blind. There is no easy way of knowing whether there is even scope of saving power. Applying every available technique, in the hope that power would be eventually saved, could ultimately end up hurting power. Designers need precise information about which of the available techniques will provide exactly how much power saving for their design.
In the third and final part of this series, we will present a methodology to complete this design exploration earlier in the design process. This methodology solves the biggest bottle-neck in power exploration: quickly knowing which technique to apply and determining the associated power savings.