Trading Off Power For Performance

When it comes to media ICs, the focus on streaming video and high performance makes it much harder to find big savings in power.


By Pallab Chatterjee
Integration of CODECS and graphics cores with new processor engines is proving to be a trouble spot for power optimization.

Because these blocks are driven by performance and are high-duty-cycle components, the main focus has been to push the limit for process performance.
These blocks still use most of the tricks identified by both UPF and CPF, including multi-phase clocking, dynamic voltage and frequency scaling, control of body bias, asynchronous paths, power gating, parallelism versus pipelines, and multiple supply domains/voltages. Those are in addition to process scaling from 40nm through 22nm. But when it comes to trading off area, power and performance, performance is the clear winner.

The reason: Graphics blocks, such as TI’s OMAP chip and several incarnations of the H.264 CODEC for high-resolution video, are now being combined with high-performance, low-power CPUs for full combination chips. These are different from standard processors because of the optimization of the data size and the idiosyncrasies of both replay and streaming video—and that’s where the power optimization problems come in. Video data is different from Web data in that it has to be processed and rendered locally to display on the screen. The task is increased in computational resource requirements when the block is called on to deliver ray tracing, 3D shading and output from multiple streams.

Typical computation requirements for video include pixel processing of 12-bit video data at resolutions of 1920 x 1080 (2,073,600 pixels) at 60 frames per second (1080p video) or 1,492,992,000 bits per second. Typical video is extended length of 10 minutes to 120 minutes in duration. As a result, the GPU and graphics processing cores do not go idle during the “operational data” stage as a CPU would upon idling down during extended memory operations and instruction preparation. This means the GPUs and CODECs tend to either be powered off completely when there is no video data being processed, but on at 100% to sometimes 120% of standard performance in the context of the video information. The 120% is achieved with overvoltage to drive the variable power supplies while using FSBB for driving the floating body of the device to provide maximum power supply levels and device switching speed.

In this mode, these GPU and CODEC blocks do not really benefit from the power management design techniques promoted by the tool flows and vendors. Instead, the single-digit percentage advantage they provide is both negated by the device use model (mid-power states, not “on” or “off”) or not generally encountered when used, which also contributes to design complexity. These devices benefit greatly from process scaling, as the reduced device size, power supply and interconnect capacitance reduction minimize the active power use. Papers submitted at the recent ISSCC conference indicate that process scaling resulted in double-digit power reductions from larger processes.

There are a number of things these devices share in common. One is that power isolation for the GPU and CODECs from the rest of the logic and CPUs is key. Second is the dynamic voltage and frequency scaling, which allows for overvoltage applications and clocks to keep data on time. Another feature is the shift from a single clock block to internal Async-Sync paths. This is the separation of multiple simultaneous synchronous paths that are normally clocked together, made up of a number of separate synchronous paths that start from a common register bank and end in a common register bank. These are then separated or decoupled from the per-path common clock to being individual function paths, each with its own synchronous clock. This type of path is a more power-efficient method than the old “ripple” style logic, which did not have managed switching peak power control.

While the low power techniques bring some value to the overall chip design, the benefits of advanced process scaling are in the single digits rather than double-digit power reduction. As a result, it may require some hard business decisions about whether the design overhead for squeezing out every last percentage of power is really worth it—or whether it makes more sense to get to market more quickly.