Can RTL Clock Power Be Accurate Enough For Sub-20nm Multi-GHz Designs?

Can RTL power adequately model the key physical aspects of clocks in order to make reliable power-related decisions? What about further complexities that sub-20nm high-performance designs pose?

popularity

The Register Transfer Language (RTL) has increasingly been adopted to enable early and high-impact power decisions. As a cycle-accurate hardware abstraction, RTL is expected to deliver reasonable power accuracy. Clocks are particularly important to analyze and optimize for power. They switch the most and drive the highest loads. Clock gating is an effective power reduction technique that shuts off redundant clock cycles. An RTL power methodology can augment existing clock gating approaches to further reduce clock power. However, RTL design description is limited to logical connectivity with no physical knowledge – clock nets, in particular. Can RTL power adequately model the key physical aspects of clocks in order to make reliable power-related decisions? What about further complexities that sub-20nm high-performance designs pose?

Let’s first take a look at some of the physical considerations for clock nets. Clock Tree Synthesis (CTS) during physical design ensures that the timing and power constraints are met as the global clock net is routed to every sequential element. The clock net is extensively buffered to handle the significant capacitive load while meeting clock skew, on-chip variation (OCV), power and other constraints. Various clock layout structures have their respective pros and cons:

• Conventional CTS follows a balanced clock tree topology. This provides a low power footprint and flexibility for adding several levels of clock control logic including coarse block-level and fine-grained register-level clock gates. However, few common paths from the clock source to sinks make conventional CTS susceptible to OCV.

• Some of the highest performance designs such as processors clocking at 5GHz require different topologies to meet their stringent timing goals. Clock mesh is a preferred topology where the clock net traverses along a long common shared path consisting of a pre-mesh tree that drives a dense mesh structure. From the mesh tap points, only a limited number of clock path elements connect to the sequential devices. This leads to a tight control on clock skew and a high tolerance of OCV, a particular concern for sub-20nm designs. However, there is limited flexibility in inserting clock elements such as multiple levels of clock gates. Combined with the dense mesh fabric, the clock mesh is also power hungry.

• Variations of the clock mesh structure include hybrid approaches that combine the best of mesh and conventional CTS to balance power and timing. Trees, local and global meshes come together to realize today’s complex SoCs with several clock domains.

So how much of these physical structures and constraints should be considered in the RTL Clock Power model?
RTL power provides the performance and capacity to analyze multi-million instance designs all at once. An analysis of the RTL logical connectivity, simulation activity and clock power distribution can identify blocks that would benefit from higher level clock gates. Block-level clock gates shut off a number of flops at the same time leading to significant power savings. RTL power can also complement synthesis-based clock gating by identifying additional fine-grained register-level clock gating opportunities. However, the key to such power optimizations is the predictability of the RTL clock power. Consider an example where RTL power identifies an opportunity to insert a new clock gate in a design that employs a clock mesh. However, during CTS the proposed clock gate and any additional buffers may have to be split and sized in order to drive the load. The added clock elements will also need to honor timing constraints such as the limit on clock path elements between the mesh and the flops. In the end, it is possible that the clock gates and buffers introduced can exceed the potential power savings from the reduced clock toggles.

Figure 1: To predict post-layout clock power reliably, the clock physical structure, timing and power constraints, and interconnect capacitance need to be modeled at RTL

Figure 1: To predict post-layout clock power reliably, the clock physical structure, timing and power constraints, and interconnect capacitance need to be modeled at RTL

Some of the leading high performance designs have experienced this resulting in increased design iterations and overall productivity loss. To accurately predict power savings from changes at RTL, the added/removed logic, changed activity, physical effects and timing constraints must be qualified upfront. A physically- and timing-aware CTS engine is required at RTL to make early clock power decisions with confidence.