Rethinking Timing Optimization

The impact of smaller geometries on physical effects is forcing engineering teams to sharply alter their approach, akin to a complete genetic rewrite.

popularity

By Ann Steffora Mutschler
As semiconductor manufacturing technology continues its march toward 20nm, SoCs are plagued with advanced interconnect delays, cross capacitance, and process variability, as well as area and power constraints—and the significance of these factors is increasing with each passing node.

“With lower nodes we are getting advantage on area, more and more logic is getting included onto the dies, which means many interfaces are also integrated on the same die. All the interfaces may have their own unique requirements of clocking and when they interface with each other there are many constraints that come into picture which result into complex clocking schemes for the ASICs,” explained Shrikrishna Mehetre, engineering manager at Open-Silicon.

In advanced SoCs additional modes of operation dictate that extra tradeoffs in timing closure must be made to make sure the device works in all of the different modes and with the minimal amount of power, said Andy Inness, place and route specialist at Mentor Graphics. “Fixing power in one mode may break an operation in another mode. You need to make sure you take both into account, so when you optimize one you don’t break a different one.” The physical effects of advanced nodes can have varied impacts because as the voltage is varied, the physical effects are different in different voltages.

Advanced timing closure requires the P&R tool to account for all operating modes, and all process corners. This significantly reduces the number of timing violations and helps for faster timing convergence. Source: Mentor Graphics

In recent designs, Open-Silicon has seen that additional complexity as it has integrated numerous such interfaces and ended up with as many as 400 clock domains, many of which are generated clocks and dependent on each other. While so many interdependent clocks interact with each other on a 18mm x 18mm die, the insertions delays and skews for all these clocks and their impact on the overall timing optimizations is huge.

In addition to the clocking requirement for the design functionality, complexity also comes from the DFT logic addition and the test requirement for such huge logic and memories in the designs because the clocking requirements for DFT logic add to the complexity of the design. In this scenario, the timing optimization is not just restricted to achieving better data path delays. It also requires consideration of physical effects such as interconnect delays, cross cap, the clocking skews and variability.

High frequencies = extreme variability
Companies like Open-Silicon are working with clocks ranging from hundreds of MHz to a few GHz at 40nm and 28nm technology nodes. At such high frequencies achieving the required skews and insertion delays for such a big clock network over large dies is very challenging. The variability across different functional/DFT modes and different process corners is very high. When combined with clock distribution over a huge die and the variability at smaller nodes, timing optimization needs to consider these large overheads, Mehetre said. Power optimizations using clock gating has become the norm, but achieving timing closure for clock-gating latches is an additional complexity the needs to be addressed.

In addition, timing optimization is not a standalone task anymore. It needs to consider all the parameters such as clock period available, physical effects (interconnect delays, cross talk, variability), and clock skews across different corners and modes. Adding in excessive margins at different stages (such as data-path optimization before CTS, clock skew opt after CTS, variability optimization after CTS is completed, SI optimization after routing is done) is becoming inefficient at smaller nodes. EDA tools needs to model these effects right from the initial optimizations rather than relying on margins at different stages. For instance, instead of applying a single OCV number for the entire design, Mehetre said it’s better to use AOCV (local OCV), and all the IP vendors need to support AOCV models at smaller nodes. So instead of trying to achieve a global skew number, effort should be put in to utilize the useful skew and planning for the useful skew should be done from initial optimizations.

The impact of shrinking process nodes
While there was an era when interconnect delays were marginal with respect to cell delays, the situation seems to be reversing beginning at 28nm. Enhanced physical synthesis considering the clocking and modeling of the interconnect/cross-talk delays, variability is needed to replace added margins for these designs. Also, logical optimizations that consider variability and different corners needs to be part of the initial optimizations, Mehetre asserted.

As such, there needs to be change in the way clocks are used for such huge designs. Rather than looking at how to optimize timing and clocks for the effects, the approach should be how to minimize the impact of these effects during the architecture and RTL design phase itself. This is especially true if the clocks are to be shared or if they are interdependent, and during the logic design the goal should be to reduce interdependency.

Paul Cunningham, senior group director of R&D at Cadence agreed that smaller geometries are having an impact and that, by far, the biggest impact is on the wires rather than scaling of the transistors. “The transistors are getting smaller faster than the wires. We went from being all about the transistors to more about the overall wire capacitances and then at the very advanced nodes—28nm and below—we’re seeing it’s more and more about the resistances in the wires.”

All of these issues require engineering teams to rethink the algorithms, which has led to a clock concurrent optimization process, he said. “The underlying model in timing terms is, rather than thinking about a path in the design—I call them chains—where a chain would be a sequence of functions that span multiple register boundaries. Eventually these functions form some kind of a feedback loop and it’s that loop that really dictates the speed. In a traditional flow, if I’ve got a four-pipeline stage loop, my speed is driven by whichever of those four pipeline stages is the longest. In a clock concurrent world, I’m going to take the average of those four pipeline stages and that’s what drives the speed of my chip.”

Embracing that approach has required major changes to the tools themselves. “It’s a complete genetic rewrite in terms of the algorithm,” Cunningham asserted. “To the end user, it just looks like a new kind of next-generation step in the flow. So I’ve just rewritten step ‘n.’ What comes into step n, what goes out of step n is still the same but the actual result is now much better for power, performance and area. You are really optimizing the clock parts and the logic parts at the same time.”

Overall, boundaries are blurring throughout the entire optimization flow and the trend in the industry is towards the different optimization steps becoming more and more intimately linked so it’s a very, very smooth transition and possibly its a more fine-grained transition as you go from step n to step n+1 to step n+2. Twenty years ago they were really very distinct, isolated steps almost as if you were running different tools, he observed.

“It’s really the sum of all the steps—making the combination greater than the sum of its parts—that’s where the industry is focused these days,” Cunningham concluded.