Optimal power, performance, and timing hinge on making the right decisions about the clock network architecture.
Laying the proper clock network architecture foundation makes all the difference for the best performance, power, and timing of a chip, particularly in advanced node SoCs packed with billions of transistors.
Each transistor, which acts like a standard cell, needs a clock. An efficient clock network should ensure the switching transistors save power. In today’s advanced nodes, when a design ports from N5 to N3, the foundry ensures that the design saves power in switching, which leaves the design engineer out of the loop.
But even more performance can be eked out of process scaling with clocking and clock routing. By removing some of the buffers, or sizing them differently, more power can be saved with the next generation. This also allows for more power optimization options.
The key is getting a handle on power variations that are the result of clocking. Typically, power variations happen in the design when the clock is gated or ungated, and power variation is an artifact of the clocking scheme.
But as clock networks span a reticle-sized chip, timing closure becomes exponentially more complicated, especially for large synchronous systems. Designers face immense timing closure challenges as clock skews become more challenging to maintain and timing margins tighten.
To make matters worse, clock networks consume power at a rapid rate. “The clock network can consume up to half of the chip’s dynamic power and lead to increased system costs, diminished power envelopes at the system level, and increased total cost of ownership for AI, data center, and automotive OEMs,” said Jeffrey Fredenburg, CTO and co-founder at Movellus.
Traditionally, choosing a clock network topology and strategy was a physical design decision. “However, design centers are applying a shift-left directive with their clock architectures and bringing the decision closer to the architectural stage. By shifting left, designers can extract greater power efficiency and performance leaps in the construction (place-and-route) phase,” Fredenburg said.
In the past, engineering teams created an elaborate clock tree to indicate they wanted everything on the chip, from one corner to the other, in order to be in sync. “The challenge was they’d have to have a clock somewhere near the center of the chip, and then would distribute that clock in a way that any circuit on the chip was always in sync with other circuits,” said Steven Woo, fellow and distinguished inventor at Rambus. “However, this global synchronization got to be a problem because distributing a high frequency clock all across a chip was a lot of power. Then what people started saying was, ‘Maybe I don’t have to have everything globally in sync. Maybe in just little pockets of my chip design I could have locally synced up circuits.’ But then there were domain crossing issues to go to another domain, and required special circuits to allow data to be passed between non-clock-aligned boundaries. Looking at the companies designing AI chips, what they do is say, ‘Maybe I don’t really need to even have big islands of synchronization. Maybe I just have to be in sync with my immediate neighbor circuits.’ What that does is reduce the need to distribute a high-precision clock all across the chip.”
This doesn’t mean we will see clock-less designs. “Clocks are still needed, but the requirement for them all to be in sync all across the chip gets smaller,” Woo said. “Then you can start reducing the amount of power, as well. Some of these AI chips, and things like systolic arrays, don’t necessarily need the huge global clock tree that sucks so much power. That’s something where now you take all that power and turn it back into things like more SRAM and more compute elements.”
Cause/effect
But if clocks are not in sync, there is an impact. “From a performance standpoint, it will take a little bit of time to cross these boundaries because it’s not perfectly in sync, and you have to just be okay with that,” he said. “In the case of AI developers, they say, ‘We’ll deal with that kind of thing because what we care about is global throughput. If I can show that with the power savings I can have more compute engines, while it may take a little longer to get the data through, I can have more compute pipelines, and life’s better for me if I do that.’ From a power standpoint, you save clock distribution power, so it tends to be good, but the latency to get things through a chip tends to be higher. So if you’re in a very latency-critical application, in general, you’d like to have more synchronous circuits and synchronous boundary crossings. But if you’re willing to give up some of that, then moving away from that kind of architecture makes a lot of sense.”
Variation adds another wrinkle, because any variation in power impacts timing. “Just as you have power ripples through your chip, the performance ripples an an analogous way as the voltage drops on the cell,” said Marc Swinnen, director of product marketing at Ansys. “It still may be well within the bounds of the allowable voltage drop, but it does mean that cell is a little slower. The lower the supply voltages, the slower the transistor switch. So as your voltage dips, your performance dips on that one, as well. And if you have enough local dips, each one may be okay in its own right. But taken together, they hit this critical path, and this path is now too slow or too fast. If the voltage spikes, you could have a faster performance and you could miss the hold time. It comes down to the impact of power noise on timing, and the primary victim of this is the clock.”
The clock is a big network, and ideally there is a very regular stream of pulses coming down the clock tree. But because of the switching in proximity to the clock, these clock transmission gates will randomly slow down or speed up a little bit, and that causes a clock jitter. One of the major components of clock jitter today is power ripple and power variation.
“Any electronic circuit these days is a basically a state machine, where the clock steps you from state to state to state, and the clock interval, the amount of time you have between two clock ticks, is basically how much work you can get done in any clock phase,” Swinnen said. “But you have to margin. Of course, there’s setup and hold times, which eat into your clock, but jitter is yet another variability or unknown that narrows the time you have to do work, because the clock may be late or early. It’s random. It’s statistical the ways it jitters, so it impacts the entire chip’s timing. If you can reduce the jitter by, let’s say, 10%, or at least predict it better, it’s like getting a 10% faster clock, which has a huge impact. Margining is becoming much more difficult to do these days, so you need accuracy. Still, the main cause of jitter is power variation, which is where clock jitter analysis comes in. Traditional timing tools see voltage as a fixed thing. They’re all built with a fixed voltage assumption, so they’re poorly situated to handle this problem, because from the get-go they don’t see power variation. That’s not one of their parameters.”
Further, at very advanced process nodes, mismatch effects cause jitter.
“The matching of the transistors is the key advantage of advanced process technology in that you can have two transistors which will be matched very, very closely with each other, such that if the area is higher, the mismatch will be smaller,” said Priyank Shukla, senior product marketing manager at Synopsys. “As geographies shrink, you not only have smaller form factors, but also have finFETs. We used to have planar, so it was easier to match. In the finFET, there are three fins to implement a transistor, and process steps have been added to implement those fins. So the mismatch has shown itself in different types of jitter, which is something new. It wasn’t there earlier. Also, in the past, the standard technique was sizing up. But now it doesn’t generally help, because it increases capacitance if the area is higher. You are adding capacitance, so new circuits are needed to tackle this kind of challenges. This is a new point with respect to advanced nodes.”
For digital designers, commercial tools can address this directly, from analyzing clock jitter to variation-aware statistical timing analysis. These tools can be used to analyze the effect of voltage and voltage noise on the timing, and can calculate clock jitter, show what the jitter will be, and what’s causing that jitter.
Other techniques include dynamic voltage drop analysis, to see what is switching in the neighborhood. This allows the design team to strengthen the clock network to make it less sensitive to voltage noise without having to pay the price for that across the entire chip, or even across the entire clock network. There are specific stages that will be more impacted by voltage noise, so a root cause analysis can be done, then target, surgically fix, and buffer the clock to be less sensitive. That granular approach helps to limit margin because it doesn’t assume worst case everywhere.
“Clocks are often disregarded, but people forget that in high-speed chips 30% to 50% of the entire power in the chip is consumed in just the clock,” Swinnen said. “A third to half the power gets burned up in the clock network. So when you’re doing low power-design, that’s the first thing you should look at.”
Analog/mixed signal
For analog engineers, there aren’t tools to address the clock in the same way, so much is learned in the industry. “Analog engineers are not taught these things because advanced finFET process nodes are very rarely taught,” Shukla said. “But at work, all the advanced chips are in finFET processes, so the analog engineer learns this in the industry. They learn how to design this better through IEEE papers or a lot of webinars.”
While tools exist, does this mean issues with clock network architectures and the resulting impact on power are a solved problem?
Rambus’ Woo doesn’t think so. “I don’t think anybody’s happy. Everybody wants the power to be lower, especially for clocking. In some of our PHY circuits, we have clocks that have to distribute over long distances, and distributing high frequency clocks over long distances is challenging. There have to be repeaters and things like that, so nobody’s happy with it. Is it a solved problem? It’s like everything in chip design — it’s a problem people can cope with right now. Then at some point, they can’t cope with it anymore, and then you’ve really got to do something. It’s the Whac-A-Mole problem, where there’s this other mole that you’ve got to whack first. In the same way when clock one pops its head up, then you’ve got to go deal with that. So much of this is just trying to keep everything at a level where no one thing is sticking out more than the other. There are solutions right now, but clock power is always a problem. The question is when you’re going to really be forced to deal with it next.”
Others point to similar concerns. “When we have architectures in which there are data dependencies between different layers of compute, we need to ensure that there is some kind of a synchronization in the whole data path,” said Ramesh Chettuvetty, general manager of memory solutions at Infineon Technologies. “Second, when the chip sizes grow, as in embedded RAMs and multicores, we are talking about 100mm² die or 800 mm² dies. Ensuring that we have synchronization at the four corners of the chip itself is going to be a challenge because there will definitely be clock routing delays within the chip itself. To address all that, we will need to have a very foolproof clocking architecture, which takes into account all the setup and hold time requirements. And these are high-frequency clocks, so you definitely will need some kind of a synchronization network between the different cores. The other way we can do it in a synchronous manner is relying on handshaking between the cores. The problem with that approach is it brings down the efficiency of data movement.”
Especially in in-memory compute kind of applications, the biggest challenge everybody faces in these architectures is improving the data flow efficiency.
“The moment we have a handshaking protocol or arbitration circuit that relies on back-and-forth communication, it will slow down the data movement between layers, which is not a good thing in these architectures,” Chettuvetty said. “It will drop the efficiency down big time. Engineering teams rely on using synchronous architectures as much as possible, but synchronous architectures without any bottlenecks in the data flow. That is what everyone is striving for. How successful they are will determine the overall power efficiency of the system. That is a very important aspect. Clocking architecture is a very important aspect in these systems. These are standard practices that engineering teams use in many of these SoCs, and they have a fair grasp on how it is done. I don’t see a totally innovative approach in the clock architectures specifically for AI, but data flow efficiency and synchronized architectures are what people are looking for to improve efficiency, in general, which obviously means clocking has an important place.”
Conclusion
Clock networks are one of the largest networks on chips, and they have a big impact on power, performance, and area, Movellus’ Fredenburg said. “With each clock network topology, designers must make crucial tradeoffs, such as power for performance or time-to-market for performance.”
While clock topologies have remained relatively static over the past 25 years, new topologies are emerging, bringing a notable leap in power efficiency and timing closure agility.
Leave a Reply