Glitch Power Issues Grow At Advanced Nodes

Problem is particularly acute in AI accelerators, and fixes require some complex tradeoffs.

popularity

An estimated 20% to 40% of total power is being wasted due to glitch in some of the most advanced and complex chip designs, and at this point there is no single best approach for how and when to address it, and mixed information about how effective those solutions can be.

Glitch power is not a new phenomenon. DSP architects and design engineers are well-versed in the power wasted by long, slow data paths. But the problem is spreading. At advanced nodes, glitch power issues are becoming more common and impactful. No one solution works for all chips or type of design, and just addressing it earlier doesn’t necessarily yield a better design.

In combinatorial circuits, a clock is supposed to enable or prevent different states from propagating. But frequently there are delays at the gates or in the wires, and as a result the inputs do not all reach the gate at the same time.

“Say you’ve got an AND or an OR gate,” said Joseph Davis, senior director for Calibre interfaces and EM/IR product management at Siemens EDA. “All of your signals don’t arrive at the same time, so you’ve got a window for settling time that you allow. What can happen in today’s circuits — and frankly, which has always been the case — is that you get delays. One input will switch and the other one doesn’t, and then it switches. When the first thing switched, perhaps the output switched. But then the other input switched, and now it switches back.”

For a simple NAND gate, if there is a delay in timing, the gate may open and close without the signal reaching it in time. The more inputs, and the longer the input sequence between latches, the greater the opportunity for this to happen, and the more power that is wasted.

“These are called hazards,” Davis said. “A hazard is an element in the circuit that has the possibility to create this glitch. The most common source is an inverted signal. Then, both the normal and the inverted signals get passed to the output gate. Any delay between those two things has a potential to cause a glitch of some sort. So depending on the type of logic, if there are a lot of cases like that, you can have a lot more of this glitch power. If there is a very wide fan-in, or very long, deep combinatorial logic, then there is a higher likelihood of these glitches happening until it settles out. They are very high frequency things. They switch up, then turn almost immediately back off, and this can happen multiple times all over the place.”

Glitch in AI accelerators
The problem is particularly troublesome for AI accelerators, which are designed for maximum performance at minimum power.

“In the neural network processing hardware, there are a lot of multiply accumulate computations,” said William Ruby, director of product marketing, low power solution at Synopsys explained (MACs). “In fact, a lot of neural network processors are rated on how many millions, billions, gazillions of MACs they do per second, and that’s a measure of performance. But if you look at a traditional design of a hardware multiplier, it’s got logic in it that performs a lot of what are called exclusive ‘OR’ functions. You can think of this as the foundation of a simple adder type of a circuit. Also, the adder becomes the foundation for a multiplier, as well. These types of circuits are connected in series, and they are pipelined. What happens is there are all these transitions of signals that are taking place, even within a single clock cycle, that eventually settle down to a final result because of different delays through different circuits, and so on. The multipliers in these neural network processors are very prone to glitch power because of the way the circuitry is designed, and it takes multiple transitions to settle down to the final result.”
Fig. 1: Glitch source identification and ranking. Source: Synopsys

Fig. 1: Glitch source identification and ranking. Source: Synopsys

Overall efficiency
Glitch also impacts the overall efficiency of a design. “When you switch something, it’s using the energy that’s coming from the voltage sources all the way at the pins, but also energy that’s stored in the capacitance of the network,” said Siemens’ Davis. “So if you’re switching ON and OFF like that, you’re charging and dissipating those capacitors unnecessarily so that energy is no longer available for the real switching that you care about.”

And it is made worse by advanced technologies, due to the increased RC delay.

“In advanced nodes, the transistors are getting smaller but the wires are staying the same,” said Davis. “If they get narrower, they get taller, the overall capacitance goes up. The resistances aren’t going anywhere and the capacitances are going up so, the delays are starting to be dominated by the RC portion. As you go into increasingly advanced nodes, you make these little bitty transistors, and they’ve got to drive these large loads. The farther you have to drive it, the more opportunity there is for delay and for variation. If you’ve got a hazard in that transmission line that is going along, that’s what adds to the probability of having significant glitches.”

Another contributor to glitch could be in the types of circuits that people are designing or how they’re doing the implementation that causes this to become more of an issue. “Since glitch power is very dramatically affected by the implementation, in designs today, you have a longer critical path or a small set of critical paths,” he said. “[Design teams] try to push as many paths as possible, as close to critical as they can, such that everything is critical. In doing so, that could be causing more of these glitches.”

Further, there is also a long pulse glitch, also called a long constant glitch or a transport glitch, where there are two inputs arriving at different times that eventually settle at a value that was not supposed to cause the toggle on the output of the combinational gate.

“In between, because of the differences in the arrival times of these two input signals, there was a glitchy pulse and output, and many times the propagation is actually a bigger impact,” said Suhail Saif, principal product manager at Ansys. “The glitch at that one signal might just cause this wasted power consumption because the transition was never expected or never accounted for. For this reason, this power consumption is considered waste power. From the designer’s point of view, they never accounted for it, so that power consumption at that signal and that gate is called cell glitch, because it is the glitch power on that particular gate. But the more concerning factor for designers these days is the downstream impact of it, because this glitch doesn’t just stay at that signal.”

This is where things get really complicated. “Many times it can propagate downstream because the combinational logics are multi-stage, and there’s never one single stage,” Saif said. “And these days, the data paths are deeper with faster clock frequencies. The data paths could be as deep as 15 or 20 stages, and the glitch at this signal can propagate all the way down to the block, where it gets captured because of the sequential nature of the circuits there. It passes through the combinational circuit, and causes wasted power consumption from every gate that it passes.”

Others agree. “If you think about digital logic, it’s all synchronous,” said Rob Knoth, group director for strategy and new ventures at Cadence. “So you get a clock pulse that happens, and essentially it’s like a race starts. All the runners leave the starting blocks and start racing through their gates. Eventually the next clock pulse happens, and they get captured at the other end. Logic states have synchronized out. That’s fantastic. You’ve got set-up time, you’ve got hold time, and digital logic just happens. Now let’s think about what actually happens, because that was the idealized 0 time version of the world. In reality, each of those gates in the logic path have some kind of delay, and wires in between the gates also have some kind of delay. So when the runners at the signals leave the starting flip-flop and start racing through those gates, some signals are going to show up sooner than others. While they all will eventually settle out by the time the next clock pulse happens, in between the start and the finish there can be a huge amount of up and down, and little pulses happening as those signals are settling through the gates. All of those swings before you have finished can count as glitch power. Those are all functional glitches, so they eventually will settle out if you’ve met the setup time and the hold time. Each of those swings are spending some kind of power, and how many swings happen is going to dramatically impact the power.”

In the past, there weren’t as many concerns about glitch power because it wasn’t a significant portion of the total dynamic power. “You were already accounting for the functional swings of the signals as they would transition from 1 to 0, or 0 to 1, for the regular logical swing on those gates, so glitch power wasn’t that big of a concern,” Knoth noted. “But what we started to see around 7nm — give or take a bit, depending upon designs — logic depths/combinatorial logic paths started getting so deep for the average design that glitch power became a big problem. Suddenly, it was accounting for 25% to 40% of total dynamic power in some designs.”

Avoiding problems with glitch
There are other problems, too. Eric Hong, vice president of engineering at Mixel, said glitch power can cause timing failures, which in turn can lead to functional failures. He noted that power delivery at the system level needs to be carefully designed in order to minimize voltage ripples (glitch).

That also needs to considered very early in the layout process. “From a design-implementation perspective, in the implementation tools, glitch power is one of the constraints/settings in how you optimize your place-and-route tool, said Siemens’ Davis. “People want to do the analysis for glitch power before implementation.”

He noted there are a few commercial tools that claim to work in this area, and this is going to be largely dependent on the coupling to the implementation tool, because wire delays and so forth are going to be modeled before implementation. “If you do glitch power at RTL, you’re modeling what you’re going to see on the output. There are a lot of things you can do at the RTL level, such as optimizing a combinatorial logic circuit to minimize the hazards. You can do that at a logical level. But then, if you still have hazards, you need to look at the delays you’re going to get, how likely you are to create these glitches, and how big they will be. The larger the delay between the two transients that come in, you have more time to switch and then go back. If they’re very close together, then it doesn’t really go anywhere, it just kind of shrugs its shoulders. But if you’ve got a good delay between them, then you have an opportunity to really glitch.”

Synopsys’ Ruby noted that in the past, if you were doing transistor-level SPICE type simulations, you could detect the problematic transitions at the gate level, but they couldn’t really be seen at the RTL. “There’s technology that allows us to do that at the RTL, but it’s been hidden a lot of times. Glitch power has always been there, maybe not at today’s range, and it was hiding in the fact that there is still some accuracy differences between a simulation and silicon measurements and things like that. But in AI accelerators and neural network processing chips, it is becoming very significant.”

Ruby says it is possible to design for lower glitch power, but it’s not simple. “You can look at the RTL code and do certain modifications, but a lot of times it really comes down to measuring the power with the original RTL, and if I make a change to try to reduce it, am I really reducing it or am I going back up to increase it,” he said. “Especially with this emerging glitch power issue, you really have to have something that measures power — and glitch power, specifically — side by side with your RTL source to tell you, first of all, where the glitch sources are. With arithmetic circuits, like comparators, that produced these glitchy outputs, because you have a glitchy output may not mean much, unless that output is driving a lot of load capacitance. That’s where the rubber meets the road. Now I need to look not at my glitch sources, but the downstream glitch power, which is a function of that capacitance. That’s really the key. We can design based on that information. We can re-design based on that information. We can change the architecture. There are all kinds of logic design tricks or transformations that can be done, such as resource sharing, pipelining, and re-pipelining, data gating, or operand isolation. That can all be brought to bear on this problem. But in many cases, there is no such thing as a free lunch. When you start re-designing things inside, how does that impact the area? How does that impact the timing? All of these things need to be considered.”

So the earlier it can be addressed, the better. “Imagine you’re creating a glitch optimal multiplier, and you define that architecture versus another architecture that may be glitch heavy,” said Preeti Gupta, director of product management at Ansys. “At synthesis and place-and-route, you’re optimizing within that architecture, so your degrees of freedom are much less, whereas at architectural and RTL stages, you’re saying this multiplier is not as good for glitch, but that multiplier is. That’s why doing these kinds of glitch power fixes early makes a lot of sense to the design community. That’s why our users are asking us for more insights into where glitch power is being consumed and why. For example, ‘This instance by itself is not consuming so much glitch, but it’s causing a whole lot of problems downstream, so maybe we want to fix this glitch source.’ In the concept of aggressors and victims in our design industry, rather than focus on the victims, focus on the aggressors. That’s one of the emerging ways to look at glitch power.”

However, there are surprises, she noted, even if glitch is addressed up front in the design. “Customers are working very closely with us right now, saying, ‘We measured power in silicon, and glitch power was beyond our expectations. We want to look at more ways of mitigating it early.’ I don’t know that we have a state-of-the-art in glitch power measurement on silicon. How do you separate glitch power from total power? Maybe techniques like doing the sweep of frequency and measuring power at all of those different frequency levels would help. Certainly, glitch power is an emerging area.”

Conclusion
There are many different strategies to address glitch, but nothing is perfect. Complexity carries a cost in many ways, and glitch power is just one result.

“We do have both EDA technology as well as computational hardware that’s advanced enough that we can start to analyze and optimize glitch early,” said Cadence’s Knoth. “The earlier you address it in the design, the better off you are, because of all the design corners. If you address it earlier, are you going to be doing different things with the information at whatever point you’re at? If you’re addressing things earlier, are you looking at a less precise model of it, compared to later?”

This is one of the main challenges. “The key is to understand that you’re giving up some of that precision for actually being able to explore a wider universe of solutions,” Knoth said. “And while you might not know the exact interconnect delay, because it might not be the exact routing topology, big delay variations can happen in the gates in the wires because of the process, voltage, and temperature changes that can happen in advance nodes. Waiting till the end could be a fool’s errand. By that point, you’ve tied your hands and you can’t make as many big changes because you’re so close to the end. If you looked at it as ‘accurate enough,’ and came up with an accurate enough prototype early enough in the design, you can make those bigger architectural moves. You can make those bigger synthesis moves, and you can do more effective techniques to suppress glitches. But again, this requires you to shift left to bring the accuracy of the implementation earlier, and you package it in a manner in which an architect can interpret the findings. You marry together a better interoperability between the functional verification and the design implementation tools. That is a useful paradigm, and where we’re at in the industry, that’s not a pipe dream. It’s what we’re working on with our partners now.”

Related Video
Issues In Calculating Glitch Power
Which of the growing number of corners must be addressed?



Leave a Reply


(Note: This name will be displayed publicly)