The Challenge Of Defining Worst Case

What are the worst-case conditions for a chip, and should you worry about them? Of course, it is a little more complicated than that.

popularity

Worst case conditions within a chip are impossible to define. But what happens if you missed a corner case that causes chip failure?

As the semiconductor market becomes increasingly competitive — startups and systems companies are now competing with established chipmakers — no one can afford to consider theoretical worst cases. Instead, they must intelligently prune the space to make sure they are dealing only with realistic scenarios. The problem is that is a huge number of variables that contribute to the notion of worst case, some of which show various degrees of correlation while others are independent. And they cover architectural issues down to the smallest details of implementation. So worst case for a product may be about throughput or latency, or peak power or total energy, or maybe maximizing yield.

As with many forms of analysis, incomplete simulation is usually prevalent early on, followed by statistical techniques that prove to be too pessimistic. And then, more controlled methods take over. Today, the industry is trying to define what those controlled methods are. When software is added, it takes on a different level of complexity and about 100 AI companies are betting huge amounts of money on the assumption that they understand this problem better than the next company.

Today, this topic is becoming more important to a greater number of people. “You can have a theoretical worst case that will blow up any chip,” says Neil Hand, director of marketing at Mentor, a Siemens Business. “It becomes a lot harder to define for SoCs, and this is what makes it an interesting discussion. How do you generate a realistic worst case?”

There are two parts to this problem. First, how can physical implementation and fabrication impact the notion of worst case? And second, what is the worst-case activity that can happen within a design?

“You really do have to separate it into two areas,” says Hand. “There is the implementation, or the realization aspects of worst case, where things will change as you go through synthesis, layout and the whole implementation process. But you also have a worst case in terms of activity and logic.”

Physical and fabrication issues
There are a number of physical things that impact worst case. “Some of them are compliance requirements,” says João Geada, chief technologist at ANSYS. “Maybe you are dealing with a nominal environment where you have to certify your chip works from -40°C to 120°C, or inside a car it may require -60°C to 150°C. Those are boundary cases. But worst is an interesting qualifier. We know the boundaries of the envelope, but where is worst? We usually have to deal with the process, voltage, temperature (PVT) envelope. Some people define worst case by the corners of that multi-dimensional space. They look at the corners of this three-dimensional cube and assume they are covered on the inside. That has probably never been true.”

This has become increasingly challenging for newer nodes. “The end of Dennard Scaling and increased process variability, especially on finFET nodes, are both making worst case even worse,” says Richard McPartland, technical marketing manager for Moortec. “It means that chips have the propensity to run hotter, and on-chip voltage drops are getting bigger. We do see designs pushing the limits and occasionally on the wrong side. Different applications may have different worst-case temperature, voltage and RC corners. Worst-case power is not just concerned with the maximum power dissipation, although that is naturally a good starting point. It is also about bursts of activity that cause temperature cycling and power differences, which cause temperature gradients across the chip.”

While Shift Left is often seen as a good thing, this is an area where it is becoming a necessity. “We could ignore the design and just say it has to operate within this environment,” says ANSYS’ Geada. “That ignores gradients, local effects, etc. This is the way people used to talk about worst case.”

There are two ways in which this simplistic analysis falls apart. The assumption that the corners are the worst case is not really true. There are islands in the middle where designs may not operate correctly.

“You assume that the full system is symmetrical across all dimensions,” explains Geada. “But it is neither symmetric or monotonic. With temperature, there is a CMOS effect called temperature inversion. Because of the relationship of transistor conductivity to metal conductivity, there are certain temperatures where lowering the temperature makes everything go faster. As you lower the temperature past a certain point, you start to increase metal conductivity. That impacts the behavior of transistors and at a certain point. Things start working the other way around, and lowering the temperature makes things operate slower. There is an inflection point in the middle where the system has the least performance.”

Similar issues exist for other process variations. Considering only a fast process or slow process can mask issues where the worst case is actually a combination of some devices being fast and other slow. But ensuring a design works under all these conditions is highly pessimistic. “When you manufacture a chip there are global effects,” adds Geada. “A particular production line is likely going to produce similar transistors, and a particular location on the wafer is going to have a particular trend that is fixed. A portion of those effects are correlated, so you cannot treat these as independent statistics.”

Manufacturing processes play a significant role here. “Process variations are now so large that designing for worst case and including wide guard bands is no longer seen as a valid approach,” points out Moortec’s McPartland. “It simply leaves too much of the performance advantages of moving to a smaller node under-utilized. New approaches are being explored that minimize the guard bands and optimize supply voltages on a per-chip basis.”

Geada agrees. “If you got close to a margin, you added a little more next time around. They would pay a little design penalty, but as you get more competitive and you start looking at very expensive technologies, it matters that you can get the most from your invested dollars.”

This problem is that previous types of analysis were design- and activity-agnostic. Today, companies only care that their design operates correctly, and that makes analysis more complicated.

“If you engineer every part of the chip for the absolute theoretical worst case, the amount of margin you build in is way too high,” says Mentor’s Hand. “IP companies will have to start providing system-level power models. Once you have the ability to do power modeling for the lower-level blocks, you can put them into a higher-level system model and start running real scenarios. Life starts becomes easier because you no longer guessing at what will happen.”

Design and activity issues
Looking only at legal activity for a chip limits the space that has to be considered, at least in theory. But it is not quite as simple as that. “We can reasonably define worst-case activity at the block level,” says Hand. “We also can do it at the subsystem level. But when you get to the chip level, software plays a huge role. It becomes an interesting challenge — how do you predict worst today versus tomorrow. You may do a software update that will change what you perceive to be the worst case.”

By bringing activity into the picture, it may appear that it will always limit theoretical worst-case conditions. That is not always the case, however.

“Designs themselves have patterns of activity, and patterns of forbidden activity that bring their own correlations,” says Geada. “Temperature is an interesting space. On a tester, you force the chip to be a uniform temperature. It has been heated or cooled to bring it to a uniform compliance temperature. The entire chip, for the brief period it is on the tester, will be uniform. In real-life usage, it will not be at a uniform temperature because a significant chunk of the temperature is going to be produced by the chip’s own operation. When you are dealing with a multi-core microprocessor, the core that is active will be hotter than the cores that are doing less. If you do analysis assuming uniform temperatures, you may miss a corner of the operating space where a hot core is talking to a cold core for example.”

In the past there was little connection between the various hierarchical levels of the development process. The development of a block of IP had no notions about how it was to be used, and so had to assume the worst. That also happened at the subsystem level, where several IP blocks were aggregated together. But today, more sophisticated analysis tools are allowing some of the issues be dealt with.

“As you look at Portable Stimulus (PSS) and analysis tools working together, you can start to take those worst cases that the designers know about, and it is possible to see if they can exist in the real world,” says Hand. “They can model it and make sure it works at the block level and pass that knowledge up through the subsystem and system levels and ensure that those traffic patterns are validated as you bring the design together. PSS allows you to re-use the scenarios in both a top-down manner, where you can use the scenarios to test the blocks, or bottom-up, where you generated the worst-case scenarios for all of the lower-level systems. You now merge that information together and build realistic traffic patterns and you start to see if you can hit these worst-case conditions.”

It takes time to bring it all together. “It is similar to the emergence of static timing analysis,” says Geada. “It started by people running vectors through simulation, and then systems became complicated enough that you were never sure if you covered the worst cases. So automation was added, such that at every cell it would compute the minimum and maximum delay of every path and gave you an effective guaranteed pessimistic bound on the performance envelope for every path through the design. Then people started asking that very question — can that actually happen? The industry came up with SDC as a way for humans to interject domain knowledge into the system and saying that certain things cannot happen. Set false path, set multi-cycle paths, etc.”


Fig. 1: Process corners at different nodes and conditions. Source: Arm

Mitigation
For many designs, finding the worst case is not the right strategy. “Incorporating in-chip PVT monitoring gives visibility into on-chip conditions and places designers in a much stronger position to design on the right side of the limit without over-designing,” says Moortec’s McPartland. “Strategies employed for thermal management range from simple thermal cut-off, where some or all of the circuitry is switched off or ramped down if a certain temperature is reached, to more sophisticated DFS and DVFS schemes, where the operating point and power in terms of clock frequency and supply voltage can be controlled and dropped to a lower level. Similar things are happening with IR analysis. Designers need to know what to pay attention to and what not to pay attention to.”

System throttling is not always an option. “Systems that live under hard real-time constraints cannot back off,” points out Geada. “If you are controlling your anti-lock braking system or an obstacle detection system, you want that to fire very predictably — regardless of your operational temperature. You have no leeway. But if I am playing a game and I lose a few frames per second, I probably won’t even notice. Whether you have the luxury to do that depends upon the application and the constraints placed on it.”

But there are still things they can do. “Any design, no matter how worst case gets defined, has to have some power management,” says Hand. “You have to validate those mechanisms to make sure they work. Then, if there is a new worst case, the question becomes, ‘Do you have the ability to change the power management architecture within the chip itself, or at least in the software, and what are the degrees of freedom?'”

Conclusion
While dealing with worst case is a significant challenge, the industry believes it has a good handle on the problem. “None of this bothers me too much from the verification side,” says Hand. “We have a reasonable handle on it. The bigger challenge for the industry is how do they gain the necessary visibility when they are putting an SoC together. Nobody wants to design for ultimate worst case. That is not manufacturable or economically viable.”

How close to the limit do SoC development teams get? “We see most, if not all, SoC teams pushing the limits to extract maximum performance, whether that is maximizing processing power in AI, minimizing power consumption for smart phones, or maximizing reliability in automotive,” says McPartland.



Leave a Reply


(Note: This name will be displayed publicly)