Have Margins Outlived Their Usefulness?

Why big data techniques are critical to building efficient chips.

popularity

To automate the process of solving complex design problems, the traditional approach has been to partition them into smaller, manageable tasks. For each task, we have built the best possible solution which we continuously refine over time. Additionally, we have managed the interdependencies between tasks by defining boundaries or margins; these often have been best- and worst-case values used to quantify the impact of other tasks on the current task being solved.

This worked well as long as the interdependencies were small or relatively uniform across the design. We could achieve design closure and had confidence that the design was nominally optimized. However, we often overlooked the fact that this optimization was achieved within the bounds of margin-driven methodology. Since these best- and worst-case values defined a ‘box,’ the end result was forced to be framed and constrained within this box. This limitation prevented us from realizing other potentially better solutions and optimizations that might have been possible.

Of course, this isn’t a surprise to most designers. If you consider a multidimensional optimization problem, you can have two approaches. The current approach would be to optimize one dimension at a time, relying on best- and worst-case values for other dimensions. The result will be a nominal solution, but most likely not the optimal one. Ideally, you would like to solve for all variables simultaneously, driving toward the optimum value of each. Achieving design closure using this latter approach is practically impossible due to the time taken and the needed compute resources. Because almost all design automation software used today was architected using databases and data structures 20+ years ago to work on single machines, they do not lend themselves to distributing the problem over multiple machines, each with multiple cores. The performance scaling saturates very quickly.

For System-on-Chip (SoC) or ASICs created on advanced process technologies, there are strong interdependencies between the architecture defined, applications targeted, layout style used, power delivery network designed, thermal signature seen, etc. Also, there is no real boundary between the chip, the package and the system, even though we design them separately. These interdependency effects are neither uniform across the design nor are they small. So the traditional approach of considering these elements as independent of each other and solving them by taking best/worst-case values to mimic their impact is no longer a viable approach, especially if one is looking to optimize and reduce both design costs (size) and design schedules.

Let’s look at power grid design and voltage drop signoff as an example. Today a design team might partition a signoff threshold of 15% noise into 10% for the chip, 3% for the package and 2% for the board. They would analyze the design on a very small fraction of all operating modes and, to compensate for their lack of sign-off coverage, they would over-simulate and pad the results to add an additional margin. At earlier voltage levels (say 1.2V), this approach worked reasonably well, but resulted in over-design of power grid widths and increased need for de-caps. Even though the design would meet its performance goals, it would be larger than needed and possibly late.

This changes completely for finFET-based designs because these devices switch faster, creating significantly higher di/dt levels from each switching event. They are also more densely packed, creating considerable localized power surges. When this is combined with increased rail parasitics (both on the chip and on the package/PCB), one can have 25% to 30% peak-to-peak voltage fluctuations on top of a 500 mV supply. What’s more, there is no guarantee that this noise is the worst that will be seen, since the analyses are based on limited simulation scenarios. Your only recourse is to heavily over-design by creating wide rails and big via arrays across the chip, or multiple power planes on the package with dense array of bumps. This will bloat chip-area (cost) and significantly complicate closure (schedule).

You’ll see similar problems in other domains. In reliability analysis, EM and ESD are strongly coupled to temperature. Contrary to the approach used in today’s simulations, thermal effects are far from uniform across the die and across realistic use cases. Coupling between clock jitter and supply voltage noise (also variable across the die and across use-cases) is another example. In each of these cases, the problem is not that we lack tools to do the analyses, but that we lack ways to effectively model the interdependencies in a coupled simulation environment. Ideally, instead of analyzing/optimizing one dimension at a time and settling for whatever optimum is visible, we could explore the multi-dimensional space and find a greatly superior optimum that, for this case, would result in optimized routing resources, package routing, thermal cooling, etc.

You could argue that this is not a big deal. Sure, you’re leaving something on the table, but you get to a workable result in a reasonable time, so it’s all good. But that “workable” result is becoming increasingly less workable. In the voltage-drop example, die-area expands significantly, and it may take you a lot more work to close timing. Worse yet, if you didn’t over-margin enough, you may have unforeseen reliability problems. These problems are only going to get worse as design schedules are pushed harder, ultra-low voltage operation becomes more common and technology continues to scale.

The software architectures of EDA solutions used to solve these design problems are inherently limited, forcing compromises in order to meet the increasing size and complexity of today’s designs. Additionally, these software solutions impose strict guidelines on the compute resources required. In the face of these challenges, design teams will be forced to continue to partition their tasks and use margins even more to converge on their designs. So while they are required to reduce their chip sizes and pull in their design schedules, the limitations of the available methodologies will force them to over-design, without necessarily preventing failure scenarios resulting in extended development cycles.

A modern SoC or ASIC is no less complex than any big-data problem that is being solved regularly using innovation happening outside of EDA. For us to solve these complex multi-physics problems of power noise coupling, signal integrity, thermal/EM/ESD reliability and EMI/EMC compliance while reducing the chip-package-system costs, it is imperative to consider and leverage big data techniques. These techniques can enable unlimited scalability to perform rapid multi-variable design analysis and weakness feedbacks on commodity compute resources to drive multi-domain design optimization.



Leave a Reply


(Note: This name will be displayed publicly)