New Design Approaches At 7/5nm

Smaller features and AI are creating system-level issues, but traditional ways of solving these problems don’t always work.

popularity

The race to build chips with a multitude of different processing elements and memories is making it more difficult to design, verify and test these devices, particularly when AI and leading-edge manufacturing processes are involved.

There are two fundamental problems. First, there are much tighter tolerances for all of the components in those designs due to proximity effects. Second, as a result of those tighter tolerances, better characterization data is required. However, the behavior of these chips or their component parts isn’t always precise. This is especially true for AI chips, as well as chips that include AI, where predictability and measurability are relative terms. Two devices may start out identical and diverge over time due to different use cases or environmental conditions.

“With AI, it’s not just silicon—it’s also software,” said Raik Brinkmann, CEO of OneSpin Solutions. “You can measure what’s going on with the training set, but you have limited visibility into how the network will interpolate from that. So two chips may still perform in a similar way, but they might not be at the same accuracy. The big question here is whether it is accurate enough.”

Some of these issues are new, others are not. But they are cumulative at advanced nodes, which is where most of these designs are being developed, increasingly driving design teams to tackle solutions at the system level rather than using traditional divide-and-conquer approaches. So while dynamic power density, static leakage and thermal issues are well understood at 16/14nm, the tolerances for all of those are much tighter at 7nm and below. Thinner insulating layers and wires have made digital circuits as susceptible to noise as analog circuits. For example, electromagnetic interference was largely ignored by digital designers at larger nodes, but it has become challenging at 7nm. The same is true for various other types of noise, including power, heat and vibration.

Historically, the most expedient and cost-effective solution to these kinds of issues has been to add margin into chips—extra circuitry to guard-band against any possible issues that might arise. But as more heterogeneous elements are included in advanced-node designs, the ability to guard-band is much more limited. For one thing, margin adds cost, decreases performance and increases the amount of power necessary to drive signals over longer, skinnier wires. For another, there are too many components to guard-band everything. What worked for a single processing element in the past no longer works for multiple processors and dozens or hundreds of IP blocks in a 7nm or 5nm manufacturing process.

“We’re moving toward multi-application scenarios across different platforms,” said Max Odendahl, CEO of Silexica. “What you need to do is gather much more data. So now you have sub-object analysis. You need to understand how area is being accessed. That will make a huge difference on how you’re going to stream data to your DSP or GPU, for example. You might be able to see there’s a bottleneck in the computation, but you may be blind to the reason. So there’s high-level stuff, where you’re synchronizing too much or you’re sending too much data to another system, versus whether you’re fulfilling your tasks and sending too little data. There are different layers, too, and you need to figure out where is the bottleneck. If it’s not in a critical path, you may not care if whether it’s an optimized loop. But to find that out, you need that system view.”

Different approaches at 7/5nm
All of these issues are exacerbated by smaller feature sizes, as the whole supply chain is discovering. Test chips already are under development at 5nm, with research at 3nm. The consensus today is that at 3nm some form of gate-all-around transistor will be needed to replace finFETs, whether those are nanosheets or horizontal nanowires. All of those developments will add their own set of technical issues, which range from variation to electrostatics to electron mobility, as well as some corporate structuring to deal with changes.

“Whatever happens at 5nm and beyond is different,” said Steve Lewis, marketing director at Cadence. “All of these chips have variation in design, but if you build larger chips, you have to start dealing with what used to be third-order effects. At smaller geometries, you also have to deal with finFETs. The good news is that structures are more regular. There is not the same degree of freedom as with planar transistors, so there is not as much variation along old tracks. That doesn’t make it easier, though, because they influence each other more. You also have new sources of variation because of the extreme size of the gates, the closeness of the transistors and routing congestion.”

Variation historically has been a problem for the foundries, which have added their own margin into chips. But as margin begins to impact performance and power—and as it becomes less effective in dealing with some of these issues—foundries and their customers have shared the problem with design teams and EDA vendors to figure out solutions. So now the entire design chain is being forced to deal with variation, from initial architecture to layout all the way through to design for manufacturing and test, and they are forced to make tradeoffs throughout the design process.

“We are now dealing with smaller and smaller budgets, and everything has to be tightened up,” said John Sturtevant, director of technical marketing at Mentor, a Siemens Business. “With resists and traceability, we are inching away at tiny fractions of a budget and we are starting to look at the probability of failure. With the sheer number of patterns, if you look at [lithography] dose and focus, even ignoring random effects, you need to look to the edge of a distribution.”

The result is that where there used to be distinct steps, from design through manufacturing, there now are overlaps and blurred lines.

“Just following DRC is not good enough anymore,” said Cadence’s Lewis. “You need to know how to place this route, and route this route, and you have to know that models are just approximate. The front-end and the back-end cannot work in silos anymore.”

One of the benefits of eliminating, or even softening, silos is greater innovation and lower resistance to change. So rather than trying to nail down everything that can possibly go wrong up front, there is more emphasis on adding resilience into systems to deal with the unexpected. This does require extra circuitry in some places, but the idea is that total margin will go down and the net impact is lower from an overall system standpoint.

“If you have lots and lots of processors, why not use those processors for doing analytics on chip and locally?” asks Rupert Baines, CEO of UltraSoC. “So you route the analytics to a subsystem, and rather than using an expensive I/O, you’re using processors internally. The advantage of that is you can do it live. We have customers laying out their chips and they’re using dedicated processors for this task. Some of them are doing this for safety and security applications. The analytics are looking to use data to prevent safety failures, potential hacks, malware, and they’re doing that live and dynamically on the chip, observing traffic patterns as they flow past. In addition, they can act incredibly quickly because you’re not sending anything off-chip.”

The AI uncertainty principle
Where this really can have an impact is in automotive design, where 7nm AI chips are being developed for autonomous and semi-autonomous vehicles. Automotive OEMs are demanding these chips last for 18 years with zero defects, but it’s not always clear how to define a defect, let alone prevent one from slipping into the market.

AI brings its own set of issues, as well, some of which are well understood and some of which are not. From a hardware design standpoint, there are a number of concerns.

“What’s interesting about AI is that it’s always on,” said Mike Gianfagna, vice president of marketing at eSilicon. “And almost every design has some AI content these days for things like predictive analysis and resource utilization. The goal is to make these designs more efficient and get more out of the same footprint. But if you add in margin, will it still perform to the same specs in a few years? At a local level, you still have a SerDes and a processor running firmware, but there’s also adaptive behavior for a system environment. And it may not just be a chip. It may be a system-in-package. So now you’re looking at the performance of a chip, a SerDes and the next chip, and anything between them, which could be the HBM PHY and memory.”

AI adds a whole new level of confusion and uncertainty into chip design, and that needs to be viewed in the context of the system rather than the individual processing elements or even subsystems. This makes the traditional approach of guard-banding far less effective for several reasons:

  • Algorithms are in a constant state of flux, which makes it difficult to develop the perfect optimization.
  • While more processing elements can speed up both training and inferencing, those processors need to be kept busy to be efficient because they are always on. But whenever there is work to do, being able to maximize the logic circuitry is critical. Adding margin reduces that processing capability.
  • Results will vary over time because the whole idea behind AI systems is that they adapt to different use cases and environments.

“One of the big issues is how to transfer that knowledge to a chipset,” said Ron Lowman, strategic marketing manager for IoT at Synopsys. “So there is a process for developing the algorithm. But then you have to figure out how to transfer that to a chipset, whether that’s for training or inference. That adds some new challenges for the application of those algorithms, but also power and memory on the chip. What we’re seeing is that exploration is more important with AI architectures.”

The question is how to best achieve that in a design when the final outcome isn’t always so well-defined.

“Verification becomes more statistical, and you cannot give any guarantees,” said OneSpin’s Brinkmann. “So you push the boundaries of what you can guarantee toward the system-level. If the process you’re using doesn’t give you the right results, there is no chance of pinpointing problems. But these platforms are related to a high degree, so at the system level you can focus better on what goes wrong.”

This represents a completely different way of looking at the verification problem, and it points to a much more selective use of margin in advanced designs.

Conclusion
There are multiple trends underway at 7/5nm and in systems that either include AI, or which are themselves AI chips, and they point the way toward some pronounced changes in design. One is that engineering teams need to obtain and make sense of much more data than in the past..

“This is like using Google Earth,” said Silexica’s Odendahl. “It’s not good enough if you see a hotspot but you don’t know what’s going on. You need to tie this back to the original source code. So we could run the same model twice, compare it to two different iterations, but one time something ran in 2 milliseconds and another time it ran in 25 milliseconds. Then you need to figure out what went wrong. So we need to do static analysis, dynamic analysis and semantic analysis to really understand what is the root cause and where did it come from, and how did it affect the system.”

Engineers also need to take that highly granular data and apply it at the system level, which could involve one chip or a combination of chips in a package or in multiple packages. And finally, they need to recognize that not everything in a design can be predicted up front or verified at even at first silicon, particularly when systems are developed to adapt.

While many of these problems are well understood individually, developing solutions that can both understand what is happening at the block or circuit level—and then apply them at the system level—represents a significant shift in chip design. And for EDA companies, which have been eyeing vast opportunities at the system level, they’re about to get a closer look.

Related Stories
Power Delivery Affecting Performance At 7nm
Slowdown due to impact on timing, and dependencies between power, thermal and timing that may not be caught by signoff tools.
Variation Issues Grow Wider And Deeper
New sources, safety-critical applications and tighter tolerances raise new questions both inside and outside the fab.
What Makes A Chip Design Successful Today?
Maximum flexibility is no longer the reliable path to product success. While flexibility must be there for a purpose, it also can be a liability.



Leave a Reply


(Note: This name will be displayed publicly)