Customization, Heterogenous Integration, And Brute Force Verification

Why new approaches are required to design complex multi-chip systems.


Semiconductor Engineering sat down to discuss why new approaches are required for heterogeneous designs, with Bari Biswas, senior vice president for the Silicon Realization Group at Synopsys; John Lee, general manager and vice president of the Ansys Semiconductor business unit; Michael Jackson, corporate vice president for R&D at Cadence; Prashant Varshney, head of product for Microsoft Azure; Rob Aitken (Arm fellow at the time of this panel, now distinguished architect at Synopsys); and Mohamed Elmalaki senior principal engineer at Intel. Part one of this discussion is here.

SE: We’re seeing many more heterogeneous designs produced in lower volume. For those, do you use the same tools as you do for high-volume designs, which may have multiple test chips and iterations?

Jackson: There will be a mix of both custom designs and reuse in these systems, just as there is a mix of them today in a single die. Both will need to be leveraged to address the overall system and design objectives. In terms of the fabric that’s going to be customized, there will be certain tools for that. If there is some processor type application, and it’s all custom, you’ll apply one methodology. If you’re utilizing some of these reusable components and chiplets, you’ll effectively be taking them and integrating them into the overall system. A lot of the same tools are being used for both scenarios, and they’re both needed to pull together the overall system.

Biswas: The same tools will be used for these heterogeneous systems as for other chips. What is different is the methodology, and that will change depending on the different kinds of designs. Whether you are doing processor cores, a low-power design, or I/O, there are different methodologies you will use. Some will be custom-heavy. Some will be more digitally oriented. And then some will be integrated together. That connection between the heterogeneous chip systems is extremely important because there are IPs that are going to be created to connect them. What methodologies are we going to use to analyze or design each of these chiplets or units by themselves, and together? The change will be in the methodologies, not in the tools that are used to create them.

SE: We’re also seeing more physical effects than in the past, including power, thermal, all sorts of noise. How do you deal with these?

Lee: The lines between system and silicon have blurred. We no longer can margin, or guard-band, at a die level. If you look at logic die, you are bringing compute and memory closer together with an interposer. You may add in RF communications on a separate die, or move silicon photonics and electrical compute closer together in an advanced package. But you also have a new set of problems to deal with. You run into electromagnetic interference between the die where you have a bus line, and it may unintentionally switch because of inductive coupling, either on that die or from adjacent die. The thermal integrity issues are real as you put multiple die on an interposer. That can increase temperature, which can cause mechanical stress and strain, which can affect the reliability of a part. So for custom silicon versus more standard chips, or low volume versus high volume, it all comes down to the content. Increasingly, we’re putting systems into automotive, which is a very harsh environment that also requires very high reliability. These methods need to evolve or be strengthened in order to ensure the electronic reliability of the system. The challenge here is that you may be designing all of the silicon yourself, in which case you have full visibility. But what happens when you’re buying silicon from a supplier, and that supplier doesn’t want to give you all the details — the IP, the GDS, the schematics? How do you ensure thermal and electromagnetic reliability? That’s a big challenge.

SE: Will all of this complexity require more brute force computing for simulation, emulation, and analysis?

Varshney: The overall percentage of the design cycle could be anywhere from 50% to 70%, depending upon the customer, but it’s absolutely right in terms of the compute hours spent doing an SoC. As a cloud vendor, we are looking at verification of a standard workload to see how you can optimize the infrastructure or architecture required for integrating that workload, and how is that architecture changing going forward. There’s a lot of slicing and dicing of the workload itself going on right now. How are we doing memory management and storage management in the cloud, and the data transfer to the cloud and back? L3 cache has become more important, for example. It’s the same thing in storage. We are looking at high-performance network storage for verification workloads. We see infrastructure as a dimension of optimization. We are working very hard with new technologies to keep it on that track, but we also are working closely with the EDA vendors to see what we can do together.

Biswas: Even when we talk about a brute force approach, we need to look at it in a more intelligent way, which is how the software matches to the hardware. Functional verification is inherently a parallel problem. There is distribution of it across multiple cores, multiple systems, and it can leveraged in the cloud to get the scalability that is required. But having said that, as we go to heterogeneous systems, you can imagine this today with the hard IP now becoming chiplets. The scale is enormous. So we will need to think of other ways to divide and conquer the problem. You can call this a hierarchical approach. We need to be thinking about the debug as an extremely critical piece, because while we can do the simulation, we won’t get the insights from inside the design to prevent bug escapes. One of the things we’re seeing with heterogeneous design, as well as on a single chip, is the need to take advantage of the data at each step. We have something that is AI-augmented that can be added on top of the data to improve coverage closure, analytics, and debug that comes out of the verification and simulation.

Jackson: This definitely will require more brute-force computing. Tomorrow’s 2.5D and 3D designs will place greater demand in terms of functional verification, especially when it comes to capacity and performance. This applies to other forms of verification, too. We’re talking about functional verification here, but we’re also seeing an explosion in complexity for the other aspects of verifying the design, including thermal, power, integrity and timing. The various simulation technologies are going to need to evolve. Emulation is going to be playing an important role. The complexity of the system is going to pose questions in terms of how you verify things and where do you get the vectors. The best way to do that today is with emulation, which can run thousands of times faster than using standard simulation techniques. Being able to boot software earlier in the cycle is always important, but when you do that in a system that is heterogeneous and has fixed die as well as custom silicon, there are going to be problems that need to be solved to enable all of this.

SE: In the past, foundries had enough margin in their processes to fix power or layout issues. Over the past several nodes, that’s shifted left into the design side. What impact does that have in terms of tools and how much processing needs to be done?

Lee: It’s made the problem much tougher. If you think about a single die pumping in 400 or 500 watts of power, and at the same time you want to operate at low voltage and run off a battery as much as possible, those are conflicting challenges. The answer if you’re running at low voltage and you have massive die and compute, and then you have noise that comes from power integrity or other sources like other die, is that you need to have much better coverage. If you don’t have the right coverage you have to guard-band. But even margining or guard-banding is not sufficient. We’ve seen with the complexity of multi-physics that timing and power and thermal and electromagnetics all interfere with each other. So it’s hard to come up with a reasonable guard-band. The answer to all of this is better compute, better coverage, and better mathematical methods. AI/ML is one of those methods. There are other methods we’re bringing forth to help solve that, as well. But ultimately, to do everything flat and brute force is not practical, even with the cloud. In all other phases of designing chips, we’ve developed abstractions, modeling, and hierarchical approaches. The same now needs to be done across multiple die. There’s a need for multi-physics models that abstract the die in a way that gives you enough resolution so that when you look across multiple die you can really look at these multi-physics effects. There’s a lot of work ongoing today in terms of research and practical development.

SE: So is the future something that is good enough for 80% or 90% of the market? Or is it all going to be customized.

Elmalaki: The answer is product-specific. There is a market that will benefit from general-purpose, like when you have a codec that is running at 80%. But for other markets, chip vendors will need a solution that is 100% for their market. I’m not seeing any change. We will still put chips in both. It depends on the workload and how big the market is.

Aitken: When you start merging these things together they’ll be used in ways you didn’t think of, and things will start to break. An approach we’re going to have to start incorporating in a much more mainstream way is resilience. Architecturally these systems are going to have to recognize that things can go wrong, that things will go wrong, and that they need to react to them. That splits the verification problem into two classes of things that will break the system if they happen, and things that are bad if they happen, but which can be worked around if they’re identified in time and then dealt with. There’s a whole multi-layer abstraction to that, as well. You can put that in the software, so that if the software asks for a non-time-sensitive computation, and that computation doesn’t happen for whatever reason, then it just re-asks. That’s the simplest time-redundancy approach. But the verification problem is essentially an economic one. People spend 70% of their budget on it because that’s what it takes to get a chip out the door. If it switches and now it looks like 95% of your design is spent on verification and 5% is spent on your design, that’s not really sustainable. So we’ll have to think of better ways of doing it. Instead of just throwing brute force simulation at it and hoping for the best, we can target algorithmically what might actually go wrong. Of the billion things I’m simulating, which of these things is more likely to happen? That kind of thing already exists in circuit design. People don’t just throw random Monte Carlo simulations at stuff. They look at what might fail, and what might happen as a result. That same process has to happen in systems. Arm does spend an enormous amount of time validating and verifying processors. Other people who want to play in the chiplet space are going to have to put similar effort into verifying their stuff or nobody is going to use it.

EDA Gaps At The Leading Edge (part 1 of the above roundtable)
What’s missing from tool chains and methodologies as chiplets and advanced packaging become more popular.

Leave a Reply

(Note: This name will be displayed publicly)