The Impact of Domain Crossing on Safety

Experts at the Table, part 3: What addition problems does this create in application areas, such as automotive? When will we see more automated solutions?


Semiconductor Engineering sat down to discuss problems associated with domain crossings with Alex Gnusin, design verification technologist for Aldec; Pete Hardee, director, product management for Cadence; Joe Hupcey, product manager and verification product technologist for Mentor, a Siemens Business; Sven Beyer, product manager design verification for OneSpin; and Godwin Maben, applications engineering, scientist for Synopsys. Part one can be found here. Part two is here. What follows are excerpts of that discussion.

SemiEngineering roundtable: (Left to Right) Joe Hupcey, product manager and verification product technologist for Mentor, a Siemens Business; Godwin Maben, applications engineering, scientist for Synopsys; Sven Beyer, product manager design verification for OneSpin; Alex Gnusin, design verification technologist for Aldec; and Pete Hardee, director, product management for Cadence.

SE: The smart phone that has pushed the need for multiple application domains more than anything else. As we see more companies move into domains such as automotive, is that changing things and adding extra dimensions of reliability and safety to the problem? How do you prove that each of the bits work separately and that the whole thing is guaranteed to work together?

Hardee: The Mil/Aero space has been contemplating this for quite a long time.

SE: But they can afford any solution.

Hupcey: You might be surprised. They are using a lot of rad-hard FPGAs and that is it.

Hardee: Which would be an expensive solution for automotive. The sense I get is that automotive is still getting their arms around functional safety and reliability and has not really contemplated how all of these things come together. I don’t get the sense they completely comprehend how all of these things interact yet.

Hupcey: I take issue with that. We see the data points and there are basically two types customers in the industry—customers that are already serving the automotive industry and those that are about to. There is a sense of inevitability. The customers who are in the industry now, were the first two years ago to ask us to qualify software for CDC, and that was the first tool we qualified across the company. There are some coming into the zone that realize this is another thing they have to do, and there is always that initial denial about it. They have to hire and train a new group of people, and that can delay entry into this space. People who have a high reliability requirement or high cost of failure, they already have that, but there are a lot of people now entering the higher-cost-of-failure market, and that is an opportunity space.

Gnusin: I don’t think we can provide a 100% complete solution for CDC issues for safety critical design. There is a small risk, but it is still there, and we may have missed something. One of the safest methods is just to run hardware verification of CDC issues. I am talking about CDC amplifiers in hardware using timing randomizers for signals injected into actual FPGAs and running for, say, 20 hours with real stimulus. Then you can check that the FPGA keeps working in a deterministic way.

SE: But you still have the problem that you need to ensure that the stimulus it is seeing is not only representative, but covers the entire range of things that it may see in real life.

Gnusin: What I am suggesting is much more stimulus than running in simulation. Any amount of simulation would not compare with FPGA testing.

Hupcey: We should make clear that if that if you are waiting until gate-level simulation to try and deal with these issues, you are making a big mistake. Gate-level has its role, but the whole thing is to stack verification. We do formal under the hood up front, we do it at RTL and get that as clean as possible involving all of the domains. And then there is another round of signoff CDC that has to be performed because of all of the other stuff we talked about previously. Gate-level simulation is too late. You can’t make any meaningful tradeoffs. That is what is nice about RTL and even architectural analysis, where you can save yourself some issues about how to arrange things, make tradeoffs. This is much easier than any gate-level methodology.

Hardee: One area that has been applied in Mil/Aero and automotive is that CDC, RDC issues are not the only source of metastability. EMI and power-line integrity also have been big issues for years in those spaces. That comes into the thinking, as well. This is another aspect on top of considerations that these verticals have been dealing with for years.

Hupcey: You can add the fault domain.

SE: Would you care to explain that.

Hupcey: You have a design and there is a layer of fault analysis, which some people talk about as the fault domain. This is just like traditional manufacturing faults, like stuck-at, bridging, etc., but now you want to find transient faults, as well, and have a scalable methodology. This requires techniques that can inject the faults in a meaningful way and collapse and isolate what are the safe faults that will have no impact and cannot affect the outputs or the logic that you are using to guard those output, namely the safety mechanisms and find out if it is doing it’s job.

SE: For those just beginning to hit the problems, how should someone dealing with it for the first time approach the problem?

Hupcey: The vendors have been working with customers who have been doing this for 10 years. EDA as a whole is pretty well prepared to help customers. We each offer solutions. There are plenty of problems waiting to be solved, or which are still being grappled with. CDC itself is not a solved problem, but it is understood. Adding reset extends CDC with reasonably low risk. The interactions that we are talking about here — we are all still grappling with those and trying to get the cleanest methodology we can. It is complicated by different customer requirements. Some are more or less risk-averse. Some have many more IP blocks in their designs. Some are just looking at low power, which is pretty simple for them as they just turn on or off a whole channel, whereas others in the consumer and mobile application spaces are turning off every bit of circuitry that they can at every moment and they have a different level of complexity. We can tell customers that EDA is moving in this direction and is sensitive to it and can help.

Gnusin: Education is important. Our focus is on customers in the FPGA domain, and so we concentrate on education within that domain. But we have many of the same issues between ASICs and FPGAs. Education will not go into the fine details about MTBF calculations, some formal. We have to make it clear what metastability issues are and why we need convergence on what the issues mean and why they happen. It is important for customers to understand these issues before using tools. Tools simply present and help them find the issues, but customers do have to understand and correct them.

Hardee: The issues have been understood for some time, but by a subset of the engineers in any given company. What is happening through the complexity increases is that it has driven a real, reusable IP methodology plus the need to manage dynamic power. Not just power intent, but the need to manage dynamic power is driving the multiplicity of clocks in designs of any size. Those are the main factors that have been making this not just the implementation guy’s problem. It has to be brought forward, and has become the designer’s problem. It is also the verification engineer’s problem. There is no escaping it. These have to be dealt with earlier in the flow.

Beyer: On the vendors side, we are obliged to do more in the automation space. With the huge numbers of waivers and the extent that some customers may have to go to, we have to be looking to see if there is anything safe that you can find under the hood. After performing structural analysis using formal or other technology that may add other layers of safety, then you don’t need to report it at all. Try and get things more precise. We also can do more with grouping by making it more accessible and easy to use. You also need the understanding.

Hupcey: We have our work cut out for us to automate. There is a lot more that we can automate. There is a lot of advancement that can be done. The results and the workflow need to make sense for the people coming into the job functions.

Maben: You have to look at it as a solution for any chip. It is not sufficient to look at one minor segment of the solution. When we consider clock and reset, I don’t see any design that still has one clock or one reset. Most people have understood it and it is a known problem. Solutions exist. With the areas that are getting added, people need to get educated but the more they look at it as a solution, not as a point problem, they will be able to partition their designs better. The solution space is expanding, and we have a lot more to work on. We need more automation. At the end of the day, the goal is to solve the problem as early as possible and try to solve the problem without depending upon vectors. Structural, formal – how far can we solve? If I can get 99% coverage by doing this, then we are successful.

SE: What would make the problem simpler and easier for everyone? A common waiver mechanism?

Maben: It has to be at the architectural level. But there are domain-specific architects, and it is difficult to find one SoC-level architect who knows everything. Even if one person knows everything, how do they pass the information down? That is missing today. IP comes together without it being known how it will be stitched. How do I define that stuff for the IP so it is know how to stitch it together?

Hupcey: Meta constraints at the higher-level — above the level of SDC files — would be beneficial. It’s characterizing each IP and having a standard for how I describe the CDC, RDC and power behavior that can be easily re-used and transferred. If that IP is frozen, then that characterization can be used in the current design . Build that up in a hierarchy and you do not need to do a full flat analysis because the IP is already telling you what it knows, so it’s basically enabling a bottom-up approach in a hierarchical manner.

Related Stories
So Many Waivers Hiding Issues
Experts at the Table, part 2: Domain crossings can produce thousands of waivers. How does a team put in place a methodology for dealing with them?
Domain Crossing Nightmares
Experts at the Table, part 1: How many domain crossings exist in a typical SoC today and when is the right time to verify their correctness?
UPF-Aware Clock-Domain Crossing
How to minimize the impact of CDC on power at RTL.

Leave a Reply

(Note: This name will be displayed publicly)