Domain Crossing Nightmares

Experts at the Table, part 1: How many domain crossings exist in a typical SoC today and when is the right time to verify their correctness?

popularity

Semiconductor Engineering sat down to discuss problems associated with domain crossings with Alex Gnusin, design verification technologist for Aldec; Pete Hardee, director, product management for Cadence; Joe Hupcey, product manager and verification product technologist for Mentor, a Siemens Business; Sven Beyer, product manager design verification for OneSpin; and Godwin Maben, applications engineering, scientist for Synopsys. What follows are excerpts of that discussion.

SemiEngineering roundtable: (Left to Right) Joe Hupcey, product manager and verification product technologist for Mentor, a Siemens Business; Godwin Maben, applications engineering, scientist for Synopsys; Sven Beyer, product manager design verification for OneSpin; Alex Gnusin, design verification technologist for Aldec; and Pete Hardee, director, product management for Cadence.

SE: Initially, designers had to consider clock domain crossings (CDC). Then reset became an issue, and now designs have increasing numbers of power domains. Every time a signal crosses a domain boundary there is the potential for problems. Why did this become such a large problem?

Hardee: The management of dynamic power is the driver for this. It is important to use the correct clock frequency for each block, and this is driving the explosion of multiple clocks. When we talk about managing dynamic power we are not just limiting that to mobile or portable equipment. Today, high-end servers care about power. Everyone cares about it.

Hupcey: There are two additional domains that you did not mention—the test domain, which is coming into play in the CDC arena in unexpected ways, and the analog domain, which is also of interest. In terms of clock domains, it is pretty common to have over 200 clock domains. We see up to 50 different frequencies or asynchronous clocks. It is happening for all SoCs, which can have a few hundred IPs on them. Not all of those IPs come from the same vendor, so they have different behaviors that were not completely specified. Or the documentation is inadequate, which means there could be surprises when you assemble them together. Third-party IP is a big driver when you are working with an SoC. Even within IPs, they have their own internal behaviors, which may include multiple domains in terms of power, reset and clock.

Maben: Let me start by looking at clocks and reset. There are so many IPs that are being integrated into an SoC, and that creates many clocks. In order to reduce power, you do all kinds of tricks to manipulate these clocks to reduce timing and power. That leads to many more domains. We are talking about 200 to 300 clock domains. On top of that you have reset domains, and there are many resets these days. I have worked on designs with more that 15 reset domains. Then you add power domains. You may have 60 or 70 power domains. When you put all of those together there are too many domains. How do we begin to concurrently solve the problem when each of them influences each other? That is the challenge.

Beyer: We live with the trend of everything getting bigger and bigger. But when you get too much of anything, and with all of the optimizations everyone is trying to do in terms of power, the number of dependencies that you get between the different domains is mind-boggling.

Gnusin: So far people have talked about synchronous designs, but we must also consider that there are asynchronous events going to each place. Any time there are asynchronous events there are issues. We have to divide this issue into two parts. One part is meta-stability, and the second is non-determinism. Meta-stability means that we may have excessive power flow, which could cause overheating. But non-determinism is a functional issue.

SE: It would appear there are two aspects to this problem. One of them is wanted in terms of an active way to reduce power. The other is unwanted and is in part due to the design methodology employed by the industry, which is demanding the variation in clock frequency. Even within IPs there are multiple domains. If things are getting so complex, isn’t it time to rethink how we are doing this? What are we trying to accomplish and is this getting us there?

Hardee: Complexity is a real issue. We are seeing a recent shift in the way that power optimization comes first and CDC later. It has to be considered at the IP level and cannot be left to the chip integration level. Both of those started off typically being the responsibility of the chip integrator, and he or she would have been responsible for the UPF and deciding where the power domains were and when they would be switched on and off. You didn’t have to worry about it until the integration stage. CDC typically was left until the chip-level netlist stage. Those methodologies are no longer workable, and you have to consider all of that at the IP stage. To cope with the complexity, the long-talked-about IP reuse methodology is absolutely real everywhere now. No one is designing an IP block for just one chip implementation. It is always designed with multiple chip implementations in mind. So there is a need for the IP to be designed in a more robust manner. And for those checks for power intent and clock domain crossing, reset domain crossing – they have to be dealt with in the IP design and not left until chip integration.

Hupcey: The customers that have been successful have characterized each IP, and if it spans more than one domain, have characterized it across the big three – reset, power and clock. That becomes its own IP, along with the characterization that is communicated to the assembly level and integration level, where there is another iteration. To leave it and do it all flat is impossible even for a medium-sized chip. There is a lot more responsibility for the IP maker to characterize their design. They need to capture that into its own reusable description and that leads to a more accurate and higher throughput when you roll up the whole thing, just like the hierarchical methodology employed with anything else. You verify the RTL of the IP that is put together into a larger set of blocks, then to clusters and to whole chip – the same activity has to occur here.

Beyer: It is basically following the good paradigm of divide and conquer. Everything that can be done at the block level should be done there once. And then we have the reuse aspect of that, which is already partially covered at this stage.

Hupcey: There is an interesting push-pull in the sense that you can’t completely ignore the global view of the chip, because to some extent it is useless to do the lower-level analysis if you don’t know what the clock tree is going to look like. IPs need to know their relationship to each other and the overall scheme. What can be done is to make sure that there is at least a first pass of the clock-tree definition, the reset tree defined and the first pass of UPF, so that the IP developer has some context to work with.

Hardee: That is the point of bringing the checks forward to the IP level. You can’t assume that you are working with a single chip implementation of a clock tree. Bringing ideas of propagated clock into that is premature for an IP that will be reused in multiple different implementation. Specifically, for the CDC checks, you are assuming that to create robust IP, where you are not in charge of where it is to be implemented, you are not assuming an ideal clock. And you are not assuming that different clocks are asynchronous unless specifically specified that they are related. That is the way to make a CDC robust IP. Yes, you have to pay attention to clock trees for the final netlist signoff, but you don’t know those at the IP design stage.

Maben: There are two things that we can do. First, there are certain things that can done at the architecture level. And second, there are certain things that can be done at the tool level. The designer faces two issues. One is how much can be validated/verified just by their structural definition. An architect knows the clock architecture for the entire SoC. Then there is the reset architect who knows that architecture. Then there is a power architect. When we put all of this together, how quickly can we solve or identify problems by structure? When it goes to functional validation, that is where it gets tricky. How can we ensure that most of the problems can be detected by structural checking and not depend on functional validation? When you look at CDC, RDC or power domain crossing problems, most of it can be done at the structural level. But there are certain things that can’t. However, by looking at the structure you can pass on information that would be helpful for a functional team. With power domains coming into the picture, it becomes a lot more complex because now all IP comes with 10 or 15 power domains. And when you integrate, there are thousands. But the tools are going to optimize and reduce that. More emphasis needs to be put on this so that we have less silicon respins because of this kind of issue.

Gnusin: I already have mentioned that it is possible to divide CDC issues into metastability issues and non-determinism issues. Metastability can be solved in a static way. We can ensure that we have the necessary flip-flops, etc. Non-determinism is a lot more difficult to fully analyze, and that is required for high-reliability designs such as mil/aero. It is very hard to ensure that a CDC design works reliably, and currently we don’t have complete solutions. We could take formal and make some additions, but still it is not sufficient. We don’t have a complete solution for functional verification of CDC issues.

Hardee: I disagree that you can deal with metastability issues purely structurally, because what CDC checkers will do is make a structural check that there is a synchronizer in place such that a domain crossing should not suffer from metastability. But they are not perfect, and there is a a tradeoff between how good a synchronizer you want and the possibility of glitches. Most solutions have synchronizers that are getting more complex to avoid glitches. For example, FIFO synchronizers use grey code, and checking that grey code is being generated correctly is a functional check, not a structural check. On top of that, a lot of synchronizers will allow an occasional glitch and you have to have some kind of metastability injection solution to see if that glitch would create a problem. There are a number of stages to these checks, and 90% of the time it probably can be done with structural checks. You have to do those first, but there are important functional checks that come after that.



1 comments

Sagar Satish says:

In Hardee’s response to the second question:

” And you are not assuming that different clocks are asynchronous unless specifically specified that they are related.”

Is this a typo where he is actually saying you are not assuming that different clocks ARE SYNCHRONOUS?

Comments are closed.