Digital systems need clocks. Today’s designs require more from clocking schemes than ever before, and it’s likely this trend will continue.
Increasing power constraints have resulted in finer-grained partitioning of designs into functional domains that can have clocks disabled or, more drastically, are powered down entirely. Systems are required to adaptively manage clocks to minimize switching power.
Performance and area constraints have led to the abandonment of more conservative practices in favor of more aggressive designs; for instance, removing register banks around memories that serve as buffers between functional or clock domains. Amid this increasing complexity, cost and time-to-market pressures raise the rigor of the verification required prior to the release of any system, SoC, or FPGA design, regardless of the industry or application.
Given these trends, clocking issues are more prevalent than ever in FPGA and system products heading into the lab or, more importantly, the market. According to the Wilson Research Group 2018 Functional Verification Study, clocking flaws are the second-leading cause of production issues in FPGAs, up almost 20% in the last six years, and are on pace to become the leading cause of FPGA production issues, as seen in Figure 1.
Figure 1. Historic trend of causes of FPGA production issues
The same Wilson Research Group study identified that in 2018, 84% of FPGA designs had non-trivial bugs escape into production, up 6% in two years as seen in Figure 2. So there is a real need to improve the adoption level and maturity of verification methodologies in the FPGA space.
Figure 2. Historic trend of number of non-trivial bugs in released FPGA designs
The good news for FPGA and system designers is that the tools to address these clocking issues have been developed and proven by the ASIC community and are readily applicable to the FPGA and system design space as well.
The trouble with Clock Domain Crossings
Clocking issues, therefore, should be a significant concern in the FPGA development community. However, clocking issues are often misunderstood to be about the integrity of the clock itself. Rather, clocking issues often have to do with data corruption or signal loss across the boundary between two asynchronous clock domains, known as a clock domain crossing (CDC). As data or control signals transit from one clock domain to another, those signals have timing characteristics relative to the sourcing clock domain. These signals are eventually sampled in the receiving domain by a clocked element, such as a flip-flop. A flip-flop whose data changes too closely to the clock edge will enter a transitional, indeterminate state whose duration is probabilistic. This is known as a metastable state. Many clock-boundary synchronization schemes exist to ensure that control or data signals are transferred accurately despite this metastable behavior. Without such a mechanism, incorrect values on the data or control signals will be sampled and erroneous behavior will result.
Metastability happens in systems with asynchronous clocks. If metastability is exposed to the functional part of the design itself, the resulting issues are difficult to debug in testing and even harder to debug if they get into production or the field. The indeterminate and probabilistic nature of the value of the flip-flop is affected by environmental conditions. Certain environmental characteristics may exacerbate the conditions for the metastability itself. The probabilistic nature also manifests in non-deterministic behavior – sometimes the failure occurs and sometimes it doesn’t, given the same set of circumstances and conditions. Designs that treat CDCs lightly cost teams significant hours in the lab in debug, or worse, in the field with customer quality issues, challenging root cause analysis, and expensive rework.
For simple designs with one or two clocks, many clocking issues can be caught before a design release. With the combination of a disciplined methodology of basic lint clock checks, limited synchronization schemes, detailed reviews of CDCs, and clock variations in simulation representative of the final system implementation, issues can be found. These analyses are tractable and reviewable since the number of clock domain permutations is no more than two.
However, as seen in Figure 3, the majority of FPGA designs now have at least three clocks. The number of designs with three or more clocks has risen six percent over the last six years, despite a small decrease from 2014 to 2016. The number of FPGA designs that exceed two clocks grew to 73% in 2018.
Figure 3. Historic trend of number of asynchronous clock domains per FPGA design
The number of CDCs requiring verification does not scale linearly with the number of domains, but with the permutations of domain crossings; so a design with four clock domains has a potential worst-case of 12 different permutations to check. When you toss in the additional complexities of clock management, power management, performance, and latency considerations, the techniques and solutions that work for simple designs simply do not scale. Manual reviews and stringent methodologies are not enough.
Limitations of simulation
As a result, many teams turn to functional simulations to verify their CDCs. The goal of such simulations is to identify when data is corrupted or signals are lost across the CDC. While this is a noble goal, there are several challenges that a team can encounter that can lead them to miss issues.
The first is that digital simulations, by definition, do not handle non-digital behavior well. Thus, the metastable behavior of a flip-flop or other storage element is not modeled in traditional digital simulations. As a result, verifying the correct synchronization against the metastable behavior is challenging.
Should a team conquer that issue successfully, it must next ensure that the constraints on the clocks precisely reflect the scenarios on the real clocks in silicon. Only by doing this will the simulations hit the scenario required to create and test clock domain crossings properly. This is not easily done exhaustively for complex designs.
The next challenge may seem counterintuitive: constraining all clocks correctly is insufficient. Synchronization issues only happen when data or control signals transit through the clock domain crossing on cycles in which the clocks are close enough to cause a timing issue. It is possible to add functional coverage on these events to ensure that they occur in simulation, but this will cover only the crossings that are known and won’t verify those that are inadvertent.
The final challenge has to do with the definition of a passing test. It is possible that a functional test may not actually fail even though data corruption or signal loss occurs through a clock domain crossing. The test has to be sensitized in some manner to the path in question, and not all tests focus on all paths.
In sum, relying on functional simulation is at best a very challenging prospect, but at worst is error-prone and will cause a design team to miss key CDCs that will exist in systems. Clearly, simulation is not enough. What is needed instead is exhaustive, non-simulation-based verification of clock networks and CDCs.
Borrowing CDC techniques from the ASIC world
Fortunately for FPGA and systems designers, the challenges of multiple-clock-domain designs were encountered in ASIC systems-on-chips (SoC) earlier, in general, than in FPGA and system designs. Since that time, the ASIC SoC industry has developed innovative and mature CDC verification techniques, resulting in a robust and broad CDC solution spectrum that has been deployed successfully for nearly 20 years. These features are readily applicable to the FPGA and system design space as well.
The success of the ASIC CDC verification effort can be seen in the differences in results between the ASIC community, in which CDC use is common, and the FPGA community, in which it is less common. As seen in Figure 4, the Wilson Research Group 2018 survey identified clocking issues as being the root cause of a functional flaw in an ASIC only 26% of the time, compared to 43% of the time for FPGA designs.
Figure 4. Differential between ASIC and FPGA verification and clocking flaws
Clocking issues are the third-leading cause of failures in ASICs and have been since 2016 (Figure 5). In comparison, they are the second-leading cause of failures for FPGAs (refer to Figure 1). This discrepancy is even more significant since it exists despite ASICs having 23% more clock domains on average than FPGAs in the 2018 survey (with 95% more in 2016!).
Figure 5. Historic trend of causes of ASIC functional flaws leading to respin
This higher distribution of failures versus the number of clocks in FPGAs is evident when comparing Figure 3 with Figure 6.
Figure 6. Historic trend of number of asynchronous clock domains per ASIC design
So what do these successful CDC verification techniques look like? In addition to static formal RTL design analysis, the CDC solution space encompasses dynamic simulation-based solutions that add metastability models to the RTL simulations under the direction of the static formal CDC tool. This sensitizes all CDCs identified by the exhaustive formal methods to produce random delays, also adding coverage reporting. These approaches address the more challenging issues discussed above with regards to verifying CDCs in simulation.
Netlist-focused CDC tools have been introduced as well that analyze the final implementation netlist ready for commitment to production. These tools are specifically tuned for netlist-focused performance, import constraints from prior CDC runs on RTL, and look specifically for new CDC violations that may have been introduced by implementation tools, such as synthesis or test insertion.
The features of the basic static CDC analysis tools have been enhanced over time, as well, to add automatic synchronizer scheme inference from code analysis, as well as specification-based design intent flows and synchronizer protocol verification through formal and dynamic simulation means.
Finally, while it is clear that there is an increasing need for the FPGA and system design communities to adopt CDC verification tools, caution should be exercised as well. There are significant differences in the capabilities of CDC products on the market in terms of maturity, quality of results, and completeness of analysis. Some are formal-based, others rely on netlist syntactic analysis to identify issues. Some are more accurate than others. Some are mature and ASIC-hardened, while others are relatively new to the market and are still in their infancy in terms of quality of results and features. It is critical that a team identify its needs and find the right solution that will scale with their roadmap and requirements.
In short, FPGA and system designs are incurring issues today that drove the ASIC SoC community to create, mature, and harden CDC tools years ago. This is good news for the FPGA and system design communities as the necessary tools are available today to significantly reduce risk and accelerate the path to market and revenue.
Leave a Reply