Understanding Scandump: A Key Silicon Debugging Technique

Repurposing DFT scan chains for functional debugging of complex and low-probability silicon issues.

popularity

Scandump is an advanced silicon debugging technique that ingeniously repurposes DFT (Design For Testability) scan chains for functional debugging. This method allows for the extraction of states from registers or latches that are stitched into the scan chains, providing critical diagnostic insights. Scandump is particularly invaluable when the CPU is deadlocked or when the system hardware becomes unresponsive. By capturing a comprehensive snapshot of the internal states, Scandump provides engineers essential data for analyzing system conditions. This capability makes Scandump an indispensable tool for identifying and diagnosing complex issues during silicon validation and debugging.

What advantages does Scandump offer during silicon debugging?

Identifying the problem on silicon can be challenging because the visibility of the system activity is limited, and it may take considerable time to reproduce the hang issue. For instance, one partner experienced a silicon deadlock that required 48 hours on over 200 devices to reproduce the issue once. Another partner needed a week and 80 devices to replicate a silicon hang. Debugging these low-probability silicon issues is extremely difficult if relying solely on software debugging tools like CoreSight. However, Scandump allows internal states to be captured from the initial occurrence, providing invaluable insights for further analysis, which would significantly accelerate troubleshooting.

Moreover, when issues arise in silicon, the other available debugging methods include CoreSight, PMU, and ELA. We generally combine all these methods for the most effective debugging approach. However, many debugging methods fall short when faced with complex silicon issues. For instance, while helping a partner debug a server chip issue in a highly intricate system with 256 cores, we encountered several challenges:

  • The external debugger could not halt the CPU cores when a system hang occurred.
  • Due to trace bandwidth limitations, the trace data could only capture information from a subset of cores, which was insufficient for comprehensive analysis.
  • While PMU (Performance Monitoring Unit) counters indicated that a CPU was not responding to a DVM snoop, they could not provide further details needed for effective debugging.
  • ELA (Embedded Logic Analyzer) could only capture specific signals that lacked relevant insights related to the issue.
  • The design team struggled to replicate the system hang scenario because of the lack of sufficient conditions or internal status.

In such cases, debugging could be nearly impossible without Scandump. If Scandump is available, it offers critical insights into the silicon’s internal states by capturing a snapshot of all registers, even in a single cycle. This gives the design team rich information to identify and address the issue.

In some instances, the design team can directly identify the root cause from Scandump’s internal states, often requiring only a few hours to pinpoint the problem source. They can then quickly reproduce the issue in simulation, enabling the resolution of a complex RTL bug on silicon within a couple of days. For some extremely complex cases, designers may not find the root cause directly from Scandump data but can develop hypotheses that are based on the information provided. They can then test these hypotheses through simulation or formal verification until the root cause is discovered.

Distinction between Scandump and CoreSight

Some partners have raised concerns regarding the necessity of implementing Scandump when CoreSight is already integrated. Therefore, it is important to clarify the differences between Scandump and CoreSight, which can help partners understand each technology’s distinct roles in system debugging and validation.

CoreSight Scandump
Standardization & Implementation CoreSight represents Arm’s Debug and Trace Architecture and offers a standardized implementation for partners through products like CoreSight SoC-400 and CoreSight SoC-600. Scandump is a non-Arm debugging technique, independently developed by various partners. Each partner typically employs a unique implementation approach, tailored to their specific DFT methodologies.
Primary Purpose Designed to provide detailed real-time monitoring and control capabilities that enable developers to debug software and hardware systems effectively. Designed for identifying and debugging hardware issues on silicon.
Capabilities 1. Run-control debug that enables a detailed insight into the CPU’s operations and the system’s activities.
2. Instruction or data trace from an ETM to provide CPU’s historical operation information.
3. Performance profiling
Provides valuable one-cycle snapshot view of nearly all internal register states.

 

CoreSight can be beneficial in diagnosing certain silicon issues. For instance, if a deadlock scenario can be replicated using CoreSight while testing software codes, it facilitates the creation of an RTL simulation environment that mirrors the specific software code or scenario and enables the debugging of silicon issues through waveform analysis. However, CoreSight may have limitations in addressing some silicon issues that cannot be effectively debugged through software debugging approaches, or those that cannot be easily replicated with straightforward software codes.

Scandump’s challenges

Scandump offers significant advantages when addressing system deadlocks or hangs on silicon as it provides comprehensive visibility into the internal status, which is not matched by other debugging tools. Nonetheless, its effectiveness can sometimes face challenges when dealing with dynamically occurring issues during system operation. In some cases, Scandump may not capture the precise moment an issue arises, potentially limiting its usefulness in resolving certain dynamic debugging scenarios.

Furthermore, the analysis of Scandump data presents its own set of challenges. Currently, there are no automated methods available for such analysis, and the data typically includes hundreds of thousands or even millions of register bits. Consequently, engineers must manually sift through this data to identify root causes. This process usually requires extensive collaboration and significant time investment from multiple engineers to resolve complex issues.

Conclusion

Scandump is a powerful silicon debugging technique that offers unique insights into system states during critical failures. To ensure comprehensive and efficient resolution of complex silicon issues, it is best utilized in conjunction with other debugging methods such as CoreSight for run-control processes and real-time instruction/data tracing, PMU for monitoring internal events, and ELA for capturing specific logic signals. This integrated approach maximizes diagnostic capabilities, offering a complete view of system failures and facilitating a thorough understanding to resolve complex silicon issues.

Resources



Leave a Reply


(Note: This name will be displayed publicly)