Home
OPINION

Reliability Verification For Smart ICs

Thermal, mechanical and electrical behavior need to be accurately modeled and simulated to prevent future problems.

popularity

By Arvind Shanmugavel
The electronic brains behind today’s advanced systems are smart ICs, paving the way for consumer electronics, energy, biomedical, automotive and avionics industries.

Power efficiency and system integration are keys to the success of these smart systems. The IC industry has swiftly responded with state-of-the-art low-power techniques and chip integration initiatives for active power reduction and management. The need to design products that can reliably function under various operating and environmental conditions is imperative, so reliability verification is no longer an afterthought in the IC design flow. Complex multi-physics failure mechanisms in ICs have drastically changed the way we perform reliability simulations.

Power on demand
Today’s ICs no longer operate on a continuous basis. Functional blocks on ICs are only powered up to execute the operation that is required. Then they go into standby mode after execution, which drastically reduces the standby power of ICs. Power supplied to these blocks is provided on demand by using complex power-gating techniques or on-die voltage regulator modules. Active power reduction techniques such as clock-gating and dynamic voltage and frequency scaling (DVFS) help to reduce unwanted power consumption during normal functional block operations. Design automation tools can design and verify the proper power intent seamlessly. Low-power techniques, including power-gating, clock-gating, voltage and frequency scaling, are represented using the appropriate constructs in UPF or CPF and are verified for proper implementation. However, the electrical verification tools that simulate these complex behaviors in time are still evolving.

As we add more controls to manage power efficiency in ICs, there is a huge demand to simulate and verify the operation of the power delivery network (PDN). We can no longer afford to use outdated electrical verification techniques to simulate power efficient ICs. Static voltage drop simulations, or simple dynamic voltage drop simulations on the PDN, will not adequately represent the complex switching behavior or power transitions. Operational reliability of these ICs must be checked more rigorously across multiple state transitions to ensure that power supply noise does not affect functionality.

Complex interactions between cores, peripherals, I/Os and IP cause a chip to go into different active states. These states can be accompanied by huge variations in power and current consumption. Identifying these critical state transitions and using them for electrical simulations of the PDN is critical for ensuring operational reliability. Identifying these states can be performed at the RTL stage where millions of cycles are processed and cycles of interest can be selected for electrical simulations. Critical states should typically cover a range of high di/dt transitions and high power cycles. Once the cycles have been identified, electrical simulation of the PDN must be performed for dynamic voltage drop noise analysis. Directly simulating the electrical behavior of the PDN across millions, or even thousands of cycles, is typically prohibitive at the full-chip level. Using RTL power models that capture these state transitions with proper activity and power will enable full-chip PDN verification.

Integration: A reliability challenge
Performance, lower power, and form factor typically have been the driving force behind SoC integration for semiconductor ICs. Increasingly, cores and IP, I/O-subsystems and graphics are being integrated on the same die. More revolutionary techniques such as three-dimensional ICs (3D-ICs) integrate the entire memory subsystem in a single package. With higher levels of integration comes the higher complexity of reliability verification.

Generally, voltage islands provide the ability to either power down a functional block or operate the block at a lowered voltage. Powering down a block can drastically reduce the leakage power during standby mode because leakage current in a CMOS transistor is a function of voltage difference across its drain and source terminals. Similarly, reducing the operating voltage of an island can reduce the power in a quadratic fashion. Both these techniques are widely used to manage the active and standby power of low-power ICs.

Reliability verification for multiple voltage islands on the same die is complex. Operational failures can occur when the voltage levels of these islands have not reached full-rail before a functional transaction happens. Power-gated islands typically turn-on and turn-off several hundred times within a fraction of a second. The ability of the voltage rails to recover from a power-down mode to an on-state mode must be simulated and analyzed in a transient fashion. In addition, the in-rush current during power-up is another reliability check that needs to be performed on power-gated designs because the amount of in-rush current can impact reliable operation of the voltage regulator module or neighboring functional blocks. These power transitions not only need to be analyzed around the locality of the block, but also with the impact of the package and board parasitics.

Integrating a large number of IPs with isolated power domains and substrate isolations poses another level of complexity for reliability verification. It is increasingly common to see USB modules, GPS modules, PLLs and RF components sharing the same silicon real estate with high-speed digital cores. Sensitive analog circuits will need to be verified with the impact of substrate coupling from their noisy digital neighbors. Substrate isolation schemes that are used for one process node may not scale appropriately for subsequent process nodes, and isolation techniques that filter out noise for one frequency may not work well for other frequencies. So, a rigorous simulation of the substrate isolation techniques needs to be performed before implementation.

Electrostatic discharge (ESD) is an event-based reliability failure mechanism that affects all voltage islands and I/Os. Every power domain or I/O port in a chip that has a path to a package pin must be protected by ESD devices. Furthermore, every combination of signal pin to power or ground pin needs to be protected through ESD devices. IP that has their own isolated power domains will need to be checked for the placement of cross-domain ESD elements. Current density checks during an ESD event are also a big part of the reliability verification process. Current density checks ensure that on-die interconnects do not burn out during an ESD event. The verification complexity to simulate resistance and current density checks is staggering, considering the number of domains and pins found in today’s ICs.

Integrated power management techniques such as on-die voltage regulators provide the flexibility for powering a voltage island without the overhead of an external power supply mesh. On-die voltage regulators provide the capability to reduce voltage levels, in order to reduce switching power. They can also respond faster than a traditional off-chip regulator to switching currents, without using a valuable pin count on the package. With package pin count at a premium and the drive for lower power a necessity, on-die voltage regulators are becoming a standard in most mobile chips. Verification of on-die regulators using the appropriate loading conditions and state transitions is critical to the operational reliability of the IC. Full-chip simulations need to be performed to capture the behavior of the regulator along with the parasitics of the entire power grid.

Integration of 3D-ICs poses a unique set of challenges for reliability verification. The adoption of 3D-IC by the semiconductor industry is largely driven by power and performance, not by cost. Stacking dice closer in proximity reduces the interconnect length for cross-die communication. Through-silicon vias (TSVs) typically are used as connectors between different die stacked one over the other. The total capacitances of these TSVs are far less than the capacitance of traditional package and board interconnects. Because switching power is directly proportional to the interconnect capacitance, TSVs offer a huge advantage in terms of power reduction. The Wide I/O standard that uses 512 bits to communicate across stacked ICs and memories has quickly become a JEDEC standard due to the advances in 3D-IC integration. The Wide I/O standard has been shown to drastically reduce power and increase bandwidth for memory interfaces.

Strong power-thermal interactions in 3D-ICs pose a unique set of challenges in verifying their thermal behavior. Thermal boundaries need to be well understood for each die, along with neighboring die. Modeling self-heat of TSV layers and thermal-conduction mechanisms at the boundary of these die is critical to understanding the entire thermal picture of 3D-ICs. A micron resolution thermal analysis must be performed across the stacked ICs and the package to understand the thermal impact of one die on the other. Concurrent thermal analysis of the IC-package-chassis using appropriate models for each die is necessary for accurate power-thermal convergence at the system-level.

Lifetime failure mechanisms such as electro-migration (EM) will need to be performed with true thermal impact, rather than a worst-case thermal envelope of the IC. Process migration has had a large impact on the amount of current that on-die interconnects can carry at higher temperatures. With physically smaller interconnects in advanced process nodes, the amount of current required to cause an EM failure is also lower. An accurate thermal profile of the die needs to be used when sizing power and signal interconnects to avoid overdesigning for EM.

Concurrent multi-die analysis for IR drop is also a necessary check for reliable operation of 3D-ICs. The impact of a shared interposer or a shared TSV across multiple dice can significantly affect the operation of the IC stack. Typically, a concurrent analysis of all the operating dies along with the interposers needs to be analyzed for IR drop sign-off. Proper modeling techniques such as Chip-Power-Models that can represent the behavior of individual die should be used when concurrent simulation is prohibitive.

Design for reliability
Modeling failure mechanisms in today’s smart ICs and accurately simulating them has become a necessity. With the use of low-power techniques and high levels of physical integration, reliability verification tools need to model complex failure mechanisms. IC designers can no longer follow a ‘correct by construction’ approach for reliability in this climate. With the impending adoption of 3D-IC standards, multi-physics simulations of thermal, mechanical, and electrical behavior is a must for reliability verification.

–Arvind Shanmugavel is director of application engineering at Apache Design.