Achieving Physical Reliability Of Electronics With Digital Design

Why it’s important to use both simulation and testing to improve reliability and productivity in digital designs.


By John Parry and G.A. (Wendy) Luiten

With today’s powerful computational resources, digital design is increasingly used earlier in the design cycle to predict zero-hour nominal performance and to assess reliability. The methodology presented in this article uses a combination of simulation and testing to assess design performance, providing more reliability and increased productivity.

Reliability is “the probability that a system will perform its intended function without failure, under stated conditions, for a stated period of time.” The first part of this definition focuses on product performance as intended to function without failure. The second part addresses usage aspects—under what conditions the product will be used. The third part addresses time—how long will the product be operating.

Figure 1: System development V-diagram.

The flow of digitally designing for performance is depicted by the V-model (Figure 1) – requirements flow down, and the capabilities flow up. Business and marketing requirements flow down for the system, followed by the subsystem, and the components in the left hand side of the V. After design, the component capability to fulfill its sub-function without failure is verified, including the subsystem and the system. Finally, the full system is validated against business and marketing expectations.

Designing for Reliability in Three Parts
Digital design improves and speeds the verification step by calculating whether the specified system, subsystem, or component inputs will result in the required output. Digital design can also be used to guide architecture and design choices. For electronics cooling design and analysis, 3D computational fluid dynamics (CFD) software constructs a thermal model of the system at the concept stage, before design data is committed into the electronic design automation (EDA) and/or mechanical CAD (MCAD) systems. The model is then elaborated with data imported from the mechanical and electrical design flows to create a digital twin of the thermal performance of the product, which is then used for verification and analyses.

The second part of designing for reliability focuses on conditions – incorporating use cases for different stages of the systems’ life cycle, including transport, use preparation, first use, normal use, and end-of-use scenarios. The product should withstand normal transport conditions: drops, vibrations, temperature extremes, and maintain performance with handling mistakes. Different loading conditions will occur in varying temperature and humidity environments during normal use. And after end-of-use, a product should be easily recycled to avoid environmental damage. These use cases represent scenarios beyond typical, normal use conditions outside of a lab environment. Digital design simulates specific steps in the life cycle, for instance, drop and vibration tests to mimic transport conditions, and “what-if” scenarios, simulating worse-case environmental conditions.

The third part of the reliability definition is about the time span that a product is expected to perform its intended function without failure. This is measured by the failure rate, defined simply as the proportion of the running population that fails within a certain time. If we start with a population of 100 running units, and we have a constant failure rate of 10%, then at t = 1, 90 units (90% of 100) are still running and at t = 2, 81 (90% x 90) are running.

Figure 2: Bathtub curve showing the rates of failure over time.

In time, the failure rate changes. The hardware product performance can be illustrated by a bathtub curve (Figure 2). The first phase, infancy, has a decreasing failure rate as kinks are worked out of an immature design and its production. Example root causes of infancy failure include manufacturing issues from part tolerances, transport or storage conditions, installation, or start up. This stage confirms that the manufactured product performs as designed. Since this is from the business perspective, note that failures do not refer to a product’s single instance, but to the population that the business produces. Temperature affects all parts of the bathtub curve, so the thermal performance of the system should be checked and compared to the simulation model at this stage.

The next phase is normal life where the failure rate flattens the bathtub curve. Random failures from various sources of overstress combine as a constant aggregate failure rate; overstress is defined as excursions outside known safe-operating limitations. In the third part of the curve, the failure rate increases with the product wearing out over time with use.

Failure and Stages of Maturity
The V-diagram shows that reliability is ensured by adherence of the manufactured product requirements. Parts that do not meet these requirements are considered defective, with assumed early performance failure. Typically, higher levels are an aggregation of many lower levels, for example, an electronics assembly comprising multiple boards, with each board containing multiple components and larger amount of solder joints. This also means that lower levels need progressively lower failure rates to ensure reliability at higher levels. In high-reliability environments, failure rates are expressed in terms of parts per million (ppm) and process capability index (Cpk).

In the electronics industry supply chain, the maximum acceptable failure rates of electronic assemblies range from a Cpk of 1.0, corresponding to 2,700 ppm falling outside either the upper or lower specification limits. Large suppliers typically work from a Cpk of 1.33 (60 ppm) to a Cpk of 1.67 for critical parts (<1 ppm). In automotive applications, increasing electronics subsystems (particularly for safety) are driving the supply chain to achieve ever-lower defect rates, now approaching 1 ppm at the level of individual components.

A reliability-capable organization learns from experiences and operates proactively. The IEEE 1624-2008 Guide for organizational Reliability Capability defines five stages in a reliability capability maturity model (CMM) that varies from stage 1: purely reactive to stage 5: proactive. Table 1 shows an extract from the matrix that covers reliability analysis and testing beginning with stage 2.

Table 1: IEEE1624 capability maturity matrix excerpt on reliability analysis and testing.

For a complex design, the multitude of failure conditions and use cases results in many potential failure conditions – costly and time consuming to test in hardware. Testing based on hardware requires a mature product in late design. For a complex product, a stage 1 approach requires predictive modeling. Digital design — computer simulations and modeling — is deployed from CMM stage 2. On the lower levels, this is purely performance and environment driven. Can the product perform its intended function in all use cases, without failure, based on nominal inputs and outputs?

Pilot runs, manufacturing investments, and lifetime tests are typically started after design freeze. These entail time and money investments that do not allow for an iterative approach. Stage 2 companies often identify providing computer simulations as design verification before design freeze. Experience shows that design rework is often needed to meet the requirements of the parts’ safe-operating limitations, such as a maximum ambient temperature.

By stage 3, virtual analysis should be highly correlated with failure conditions; for instance, through field data and dedicated reliability tests to provide a high likelihood of detecting failures through virtual analysis before they happen. In design failure mode and effect analysis (DFMEA), a risk priority number (RPN) is assigned to product failures as scores for severity, occurrence, and detection. Increasing the likelihood of detection can lower the RPN by as much as 80%.

In CMM stage 4, typically simulation is used early in the design process. Simulation is used to calculate a nominal performance, and the statistical distribution, that is, failure calculated with more granularity—not as a yes/no binary outcome but as a probability of failure – the statistical capability of the design as expressed in Cpk. In the DFMEA, this again lowers the RPN further by backing up the claim of a low or remote occurrence score. In thermal design, higher CMM companies evolve to use measurements to underpin the fidelity of the simulation model by confirming material properties, thicknesses of bond lines, etc., along the heat-flow path.

Early design models, shown in Figure 3 for an automotive ADAS control unit, simulated before component placement has closed in the EDA design flow, can be used to support cooling solution choice, apply deterministic design improvements, and explore the likely impact of input variables.

Figure 3: Initial design for automotive ADAS unit modeled in Simcenter Flotherm.

The combination of computer simulations and statistical techniques is powerful in addressing both nominal design and statistical design capabilities. In design-of-experiments (DOE), a scenario consisting of a number of specific cases can be calculated as an array of virtual experiments. The cases are selected to enable separating out the effects of inputs and combinations of inputs, which results in the nominal performance output as a quantified function of the design inputs. At the lower CMM levels, this function can be used to choose the design inputs so that the design meets its intended function in all stated conditions.

Becoming a Highly Capable Reliability Company
At higher CMM levels, the V-model also includes knowing the statistical distribution of the inputs and having a requirement on the allowed probability of failure, usually expressed as a Cp/Cpk statistical capability or a sigma level. Again, a DOE can determine the output performance as a function of design inputs and noise factors; subsequently, the effect of noise and the statistical distribution of the input factors can be determined through Monte Carlo simulation. For each design input and each noise factor, a random value is picked from the relevant distribution and substituted in the equation to calculate the performance output. This is repeated a large number of times, So 5,000 times, a set of design inputs and noises is selected and substituted into the function to calculate the performance output. This results in a predicted data set of 5,000 values for the performance output to show the expected statistical distribution, statistical capability and failure rate.

Figure 4: Workflow for combining digital and statistical design.

The workflow for a higher CMM company is shown in Figure 4, with the results of the capability analysis of the 5,000 simulations shown for an improvement to the design in Figure 3. The demonstrated Cpk = 1.05 is far below 1.33 so the expected failure rate far exceeds the acceptable ppm level. Because a low failure rate is sought, the number of Monte Carlo experiments needed is high, illustrated in Figure 5.

Figure 5: Prediction of junction temperature for critical IC7 component for 5,000 simulations, accounting for statistical variation in input parameters using HEEDS.

A Proactive vs. Reactive Approach
Lower level CMM organizations have a reactive approach to high levels of failure in normal use, that is, nominal calculations that affect the failure rate in the flat part of the bathtub curve. Mature organizations simultaneously work in more fields and deploy both nominal and statistical modes of digital design specific to the different parts of the bathtub curve: product infancy, normal use, and wear. Stage 5 CMM organizations also invest in understanding the root causes of failure mechanisms underpinning the random failures in normal life and wear.

Assessment of the package’s thermal structure is used to calibrate detailed 3D thermal simulation model for the highest predictive accuracy during design. The graph in Figure 6 compares the results of running thermal structure functions for a thermal model of an IGBT to testing the actual part using active power cycling.

Comprehensive cycling strategies for different use-case conditions and capture a range of electrical and thermal test data that can be applied to the model, in addition to running regular thermal transient tests. The results can identify damage to the package interconnect or locate the cause of degradation within the part’s thermal structure, thereby meeting the testing requirements of CMM stage 4 and providing the data necessary to achieve Stage 5.

Wendy Luiten is a well-known Thermal Expert and Master Black Belt in Innovation Design for Six Sigma. She authored over 25 papers, holds 6 patents and pending patents and is a well-known lecturer. She received the Semitherm best paper award in 2002, the Harvey Rosten Award for Excellence 2013, and Philips Research Outstanding Achievement award in 2015. After 30-plus years in Philips Research she is now principal of her own consultancy and continues to work as thermal expert, and Master Black belt, as lecturer at the High Tech Institute and as DfSS lead trainer.

Leave a Reply

(Note: This name will be displayed publicly)