The Quest To Make 5G Systems Reliable

Experts at the Table: Where are the challenges and what can be done to solve them.


Semiconductor Engineering sat down to discuss 5G reliability with Anthony Lord, director of RF product marketing at FormFactor; Noam Brousard, system vice president at proteanTecs; Andre van de Geijn, business development manager at yieldHUB; and David Hall, head of semiconductor marketing at National Instruments. What follows are excerpts of that conversation.

SE: How do we measure the reliability of 5G chips and systems, both from the base station and the receiving device side?

Lord: To make a good reliable device, whether it’s 5G or even 4G, or memory or application processors, it really comes down to the quality of the device models. Before designing any sort of IC, you need to have the building blocks. That includes a process design kit, which is made up of models and transistors, inductors, capacitors, and interconnects, but you can only do that correctly if the models for those devices are accurate. The role of the SPICE modeling lab or the device modeling team is to create those models. That requires lots of different types of measurements, whether it’s an I-V (current-voltage) curve, C-V (capacitance-voltage) curves, S- parameter (scattering parameter) measurements, local measurements. From that data they can extract the models, create databases of what a transistor looks like in that process, and then the IC designer can use them. The more accurate that model is, the more likelihood the IC is going to work the first time, or second time at least — and the more likely that it’s going to be a reliable chip when it goes into the system, whether it’s the handset or the base station, and that the whole system works. So it’s really about the basics of the transistor to make sure they’re accurate.

Brousard: Before a chip makes it into the field, there’s still a long voyage it has to make through the production cycles. It’s paramount to get the production, the testing, and the qualification of this chip right, and to have the best starting point once out in the field. Up and coming 5G applications will not tolerate failure. This is not just purely about smart phones anymore. For instance, 5G applications in autonomous vehicles are much less failure-tolerant. Best-known practices for producing electronic devices, be it testing and qualification at the chip level, the system level, or a system-of-systems level, historically worked well enough. But once we started getting down to 7nm, 5nm, and soon-to-be 3nm technologies, we needed to rethink how we take these to the next level. For instance, a siloed production and testing approach, where we make sure the chip is okay and then throw it ‘over the wall’ to the system vendor, who makes sure the board that includes that chip is okay, and so on, just doesn’t cut it anymore — especially with these stringent requirements. Now we need a holistic, end-to-end view of how we look at the chip throughout its production lifecycle, and to ensure there’s one language or one data view that is truly indicative of its quality throughout its production lifecycle. Based on that data, we also can correlate between the different stages to see how we can make it a more streamlined and more efficient production line.

Hall: We really have to break the problem into a couple different phases. There’s the chip reliability, there’s the system reliability, and the way that you measure or characterize those is different depending on if you’re talking about sub-6 GHz 5G versus millimeter wave. With millimeter wave there’s a significant risk that the chip will function as designed, but the system may not function as designed because there are significant challenges with millimeter wave over-the-air propagation. There’s a lot to be figured out in terms of beamforming, and there’s a lot to be figured out in terms of network design. A lot of the network reliability will include things like throughput and quality of service measurements. It’s completely possible — in fact, likely — that the chip will function as designed, but the network may not.

van de Geijn: Our customers are developing the chips and modules for both receivers and senders. They not only test the chips, they test the whole module to make sure these chips work in their environment. And we do that not only for a single moment in time, but over time. They measure the results every week or every month, for up to six months, to see how how the chips degrade over time, because that’s another issue. Designing those RF modules is quite new for them, and they need to see over time how they behave, and they use our tools to see how they correlate over time — how they drift, how they behave with each other, how the receivers are doing, how the senders are doing. You can start out making sure that your chips are good, but after that does it mean your modules are good? This is where things get really complicated with 5G. It’s not just one market. It’s a lot of different markets rolled into one, including millimeter wave, base stations, and whatever the end receiving device happens to be.

SE: Some of these devices or systems will be changed over on a fairly regular basis, other ones will be in the field potentially for decades, and in the case of base stations and repeaters, exposed to harsh environments. So what do we have to think about in terms of aging and taking into account all these different factors?

Brousard: Looking at this from the chip perspective, the reasons for silicon or transistor degradation are well-known, and especially how pronounced they are in the more advanced nodes. So with a chip in the field for an extended period of time, scientifically we know what the physics of failure behind that are. What is missing is a way to translate that into something we can understand, that is measurable and actionable. How do we monitor and measure it over time and deduce the implications of these results? A straightforward way is to continuously test how the chip is behaving and if its performance is degrading over time. This falls short on multiple fronts. When monitoring the functional reaction to degradation we might only notice the effects of the degradation once they actually cause a failure, or we won’t know the actual physical cause for the degradation. And maybe we won’t be able to use this knowledge to predict when in the future the failure will happen. Another approach is to use telemetry, based on what we call Agents, embedded in the silicon. Those are always monitoring the physical parameters of the silicon at the transistor level, mapping their degradation over time. You can plug that data into well-known formulas for hot carrier injection or NBTI, which cause aging in chips, and run ML algorithms that monitor the degradation rate and from there you can calculate and predict time to failure ahead of its actual occurrence.

Hall: With base stations, there is a high reliability requirement based on time. Do you see the requirements in automotive getting more difficult than some of the base stations, or are they similar?

Brousard: The automotive sector has very high standards, of course, but this also is the way of the future for a lot of different markets. There is a steady paradigm shift where reliability isn’t just a requirement of the highest-end sectors, but also an economic necessity. It might be required from a reliability standpoint, because you just don’t want things to fail. From an economic point of view, you want to know when you need to roll out trucks to replace stuff. If you can look inside a chip and see when it’s failing, you can take action. If a base station fails, you’re cut off from communication. But if your lidar fails, or if the cameras in your autonomous vehicle fail, that’s a bigger problem.

Hall: We’ve seen some of our automotive semiconductor customers go to ATE type of testing practices, even in the validation lab, which historically you wouldn’t do. You would do a pretty minimal set of measurements, which might include a bunch of stress testing in a thermal chamber or a pressurized chamber, and you would do it for a few dozen samples. But they’re at the point where they’re doing thousands of samples, which is pretty new for our industry.

Lord: We’re seeing a lot of reliability testing going around stress. That can include higher temperatures, higher bias than normal, in order to accelerate failures and electromigration or time to break down a junction. That allows us to predict failures in transistors and build better processes — or at least to have a good knowledge of when and under what circumstances they might fail so you can build in redundancies. For automotive, it needs to be tested over a wide range of temperatures. A typical device characterization today is from -40° to 125° Celsius. Automotive now is being tested at 175° Celsius because chips are closer to hot components in the car. There’s more work going on in reliability, particularly driven by automotive.

van de Geijn: Those checks are happening across a wide range of temperatures and a wide range of voltages, which is necessary because those are non-stable factors. The environments also are tested to see what the correlations are and how the different tests behave in those environments over time. Automotive customers are still at the beginning of the process. There are so many unknowns, especially for automotive and 5G. It’s nice if it works, but if you need to go past 1 billion devices with no failures rather than 1 million, which is what’s happening in automotive, then what you’re doing will be completely different than for standard chips.

SE: We’ve been hearing about parts per quadrillion for some materials.

van de Geijn: What you really see for critical components is they are added into systems twice, so if one is failing, the other one takes over, or you balance out the results from the from the two systems before you really use the data and you do something with it. But it is not only for 5G. It’s also for microcontrollers.

Lord: That’s similar to aircraft, where there are multiple systems doing the same job. If one goes down, then two others can pick it up.

SE: So what is a failure in 5G? Is it a dropped call? Is it drift that happens occasionally? Or is it all of these, and no aberrations are acceptable?

van de Geijn: It’s more or less trying to get a signal out of chaos. With other systems, you’re using fidelity decoders and complex algorithms to get your signal in. So you get zero data in, or you get zero data in, which means you get correct data. That’s a good thing. But if you don’t get data, you need a failover mechanism. That can be a second system, or in automotive it can be a person taking over.

Hall: If you look at a transistor failure in a low-noise amplifier or a power amplifier, the outcome is that you get zero output of your transmitter, or you get zero input on your receiver. And so in a complete failure mode type of situation, you would cease the ability to communicate. What likely happens is more of a degradation, where over time the consistent ability of a power amplifier to deliver the appropriate output power may go down. That translates into a degradation in communication throughput. The standard is actually adaptive, so it can use modulation techniques to handle whatever the signal-to-noise ratio the communication link can afford. If your low-noise amp has a worse noise floor, or your power amplifier can’t produce the right output power, then the network will switch you to a lower data throughput, but you still have a reasonably robust communication link. The outcome is lower quality of service, even though the system actually still may be functional.

Brousard: Brousard: While there are mechanisms of compensation, looking at the rate of degradation at the transistor level is important because you don’t want to meet that point where you pass the threshold. Knowing the rate of degradation is nearly as important as a degradation itself. You can compensate to a point, but if degradation continues, at a certain point there will be failure. If you have a system where you can monitor the degradation, you can compensate communications-wise, and take precautionary action (replacement, service), and not have to wait for the failure. In a car, you can drive to the shop and have it replaced before a failure ever happens, which can have an impact on how much redundancy you need and the cost of that redundancy. So monitoring degradation is not just about compensating when something bad happens.

5G Knowledge Center
Top Stories, special reports, videos, blogs and white papers about 5G
5G Brings New Testing Challenges
Millimeter-wave and beamforming capabilities present the biggest testing challenges.
Huawei: 5G Is About Capacity, Not Speed
One-on-one with CTO Paul Scanlan.
What’s After 5G
The path to 6G will require some radical changes to both infrastructure and use models.

Leave a Reply

(Note: This name will be displayed publicly)