When You Can’t Afford To Scrimp On System Reliability

Key challenges to maintaining system reliability in the face of natural phenomena and malicious attacks.

popularity

Failure happens, whether we like it or not. What’s important is to be prepared for failure to occur, which involves putting in place measures that allow us to quickly address or resolve the problem. But not all failures are created equally. For example, a laptop that you use daily might experience occasional glitches. If it’s well-designed, you can simply reset the machine to get it back to its original state. But, as Carlos Rodriguez, a Maxim Integrated microcontroller expert, notes, “We don’t always have this luxury of having a manual reset and being able to reset the device we are designing.”

Rodriguez recently hosted a webinar, “When and Why System Reliability Matters,” during Maxim’s virtual Sensors Experience. In the session, which is now available on demand, he highlighted some key challenges to maintaining system reliability, plus some solutions to help you keep your systems up and running.

To frame his discussion, Rodriguez outlined four scenarios where system reliability is really non-negotiable:

  • The device is not easily accessible. This is the case for agricultural or industrial applications, where there could be hundreds of sensors across a field (figure 1) or inside a manufacturing facility. Having to physically locate a failing sensor in these situations isn’t a sustainable way of doing business. And in an industrial IoT environment, if the sensor is in a hard-to-reach location, this could entail stopping the manufacturing line to resolve the issue.
  • Maintaining robust communication is critical. An example is an emergency response device, where missed communication could lead to a devastating consequence.
  • Continuous processing is crucial. Medical devices like pacemakers, for instance, cannot stop working for even a brief moment; otherwise, the consequences could be lethal.
  • Stored data is valuable. Examples in this scenario include a crypto hardware wallet application, where access to funds could be hampered if the memory location where a private encryption key is stored gets corrupted.


Fig. 1: When sensors are distributed across a farm, tracking down a point of failure can be time-consuming and laborious. A better tactic is to establish and maintain high system reliability at the design stage.

What are the three main things that could go wrong and cause a system to fail? Rodriguez noted that there could be corruption in code that prevents the desired output from being produced. Data could get lost or corrupted. Or, something can go wrong while data is in transit from point A to point B.

Diving into the root causes of these system failures, Rodriguez pinpointed two key problems, as well as ways to solve them. One of the problems is something we can’t see: alpha particles. Cosmic rays that rain down from space contain these alpha particles, which can trigger undesired bit flips within the memory of electronic devices and lead to:

  • System failures
  • Memory corruption, causing data loss
  • Unpredicted device output
  • Other unexpected events

The larger the memory in the device, the greater the likelihood of bit errors since there’s more probability of the alpha particles getting into those areas. When a device like a laptop becomes hindered in this way, you can reset it. But if it’s a device that cannot be reset, that system needs error correcting code (ECC) in its memory. ECC detects the exact location of bit errors in memory and corrects those errors. “ECC is going to increase the robustness of the memory and decrease the likelihood of the system failing throughout the lifetime of the product,” said Rodriguez.

Hackers present another major problem. Rodriguez noted that hackers have been known to find numerous ways to change the performance of a device. For example, by:

  • Intersecting data during communication and replacing it with false information
  • Injecting code into the microcontroller to change the behavior of the application
  • Stealing information from the stored data

The solution, explained Rodriguez, is to integrate into your design a microcontroller built with robust security features, such as: encryption engines, cyclic redundancy code (CRC), secure bootloaders, true random number generators (TRNG), and secure nonvolatile key storage. Such features make it harder for hackers to do any kind of tampering, he said.

Microcontroller for highly reliable systems
Maxim Integrated has introduced a new low-power microcontroller that’s designed to keep system reliability high: the MAX32670, which is based on a 100MHz Arm Cortex-M4 processor with floating-point unit. Its embedded memory includes 384KB of flash with ECC, 160KB of SRAM with optional ECC, and 16KB unified cache with ECC. For security, the device features a secure boot ROM, secure nonvolatile key storage, TRNG, CRC 16/32, and Advanced Encryption Standard (AES) 128/192/256. The device is energy efficient, operating as low as 44µA/MHz in active mode and roughly 0.1µA in its lowest power sleep mode, and is available in a 5mm x 5mm TQFN package (a smaller WLP will soon be available).

“It’s really going to allow you to take your system to the next level as far as reliability goes,” Rodriguez said.

To learn more, evaluate the MAX32670 for your next design by buying the evaluation kit, MAX32670EVKIT, check out other devices in the DARWIN family of ultra-low-power Arm microcontrollers, or watch a video on ECC for a better understanding of how devices detect and correct errors.



Leave a Reply


(Note: This name will be displayed publicly)