NoC Reliability: Simplified

There are four primary failure modes associated with NoCs.

popularity

Recently, the reliability features of on-chip network (NoC) IP have received much attention. One reason for this focus has been the rush of companies to get into the automotive electronics market and the explosion of new automotive features being implemented in electronic systems. While the details may vary, the high-level view of on-chip network reliability is really quite simple.

At the architectural block level, an on-chip network appears as just another functional block in the system-on-chip (SoC) design, albeit a very important one. The function of the NoC is to move data between other blocks in the SoC design. Modern NoC products also layer in many services and features, such as protocol conversion (including data width), quality of service (QoS), automated support for multiple (or unlimited) combinations of power and clock domains, network security, power management, and interrupt management, to make the designers’ job easier. However, the main function is simply moving data, which is what reliability engineers must be concerned with when trying to meet ISO 26262 compliance.

Screen Shot 2015-03-25 at 3.27.12 PM

As the table above shows, there are only four primary failure modes associated with NoCs. One failure mode has to do with the corruption of the data during transport. The other three modes have to do with the delivery of the data to the proper recipient block. While there may be other methods to detect these types of errors, the detection methods listed in the table are very straightforward. As should be expected, how the errors are handled will depend on the level of resiliency desired.

Parity does not provide much protection against errors. While Error Correcting Code (ECC) supplies more capabilities, there are a few more decisions to be made. Because modern NoCs support transmission of data, where the sender and receiver have different word sizes, the designer must decide to: (1) use only one word size across the entire network; (2) ECC encode based on the recipient word size; or (3) do all encoding at the byte level. While method 1 is very simple to implement, it requires the redesign of all the IP cores connected to have a single word size, which is very costly. And, while method 2 may be more efficient in wires than 3, it requires that each endpoint know the word size of the destination of each transaction, which complicates the design of IP blocks and software that programs them. So, method 3 is the preferred methodology as it is more flexible and allows rapid reuse of IP cores from many sources.

A “shadow network” is simply another network mirroring the connections of the first network. However, to be more useful in detecting errors, designers will want to isolate the implementation of the primary network from the shadow network. Designers ensure network isolation by keeping the shadow network one cycle out of phase and with a different layout from the primary network. In this way, errors caused by the environment (e.g., particle hits, power spikes, etc.) are not likely to affect both networks in the same way. By comparing the results of the primary and shadow networks, designers can determine if significant errors have occurred in the routing, framing, or delivery of transactions.

NoC resiliency is very straightforward to analyze. As in all situations, what to do about failures will depend on the consequences of the failure of the function being implemented. In the parlance of ISO 26262, the desired ASIL level helps determine how designers should handle a failure. Incorporation of a NoC that includes rich error detection features makes error handling inside the SoC easier through the use of interrupt and other system services the NoC provides.



Leave a Reply


(Note: This name will be displayed publicly)