Mitigating Silent Data Corruptions in High Performance Computing


A new technical paper titled “Mitigating silent data corruptions in HPC applications across multiple program inputs” was published by researchers at University of Iowa, Baidu Security, and Argonne National Lab. The paper was a Best Paper finalist at SC22.

The researchers “propose MinpSID, an automated SID framework that automatically identifies and re-prioritizes incubative instructions in a given program to enhance SDC coverage. Evaluation shows MinpSID can effectively mitigate the loss of SDC coverage across multiple inputs,” states the paper.

Find the technical paper here or here. Published November 2022. Presentation slides are here.

Huang, Yafan, et al. “Mitigating silent data corruptions in HPC applications across multiple program inputs.” Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2022.

Related Reading
Screening For Silent Data Errors
More SDEs can be found using targeted electrical tests and 100% inspection, but not all of them.
Silent Data Corruption
How to prevent defects that can cause errors.
Why Silent Data Errors Are So Hard To Find
Subtle IC defects in data center CPUs result in computation errors.

Leave a Reply

(Note: This name will be displayed publicly)