Ensuring AI Reliability: Mitigating Silent Data Corruption Risks


Silent Data Corruption (SDC) is an industry challenge affecting data centers worldwide with increasing frequency. This phenomenon stems from untraceable hardware failures that make detection notoriously difficult. SDCs don’t leave any record in system logs or trigger exception mechanisms. The corrupted data they produce can propagate unnoticed, causing cascading failures that often demand ext... » read more

Why Hardware Monitoring Needs Infrastructure, Not Just Sensors


Chipmakers need comprehensive hardware monitoring, with monitors (Agents) and sensors distributed throughout their devices, to manage the growing complexity and scale of modern SoCs. As designs now incorporate billions of transistors, multiple power and clock domains, and advanced process technologies, traditional characterization, test, and guard-banding approaches no longer provide sufficient... » read more

Ensuring AI Reliability: Mitigating OCP’s Silent Data Corruption Risks


Silent Data Corruption (SDC) is an industry challenge affecting data centers worldwide with increasing frequency. This phenomenon stems from untraceable hardware failures that make detection notoriously difficult. SDCs don’t leave any record in system logs or trigger exception mechanisms. The corrupted data they produce can propagate unnoticed, causing cascading failures that often demand ext... » read more

Resilient And Optimized GenAI Systems


AI and data center systems are being pushed to their limits, with soaring complexity, nonstop inference workloads, and rising energy demands. Addressing these pressures requires more than incremental improvements, it calls for collaboration across the ecosystem. That’s why proteanTecs has joined forces with Arm, bringing our real-time monitoring technology into Arm’s Neoverse Compute Subsys... » read more

Thermal Sensing Headache Finally Over For 2nm And Beyond


Effective thermal management is crucial to prevent overheating and optimize performance in modern SoCs. Inadequate temperature control due to inaccurate thermal sensing compromises power management, reliability, processing speed, and lifespan, leading to issues like electromigration, hot carrier injection, and even thermal runaway. Unfortunately, precise thermal monitoring reached an inflect... » read more

Critical Optimization Factors For GenAI Chipmakers


Today’s GenAI arms race is fought with novel chip architectures and packaging. Specialized hardware designs are proliferating in the form of GPUs, TPUs, NPUs, and more, all tuned for parallelism and matrix-heavy AI math. In this hyper-competitive landscape, chip vendors scramble to differentiate their products on multiple fronts. They promise some mix of better performance, efficiency, or ... » read more

Same Chip, Two Destinies: How Power Profiles Improve With On-Chip Monitoring


What happens to critical power-related considerations when the same chip is handled two different ways, with or without visibility from within? This article begins by examining how the absence of on-chip monitoring impacts peak power, average power, and Di/Dt noise (rate of current change), as illustrated in the diagram below and the subsequent discussion. It then details how these aspects c... » read more

The Painful Reality Of Scaling Cloud AI


The shift to Generative AI (GenAI) has overwhelmed existing infrastructure, transforming previously rare issues into daily operational realities. Skyrocketing costs, intense energy consumption, and hardware failures at unprecedented scales illustrate the strain of current AI workloads. With models like GPT-4 costing tens of millions and GPT-5 projected to surpass a billion-dollar threshold, the... » read more

GenAI’s Breakneck Pace Is Reshaping The Semiconductor Industry


Humankind is witnessing a technological revolution so extreme that its full magnitude might extend beyond the scope of our intellect. Generative AI (GenAI) is doubling its performance every six months [1], outpacing Moore's law in what the industry calls Hyper Moore's Law. Some cloud AI chipmakers expect to double or triple performance every year for the next ten years [2]. In this three-part b... » read more

Can Your ATPG Do This? Cut Defects Escaping Detection With ML


Chipmakers worldwide consider Automatic Test Pattern Generation (ATPG) their go-to method for achieving high test coverage in production. ATPG generates test patterns designed to detect faults in the silicon and ensures they are applied effectively using the chip’s Design-for-Test (DFT) infrastructure. This combination enhances fault detection while optimizing test efficiency. These patter... » read more

← Older posts