Margin Sensors In The Wild

Silicon health and performance monitoring for SoCs.

popularity

Back in March, I wrote up an article here that looked at how a proxy circuit could be used to measure variations in circuit performance as conditions changed in the operating environment. There were a couple of recent presentations on margin sensors at two of the big EDA vendors’ customer engineering forums that we’ll look at as well as another product with an upcoming presentation at DAC. Margin sensors have applications for silicon health and performance monitoring for SoCs, characterization, yield, reliability, safety, power, and performance. How they are configured, though, determines their best suited tasks.

The first presentation was given at Synopsys’ SNUG Silicon Valley on March 20, 2024, titled “Diagnosis of Timing Margin on Silicon with PMM (Path Margin Monitor)”, by Gurnrack Moon, Principal Engineer at Samsung. One of the key aspects of the PMM that Samsung appreciated was the closer correlation between the PMM and the actual paths versus, say, using a Ring Oscillator approach.

Fig. 1: Synopsys Path Margin Monitor diagram. (Source: Synopsys)

My previous article described how the “Monitor Logic” portion of the PMM diagram shown above in figure 1 would conceptually work. Taps taken along the synthetic circuit of buffers could be compared to see how far the signal made it down the path and thus determine how much margin is available. A strength of this approach is that it allows one PMM to be used on multiple paths. It does have a disadvantage, though, of introducing additional control overhead and adding additional delay components in to the monitor path.

The PMMs on the chip are connected in a daisy-chain fashion which reduces the number of signals needed to send information from the PMMs to the Path Margin Monitor Controller. This also reduces the number of signals for communication. This setup efficiently uses chip area to provide information about the state of the silicon. Typically, one might expect this type of capability to be exercised in a “diagnostic” mode where data would be captured, analyzed, and then used to determine appropriate voltage and frequency settings as opposed to a more dynamic or adaptive approach.

Samsung appreciated being able to “determine if there are problems or what is different from what is designed, and what needs to be improved. In addition, PMM data fed to the Synopsys Silicon.da analytics platform provides rich analytics, shortening the debug/analysis time.” This was used on production silicon. Synopsys also has other blog articles here and here for the interested reader.

The second presentation was given at CadenceLIVE Silicon Valley, April 17, 2024, titled “Challenges in Datacenters: Search for Advanced Power Management Mechanisms”, and presented by Ziv Paz, Vice President of Business Development at proteanTecs. His presentation focused on proteanTecs’ Margin Agents and noted how these sensors were sensitive to process, aging, workload stress, latent defects, operating conditions, DC IR drops, and local Vdroops.

Fig. 2: Reducing voltage while staying within margin. (Source: proteanTecs, CadenceLIVE)

Figure 2 shows how designers must handle “worst-case” scenarios and often do so by creating enough margin to operate under those conditions. In the diagram shown here, that margin shows up as a higher operating VDD. If the normal operating mode is 650mV with an allowance for a -10% change in VDD then the design is implemented to run at 585mV (90% * 650mV). Most of the time though, the circuitry will operate properly below 650mV so that running at 650mV is just wasting energy.

proteanTecs then presented a case study that was designed using TSMC’s 5nm technology. The chip incorporated 448 margin agents consisting of buffers with a unit delay of 7ps.

Fig. 3: Example margin agents and corresponding voltage. (Source: proteanTecs, CadenceLIVE)

Figure 3 above shows the margin agents (all 448) on the left side with the thicker black line showing the worst case for all 448. The right side shows the voltage. It also demonstrates that when the threshold is lowered the voltage will now drop to 614mV and the design continues to operate properly.

Fig. 4: Example margin agents with droop and corresponding voltage. (Source: proteanTecs, CadenceLIVE)

Figure 4 shows that as the voltage on the right drops that the worst-case margin agent values also drop and once they cross the yellow(-ish) line the voltage is signaled to return to the pre-AVS voltage of 650mV. The margin agent values then improve and the AVS voltage of 614mV will kick back in. By reacting when the margin agents cross the yellow line, it allows time for the voltage to increase and adjust before the voltage hits the red (585mV) line, thus always keeping it in the proper operating zone.

For this case, proteanTecs saw a 10.77% power saving and said that they’ve typically seen savings in the 9%-14% range. For this data center-oriented customer, this was important because of a limited power budget per rack, cooling limitations, carbon neutrality requirements (PUE), and a high CAPEX. Other benefits are a higher MTTF, lower maintenance costs, and a prolonged system lifetime. proteanTecs claimed a minimal impact on area and that currently most of their designs are in 7nm, 5nm, and below.

The third vendor announced their Aeonic Insight product line including a droop detector on November 14, 2023. Movellus’ Michael Durr, Director of Application Engineering is scheduled to give a talk at DAC on Wednesday, June 26, 2024, titled “Droop! There it is!” Movellus has been long known for their digital clock generation IP and, as one might guess, their design uses a synthetic circuit for detecting changes in the operating environment. Leveraging their clock generation expertise, they are initially targeting an adaptive frequency (or clock) scaling (AFS) approach that also leverages their digital clock generation IP.



Leave a Reply


(Note: This name will be displayed publicly)