New approaches surface for persistent DRAM issue.
Rowhammer is proving to be a difficult DRAM issue to fix.
While efforts continue to mitigate or eliminate the effect, no solid solution has yet made it to volume production. In addition, more aggressive process nodes are expected to exacerbate the problem. In the absence of a fix, then, testing may be one way to give DRAM manufacturers and users some way to segregate devices that are more susceptible to the effect in order to improve system security.
“Row hammering came to our attention because of security,” said Yervant Zorian, fellow and chief technologist for hardware analytics and test at Synopsys. “By doing that hammering, one can reverse-engineer the content of the memory.”
Mohammad Farmani, a researcher at the University of Florida, reported similar findings in a presentation at the 2021 European Test Symposium. “Since the introduction of rowhammer in 2014, every year, several attacks have been presented in which attackers from the software level exploited rowhammer vulnerability on the circuit level and violated the integrity of main memory.”
Doing a full, exhaustive test of rowhammer vulnerability would take far too long to be economically viable. “It’s really time-consuming to do that,” said Zorian. “Even though the test is running at speed, it’s a very long test.”
Currently, the only option is no testing. But a research project [1] presented at the recent European Test Symposium suggests it may be possible to provide most of the benefits of an exhaustive test using an approach that’s far from exhaustive. If fully deployed, it could both provide manufacturers with feedback on devices that are more vulnerable, and provide users with a viable incoming test process that would allow them to be more selective about the parts they use.
A stubborn challenge
Rowhammer is a DRAM-specific phenomenon, where repeatedly accessing a specific memory row can corrupt data on physically adjacent rows. Each time a row is accessed, a small puff of electrons drifts over into a neighboring cell.
Individually, these puffs aren’t significant. But if someone intentionally accesses the row — called the “aggressor” — over and over in quick succession, so-called “hammering” the row, those accumulated electrons can change the state of the adjacent cell in the neighboring row, referred to as the “victim.” This has been found to be a security challenge, potentially enabling attackers to take control of a system.
If a refresh is executed prior to any corruption, the memory starts afresh. The challenge is when too many accesses occur between refreshes. Closer cell packing makes the problem worse, which is why older DRAMs are less susceptible and newer ones are expected to be worse. The number of accesses required to flip a bit is expected to go down as dimensions shrink.
Such hammering wouldn’t be typical of a normal application, since it’s an unusual access pattern. “In normal operation, you don’t do hammering,” pointed out Zorian. It occurs predominantly during an attack, making this exclusively a security issue, not a quality issue. “Vulnerable cells are not the same as weak cells in the memory,” noted Farmani.
There have been many discussions of both mitigations and possible fixes for rowhammer. Mitigations are in place, with mixed reviews, but no fix has yet been conclusively adopted by the industry. “Recent attacks show that the current solutions are not completely effective,” said Farmani.
Implicit in the consideration of testing rowhammer is the expectation that some chips are more vulnerable than others. Ideally, testing could let manufacturers cull particularly vulnerable chips. Or, implemented as an incoming quality inspection, system builders could decline to use more vulnerable parts, considering them a security risk.
In the latter case, one is dealing with a DIMM, not an individual memory chip. That means that testing has to exercise not only a single chip, but every chip on that DIMM.
Fig. 1: Each DIMM has multiple DRAMs, each of which has multiple banks, each of which has multiple subarrays of DRAM cells. Source: © 2021 IEEE
If all cells in each chip are equally at risk then, again, testing provides no benefit. “If there is no consistency every time we apply the rowhammer, then the errors happen in a new location [with each test],” said Farmani. “It means that every time all the cells have the same probability to show errors, which makes it infeasible to find a test solution.”
That’s because exhaustive testing is simply too expensive. One would have to expose each row to hundreds of thousands of accesses, examining adjacent rows for errors, in order to perform the test.
Testing from an SoC
Rowhammer tests are available today in memory built-in self-test (MBiST) modules. Such MBiST blocks usually are placed in an SoC for use in testing the embedded memories on the SoC. In the case of DRAM, however, the memory is external to the SoC rather than being embedded.
For this purpose, the MBiST block can access the DRAM through the high-speed DRAM access port. This provides a way to test the DRAM under the control of an SoC. The MBiST capabilities may cover many different kinds of faults, but rowhammer can be one optional test. “We have a special engine that talks to the DDR memory, to the HBM memory,” said Zorian. “And then we apply this hammering approach.”
Such a capability is particularly useful in advanced packages. “When you package a die in a multi-chip module, the die is still exposed during the packaging, the handling, and so on,” said Zorian. “It can be damaged. That’s why it has to be re-tested post-packaging. But I don’t have direct access to it anymore. It’s connected to logic. And if I try to probe it, it’s extremely difficult, because these are very high-speed connections you’re testing.”
Some of the chips may even be upside down or covered by another chip in a 3D configuration. So an MBiST approach via the DRAM port can be an effective way of testing such an assembly.
Automotive applications have the unique requirement for power-on, interim, and power-off tests. But because rowhammer is a security issue, not a quality issue, it’s less likely to be performed in-system (although it could be at the system-builder’s option). This is primarily a manufacturing-level test prior to a system being deployed in the field — or prior to a DIMM being installed.
Some MBiST blocks allow for different programs to be executed in different settings. So the full test, including rowhammer testing, could be performed at manufacturing test. Other settings — say, for power-on test and for interim tests — may involve shorter tests, and they can omit the rowhammer test entirely.
“Some people like to have a shorter test, just to see if the DRAM is alive,” said Zorian. “In other applications, such as in automotive or high-end computing, they apply the entire DRAM test via that [SoC-to-DRAM] bridge.”
With a programmable MBiST block, a generic rowhammer-test algorithm can be built into the MBiST, while leaving important parameters under the control of whomever writes the test program. In that manner, the specific addresses to test, and the number of “hammers” to perform, can be programmed in.
“In our BiST engine, we have a loop,” said Zorian. “And we leave it to the control of the packaging company — whoever has put it together – and they decide how many cycles to do.”
Test time is always important. So the big question becomes, “How many rows should be accessed, how many columns should be read, which specific rows and columns should be tested, and how many hammers should be performed?” An MBiST block may allow that information to be programmed into the test, but it provides no guidance on its own about how to set those parameters.
The University of Florida team was searching for a more efficient way to systematically test for vulnerable parts — that is, do it in a manner directed at the most vulnerable cells. This would apply to an MBiST implementation or a test controlled externally by automated test equipment (ATE). But, in order for that to make sense, some locations within each memory array would need to be more vulnerable than others — and this correlation would need to be consistent from part to part for a given design.
Alternatively, if all cells were equally subject to corruption, then testing a random subset of the memory might provide some benefit in less time. But selecting what to test and how much to hammer would be more or less randomly chosen, making it difficult to justify the specific choices.
So, given an ATE or MBiST program that allows a subset of the array to be tested for rowhammer vulnerability, which subset should that be?
Getting acquainted with the neighbor
A systematic, directed rowhammer test requires knowledge of row adjacencies. There are three levels of address in a design like this. There’s the logical address that the software uses, and that address is turned into a physical address by the memory management unit (MMU). But the physical layout of these DRAMs can sometimes scramble addresses, such that two rows with adjacent “physical” addresses may not be physically adjacent.
“We used to have five, six types of scrambling – bit twisting, address scrambling, block scrambling, and so on,” said Zorian. “Today, there are about 20 types of scrambling.”
Fig. 2: An example of row address scrambling. The top and bottom sets of rows aren’t scrambled, but the middle ones are. Source: © 2021 IEEE
The chip manufacturer will know how to map the physical addresses to what the research team referred to as “implementation addresses.” But a system builder, as a DRAM customer, would not know that mapping. In order to come up with a test, there would need to be a way to reverse-engineer the implementation adjacencies.
This was the first step in the Florida team’s project. It’s exhaustive work, since it’s done by hammering the device row by row to see which rows become corrupted. Hammering one aggressor row should cause two other victim rows to show bit flips. Those two victim rows would then be assumed to lie on either side of the row being hammered. Indeed, this is what they found.
“For each aggressor, there are two victim rows except the beginning and the end of the memory,” Farmani said. The three different vendors tested (whose identities were not disclosed) each had a different map, but for each vendor that map was consistent over multiple devices.
There is a concern that with newer technologies the effect may corrupt not just the neighboring rows, but the rows next to the neighbors. That’s because the rows will be packed closer together, and the puff of electrons may be able to reach additional rows.
In that case, using this methodology, one would expect four rows with corruption – and with two of those rows being more corrupted than the other two. The more corrupted rows would then be interpreted as the immediate neighbors, and the less corrupted rows would be the neighbors’ neighbors. This project used older memories, so this wasn’t a consideration in their specific work.
Because older devices require many more hammers to cause corruption, the team hammered each row 15 million times – enough that it would be hard to do between refresh cycles. So they disabled the automatic refresh, refreshing manually only when desired (although extending the period between refreshes could also result in some random cells leaking their state away naturally).
Looking for correlation and patterns
Once the address mapping is known, the next step is to figure out whether certain parts of the chip are more vulnerable than others. This restarts a hammering sequence, although with fewer hammers. Through this process, they can separate out cells that are corrupted randomly from those corrupted systematically using autocorrelation.
What they ended up with is a distribution of vulnerable cells, including the number of hammers it takes to corrupt different cells. This is where it’s important for there to be consistency from chip to chip for all but the random bit flips, and this is what was found.
“We saw that the rowhammer-vulnerable cells are highly consistent,” said Farmani. “They have correlation in the bit lines, which helps us to locate the more vulnerable lines in the memory. They didn’t show any correlation with the rows. Different chips from the same series of a vendor are very similar to each other. Rowhammer-vulnerable cells, on average, are more than 80% consistent for all three vendors.” That means that the same test can be used for all of the devices with the same design.
The practical impact is that for a given memory design, one would need to go through this mapping and correlation process once, using multiple chips, to effectively characterize the rowhammer distribution for that design. That data could then seed the test programs for any other chips that use the same design.
It also means that the process would need to be performed for each design from each vendor. In theory, that data could be made available for others to use in creating test programs. In practice, business considerations may result in the data being taken by multiple organizations.
Providing for tunable test programs
There’s no hard drop-off between vulnerable and invulnerable cells. They change by degrees. So the distribution they found during this phase told them how many cells they could flip for a given number of hammers. It also provided a distribution of bit line correlations, in order of vulnerability.
This gives test creators two knobs to play with when trading off coverage and cost. For higher coverage, the program could relax the correlation, taking in more bit lines, and/or increasing the number of hammers on each row tested – at the expense of a longer test. That assumes, of course, that the manner of examining the results can benefit from fewer bit lines being checked.
A shorter test could focus only on the most vulnerable bit lines, using fewer hammers (since the more vulnerable bit lines require fewer hits to corrupt their neighbors). As for the rows, there’s no correlation there, so any subset of the rows could be used for the test. This becomes a third knob, with more rows providing higher coverage for higher cost.
While newer process nodes are expected to result in more vulnerable chips, this is actually good news for the test. The fewer hammers required to cause corruption, the faster the test.
For tests performed by ATE, these parameters provide direction as to how to write the test program. For a programmable MBiST block, as long as the hammer count and target addresses can be programmed, the ATE test (or a variant of it) can be implemented in the MBiST program.
Further research will be required to extend these results to more vendors and newer technologies. Automation then could see rowhammer vulnerability being added to the bring-up characterization of a new DRAM chip.
Sources
[1] M. Farmani, M. Tehranipoor, F. Rahmanr, “RHAT: Efficient RowHammer-Aware Test for Modern DRAM Modules,” 2021 IEEE European Test Symposium (ETS), 2021, pp. 1-6, © 2021 IEEE.
Related Stories
DRAM’s Persistent Threat To Chip Security
Rowhammer attack on memory could create significant issues for systems; possible solution emerges.
More Data, More Memory-Scaling Problems
Challenges persist for DRAM, flash, and new memories.
Semiconductor Security Knowledge Center
Why not put simple hardware to prevent consecutive access to the same row and hence prevent RawHammer since it is not needed in function mode. Just a thought!!
Thanks for this well-written article on a paper presented recently at the IEEE European Test Symposium (ETS-2021), The full paper on this topic will soon appear in IEEE”s digital library, IEEE Xplore. In the meantime, papers and Zoom video recordings of the presentation at ETS-2021 are available for on-demand viewing until June 27, 2021 for all registered ETS attendees. BTW, you can still sign up for ETS. Of course, the ETS live program concluded on May 27, but you can still watch all sessions. See: https://ets2021.vfairs.com/.
Erik Jan Marinissen
Good thought indeed Baker as pointed out in an ISCA2018 paper and presentation „Mitigating Wordline Crosstalk using Adaptive Trees of Counters“ and already implemented in commercially available 2Gb or 4Gb (and recently 8Gb) DDR3L DRAM ICs as mentioned in the author’s previous related article „DRAM’s Persistent Threat To Chip Security“. Now you may ask why not all DRAM brands implement this watchdog (and „out-of-term“ row refreshing) scheme to make sure that their DRAM would not leave a hardware backdoor open to those recently ever more sophisticated Row-Hammer-hacking attacks? The answer is that It comes at an extra cost of chip real estate and thus makes it tough to survive in this price-driven cut-throat dynamic RAM business which some major DRAM brands did not survive over recent decades such as Qimonda, Promos or Elpida.