What comes after DRAM and SRAM? Maybe more of the same, but architected differently.
New memory types and approaches are being developed and tested as DRAM and Moore’s Law both run out of steam, adding greatly to the confusion of what comes next and how that will affect chip designs.
What fits where in the memory hierarchy is becoming less clear as the semiconductor industry grapples with these changes. New architectures, such as fan-outs and 2.5D, raise questions about how many levels of cache need to be on-die, and whether high-speed connections and shorter distances can provide equal or better performance off-chip. But that’s only part of the picture. With no obvious successor to DDR4, there are more questions being asked about which type of memories to use for what, how they should be packaged and used, and how those new memories will impact data storage further downstream at the disk level.
There are a number of factors that need to be considered:
• SRAM takes up valuable real estate on a die. In fact, in some chips it can account for as much as 80% of an SoC. If some of that can be moved off-die using high-bandwidth connectivity, chips either can be made smaller—which has a direct effect on yield and therefore cost—or they can include additional functionality that previously didn’t fit on the die.
• Contention for memory resources on a complex SoC is growing as more functionality is added onto a die. Moving some memory off-chip can alleviate wiring congestion, which in turn reduces heat and other physical effects. The result is improved reliability and performance of memory, as well as other components.
• Isolating memory and rightsizing it to processors can be much more power efficient and significantly faster, particularly if software is written to support it. In the past, there has been much talk but little effort in this area, in large part because SRAM and DRAM were proven technologies. But with more pressure to reduce power, these kinds of decisions are getting much more serious attention.
• In the datacenter, the quantities of data that need to be moved in and out of memories are huge and growing. Managing big data is moving from a macro problem to one that requires much more attention at the processor level. Understanding how to best process that data, and whether it makes more sense to store it locally or centrally, can have a significant impact on the memory architecture.
On-chip, off-chip, way off-chip
How much memory stays on chip in the form of high-speed static RAM (SRAM), and how much is moved off-chip in the form of dynamic RAM, used to be a straightforward calculation when it involved homogeneous processors and far less data. That’s no longer the case. In addition to variations on this theme, such as eDRAM for L3 cache, there are now more heterogeneous compute elements scattered around an SoC. That has created the need for a better understanding of where bottlenecks will show up within a complex chip, along with the best ways to deal with those bottlenecks.
Cache allows commonly used instructions or data to be stored very close to the processor. But because cache sizes are limited to what can be put on a die, they are set up hierarchically, depending upon how frequently that data needs to be accessed and the tolerance for latency.
“When you add L2 cache, you get another 3% improvement,” said Charlie Cheng, CEO of Kilopass. “That may not sound like much, but it ‘s a big deal because instead of wasting 100 cycles you only have to wait 3 cycles. If you go up to L3, you get 2% improvement, and with L4 you get another 2%. The last 2% is either DRAM or SSD.”
Cheng said L4 cache has blurred the lines between SRAM and DRAM. “That boundary is really interesting. What can you do to make that boundary more optimized?”
The answer to that question is complicated. For an SoC, that boundary might be extra memory on a board, connected by a high-speed interface. In a data center, it could involve additional cache levels plus a completely different DRAM architecture that emphasizes moving the data closer to the processors. So instead of data being hundreds of feet away from the server and connected over networks with variable latency, depending on traffic, that data can be much closer to the processors.
“The world is going down the path of scaling out to more servers,” said Steven Woo, vice president of solutions marketing at Rambus. “The problem is that the data is distributed among so many servers that you risk a long reach across high-latency networks to get that data. We’ve looked at a number of ways to change that equation. One of the solutions we’re experimenting with is cards with FPGA flexibility that include 24 DIMM modules. So you may have 1 card with 1.5 terabytes of data on a single card. That allows you to execute portions of the program right up against memory.”
This represents a radically different way of looking at how and where large amounts of data are stored and accessed. One potential approach is to move these massive memory cards inside of server racks, greatly reducing the distance the data needs to travel. “You used to drag data to the processor,” Woo said. “That worked fine when there was not as much data. But now we’ve got terabytes and petabytes of data, and the slowest approach is to move that data around. It’s much more efficient to move the computation to the data.”
New memory types
There are other ways to tackle memory latency and bandwidth issues, as well. New memory types, such as Magnetoresistive RAM, ReRAM, and Ferroelectric RAM are receiving much more attention these days than in the past.
MRAM has been touted as a universal memory, an idea that has been floating around since the 1990s when it was suggested that one type of memory could be used to replace SRAM and DRAM. So far that has not come about, and it may never happen. While DRAM and SRAM continue to shine in terms of manufacturability, that kind of mass production for MRAM reportedly remains difficult to achieve.
Magnetoresistive RAM stores data magnetically. DRAM and SRAM, contrast, store data using an electric charge. ReRAM, another contender, works by changing the resistance in a dielectric material. And FeRAM is similar to DRAM, but uses a ferroelectric layer rather than a dielectric one.
But DRAM also is running out of steam, so some sort of succession plan is necessary. While there is a low power DDR5 for mobile devices on the JEDEC roadmap, the path for standard DRAM ends at DDR4. There are efforts to scale DDR4 to new process geometries in order to continue reducing the cost, but there are few proponents at this point for extending the architecture to DDR5.
“For the last 30 years, whatever the PC makers bought was also the lowest cost memory option,” said Drew Wingard, CTO at Sonics. “That’s gone. There’s nothing beyond DDR4.”
What comes next may be a new memory type, or it may be a new architectural approach using the same technology. The Hybrid Memory Cube, for example, uses DRAM stacks connected to a logic layer and connected with through-silicon vias.
Another option is a high bandwidth memory interface (HBM2) to a stack of DRAM chips, which are connected internally to TSVs and externally to one or more chips using microbumps. The main benefits of HBM are increased speed over current DRAM technology, smaller form factor, and dual sourcing—at this point it is sold by both Samsung and SK Hynix. Since its commercial rollout in early 2015, HBM has seen commercial adoption in 2.5D packages in the networking and graphics markets, with more designs expected this year.
“If you look at HMC and HBM in terms of performance and power, there are very attractive numbers using very different architectures,” said Wingard. “With HMC, you’re improving performance by eliminating the locality of data. With HBM, it’s very different. You basically get rid of the pad ring between the SoC and DRAM.”
The best options
Which route is best will depend on what design teams are trying to achieve. For inexpensive and quick implementations, the simple choice at the moment is still a combination of SRAM and DRAM. For more advanced designs—where power and performance are competitive metrics, and where companies have more time and money to spend on exploration—there are more options to consider. There are benefits and drawbacks to each. Consequently, the amount of work that needs to be done up front and throughout the design process increases dramatically.
“Advanced users go through hundreds of configurations,” said Bill Neifert, director of models technology at ARM. “You need to model all of these things accurately, and the accuracy of the model lets you understand the impact of performance tradeoffs.” He noted this applies to system configurations, as well, including how interconnects are structured, caches are sized and memory controllers are configured.
One big consideration involves thermal effects. The more data that needs to be moved, the greater the heat. But it’s also true that the higher the heat, the slower the movement of that data because systems will automatically slow down once they hit thermal limits in order to protect the circuitry.
“If you look at a high-end server, the biggest bottleneck is the memory interface, not the compute power,” said Arvind Shanmugavel, senior director of applications engineering at Ansys‘ Apache business unit. “The more data that is transferred, the greater the heat. You need to look at that from a chip, package and board level. But when you move from 20nm to 16/14nm and 10nm, you’ve also got to deal with issues like self-heat. The localized temperature effects are exacerbated. There is a higher thermal gradient build-up.”
Shanmugavel noted that the No. 1 effect of increased temperature is reliability, particularly how fast metal lines burn out before failure. “A small change in temperature has a big impact on lifespan. With ICs that are connected with microbumps, a small change in tolerance can lead to problems. Historically, you would design a processor to run as fast as it could and tune it down from there. Now, you need to understand the maximum performance in a thermal envelope. All of that has to be simulated.”
Flash, disk and other options
There are changes in other memory types, as well. Flash memory has become critical to the basic input/output system (BIOS) of a computer because it can be easily updated. The trouble with flash is that it wears out more quickly than DRAM or electromechanical disks. The search is on to find an alternative to solid state drives, which are based on NAND flash, that will last longer.
That is the impetus behind 3D XPoint, which was developed by Intel and Micron as a faster and much more robust alternative to SSD. The memory uses a 3D lattice arrangement for stacking NVM, with a big emphasis on lower latency and throughput. Intel claims the memory is up to 1,000 times faster than NAND with up to 1,000 times greater endurance and 10 times more density. That memory is expected to begin showing up in computers next year under Intel’s Optane moniker.
There also is read-only memory, including programmable ROM, field-programmable ROM, and one-time programmable NVM, where bit settings are locked by fuses or antifuses. Each of these serves a purpose in the memory hierarchy, with OTP gaining ground in such areas as automotive electronics due to its temperature resilience and better security.
Even spinning disk drives are getting a boost through fusion drives—a combination of flash plus standard magnetic storage—and some new options such as heat-assisted magnetic recording, which allows data to be written on a much smaller scale than in the past, which are likely to be rolled out over the next couple years. Also on deck are shingled magnetic recording, which increases density using overlapping tracks, and helium-filled drives, which reduce friction and lower overall power consumption.
But how successful any of these technologies becomes depends upon not just one element in the memory hierarchy, from cache all the way up to disk. It depends upon all of them, and unlike in the past, where SRAM and DRAM were fixed numbers, they are all now in flux.
Changes are being considered at every level of the memory hierarchy, ranging from on-chip SRAM all the way up to external disk storage. The focus is still on improving density to reduce area, and on lower power and higher performance, but how to achieve all of that will have to change.
There already are a number of architectural and packaging options to move data in and out of memory more quickly, an increased focus on new memory types, and different ways to utilize memory. More will follow as the industry turns its focus to improving efficiency and solving new bottlenecks. But which ones, at what time, and for which applications, all are difficult to assess at this point. Change is happening everywhere at once, and that makes predictions significantly more difficult.