Rethinking Memory

New design options create new ways to use memory to improve power and performance.

popularity

Getting data in and out of memory is as important as the speed and efficiency of a processor, but for years design teams managed to skirt the issue because it was quicker, easier and less expensive to boost processor clock frequencies with a brute-force approach. That worked well enough prior to 90nm, and adding more cores at lower clock speeds filled the gap starting at 65nm.

After that, the solution of choice was to pack more SRAM around processors. In fact, some SoCs are now up to 80% memory, which is not the most efficient way to design chips. For one thing, it puts the onus on operating system, middleware and embedded software teams to integrate the flow of data and make it all work. Even though that approach has been well tested and market-proven, it too is beginning to run out of steam.

That puts chipmakers back in front of the original challenge of getting data in and out of memory more efficiently, but with some new hurdles and options:

• While plans are in place to shrink transistors down as small as 5nm, and maybe even 3nm, the big problem areas are the interconnects, the wires, and the thinning gate oxides. Resistance/capacitance delay, along with the resulting thermal effects, electromigration, electromagnetic interference, electrostatic discharge, are increasingly time-consuming and expensive to deal with.
• The number of cores continues to increase, but most of them remain dark most of the time because software still cannot be effectively parallelized—a problem that has dogged computer science for more than 50 years. As a result, more processing is being distributed around the chip because function-specific processors such as DSPs with hardware accelerators use less energy than general-purpose processors, and memories are being scattered around the chips with them.
• The commercialization of fan-outs and 2.5D approaches is allowing chipmakers to rethink how and where to add memory on the Z axis, and new memory types are providing new options to balance cost, performance and data reliability.

All of these point to some rather pronounced changes in how engineering teams are approaching memory issues and how to solve them.

“There was a classic hierarchy of memory of SRAM to DRAM to flash to hard disk that worked well for a long time,” said Rob Aitken, an ARM fellow. “Computer architectures evolved alongside of this, and the behavior in software evolved with it. Now there are new forms of memory and new ways of accessing that memory. So with magnetic memory, if you use it only as memory, the performance is okay. But you also can do interesting things with it, such as save on leakage power of SRAM and active power in DRAM. And with more complicated systems, there are all sorts of options.”

This is more than just changing out one type of memory and replacing it with another. It’s a system-level change that affects everything from architecture to layout to a host of physical and proximity effects, adding new interfaces and altering the speed at which data is accessed and the amount of power needed to make that happen.

“The Hybrid Memory Cube is a stack of DRAM, but it acts differently than conventional DRAM,” Aitken said. “That changes how cores interact with memory. We’ve seen in our own studies that single-threaded performance is not the determining factor in system performance. Getting data in and out of memory is critical.”

Variability at advanced nodes
This is compounded by a number of factors that are escalating from second-order effects to first-order effects. Among them is process variation, which is generally defined as deviation from the nominal specs in the fabrication process. While it has been dealt with relatively painlessly by EDA tools for years, the problem is getting worse at each new node to the point where it is now affecting memories.

Sparsh Mittal, postdoctoral research associate at the Oak Ridge National Laboratory, said one of the most common approaches to deal with process variation in memory has been redundancy. But he noted that the impact is getting worse in all types of memory—SRAM, DRAM, embedded DRAM and non-volatile memory—particularly as the voltage is turned down to save battery life in mobile devices, or lower energy bills in large datacenters.

“With NVM, for example, you have an expectation of how it will work and you assume standard variation,” said Mittal. “But if you decrease the voltage, process variation is even more pronounced.”

Mittal mapped process variation effects throughout a design, from multi-core to stacked die, and from low-power to high-performance designs. The effects are significant across the design spectrum. In DRAM, for example, he said there was variation in terms of how much time a block would retain data before losing a charge, as well as the impact on one memory block versus another. “One may have a latency of 10 nanoseconds, while another may have a latency of 8 nanoseconds.” (To view Mittal’s research paper, click here. Sections 5.2 through 5.4 deal with process variation’s effects on memory.)

Performance and power
One of the key metrics for memory performance is throughput. This is hardly a straightforward measurement, though. Congestion in layout, contention for memory from multiple cores, and varying read/write speeds all can play havoc with throughput. So can the number of calls back and forth between memory and the processor—or with multiple levels of cache, the memories—which also determine how much power is used by these components.

That complexity is compounded by a growing number of choices involving different memory and processor configurations, prioritization and scheduling for how those memories get used, whether they are on the same die, off-die, or off-die in the same package, and what the maximum voltage will be. Power can affect throughput, as well.

“You have a major set of possibilities that affect the architecture and the integration of various IP blocks,” said Steven Woo, vice president of solutions marketing at Rambus. “You could make processors with lots and lots of cache, or you can use smaller die, more processors, and higher-bandwidth memory. For the low power community, you want to look at where power is being wasted. Moving data long distances is wasteful, which is why in phones you see the memories located very close to the processor.”

The main location where power is dissipated is in the memory core, where bits are written and retrieved. Power is spent transmitting data to and from those cores, and in the past a common way of boosting throughput was simply to turn up the power. That doesn’t work anymore because it will cook the chip, which is why there has been so much focus on pipelining in planar designs, and interposers and through-silicon vias for systems-in-package. But all of this has ripple effects that extend well beyond the memory.

“With something like Wide I/O, instead of a few wires at high speed you use a lot of wires at lower speed,” said Woo. “You get better performance characteristics that way. What we’re witnessing is that the physical design envelope is changing. It’s no longer design in isolation. Packaging is changing, and that’s changing other things. TSVs are a change to the value chain and the way DRAMs are sold. When they’re assembled, there are a host of other issues, like where it can go bad and whether it went bad in assembly. Packaging changes the relationships between the players, too. There are established methods for determining where things went wrong, but those signals might now inaccessible. This is like a big equation. When the benefits outweigh the cost of assembly, test and manufacturing, then people adopt it.”

Making tradeoffs
Adoption depends on a variety of factors, but there are only so many companies to drive these changes—fewer with consolidation—while there is a long list of options. For the most part, memory designs can be molded to almost any design constraints imaginable these days. While 90% of design teams use standard foundry bit cells, in highly power-sensitive applications they can develop their own, using back biasing or dual rails to limit power consumption.

“The big question to start with is how much memory will be on chip and how much will be off-chip,” said Lou Ternullo, product marketing director for the IP group at Cadence. “The DRAM guys have always focused on what you need when, what gets turned on when. That’s getting carried over to the SoC now with frequency scaling. So with a video application you may have background computation, and with dynamic voltage frequency scaling you can still access it at a lower rate. Plus, you may not need some data at all, and some you might not need all the time. You can build intelligence into the controllers where if you’re not going to see data for 100 cycles, you can power down some of the memory.”

But what does this do to throughput? And is there enough bandwidth for the data to flow freely in and out of memory? Those kinds of questions are being asked much more frequently by design teams these days. In the past, the memory subsystem was an almost fixed element. Increasingly it is being seen as another knob to turn, which is why there is so much attention being placed on memory controllers.

“If you make a memory system faster, that doesn’t mean you necessarily can use it,” said Ternullo. “The question is how efficiently the controller can access data from memory. That’s one more reason why systems companies are moving to high-end modeling. RTL can only run so much traffic analysis. With transaction-level modeling, you can start running code in a system and understand how to access that in DRAM. As you increase frequency (CV²F), dynamic power increases. So now you have to question whether you need to run at that performance all the time. In the DDR world, there are two components to performance—efficiency and data rate. Are you better off with 100% efficiency at 1GHz or 50% efficiency at 2GHz?”

Prasad Saggurti, product marketing manager for embedded memory IP at Synopsys, agrees: “We’re seeing external memory management units being used even for on-chip memory. The architects know the access patterns of the memory and they know how to do the write and read, which allows them to build clever controllers where they can space out or localize the read and write. This isn’t the whole memory operating. The rest of it is in a low-leakage state. We’ve built memories like that in the past. They tend to be for networking companies or discrete, where the reads and writes fall into a certain pattern. Cache controllers tend to do a job like that, too.”

Saggurti noted that one other trend is to re-use circuitry for memory at low voltage. “So the bit cell part and the periphery part are different. The periphery can run at a lower voltage than the bit cell. We’re seeing more and more customers do that. They are not sacrificing the bit cell, but they are running at a lower voltage without running voltage rails across the chip.”

The memories themselves can be customized, as well.

“We see more and more customized memories for almost every chip,” said Mike Gianfagna, vice president of marketing at eSilicon. “There are thousands of memory blocks in a chip. Typically, you find about 10% are on critical paths, which contributes to bottlenecks. If you want to reduce power, you have to take out some memory and customize some of what you have. But you can’t just buy that off the shelf. We’re seeing a lot more of that in 2.5D where people are looking at what memory blocks are needed, what applications they will be used for, and what is the intended use. You can optimize that by playing around with the bit cell or decreasing functionality. Maybe you change the bit cell size. If you think about it, logic is optimized like crazy to improve performance. Most people don’t think about optimizing memory, but if you can take out 10% or 15% of the power, that’s a very big deal.”

Conclusions
All of this just scratches the surface of what can be done with memory architectures. There are a variety of signaling approaches, including some that are gaining a second look such as optical communication between die in a package. There are different options for how bits are laid out within the memories, as well.

In addition, there is research underway into changing the resistance/capacitance of the wires and the inductance of the pins in a device, more work on buses, and III-V are being developed to improve the flow of electrons through interconnects. And on top of that, there is work underway into how data itself is structured, which is particularly important for Big Data processing.

“We’ve seen a move toward near-data processing, where the data sets are so large that it’s cheaper to move the processor closer to the data than the data closer to the processor,” said Rambus’ Woo. “There’s also a movement to minimize the data through semantic awareness, where you understand the structure of the data and you walk down a list of what’s done right in memory and an FPGA instead of having to return to a Xeon processor.”

What’s clear, though, is memory is no longer just a checklist item in any advanced designs. It’s now an integral part of the design, and it can be tweaked, bent and twisted in ways that were largely ignored in the past to improve performance, reduce power, and create differentiation.



  • Sandeep Patil

    Hi Ed,
    Thanks for the post..
    I am wondering how exactly are light sleep and deep sleep architectures vary w.r.t. performance and other issues for any memory?

    • Ed Sperling

      Hi Sandeep, good question with no simple answer. It depends on the speed of the conduit to memory, the type of memory itself, the signal path from input to logic for wakeup, the prioritization for signals within memory, how much is controlled by software versus hardware, and the overall application. It also depends on whether security is part of this scheme, and whether that security is active or passive.

      • Sandeep Patil

        Thanks for the reply Ed. Being a P.D. engineer from a design service house, i have come across scenarios where memory selection would be done based on clk-q delay versus power consumed for the operation. The other points you have mentioned might be considered at higher levels of execution.
        Merry Xmas to you!

        Thanks,
        Sandeep