Memory is emerging as the starting point for SoCs, adding more confusion to already complex designs.
Memory is becoming one of the starting points for SoC architectures, evolving from a basic checklist item that was almost always in the shadow of improving processor performance or lowering the overall power budget. In conjunction with that shift, chipmakers must now grapple with many more front-end decisions about placement, memory type and access prioritization.
There are plenty of rules and standards, but like everything else in the semiconductor world, complications surrounding memories have grown and compounded to the point where they have created their own special brand of complexity and confusion. For example, consider access to memory. That access can be prioritized through software, or it can be done physically by moving a processor closer to the memory to reduce the latency or by changing the wire thickness. Or the memories can be located off-chip, where latency is higher but the cost is lower. Off-chip memories can be bigger, as well, and with new packaging options there are even more tradeoffs to consider.
That’s just the starting point. Chipmakers also have to decide how fast those memories run, at what voltage, and what the likely use models will be. There are software decisions, as well, because some software is more processor-intensive, while other software is more memory-intensive. All of that has implications for system performance, what it will cost, how much heat it will generate, how long it will take to debug and verify, and ultimately whether it will be competitive in the market.
“The physical problems are relatively more important than for any other component,” said Chris Rowen, CTO of Cadence‘s IP Group. “It’s big, so it has an effect on cycle time and latency. You need to think about how much is needed and where. But memory is the one place where you don’t know how much you need, so having more memory and more code resident is something that almost everyone does.”
What’s particularly daunting with memories is that for every possible decision there is an opposite possibility, and each has to be weighed against many others.
“You always want to make memory as small as possible, but at the same time you also want it as big as possible,” said Rowen. “So for some kinds of problems, cache hierarchy is critical. For other problems, you can’t use the cache hierarchies, so you need to be thoughtful about sizing. There also is a constant push toward centralization and decentralization. The more centralized it is, the more flexible the memory resources. One of the big areas for decentralization is with more parallelism, and one of the ways you achieve that is by separating the computation engines.”
One opportunity, at least on paper, is to more quickly adapt SoC designs for specific market applications without significant changes to the design itself. This is not a trivial problem, however, even if it does make the design more flexible.
“There are a couple different ways of using embedded memory,” said Prasad Saggurti, product marketing manager for embedded memory IP at Synopsys. “One is to lower the voltage. You can use the foundry bit cells, but to go to low voltage you need to add other circuitry for read and write. The second way is not to use the foundry bit cells. You can use a logic bit cell, which is larger. That works if the value of low power is greater than the concern about area. So for the networking guys, it’s usually okay to have extra area. For the smart phone guys, they cannot use a logical bit cell because the area is more important.”
Kilopass likewise has been developing its own memory bit cells. Charlie Cheng, the company’s CEO, said that one of the big problems is that DRAM, SRAM and embedded flash have been developing in a straight line for the past quarter century. “Power dissipation is exponentially proportional to temperature. The capacitor leaks, but the transistor leaks more. When you shift the refresh from 50 milliseconds to 20 milliseconds, it can’t refresh. This is a road map problem and an architectural problem.”
The / has added another page here, as well. By dropping the voltage and frequency of the memory, it can be run below its threshold voltage. This is particularly important for smart devices that don’t need to wake up quickly and where batteries are difficult to replace, such as on a streetlight or a bridge. Near-threshold and sub-threshold design has a serious impact on performance, but the power savings are significant.
A more balanced approach is to use memories in a dual-rail mode, so the periphery is run at a lower voltage than the bit cell. “We’re seeing more and more customers doing that,” said Synopsys’ Saggurti. “They’re not sacrificing the bit cell performance, but they’re running a lower voltage and they’re not running voltage rails across the chip. If some memories don’t have to run as fast, they can be kept in a lower voltage. The first step was to break up the periphery into lower voltages. Now those are being broken up even faster.”
Size matters
What’s confusing about memory discussions is that they tend to conflict with themselves, and they sometimes conflict with basic directions for IC design. In memory, more bits generally are considered better. In SoC design, in contrast, the focus has been on reducing margin wherever possible to improve performance and reduce power.
Ask any memory expert about using the right amount of memory for any job in advanced designs, and they almost always agree that the minimum is better. Ask them about a particular SoC design, and they usually will opt for the maximum allowed. And then there is a group that fits somewhere in between, pushing for rightsizing and more efficient usage wherever appropriate.
“We’ve asked our customers what they’re looking for most, and the number one request is lower power,” said Kurt Shuler, vice president of marketing at Arteris. “The second is better performance. So you can add cache coherency into a design to improve both, but the big problem there is that all the cache coherent subsystems are vendor-specific, and each vendor does its own thing.”
Still, cache coherency is a critical feature for multi-core designs. It allows memories to be shared across cores using the same processor instruction set.
“At least it’s logically unified, so it can live in the same address space,” said Cadence’s Rowen. “It’s not necessarily a way to improve performance, though. We’ve seen L3 cache shared across a processor cluster, each with its own L2 cache, to improve performance.”
That cache can be shared equally or unequally through a process called non-uniform memory access (NUMA), Rowen said. Sometimes that is as simple as changing levels, which can improve performance by up to 10 times.
New memory choices
One of the new options in memory architectures is another contradiction of sorts. For years, it was a given that on-chip memory was faster than off-chip memory. Bandwidth to get off chip, and even through the chip, has always been problematic. That has changed with the commercialization of high-bandwidth memory this year.
“The choice in some cases is embedded memory versus HBM, or basically what’s inside versus what’s outside,” said Bill Isaacson, director of ASIC product marketing at eSilicon. “Many of the chips we’re doing are dedicated function. We’re not putting together chips around the processor. We’re having discussion with customers about what are the main bandwidth hogs in the system and customized memory. If you can improve memory bandwidth, that can drive the memory size. If you can decrease memory area by 20%, that’s a big deal.”
Isaacson said that in most cases, the starting point for those discussions is not memory optimization. “It’s typically about area and power. But the way to get there, particularly with multi-core architectures, is optimization. Customers haven’t been doing custom memory design so they’re not used to this discussion.”
HBM and the Hybrid Memory Cube both require advanced packaging, whether that is 2.5D, 3D, or monolithic 3D.
“If we accept the argument that DDR4 is that last general DRAM, and LPDDR5 is the last low-power version, then you have to look at what comes next,” said Drew Wingard, CTO of Sonics. “Wide I/O-2, HBM and HMC are all multichannel. That’s a clear trend. Another trend has to do with the total complexity of the design. To get around that you want to make subsystems as self-sufficient as possible, with private memory that you can exploit locally.”
There are other possible choices, as well, such as different classes of memories in a multi-die package, and more distributed memory with a supervisor routing running the background, Wingard said.
New packaging options certainly can reduce the parasitics in memory access using interposers or through-silicon vias, but whether those approaches are faster than on-die memory depends upon location, interconnect speed and a variety of other factors ranging from materials to whether the memory is dedicated, what voltage and frequency it is using, and how the data path was architected.
As with all things related to memory, there are no simple answers, but there are plenty of questions and contradictions. And those questions and contradictions will only grow in number as complexity continues to increase everywhere else.
So we didn’t used Wide I/O 1, we won’t use Wide I/OO2 because we have Lpddr4, lpddr4x and will get lpddr5.
But Wide I/O 3 will be used.