Flexibility Improves Memory Interface Bandwidth

With memory at the heart of most SoCs today, careful design is key to achieving the best bandwidth, performance and power. Flexibility plays a big role.

popularity

In today’s SoCs, memory is the heart or at least one of the main elements of the design. As such, designing them carefully is paramount to achieving the best bandwidth, performance and power.

Performance is very important to be able to access the memory and to trade and store information from different IPs with shared memories or local memories. From the power perspective, every access to memory consumes many cycles and lots of power, so designing the interfaces to memories carefully is one of the fundamentals of SoC design, according to Ziyad Hanna, senior vice president of R&D, chief architect, and general manager of Jasper Israel Ltd.

“Typically in CPU design,” he said, “you have different levels of memories, from first-level cache, second-level cache, external memories, etc. Over the years, designers created and developed several techniques of different memory hierarchies, focusing on locality and cache misses, to think about how the memories are being utilized. We know there are different kinds of configurations of memories and, in particular, memory could be storage that has some control, some interface to it and writes from the memory to the peripheries and to the blocks or the IPs.”

Different structures evolved and were designed over the years through different banks and hierarchies. Depending on the width and depth of the memory, there can be multiple ports for read and write access. That allows information to be written and stored in memory, and to be sent directly to the output ports. But there are now techniques to consider the memories dynamically, where the structure of the memory is not static but rather changes based on the application. http://esl.cise.ufl.edu/Publications/suscom11.pdf

“Every system has its own focus and drive for efficiency, for speed or for low power,” Hanna said. “For example, for low-power designs like mobile or handheld devices, power is very important. Sometimes you don’t want to access the memory for every transaction because it causes several instructions to access and go outside from one IP to another IP to shared memory. So [engineering teams] try to do some localized techniques to have some local storage to do the local caching to avoid accessing the more extensive interfaces.”

In addition, Wendy Elsasser, a Cadence architect, said that as speed has increased on designs, DRAM delays are having more of an effect, such as higher tRCs (random access cycle times) and recoveries, and recharge delays are having more of an effect on bandwidth. “One of the biggest things from memory subsystems aspect is the ability to re-order commands and try to group things to the same page as much as possible.”

Grouping can improve power if pages are not continually opened and closed. “It’s kind of a double-edged sword. The longer we keep a page open in memory — and this goes back to whether or not we have an open-page or close-page bank policy—the more power it’s going to consume. From a power perspective, it’s better to have a closed-bank policy. From a performance perspective, if you’re going to hit that same page multiple times, it’s better to have an open-page policy. There is definitely a balance between those two policies where we want to keep it open as long as possible as long as we have traffic going to it, but we want the ability to dynamically close it when we are truly done,” she explained.

Deciding which way to go is pushed back to the higher-level masters, Elsasser said. “Essentially, it’s two folds. One is the higher-level masters. The other is optimizations we can do on the controller itself.”

Further, the open-closed page policy will have an implication on bit performance and power. “A lot of what we’re seeing today is a lot of the enterprise customers trying to achieve higher bandwidths using multicore solutions. If I have four or eight cores out there that are going through one controller to a memory subsystem. There’s a lot of work that we’re doing in order to understand how to best address the memory such that we can handle all the accesses from all the cores and optimally schedule the commands and not continually be opening and closing banks and so on. Part of the overall picture is how am I addressing the memory, and if I’m getting traffic from multiple sources, how is that divided into the full memory space.”

Traffic optimization
Marc Greenberg, director of product marketing for DDR Controller IP at Synopsys, said there are many things that can be done to optimize traffic on the memory bus. But all of the hard work to optimize the memory bus can be broken down by the requestors in the SoC if the requestors inside the SoC are not following a responsible pattern of accessing the memory.

On-chip bus standards can put restrictions on what the memory controller can do when it’s trying to optimize the traffic in the system. One of the common restrictions that on-chip buses can impose is to require all traffic with the same thread or tag id to be returned to the memory bus in order.

When designing a memory controller, one of the ways that requirement can be interpreted is to say, ‘If the on-chip bus requires all that traffic to be returned to the interface, the responses to the request to be returned to the interface in order, that means I must execute the transactions on the memory bus in that same order.’ This relies on the system to be giving well-ordered traffic because the traffic will be executed in the same order that on-chip bus requested it, Greenberg said.

Interestingly, what Synopsys has done with its read-reorder buffer is to remove that restriction because with on-chip buses — as much as everybody has good intentions for how their on-chip buses might access the DRAM — there’s always going to be cases, either short-term or long-term, which require the memory controller to return responses in an order that requires execution on the DRAM in an inefficient way. “By removing that restriction with the read-reorder buffer, it gives us the freedom on the memory bus to execute the transactions in the order that’s most efficient for the bandwidth of the memory and quality of service requirements that have been given to the system. We can execute those transactions in the best order for the memory and then return the results back to the on-chip buses in an order that is required by the on-chip bus standard,” Greenberg explained.

“If you do this sort of simple implementation of a controller without the read-reorder buffer, you either then are requiring on multi-port traffic to be able to do your reordering, or relying on the requestors in the system issuing a large number of tags to you that would then also allow you to do reordering. But we can reorder traffic for more optimal memory bus bandwidth even from a single port and even from a single requestor requesting it at a time. That’s what the read-reorder bus does for us,” he added.

Elsasser said Cadence’s approach relies heavily on the fact that its memory controllers are highly configurable and programmable. “We have lots of different customers and we don’t know in the end — and a lot of times they don’t know from the beginning of the project phase until it’s actually in silicon — what things will look like a year from now or what their software or systems will be looking like. So we try to approach things in very programmable way, where you can programmatically tell us how your logical addresses map to a physical address region and give them enough flexibility so they can achieve good performance and good distribution across different cores and different masters. A lot of it comes down to programmability and flexibility in the address mapping.”

With so many techniques available, Hanna said, “The whole thing about memory is about how to verify the correctness of these things. You want to make sure that from the efficiency perspective to verify memory interfaces, regardless if it is dynamic or static, through standard protocols like DRAM controllers or custom protocols. There is a challenge here to make sure there is some coherency happening, especially when it is a multicore design accessing the same memory.”