Architecturally Optimizing Memory Bandwidth

Optimizing from an architectural perspective is the best bet due to impacts on system performance and power.

popularity

Making sure that an SoC’s memory bandwidth is optimized is a crucial part of the design process today given its significance toward overall system performance.

There are many ways to approach this issue, and all of them can have a direct bearing on the competitiveness of a chip in terms of both power and performance. So where should you start?

“Number one, choosing the right algorithm makes a big difference in memory bandwidth,” said Cadence Fellow Chris Rowen. “You often have many choices about how you write the software. Some of the ways you write things are going to dramatically change how much bandwidth you need. To take a simple example, suppose you’re analyzing images and you need to read each of the images multiple times and the size of the image is much bigger than your local memory in your processor. You could break the image down into sub-regions and do all of the different reads that you want to do on each sub-image, which might mean that part of the image you’re working on actually does fit into the local memory.”

In this particular case the only bandwidth you generate is the first time you read that section. In fact, you end up reading the original full size image just once. “Or you could choose an algorithm that attempts to work on the whole image altogether and you have to read each piece of it multiple times, and you have to go all the way to main memory and fetch that particular piece many, many times. You could potentially reduce the amount of memory traffic by an order of magnitude if you choose to partition or organize the computation right. So, choice of algorithm is important.”

The second thing engineering teams do is organize the memory hierarchy, focusing in particular on how different memory references are spread across different banks of memory and what size of transfer is used. That allows them to take better advantage of the characteristics of the memory devices.

“This is because DRAMs, of course DDR, likes to fetch things within a current open page,” said Rowen. “It’s very inexpensive if a page is already open to fetch more bits from that same page, and it’s relatively expensive to switch between pages. As such, you want to optimize it to do as much of your fetching in a single page as possible. That may mean you want to fetch longer bursts from a given page. It may be that you use the pages as a kind of cache so that you access within a page very easily and you switch pages only reluctantly.”

That means organizing the transfers. Some of this is done automatically in the DDR controller itself, which can see a stream of requests and prioritize them. The result is less thrashing of the DDR controller, less total time to satisfy a string of requests, and less energy used because making requests within a given page is a lower energy operation than switching pages, he explained.

Along the same lines, Navraj Nandra, senior director of marketing for Synopsys’ DesignWare Analog and Mixed-Signal IP, noted that optimizing memory bandwidth from an architectural perspective should enable users to tune system performance according to the types of traffic indicative for the system. “Configurable parameters should include address and data widths, depth of look-ahead in the scheduler, priority bypass, and ECC.”

He noted that there are commercially available tools that allow designers to run “what if” scenarios to optimize their system performance based on the actual traffic expected in the SoC. And CAM-based scheduling algorithms used in certain IPs are aimed at optimizing the overall memory bus utilization.

Communication is key
Another good place to start, architecturally, to circumvent memory bandwidth issues is to understand the amount of communication between the cores and subsystems on the chip—not just in quantity, but also to have an idea of the size of the data structures that they’re going to be swapping around with each other, said Drew Wingard, CTO of Sonics.

“An easy example to talk about where things don’t fit on chip is when we’re doing video processing because the frame buffer takes up a lot of space, and you often need many copies of the frame buffer so you have no choice but to keep those frame buffers in external memory,” he explained. “But then other algorithms can be working on smaller data sets, and those data sets can be kept on chip and using compiled memories on the chip can be a much more effective way. You’ll sacrifice some die area but you can increase your performance, reduce your power and a whole bunch of good things if you can build some on-chip memory, which may well be shared in technologies such as SRAM or 1P SRAM.”

These kinds of tradeoffs all affect access to memory, and optimizing them can improve throughput and lower power.

“This is what we’ve been talking about with the memory wall,” said Hans Bouwmeester, vice president of engineering operations at Open-Silicon. “The number of I/Os that you need to get sufficient bandwidth is going up and up. This is why we’re seeing more interest in the Hybrid Memory Cube, which has super high bandwidth over a serial interface, and our Interlaken IP, which provides bandwidth from chip to chip. That leverages SerDes rather than a parallel PHY.”

He noted that 2.5D IC will be a straightforward way of achieving high throughput. “You integrate memory on an interposer and you don’t need big I/Os driving capacitance. Instead you use TSVs, which provide hundreds or thousands of interconnects. This is necessary if you take something like ASIL (Automotive Safety Integrity Level), which can be limited because it doesn’t optimize area. You have to do something different to get the data in and out.”

Wingard suggests the starting point is an analysis on the communication patterns, the bandwidths associated with them, and then the sizes of the data sets that are being described. “[From this,] you can start to build up interesting memory hierarchies. In general-purpose computing we build deeper and deeper memory hierarchies largely to try to hide the fact that the latencies out to the external memory are so long. In an SoC context, a lot of times it’s not about latency, it’s about bandwidth. We want to try to store objects into the appropriate level in our memory hierarchy to improve our throughput to gain access to those, and to reduce the energy associated with those accesses and therefore the power of the application.”

Still in most applications, the memory system is the bottleneck and engineering teams run into problems like designing in enough memory bandwidth. “If we’re not careful we get it in chunk sizes that are too big. We’re at DDR3 and DDR4 technologies at this point, and that means the minimum transaction you can send to memory is 8 words long. If to get the bandwidth you need, you end up making that word size large — if the objects you’re trying to move to and from memory don’t fit nicely into those chunks — you’re going to waste some of that precious bandwidth you’re building. Most of the processors today have either 32-byte or 64- byte cache line sizes. They don’t work well with memory systems that have a minimum access of 128 bytes,” he explained.

Wingard said what people end up doing architecturally is breaking the memory subsystem up into multiple independent channels, and by doing so they reduce the access granularity to the memory system and then get bandwidth in the right-size pieces.

Rowen noted that doing the right thing from a performance standpoint often helps power. “There are also opportunities in many cases where people put pre-fetch into the behavior of the application or into the processor hardware itself so that it will anticipate likely future cache misses and attempt to make those inexpensive sequential accesses in anticipation of future need. It will fetch larger blocks, or it will fetch blocks that have not yet been requested but which are likely to be requested, and get them from DDR and move them into the local memory or cache so as to prevent future requests coming. So it speculates.”

There is an intriguing tradeoff between wasting bandwidth but saving on latency that comes from that speculation, he said. “The latency may be improved because, let’s say, 75% of the time it was a good bet to fetch something, and by fetching ahead you never took a cache miss, but 25% of the time you ended up just throwing it away — you fetched it, you polluted your cache with data you weren’t using. And it caused you to do extra work on the memory interface rather than waiting until you knew that you needed it, but you fetched it in anticipation and 25% of the time you were wrong. So you used a bit more bandwidth but you may have reduced the average latency. Fortunately, there are modeling tools that here that allow you to get insights into what goes on with some of these scenarios.”

Big challenges coming up
Given that engineering teams spend a lot of power in their memory systems, having more accurate power modeling and modeling on more realistic usage scenarios is high on Rowen’s list of near-term challenges to overcome. “Relatively small changes in either the configuration of the memory system or in the behavior of the application can make a significant different in the actual power that gets dissipated in the DDR subsystem, so it can move the needle quite a bit for overall system power.”

Another area that deserves attention is modeling of the memory system behavior using simple traffic generators. Those can can be leveraged for real application traces that are directly attributable to the system application that’s running, or their realistic video subsystem or their realistic vision system, he said.