Memory, Bandwidth And SoC Performance

Off-chip communication bottlenecks, unexpected process effects and 3D unknowns are adding new challenges for SoC designers.


By Ann Steffora Mutschler
High-end SoC architectures today can contain dozens of processing engines—multiple cores from MIPS and ARM, DSPs from Tensilica and CEVA, and even graphics processors. But with so many cores there also is a need for enormous amounts of memory, and that has been creating some unexpected design problems,

In many cases so much memory is required for an SoC that some of it has to be added off-chip. If it’s designed well that shouldn’t have a huge impact on the main functions of a chip, but traffic can back up everywhere like air travel during a blizzard. One bottleneck leads to another, which can have a huge impact on performance.

“Memory is a part of every SoC, and the amount of memory can be quite large depending on the application,” said Mike Gianfagna, vice president of marketing at Atrenta. “Design teams often run out of room and/or power for on-chip embedded memory and need to go off chip. Once that happens, there are issues with memory access speed. Going through I/O buffers and board-level wiring will slow down access, and that can have a big impact on overall system throughput.”

Better direction of network traffic on and off the chip helps solve at least some of this issue, which explains why network-on-chip vendors such as Sonics and Arteris are now finding receptive audiences at far more companies than in the past. James Mac Hale, vice president of Asia operations at Sonics, said that many customers that have relied on a traditional bus mistakenly believe they have a memory bandwidth problem when in fact they have an interconnect problem.

“Even though they calculated how much peak memory bandwidth they needed in theory—2GB per second or whatever it might be—they would calculate whatever memory technology would supply that bandwidth,” said Mac Hale. “But they were finding in many video applications they were only getting anywhere between 30% to 40% of that theoretical bandwidth delivered, and that it actually is an interconnect problem.”

Streamlining the interface between those cores and off-chip DRAM is vital. So is playing traffic cop for the transactions being presented to the memory subsystem.

Mac Hale noted that the video core needs a certain amount of bandwidth—as much as 30% of the total bandwidth of the system. “How do you make sure you can keep the CPU latency performance good, ensure the video core is getting enough bandwidth, and ensure that overall efficiency coming in and out of DRAM is higher than normal? It becomes a very complex balancing act where it is the interaction of all these systems that causes the end result.”

Process and temperature effects
Even with memory on the same chip, the move from 40nm to 28nm process technology is not scaling for performance as expected when looking at slower process and running at low temperature. “We always characterize these products across commercial operating ranges, which is -40 degree C to 125 degrees C, and it is at that low temperature corner that we’re not seeing the process scaling very well, so customers are having to adjust their specifications to be able to meet the targets for their markets,” offered Lisa Minwell, DesignWare embedded memory product manager at Synopsys.

Design teams will actually adjust the product spec to operate at 0 degrees C as its lower operating range or will have to compromise in the way that they build their memory subsystem, partitioning the memories a bit more than they would have.

“The increased content of memory continues so these products have a larger aggregate count of memory but they cannot use a single large memory instance and achieve the speeds they are looking for because they have to partition it,” Minwell said. “Not only are they trying to meet the performance, but everywhere across the board they are dealing with green initiatives. A lot of these products are merging together where you had high performance requirements, but they were in lined power environments. Now we are seeing even in lined power environments that there are initiatives for power management.”

This is driving the coupling of simulation capabilities and the need for simulating various power management capabilities. As well, design teams need to be able to simulate the actual performance across the chip. “It’s really EDA having to keep up with the models and some of that has not really made the strides as quickly as the marketplace has desired. It’s kind of lagging there,” Minwell acknowledged.

As a memory IP provider, she said Synopsys is trying to characterize memories so that they are as accurate as possible. “These memory compilers generate instances – thousands and thousands of different instance possibilities come out of one compiler – and we have to be very careful about the way that we characterize them so that there is enough buffer for our customer but not too much because the customer wants to be running on the edge.”

3D stacking adds new challenges
As more semiconductor companies seek to implement 3D design and/or packaging, these technologies add even more issues.

Sonics’ Mac Hale believes there is the potential for some revolutionary new architecture that could potentially eliminate the bandwidth bottleneck between the logic die and a separate memory or DRAM die. “Today, with the DDR2 or DDR3 interface, you basically have very wide paths inside the logic die, going through a narrow interface, going to a very wide structure inside the DRAM chip. With 3D stacking and through-silicon vias (TSVs), there is the potential for having a much wider connection directly between the logic die and the DRAM die.”

Fortunately, there are standards being developed, one of which is the Wide I/O specification by JEDEC. If the industry can gather around that standard so DRAM vendors are providing chips with that interface and the price drops to an acceptable level, then there is the promise of much higher bandwidths being made available for designs in the future, he added.

Using one of the most popular 3D stacked die design configurations in these early days of 3D is memory on top of a processor with a wide I/O interface implemented with through TSV technology. This configuration allows off-chip memory sizes and performance with close to on-chip access speeds thanks to the TSV technology.

“This sounds like the best of both worlds, and it is with exceptions,” said Atrenta’s Gianfagna. “The exceptions include increased cost due to the newness of TSV technology and a host of new thermal and mechanical stress issues due to the proximity of the silicon slices.”

Leave a Reply

(Note: This name will be displayed publicly)