Overcoming The Design Bottleneck

What needs to happen to improve SoC design? The industry weighs in with some suggestions and the obstacles in the way.

popularity

SoCs control most advanced electronics these days and functionality, quality, power and security are a combination of both hardware and software. All throughout the development of today’s complex systems, the memory hierarchy has remained the same—preserving the notion of a continuous computing paradigm. Today, that decision is leading to performance and power issues.

There are several reasons why this no longer works:

  •  Software, which plays an increasing role in functionality, power and reliability, has become attuned to memory being an almost limitless resource. It isn’t.
  • Processor performance stopped scaling more than a decade ago when individual processors stopped becoming more powerful and increased computing resource came from adding simpler cores.
  • Memory, like ASICs and SoCs, is no longer scaling according to Moore’s Law.
  • Bandwidth between off-chip memory and the rest of the system has hit its limit. It is no longer scalable without a major change in packaging technology or the adoption of optical interconnects.

Each new node compounds these problems. “Memory controller accesses seem to be one of the major issues,” points out Tom DeSchutter, director of product marketing for virtual prototyping at Synopsys. “It is also getting more complex with multi-cluster systems and cache coherence across levels 1, 2 and 3 and across the interconnect. Then you have the memory controllers and quality of service. All of these pieces are becoming complex and they all have to be taken into account. It typically leads to significant overdesign to ensure no mistakes are made.”

Is all of the complexity in the memory architecture really necessary? “People are always pushing the envelope on performance, power and functionality, and memory plays such a central role in the functionality of the systems,” says , fellow at Cadence. “Sometimes that is an unnecessary complexity, but other times it is necessary. You don’t expect to be able to conquer new mountains without doing some amount of hard work, and the scaling up of performance while staying within energy bounds and cost constraints is one of the central hard problems.”

How did we get here and what are the possible ways to steer a course into the future that does not have this limitation?

The role of memory
“Today’s memory and processor architectures are a superb piece of engineering and work around the limitation that both bandwidth and latency worsen as a program’s memory requirement goes up,” says Pranav Ashar, chief technology officer at Real Intent. “This works reasonably well as long as the program’s memory access patterns are regular and predictable. When a program does not meet that bar, such as when the memory access pattern is irregular, its performance falls off a cliff.”

In many emerging markets, overdesign may not be tolerated as well because of reduced margins, and the creation of a product family makes things even more difficult. “With reuse across several chips within mobile phones, you start with the highest-level product because that has the highest margins,” explains DeSchutter. “You assume a certain throughput and bandwidth required for the memory controller. Then you try to reduce the cost of the SoC for derivative designs and often get into a serious memory bottleneck. That can cause it to significantly underperform. In this case, using a few components that cost less or a different type of memory can result in under-design.”

Much of the bottleneck comes from going off-chip to the DRAM. “At JEDEC, where they look at improving the DRAM interface, it is clear that there is very little to zero tolerance to changes to the underlying memory system architecture that result in a higher price per bit of DRAM,” says Drew Wingard, chief technology officer at Sonics. “They can do anything they want from a technology perspective so long as the cost per bit of DRAM continues to go down.”

Companies also do not like the variability that comes from off-chip memory in certain markets. “One problem with off-chip memory is that the price of DRAM is variable and that makes it difficult to have a predictable business model,” says Farzad Zarrinfar, managing director of the IP division of Mentor Graphics. “When it is on-chip you know your costs. It also eliminates capacitive loading from external connections and thus enables wider bandwidth compared to off-chip. For video applications or imaging they need the highest bandwidth and they will stay on-chip. This also avoids high-pin-count packages that are expensive.”

Impact on software
Ask any hardware person about the wasteful practices of software and you will get countless stories. “There is a guy who, every time a new generation of PC comes along or a new version of Windows or a new version of Office, runs the same benchmark on each of them—the same spreadsheet, the same PowerPoint, the latest version of Office on the latest version of Windows running on the latest hardware,” says Wingard. “The results are essentially flat with those of the past. Processors used to clock at under 1GHz, so even with a 10- to 50-fold improvement in every hardware metric, the actually performance is no better. This is because they have added so much more stuff around it and the amount of time the processor spends virus checking and the extra error checking looking for problems, that they build a self-fulfilling prophesy that requires even increasing performance and amount of memory to do the same amount of work.”

The reasons are obvious. “Windows 3.1 had 2.3M lines of code and that has gone to 40M in windows 10,” says Andreas Kuehlmann, senior vice president and general manager of the software integrity group at Synopsys. “Has functionality increased that much? Why is that? The reason is that the programming languages are getting more wasteful. C++ is unbelievably wasteful. Java or JavaScript are even worse. The reason is that replicating code doesn’t cost anything in software. has always made memory cheaper and so adding software doesn’t cost anything until you realize that it has to execute on a processor and that it is an embedded device that consumes power. Nobody is going back to more efficient languages because the cost of development would go up. So it is not just the interface that has to be optimized. The raw efficiency of high level programs and power consumption – nobody takes this into account.”

And none of this even starts to consider the challenges associated with multi-processing. “The old software paradigm of single-threaded execution, and architectures that had separate memory and CPU chips, certainly makes migration to a non-CC/NUMA many-core platform difficult,” points out Kevin Cameron, consultant at Silvaco.

Rowen explains why we have gotten to this state: “From a programmer’s perspective you would like for everyone to just take memory for granted. You don’t have to think about what it is, or where it is. It just works. That leads to a very high level of abstraction in the paradigm for memory, and that has led to solutions where we say that memory is one grand unified address space. That is the motivation for large off-chip memory and quite complex memory hierarchies requiring cache coherency between cores that maintain the illusion of a grand unified memory space. In reality there are a lot of hardware elves working very hard to make that illusion remain true. It takes a lot of hardware to do coherency and intelligent pre-fetch and smart cache. This enables some rough approximation to the unified memory that software relies on.”

Then the software teams were told that what the companies really cared about was programming productivity and they built more levels of abstraction and layers of libraries and operating systems and middleware and services that allow them to do a better job of creating interesting new applications. “I would guess that 90% of every increase in performance or capacity goes to making the software team’s job easier and only 10% to making the application run faster,” says Rowen.

Times are changing
While we can argue about the fate of Moore’s Law, this is not the only technology under pressure. “We are at the end of the road for JEDEC standard DRAM interfaces,” says Wingard. “There appears to be little hope for anything after DDR4, except possibly for an LPDDR5.”

Several changes already have been mentioned that are putting additional strain on the memory interface and new markets have very different requirements. “Thank heaven the PC doesn’t drive all electronics,” Wingard says. “If you go to the other end of the product spectrum and consider wearable and (IoT), you have a different world. There I care about form factor and this means the size of the battery and that is dependent on the amount of energy I spend. I care about peak power dissipation and thus how efficiently I am using the memory sources, or having memory that is leaking. Products in the IoT space do not appear to have much margin, so I have hope that we will get to more efficient hardware, software and system design.”

There is increasing pressure on the software teams. “Software is getting larger and larger and increasingly buggy, as well as consuming more power,” says Kuehlmann. “This is because there is no regularity enforced in software. Regularity in hardware minimizes corner cases and that makes for more robust designs. If you have a spaghetti design you have a lot more corner cases.”

Software itself may also be undergoing change. “Many fundamental computer science problems at the core of numerous interesting applications, such as SoC verification and machine learning, have memory access patterns that are irregular,” Kuehlmann notes. “This causes the performance of current memory architectures to fall off a cliff. Another cause for concern is that the current state of memory architectures is a bottleneck to scaling up multi-core parallelization, both in the parallelization of single programs as well as in being able to run multiple programs simultaneously.”

This means that the hardware teams have to be ready to respond with new architectures. “We have an opportunity in terms of hardware design to rethink compute models,” says Harry Foster, chief verification scientist at Mentor Graphics. “It is not just the software guys who are guilty. There are much more efficient ways we could do caching and in the ways in which processors and memory interact. It will take both teams to solve this problem.”

Possible approaches and solutions
There are several approaches being considered, including new memories, new fabrication technologies, new packaging, and new types of tools that could change the compute paradigm. It is certainly not clear which ones will see successful introductions and how they may get combined. It is also unclear if cost per bit will continue to go down.

Two approaches that are receiving a lot of attention are 3D memories and 3D stacking of chips. “This is a dimensional problem,” says Cameron. “A 2D chip has a 1D edge, and a 3D stack has a 2D surface. This changes the limits on how fast you can move data in and out. The bigger the memory, the worse the I/O problem.”

NAND flash memories are already in production that have grown in the vertical direction. In DRAM, one approach is the Micron Hybrid Memory Cube (HMC). Wingard says this is a “rethink of some basics of DRAM architecture in that the controller, which understands the organization of the memory and generates all of the RAS and CAS signals, is put into a log chip that is underneath the DRAM stack. So from an SoC perspective, you have something that looks more like a SerDes interface to send packets to and from the memory stack so there is no longer a memory controller in the SoC. Because of the speed of the SerDes and the ability to have multiple lanes in parallel it claims it can get to 320GB/s from a single stack.”

Many in the industry are looking for the day when the existing memories will be completely replaced by non-volatile storage such as resistive and technologies. “3D architected versions of these memories will deliver the capacity of rotating magnetic platters, in solid-state,” says Ashar. “The upshot will be that all memory will be solid state, and two memory-hierarchy layers (DRAM and Disk) will merge into one.”

2.5D and 3D integration also may bring more of the memory into the package. This could be using Wide I/O, “which focuses on mobile and relies on bonding DRAM right above you in the same way as package-on-package, but using through-silicon vias,” explains Wingard. “There have been no large commercial adoptions of this yet, but there are continued rumors that this will be seen next year. Alternatively, there is High Density Bulk Memory, which looks like taking a set of LPDDR chips and putting them into a 3D stack but not stacking them on top of an SoC, instead using a 2.5D interposer. It has the benefit of relatively standard DRAM interfaces and achieves low latency access while scaling the bandwidth massively, but takes a lot of pins and traces on the interposer.”

Alternatively, we can rethink some of the existing memories. “We will see a path in memories that looks similar to the techniques used in flash technologies,” predicts Rowen. “With any memory technology there is a tradeoff between how reliable the bitcell is, versus how much other stuff you wrap around it, in order to compensate for the flakiness of the individual cells. A typical NAND flash cell is a pretty lousy storage device. Lots of them are wrong, they are stuck, they don’t hold their values as long as you would like, they wear out and we put a layer of top of them that can find the bad cells, provides increasing levels of ECC, does wear leveling, and these all improve the poor characteristics of the memory. Very little of this has yet been applied to DRAM. It is an expensive proposition to require a bitcell that is 99.9999% correct. If we can lop a few 9s off of that, you don’t have to go as far. People will add higher levels of protocols to compensate for the degrading quality of the bitcell.”

Cameron also sees the possibility of a tool approach to the problem. “If we treat C/C++ more like an HDL and fully elaborate the runtime call structure, it will allow a non-cache Coherent/NUMA architecture to happen. Synopsys’ Coverity tool, and other static code analyzers, can perform the analysis but don’t have the code generation capability yet.”

Cameron explains that you can view the existing code execution paradigm as being more like a high-speed pseudo-code interpreter than a real attempt at compiling the code. “The new computing paradigm will include inserting forks, and joins, and compiling the communication between threads rather than relying on the single data-bus architecture that most current computers use for sharing data.”

The architectures of tomorrow almost certainly will be different than the ones we use today. It will be interesting to see who will be the champion for each of the new possibilities.



Leave a Reply


(Note: This name will be displayed publicly)