Part 1: Moving data is expensive, but decisions about when and how to move data are very context-dependent. System and chip architectures are heading in opposite directions.
Should data move to available processors or should processors be placed close to memory? That is a question the academic community has been looking at for decades. Moving data is one of the most expensive and power-consuming tasks, and is often the limiter to system performance.
Within a chip, Moore’s Law has enabled designers to physically move memory closer to processing, and that has remained the most economical way to approach the problem. At the system level, different cost functions are leading to alternative conclusions. (Part two will examine decisions at the system level.)
Historically, a chip was only large enough to contain the processor, so memory had to be off chip. enabled more functionality to be moved onto the chip, providing a bigger performance boost using less power than could often be obtained by enhancing the processor.
“The goal is always to bring the data as close to the processor as possible,” says Michael Thompson, senior manager of product marketing for the DesignWare ARC processors at Synopsys. “This is so that memory can be accessed in a single cycle. Memory continues to be the slowest component in these systems. That is why we have cache and other mechanisms to manage the access times to memory.”
Over time, cache has become a lot more complex. “The whole concept of cache is trying to get data closer to where the processing happens,” says Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “This was easy in a world that was single-CPU dominated, and you can do optimizations on latency or bandwidth or power. Many of those approaches fall apart when you come into the world of heterogeneous architectures where there are multiple compute engines and each has different characteristics about how it processes its data.”
The rise of video processing has also changed the equation due to the large amount of memory required and the streaming rates necessary for the latest video standards.
Multi-layered cache is now normal. “We see ever larger and deeper memory hierarchies and ever larger shared memory just in front of DRAM to try and minimize the sustained memory bandwidth requirements,” adds Drew Wingard, chief technology officer at Sonics. “We see a push towards HMB2 and 3 and other forms of very-high-bandwidth DRAM because there are some applications that have dataset sizes where it isn’t practical, even with multi-level hierarchies, to get enough bandwidth.”
Fig. 1: HBM2 memory. Source: Samsung
To try and satisfy heterogeneous demands, more complex hierarchies become necessary. “One thing that is happening within a cluster of multiple processors, is to implement L2 cache that is attached directly to each of the processors,” explains Thompson. “You have a dedicated L2 and then an L3 behind that. It is an attempt to move more of the data closer to the processor in real time. With 512K L2 dedicated to the processor, you don’t have to maintain coherence because you can do that in the L1 cache. The chances of having a hit in L2 become a lot higher.”
But all of this is very different from the academic work that wanted to implant processing within memories. “People called them smart memories,” recalls Wingard. “For certain classes of application, the amount of memory needed is so large that you would never be able to bring the whole dataset onto chip. So they asked if we could do better by putting a little bit of processing into a large amount of memory rather than the other way around.”
There are several problems with this, including the fact that memories often use specialized production processes that are not amenable to the incorporation of much logic. Trying to adapt the process to include more logic would decrease the cost effectiveness of the memory.
“The processes for building DRAMs don’t tend to have nice transistors for building digital logic and so you end up paying a significant performance penalty,” adds Wingard. “Also, the density of those parts is such that it is difficult to get much logic in. We have not yet found a killer app for smart memories. The types of operations that we know how to do are of limited benefit. People are interested in it in the area of cognitive computing because some of those are very memory bandwidth-bound – especially in the training phase. However, they also have the property of being highly connected and needing a lot of communications, which is not easy to accomplish in a memory structure.”
Changing the programming paradigm
Any change from a single contiguous memory space affects the programming paradigm that has been used for decades. With so much software that relies on that, change is difficult.
“Moore’s Law is dying and we have to figure out a different way,” says Ty Garibay, CTO for ArterisIP. “We are seeing an increase in heterogeneity for exactly this reason. We need application-specific data processing in order to continue to scale. The old software paradigm will bend and possibly be forced to pop up a level. The host CPU will still call everything as a software call, but it will be executed on a GPU or an FPGA or custom block, and the software folks still perceive it as if it was a single thread or set of threads that executes faster. So long as we have a lot of work that continues to look like a single thread, it is relatively straightforward to optimize those things into heterogeneous processors.”
Software portability remains important. “You want to decouple software and hardware,” adds Mohandass. “You do not want software engineers thinking about the hardware architecture when coding. Software and hardware have to be optimized together but it falls apart if you have a bunch of software guys trying to write code based on the hardware structure.”
And it is not just about execution. “Any complex asymmetric processing architecture will have a problem associated with booting, about system-wide communications or IPC across sub-systems,” points out Warren Kurisu, director of product management for embedded systems division of Mentor, a Siemens Business. “OpenAMP solves a complex problem and changes the paradigm about how people develop a system that is heterogeneous in nature, both in terms of the cores and in terms of operating systems. Virtualized I/O is necessary to support multiple machines accessing limited I/O devices. OpenAMP enables you to boot multiple instances of OSs on different cores and to create communication channels across different OS instances. To make this work you have to have shared memory.”
One thing that changes with increasing amounts of integration is the depth of that cache hierarchy. “When you communicate between sub-systems through memory and if that will happen over a short enough window of time, and involve a small enough dataset, you can afford to keep it on chip,” says Wingard. “It can either be done as a peer-to-peer transfer, where one heterogeneous cluster can access memory in another one, or through some form of last-level-cache. This is typically used for communicating between sub-systems on a chip. Memories that are tightly integrated into a sub-system are only useful to that sub-system.”
But there have been many cases where accelerators contain private memory. Usually these are programmed by dedicated teams who understand hardware more intimately and thus the main software flow is not disturbed. But this is increasingly seen as an imperfect solution. “There certainly is private memory within some processors and at a higher level we are making more of that memory available to other blocks on the SoC,” explains Thompson. “For example, we had a low latency memory that sat between the cores in a cluster and allowed information to be passed between them. We moved that memory to make it the same level as the L2 cache so it is accessible over the AXI bus. That means the information can still be shared between the cores in the cluster but it can also be shared with other functional blocks in the system. This makes the whole system more efficient.”
Combining data and processing
So are smart memories destined to remain of academic interest only? “There is a lot of research, but there has yet to be anything that has made volume production,” notes Garibay. “There are cases with video memory where some processing was done on them. That lost out to the benefit of having high-volume sockets rather than custom devices.”
GPUs are another way in which total latency has been brought down simply by having a dedicated, wide, high-speed path to memory.
But there is one example that has been commercially viable in this area for a long time. “FPGAs interleave memory with functional blocks,” Garibay says. “In-memory processing is possible with FPGAs, and this is one way in which they are trying to win the battle of AI inferencing. For most things FPGAs are not particularly power and area efficient, but with inferencing you can take advantage of the variable bit-width of the FPGA and you can have something that looks like in-memory or near-memory processing. That makes it more efficient than GPUs.”
The people programming FPGAs are more likely to be aware of hardware architectures, and can thus make full usage of the capabilities of these devices. As well as standalone FPGA chips, we are seeing embedded FPGAs talked about a lot more than in the past and commercial devices are now appearing. In addition, FPGA programmers are rapidly adoption new programming techniques, such as OpenCL. These are a lot more amenable to heterogeneous systems and parallel execution. In addition OpenCL does not depend upon all memory being shared and can utilize a much more flexible memory definition to gain performance.
“For the ASIC you have layers of software that shield the implementation from the programmers,” says Frank Schirrmeister, senior group director for product management and marketing at Cadence. “The FPGA market appears more advanced because they have a higher degree of freedom. The hardware is more programmable. The layer to automatically map a description from OpenCL down to an implementation becomes more important. You cannot have every software designer understand all of the details, so it is abstracted into a higher-level programming model.”
All of this points toward change and innovation at the architectural level.
“We will see more examples of closer ties between memory and computation as systems become more heterogeneous,” says Garibay. “The reason why we have not seen them in the past is that historically there was a huge bonus applied to homogeneity. If there were tasks that could be done better using in-memory or near memory processing, nobody wanted to build the chip because it would not be usable for anything else. If the trend of increasing heterogeneity continues, we could see a bit of everything.”
Conclusion
Devices created for mainstream applications are likely to continue along the same path they have been following for decades, which is to reduce the power and latency of memory transfers through the application of more complex cache hierarchies. Heterogeneous architectures are making this more difficult because optimization requirements for each core may be different, but the paradigm can basically be preserved. New technologies are emerging in the package and at the system level that will continue to improve both power and latency, such as HBM within the package and photonic interconnect between packages.
Where new applications that are outside of the traditional programming paradigm are encountered, such as machine learning and inferencing, and when suitable hardware architectures exist, such as FPGAs, we may see a lot closer connection between processing and memory. However, these are destined to remain in the minority.
Part two of this report will examine the same question from the standpoint of the system and look at how quickly the paradigms there are changing.
Related Stories
Move Data Or Process In Place? (Part 2)
Costs associated with moving data escalate when we move from chips to systems. Additional social and technical factors drive different architectures.
How Cache Coherency Impacts Power, Performance
Part 1: A look at the impact of communication across multiple processors on an SoC and how to to make that more efficient.
How Cache Coherency Impacts Power, Performance
Second of two parts: Why and how to add cache coherency to an existing design
CCIX Enables Machine Learning
The mundane aspects of a system can make or break a solution, and interfaces often define what is possible.
Leave a Reply