Data Buffering’s Role Grows

As data streaming continues to balloon out of control, managing the processing of that data is becoming more important and more difficult.


Data buffering is gaining ground as a way to speed up the processing of increasingly large quantities of data.

In simple terms, a data buffer is an area of physical memory storage that temporarily stores data while it is being moved from one place to another. This becomes increasingly necessary in data centers, autonomous vehicles, and for machine learning applications. The challenges with these applications are advanced signal equalization, and increased capacity and bandwidth. Data buffering techniques — either as a discrete chip within the memory module, or integrated into an SoC — help make this all possible.

The data buffer chip was conceived during the development of DDR3. It captures data to and from the CPU and the DRAM. Buffer chips manage read and write commands from the CPU to the DRAM. When buffer chips are used, it is known as a load reduced dual inline memory module (LRDIMM). Those chips boost total memory capacity at the same performance and speed.

“As data is growing at exponential rates, reducing CPU load through data buffer chips becomes necessary to increase DRAM capacity per CPU,” said Victor Cai, director of product marketing for Rambus‘ Memory and Interfaces Division. “LRDIMMs are common in data centers that are performing data-intensive applications, such as big data analytics, artificial intelligence and machine learning.”

The use cases for this kind of technology are growing across a variety of markets. “It is unbelievable how much data is steaming in each second from things like radio telescopes,” explained Randy Allen, director of advanced research at Mentor, A Siemens Business. “It is literally petabytes of data. There is no way you can possibly have storage for more than two or three seconds worth of data, and this is something that you won’t be running 24 hours a day for years and years. For streaming applications like that, you have to build up a quick frame of data because the instruments are continually feeding it in. You go do whatever analysis you want to do on it, which in the case of a radio telescope is a couple of cross correlations, for example, then move on to the next set of data. It’s that buffer you need in there of data that is continually streaming.”

This also is one of the goals for heterogenous computing, because with all that data coming in, “you’ve decided you have some length of time, 15 nanoseconds, for instance, in order to process the data,” Allen said. “You tell it the processing that you want to do. If you do it in 5ns instead of 15ns, it doesn’t do you any good. All you care about is getting the cheapest hardware that can do the processing you want in the 15ns. This makes you willing to go around and get cheaper hardware that will go and do it, which is how you start getting into heterogeneous applications.”

Fig. 1: Data buffering inside Google’s Tensor processing unit. Source: Google

Different uses
In the case of machine learning, there are vision sensors that are doing exactly the same thing, Allen said. “They are shoving data in all the time. The same is true with . There is a continuous stream of data coming in, and it’s picking out what thing you want to work on at a particular point in time, working on it, and then moving on.”

Data buffering chips target different applications, which vary from being used for memories, and within memories, specifically in DDR3, DDR4 and DDR5, in addition to the next wave of memory technology, which includes high-bandwidth memory (HBM), according to Rishi Chugh, senior group director for product marketing for USB, PCIe and SerDes in Cadence‘s IP Group.

There are four driving forces for these data buffering chips, Chugh noted. “The first is to get more capacity for memory. When you have multiple memory modules attached to a single controller, you will require a data buffering chip. Second, data buffering chips can be used for more bandwidth. For instance, if additional parallel buffering is needed or high bandwidth data is flowing between the device and external memory, this requires a data buffering chip. Third, and the most common use of data buffering chips, is for better management of RAS (reliability, availability, serviceability) features in the memory. When there is a data buffering chip there is better controllability of the data. So if there is any corruption or an ECC on the data, the buffering device acts as a second source of defense to get it rectified.”

The fourth use of data buffering chips, which is very prominent in today’s world of machine learning or artificial intelligence, is that this technology provides additional logic to do interesting processing on the data bus by itself before it heads to the processor, Chugh explained. “This is predominantly used in the high-bandwidth memory designs where there are various types of memory, depending upon the application in terms of capacity and bandwidth, multiple layers of DRAM are stacked, which are tied together with a through silicon via (TSV). The TSV block gets embedded within a silicon interposer. Between the interposer and the TSV block, there is a data buffer chip that manages the data flow between these HBMs with 4 to 8 TB of memory. The buffer becomes a critical gateway between the HBM module and the actual processor or the GPU that is sitting on the other side where both the GPU and the HBM are talking to each other through the silicon interposer.”

According to , CEO of Babblabs, software architects look at data buffering in at least three ways, especially in real-time systems that have data streaming in and results streaming out.

“First, it may be a way to smooth the unevenness of incoming data rates. With sufficient buffering, the architect can ensure that there is always enough data to complete one chunk of processing and complete one logical block of results, as required by the functionality and the interfaces,” he said. “Second, the architect may be able to achieve the needed function with many different incoming data block sizes, but it is often most efficient to work on large blocks. This shows up strongly in dense computations like training and inferencing on neural networks. On some platforms, the compute capacity dwarfs the memory bandwidth, so the software architect wants to re-use as much code and data as possible, usually by operating on large matrices or other data blocks. This implies buffering. For the lowest latency behavior, you might like a vision system using deep learning to operate on each image frame as it arrives, but this may degrade performance by as much as a factor of 5 or 10 relative to the performance on a ‘batch’ of 128 or 256 images.”

And third, buffering may be inherent in the algorithm. “If an FFT (fast Fourier transform) is required in a speech processing system, for example, the range of frequencies to be handled and the sample rate dictate the size of the FFT. And that whole FFT block of input data is necessary to create even the first bit of data output. As a result, the block sizes in the algorithm directly control the minimum latency for processing in real-time streaming applications. So buffering is a two-edged sword. Deeper buffering typically allows for more sophisticated algorithms and more efficient computation, but it inevitably hurts system latency, which can be equally important, especially in safety-critical systems and user experiences,” Rowen added.

Fig. 2: Buffering options to improve throughput or lower latency in FPGA architecture. Source: National Instruments.

Things to keep in mind
In every application, there are tradeoffs to be made, and one of these is whether to implement the data buffering as a discrete chip or integrated within the SoC.

Chugh noted that while data buffering can be integrated within the SoC, this is a specialized function. It isn’t something every logic designer can do, and therefore is typically left to the experts at Texas Instruments, IDT and Inphi/Rambus. “I’m not saying it’s rocket science to do it, but it’s definitely something that requires a special skillset. It requires a very heavy analog design skillset, so normally companies will license this technology or purchase the IP as such from the specialist companies to integrate it within the SoC. So while it is possible, it’s not something that is typically done in-house.”

However Rambus’ Cai noted that the physical layout of the DIMM industry standard specifications require a chip rather than an IP block. Integrating the data buffering function into the DRAM negates the main purpose of reducing the memory bus load, as you will still have the same amount of DRAM chips per DIMMs. In other words, including data buffer chips as an IP block would not increase DRAM capacity, he said.
Therefore, designers need to keep in mind that signal integrity (SI) is critical in DIMM design and proper SI methodology needs to be followed with placement and routing of signals to and from the data buffer chip to the host and DRAM. Crosstalk, power noise or reflections from a poorly designed system all will impact the max speed in which LRDIMMs can operate, Cai added.

Another consideration with data buffering is latency. The data buffer interfaces with very high speed signals internally, and at the same time it has to act on the signal very quickly. “It has to immediately pass the data or send the data out of it with no transient time allocated to it,” Chugh said. “It is a very intriguing way of designing the analog components.”

This means the system architect must look at the management features that need to be enabled inside the data buffer, including the specific RAS features in terms of ECC correction, such as one- or two-bit correction, and the algorithms associated with those. In addition, when looking at the design from the perspective of the hit rate of the memory, the system architect must keep in mind not only the hit rate of the memory, but what is connected to it, and what the data rate is, including how many transactions the data buffer has to tolerate in a mainstream operation.

A third consideration is the width of that particular data buffer. Is it 64, 128 or 256 bits wide? And how is the buffering within the data bits taking place? Then, in terms of the slice management of the buffer, at what resolution can it support for the memory or for the data that has been transitioning within the data buffer?

“All of those things factor into the overall aspect of performance latency associated with it, along with the footprint of the data buffer, because another aspect of the data buffer is also the footprint and the power,” Chugh said, “Many times the data buffer is not integrated inside the DRAM, but is integrated inside the memory module. If you look at the memory module, you’ll find multiple DRAMs and data buffer sitting right in the center.”

To be sure, memory modules are very sensitive to power. “There is only so much power the DIMMs get from the DIMM slots, and there is no fan running on the memories,” he said. “So they have to be self-dissipating in terms of the heat. That’s why the data buffer is one of the key critical components, which is heavy on power consumption, so they have to be very agnostic about the fact that the parameters associated with where this buffer is sitting is not conducive with high power dissipation.”

IoT ‘garbage collection’ is another application area where data buffering chips are being used, because IoT applications keep gathering data over a period of time and then use the data buffering model, where the data is collected. After a certain period of time, the data is flushed out. There is no constant flush of the data, however. Following edge computing, the data is collected, buffered and kept. Then, at certain intervals of time, the data is flushed and sent to the central CPU or central server.

The reason it is done this way is so that all the data will be flushed out at one given time. That way the servers can get the holistic view of the entire condition based upon all of the data collected. That can include a specific piece of farmland or an entire metropolitan area. But if the view of that data is too narrow, it will not provide clear or deterministic output.

At the end of the day, data buffering is not complicated rocket science. “You’re just basically gathering a set of data, holding onto it and then throwing it out,” said Mentor’s Allen. “But there are some architectural considerations to keep in mind—size and access time. The buffering must be sized correctly, and the system architect must determine whether it should be implemented as a discrete chip or a block of IP, and be able to hold the right size buffer. The buffer must be available for all of the necessary processing. Depending on the type of memory it is in, it has to be available quick enough to get the processing done before it is all cleared out.”

And while it is most commonly done on streaming applications and data coming from sensors, there are other uses. In machine learning, for example, it can act like another level of cache. “There, you’ve got pure digital and just having the data you need at the right part in the parallel hierarchy is going to be an important consideration,” Allen said.

Finally, data buffering chips are designed on bleeding-edge nodes, which brings additional considerations.

“Given an opportunity, data buffering chips are always leaping ahead on the bleeding edge nodes because of the power constraint,” said Cadence’s Chugh. “What happens is there is a tradeoff between the power and the density, so when they go to a new process node they can have a higher density with the same power, or the same density with lower power. They always love to go with the lowest process nodes but on the same size, since it is an analog-heavy design, they have to do a tradeoff in terms of yield. That’s why these chips are not cheap.”

Leave a Reply

(Note: This name will be displayed publicly)