Designing for Data Flow

Processing more data in more places while minimizing its movement becomes a requirement and a challenge.


Movement and management of data inside and outside of chips is becoming a central theme for a growing number of electronic systems, and a huge challenge for all of them.

Entirely new architectures and techniques are being developed to reduce the movement of data and to accomplish more per compute cycle, and to speed the transfer of data between various components on a chip and between chips in a package. Alongside of that, new materials are being developed to increase electron mobility and to reduce resistance and capacitance.

While all of this is necessary, it may not be enough. The rollout of AI in more devices, fueled by data-producing sensors everywhere, is creating an explosion of data. The continued build-out of the edge, and the move to pre-process data at the end point, are a recognition that data needs to be processed closer to the source because there is simply too much to send to the cloud. Exactly how many zettabytes of data are generated each year is up for debate, but everyone agrees the number is enormous, and the vast majority of that was created in the past couple years.

That has created several issues around processing, partitioning, prioritizing, and storing data, as well as leading to approaches ranging from high-speed bridges and hybrid bonding, to stacked memory and new transistor structures. It also has led to more customization around different data types, where one size no longer fits all, and partitioning of data by function and importance.

“When stitching several functions on a chip, one has to plan for efficient flow of data between those functional modules,” said Alpesh Kothari, chief technologist at Siemens EDA. “Engineers need to plan for the size of the data, whether the data needs to be synchronous or asynchronous, the operating frequency of the design, and the direction of the data flow. The data flow can be one-way or two-way. The planning for data flow also needs to take into consideration whether data exchange is happening between functional modules and memories (cache) versus between two functional modules.”

Moving data is costly in terms of performance, the amount of power needed to move that data. And in the case of AI/ML, there is an additional cost of highly customized accelerators and specialized memories that can process that data as quickly as possible to achieve results in real-time, or as close to it as possible.

“Transferring the data can be serialized to reduce the number of bits, but it is at the expense of how fast the entire data can be transferred to the receiver side,” said Kothari. “Moreover, serialization requires planning of retention logic on the receiver side until all the data is received. On the other hand, allowing increased data transfer can take up precious silicon space. Also, data speed is dependent on the physical locations of the two functional modules communicating with each other.”

Floor-planning on steroids
This has made floor-planning a critical and increasingly creative step in design, rather than a mundane, low-level place-and-route exercise.

“This is what makes this whole area of AI so interesting,” said Sailesh Chittipeddi, executive vice president at Renesas Electronics. “From the EDA side, we’re seeing new AI place-and-route methodologies being implemented by companies like Cadence, Synopsys, and Siemens, as well as system-level simulation. And you have to optimize for a particular application. You can’t have one generic device performing the functionality that you need today. We’re also seeing systems companies taking on more of the functions that used to be separate.”

That increases the amount of silicon required, however, which is one of the reasons advanced packaging has begun picking up steam. Data needs to be processed according to priority for a device, which can vary significantly depending upon use cases, applications, and functions.

“It’s not enough to have a controller in memory, wire it up, and expect it to work fine,” said Niels Faché, vice president and general manager of PathWave Software Solutions at Keysight. “You need to design this in context of the different technologies that you’re bringing together. It’s not just designing the memory or the control. It’s the entire design that goes into packages.”

In some cases, processing is done synchronously, in others it is asynchronous, and sometimes there are both types. It also may involve various levels of cache, or no cache at all. But all of those variables are based on a number of factors, ranging from the amount and criticality of certain data to the precision required in processing that data.

“The size of the floorplan will determine how much sampling is needed to transfer the data from one side of the design to the other,” said Siemens’ Kothari. “The smaller the sampling, the faster the data speed that can be achieved. Sampling rate is not just a function of the floorplan but also of the cell delays (i.e. how fast are the buffers, and the length that the buffers can drive the signal). Placement of those sampling registers, buffers, and net delays at advanced nodes plays an important role. Placement of the functional modules within the floorplan determines data speed. Designers need to plan for placing the functional modules, which communicate the most with each other in close proximity, to drive their place-and-route tools to produce optimal placement.”

Put simply, the bottlenecks in designs are being defined by the type and volume of data, and the speed at which it needs to be processed. “SoCs are getting bigger and more complex, fitting everything in the actual chip,” he said. “So data exchange, which used to happen at a system level, is now happening within the IC. This means efficient circuit design for data transfer is required to achieve the overall expected performance. The data flow design at the logic level is quite abstract. In the past, the chips were smaller and mostly driven by specific functionality, so there were only a few stages required to plan for data flow. With bigger chips, this has changed, and more effort is needed to understand the data sampling and placement of the appropriate functional modules next to each other, to achieve optimal data flow.”

Data integrity also is becoming a challenge. In addition to crosstalk and various types of noise, which are prevalent at advanced nodes, there are a variety of aging effects that can appear over longer lifetimes, thermal mismatch between increasingly heterogeneous components, and latent defects that can become real defects as the amount of processing required on a chip or in a package increases.

While many of the topics around designing for data flow have stayed the same in recent years, there’s an increasing focus on composable architectures and standards. “Data flow is a very interesting approach to processor architectures. It is maybe not mainstream von Neumann classic compute, but there are some classes of problems where it is a good fit — network processing, security, some classes of communications, etc,” said Rupert Baines, chief marketing officer at Codasip. He noted the approach is ripe for customization of the processor itself. “You can use a standard ISA to get the benefits of standard software, tools, and libraries. But then you can customize the architecture of the core to support data flow operations with the tight integration between the comms links and the processor.”

Imperas CEO Simon Davidmann agreed, saying RISC-V has become a critical enabler in the process of designing for data flow. “The reason for that is that RISC-V can be small or medium or large, and that can help in terms of the computations you have to provide. You could put up a whole row of RISC-V engines in a line and have very efficient communication by building a sort of data flow machine. We know of people that are doing that, especially in the AI space. You have complicated RISC-V arrays all communicating efficiently, which is enabled by having custom extensions and custom instructions. The tasks are huge, and RISC-V helps because it allows you to have very sophisticated processes with all the vector engines as well as efficient and sophisticated communication.”

Fig. 1: Composable architectures in a data center based on data processing and movement. Source: Cadence

Fig. 1: Composable architectures in a data center based on data processing and movement. Source: Cadence

Moving data
Regardless of your perspective, moving data is a challenge. But depending upon whether you are looking inside a chip, inside a package with multiple heterogeneous chips or chiplets, or inside a data center, where data can be shared between different servers, the problems that need to be solved can look very different.

Arif Khan, product marketing group director for PCIe, CXL, and interface IP at Cadence, noted that the industry is facing interconnect delays, with power dominating the actual compute. That is one of the many factors driving customization. “The market and applications are much more intensely data-driven, with extremely specialized requirements for compute versus interconnect technologies in HPC/AI,” he said. “This is leading to a demand for more composable architectures. Technology standards such as CXL are adapting to these needs to support this, with new features in the 3.0 specification for advanced fabric capabilities that will enable such systems to be built up efficiently.”

CXL plays a significant role in this scenario, particularly in chips used in data centers. “First and foremost, CXL3.0 increases the speed of the link to 64GT/s and introduces 256-byte flits,” said Khan. “It also introduces a new feature that allows non-tree architectures to be built, allowing for full-fabric capabilities. This allows for implementing global fabric attached memory, which disaggregates the memory pools from the processing units. The memory pools can also be heterogenous with various types of memory. In the future, we can envision a leaf/spine architecture, with leaves for NICs, CPU, memory, and accelerators, with an interconnected spine switch system built around CXL 3.0. UCIe and other innovative standards that have come up recently attempt to standardize the slew of proprietary die-to-die interfaces that arose as implementers addressed problems partitioning designs to address reticle limits.”

Designing for data flow is critical as data volumes grow. Engineering teams must manage the tradeoffs between data quantity and speed, as well as design around data size, operating frequency, data direction, and whether the data exchange takes place between functional modules and cache or between two functional modules. They also need to build these chips in the context of different applications, use cases, data types, and the classic tradeoffs of performance, power, and area/cost.

In the future, this will likely drive entirely new architectures and packaging approaches, many of them expanding into the z-axis with through-silicon vias, bridges, and other high-speed interconnects, as well as new materials and bonding techniques. The future of chip design is increasingly about how to deal with more data, processing in more places — in memory, near memory, or using pooled resources with extremely fast interconnects. That will open the door to all sorts of new opportunities and architectures, some of which have been sitting on the sidelines, and some of which have yet to be discovered.

—Ed Sperling contributed to this report.

Leave a Reply

(Note: This name will be displayed publicly)