Shifting Toward Data-Driven Chip Architectures

Rethinking how to improve performance and lower power in semiconductors.

popularity

An explosion in data is forcing chipmakers to rethink where to process data, which are the best types of processors and memories for different types of data, and how to structure, partition and prioritize the movement of raw and processed data.

New chips from systems companies such as Google, Facebook, Alibaba, and IBM all incorporate this approach. So do those developed by vendors like Apple, Samsung, as well as many carmakers. And the adoption of these design approaches is spreading.

Many of these changes are evolutionary. Others, such as memories and network interface cards that incorporate some level of machine learning or other in-device processing, have been stuck in research for years and are just now beginning to surface commercially. Collectively, they point to a foundational shift in designs, offering orders of magnitude improvements in processing speeds at the same or lower power.

This shift has opened the door to a number of different options, each with its own set of challenges. For example, just having more and higher-performing processors or less data movement doesn’t help if the power delivery network cannot supply enough power to all the processing elements at the same time. On top of that, system resource needs can vary by application. In some cases, data may need to be retained, stored, and remain accessible for a decade or more, while in other applications some or all of it may be discarded almost immediately.

At the base level, the changes roughly fall into three broad areas:

  • Faster processing, with improvements dependent on the number and type of processors and accelerators.
  • Faster data throughput, both on-chip and off-chip.
  • Improved energy efficiency, which depends on how data is structured, how many different types of data are employed, and how the processing of that data is prioritized. This becomes particularly important as more data is generated at the edge, where many of systems rely on battery power.

In all designs, data needs to flow into a chip or system. From there, it must be routed appropriately, processed as needed for a particular application, and either sent along for further processing, stored, or trashed. The closer to the data source that all of this can happen, the lower the latency and the overall energy consumption. That may sound simple enough conceptually, but the implementation can get extremely complex very quickly. It depends on the specific use case, various dependencies on-chip or off-chip, time-to-market design constraints, as well as budgets for design and manufacturing costs, system power, and the ability of a system to dissipate heat without affecting other components.

“It requires an entire ecosystem,” said Peter Greenhalgh, fellow at Arm. “You want to move data around, process that data, you want it to be secure, and you want it to be handled through the right kind of memory management. You ideally want to leverage as many standard software environments as you can, so you probably want virtualization to allow it to be handled by different compute cores. And you want to be able to debug it when you bring up software, manage performance — and have some of that available to customers, whereas you might want something deeper for yourself. So you’re doing heterogeneous compute from larger computing variants, like video acceleration and machine learning. And to manipulate, move and operate on that data, you need much more hardware underneath. That volume of hardware raises the bar so that when you build something, you can accelerate that data.”

Many of the initial data-driven architectures have been developed for data centers and high-volume devices, such as smart phones, where NRE costs can be more easily justified. But the concepts are beginning to spread beyond just the largest companies. Increasingly, chipmakers are customizing systems to be able to handle different types of data more effectively. RISC-V has gained in popularity because the source code itself can be modified for specific purposes, but it’s certainly not the only one. Most other processors now come with a range of customization options, and there is a related push to add some level of programmable logic into many of these devices.

“We promote as much flexibility as possible, and we’ve always had the ability to bolt on a customer’s accelerator,” said Rich Collins, product marketing manager for IP sub-systems at Synopsys. “But now we’re seeing more and more customers taking advantage of that. AI is the big buzzword, and now you can bolt a neural network engine onto a standard processor.”

That helps to process specific types of data faster, but it’s just one piece of the puzzle. Software-defined architectures were a first step, helping to customize designs for specific applications and end markets. The next big shift is happening at a higher level of abstraction, pinpointing where data is processed, which hardware is used for different data types, and how that data should be moved, stored and prioritized.

This requires a rethinking about what’s actually happening in a system, and what the ramifications are for choosing different options. Processing some data in-memory or near-memory reduces the distance that data needs to travel, which in turn reduces the amount of energy required to move that data.

But it also begins to blur the line between memory and processor, which has been in place since the early days of computing. In addition, this adds new concerns about how to prioritize where processing gets done, and how quickly it needs to happen. Not everything demands the fastest processing. For example, a backup camera in a vehicle needs to take priority, while a music selection using that same infotainment system can wait.

“If you have new architectures where there is no longer a distinction between memory and computing, and if you have something like neural networks, then we will need a different way to describe those systems,” said Roland Jancke, department head for design methodology at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “While there won’t be a need to change the mission profile format, you will see continuous improvement in the structure or description of systems.”

This is readily apparent with Apple’s current compute architecture. “One of the most surprising things that happened a few years ago was when Apple became the first company to come out with a 64-bit architecture for an application processor,” said Joe Sawicki, executive vice president for IC EDA at Siemens EDA. “Previous to that, every time you thought about 64-bit, it was an address space issue about being able to manage bigger sets of data, bringing in bigger pieces of software. But Apple didn’t do it for that reason. They did it because it let them be more power-efficient. What that’s really talking about is an aspect of designing the silicon to the software stack lying on top of it.”

This changes the focus of validating a design to spec, to focusing on the end user application. “It’s about really taking a look at what the end user application is,” Sawicki said. “That end user application may go beyond just simple data processing. It may involve being interfaced to the outside world, and it’s changing both design and validation such that it has to span out and increasingly handle those aspects of validating an end user software stack operating in real world, which is way more data processing on the design side of things, way more invested in end user experience, and far more holistic about how you optimize for design.”

Data throughput
To make this all work, data has to move intelligently, quickly, and securely.

“We still have to figure out how to move data around correctly,” said Frank Schirrmeister, senior group director, solutions and ecosystem, at Cadence. “How fast can it move, and where do you put it? Do you have it in the cache? Does it have to go over a chiplet border — or worse, a package border? And how impactful is all of that? This is data and compute co-design, and it’s an optimization criterion. This is why edge computing exists. If we could instantaneously get all of this data to a data center, then we could could put all of the data centers in Antarctica and do all the computing there. Instead, we have to make careful decisions about where to compute the data. Do you do it on the sensor, on a device, or on the far edge? They all have different latencies and power and compute requirements. So you have to be very disciplined about how you design your application even beyond the data center.”

Design still requires the standard checks done in complex chips, of course, such as making sure wires are thick enough to minimize resistance so that they don’t generate too much heat, or emit too much electromagnetic radiation to disrupt other signals. And the more heterogeneous and complex the design, the greater the challenges in getting everything to work correctly.

“Our business is analysis, and a lot of it is signal integrity, electromagnetic interference and power data,” said Marc Swinnen, director of product marketing at Ansys. “A slow, short distance wire doesn’t require too much analysis. A simple RC extractor will do. But when you’re running that same high-speed bus across an interposer layer 4 or 5 centimeters away, and you’re trying to squeeze high-speed SerDes on there, the analysis part becomes much more critical, and the interference modalities also increase. So things you didn’t have to worry about before, particularly electromagnetic interference, become more critical, and the analysis ramps up.”

In addition, all of this needs to be set in the context of priorities and dependencies, which often includes routing data on-chip and off-chip, and even on-premises and off-premises. That, in turn, requires significantly more flexibility in the routing than in the past.

“There are dynamic routing opportunities at runtime,” said K. Charles Janac, chairman and CEO of Arteris IP. “We’ve always resisted runtime dynamic routing because there are issues with verification. If you have billions of transactions, the verification is much simpler if you’re forcing the traffic to go onto a single connection every time. But there are opportunities for easing that in the future and have the NoC essentially be able to reroute traffic dynamically based on some sort of routing controller, which in turn is controlled by some global software.”

That requires a level of system intelligence, which is beginning to show up across the board in everything from interconnects to network interface cards and various memory offerings. Alongside of all of this, chip and system architectures are undergoing significant changes.

“How to structure the architecture to optimize for data movement should be studied from an application/use-case perspective,” said Kamesh Medepalli, vice president of technology, innovation and systems at Infineon Technologies Americas. “For applications such as local sensor processing, it would be efficient to not use much storage at all and process the samples as they come. For applications such as wireless networking in IoT, there are going to be certain memory requirements inherently dictated by the TCP congestion control protocol to achieve maximum throughput performance. Finally, the performance vs. power tradeoff for these applications needs to be considered, as well, in determining the optimal architecture.”
 
AI plays a key role in many of these designs, increasingly made possible by the shrinking footprint of inference algorithms. Unlike in the past, when many systems required gigabytes of data for inferencing, the current thinking is that more targeted inferencing can be done using much less data and much closer to the source of that data. That, in turn, significantly reduces the amount of energy required to process it.

“Local inference not only intelligently processes the data locally, it eliminates the need for additional memory and battery drain of transferring data to/from the cloud,” Medepalli said. “Depending upon the application and algorithms used, AI at the edge also can achieve good power/performance tradeoffs using reduced SRAM footprint and off-chip memory, reducing leakage and product cost. Advances in analog AI are reducing the need for data conversion/storage. Similarly neuromorphic computing is allowing high performance AI at low power with in-memory compute. These techniques, in combination with on-chip secure, high performance  non-volatile memory, are helping provide advanced architectural options to improve performance and power while being cost-effective for a wide range of IoT applications.”

On-chip, off-chip, in-package
Moving data efficiently and quickly has been a major concern for hardware design teams since the introduction of computing. The von Neumann architecture, coupled with continued process shrinks in keeping with Moore’s Law, and a broad collection of new materials, have enabled continuous improvements in both performance and power reduction. But scaling by itself no longer offers sufficient improvements in power and performance, and both are critical as the amount of data generated by sensors everywhere continues to skyrocket.

This has led to one of the most important architectural shifts in chip design, based on the recognition that moving data has an associated cost. Processing all data in the cloud and sending it back to end devices can lower the design cost and bill of materials for end devices, but it requires massive bandwidth, power to drive the signals, and it adds latency. This is true even with off-chip memory, and design teams have been wrestling with what is an acceptable amount of latency for different functions and applications.

“One of the challenges is that anytime you’re moving data and communicating with external chips, it does take a lot of power,” said Steven Woo, fellow and distinguished inventor at Rambus. “DDR5 is the industry’s next technology for main memory, which is more power-efficient. It delivers more bandwidth, and it is compatible and very similar to the type of infrastructure we already have. In a lot of ways, that’s music to the industry’s ears, because it does check all those boxes of being a great transition plan while also addressing the concerns of performance and power.”


Fig. 1: Cost of moving data with HBM2, with PHY and DRAM at 2Gbps streaming workload and power breakdown for 100% reads or 100% writes. Source: Rambus

However, not all technologies evolve at the same pace. This is a key reason why standards are so important. They help to smooth over those differences and provide backward compatibility when a new versions of a particular technology is released. But with so many pieces in motion, and the lines blurring across different approaches, it remains to be seen how well those standards efforts will fare in the future.

Getting more from less
Amid the swirl of new technology and approaches, some old approaches are being re-examined in a new context. Consider compression, for example, which used to be viewed as the best way of moving large amounts of data. Now, with better throughput, this needs to be weighed against the power required for compression/decompression.

Ashraf Takla, CEO of Mixel, highlighted some of the challenges. “From a system perspective, do you use four lanes at high frequency without compression? Then, after the compression, you need less bandwidth, so what do you do? Do you reduce the speed, or do you, for example, reduce the number of lanes? Typically, the latter is a better solution. Instead of running at lower speed, you run at the full speed, but you reduce the number of lanes, because that not only saves power but it also saves pins.”

The key variable here is data movement. “For the controller, there are pin count constraints, but users want speed,” said Joe Rodriguez, product marketing manager at Rambus. “So we try to get as much out of the PHY as needed. Then, when it comes to the controller for display technology, we make sure that the display stream compression engine is not only getting the packet information, but also knows what frames are coming their way. We have an optional video interface that we use when we do a hard DSC integration, and it really knows the visibility to the file is non-existent there. So in terms of improving the throughput on the back end, that video interface is a huge benefit to the ease of integration and validation.”

Increasing transistor density only adds to the complexity. Signals at 5nm and 3nm are bombarded with physical effects such as EMI and various types of noise, and they need to be planned against thermal gradients that can vary from one side of a chip to another. Offloading some of the data processing to other chips and systems can help limit those effects on a single die, but demands for faster data processing and data movement are raising challenges everywhere.

“Bandwidth is a big issue in display technology,” said Alain Legault, vice president of IP products at Hardent. “Display technology has four dimensions. It’s got the X and Y, plus pixel depth and time, and we’ve been expanding on all of that. Display resolution is getting higher. People want to go from 8- to 10-bit video now, and they want twice the frame rate they used to have. With 16-bit becoming quite common, and mobile applications aiming at 120 frames per second, bandwidth has been going through the roof. Engineering teams have been looking for ways to manage that bandwidth. Visually lossless compression is a really good way to do this. With standardized DSC compression, we’ve been able basically to compress by 3X, the video while having no visual impact on the picture quality, with what we call visually lossless compression. So effectively we managed to reduce by three times the bandwidth with a clever mix of algorithms that address different aspects of the image.”

Latency in moving and processing data has some obvious impacts when it comes to display technology. With augmented and virtual reality, latency can make the user nauseous. And in automotive applications, latency can cause an accident.

Conclusion
In the past, much of the semiconductor industry’s efforts focus has been on improving performance and reducing power through shrinking features, but as the benefits of Moore’s Law continue to dwindle, the emphasis has shifted to architectural improvements. Now, with more data being generated at the end points and the edge, the focus is shifting again to how to process that growing volume of data with minimal latency and the lowest amount of energy.

Design teams are now wrestling with the best ways to process data with a minimum amount of movement, and how to partition designs so that the most important data is processed first and fastest. The challenges are non-trivial, but the benefits of intelligent and pervasive computing are enormous, and that approach is expected to continue expanding into new markets and applications for the foreseeable future.

Related
New Architectures, Much Faster Chips
Massive innovation to drive orders of magnitude improvements in performance.
Making Sense Of New Edge-Inference Architectures
How to navigate a flood of confusing choices and terminology.
Steep Spike For Chip Complexity And Unknowns
Increased interactions and customization drive up risk of re-spins or failures.
Architectures Knowledge Center
Special reports, top stories, white papers, videos and blogs on compute architectures.



Leave a Reply


(Note: This name will be displayed publicly)