Design teams rethink the movement of data on-chip, off-chip, and between chips in a package.
Rapid increases in machine-generated data are fueling demand for higher-performance multi-core computing, forcing design teams to rethink the movement of data on-chip, off-chip, and between chips in a package.
In the past, this was largely handled by the on-chip interconnects, which often were a secondary consideration in the design. But with the rising volumes of data in markets ranging from autonomous driving to AI, SoCs and systems-in-package are being architected around the movement of data. As a result, interconnects are now a central component in this scheme, and they are becoming more complex, more expensive, and much more critical to the success of the overall design.
The interconnect within chips, often referred to as a network on chip, as well as the interconnect between chips, have evolved dramatically as the older approaches run out of steam.
“There are more and more advanced 2.5D and 3D packaging options becoming available with high-bandwidth, low-power die-to-die interfaces that offer similar characteristics to on-chip interconnects,” said Jeff Defilippi, director of product management for Arm‘s Infrastructure line of business. “Now, high performance interconnect designs need to be optimized across multi-chip without adversely impacting performance or blowing up the system power budget. With the added complexity, the traditional quality-of-service techniques are no longer enough. Adding intelligence and software hooks for dynamic cache, bandwidth, and memory management is crucial for providing efficient, predictable workload performance.”
The interconnect between chips essentially has been viewed as an extension of on-chip interconnects (often referred to as networks on chip), with the digital controller parts of the interconnect between two chips closely coupled to each on-chip interconnect.
“The digital controller is what transfers the digital packets to an underlying PHY or SERDES physical layer (or even a bunch of bare wires/pins) on a chip for receipt by the other die’s/chip’s physical layer, and then digital controller,” explained Kurt Shuler, vice president of marketing at Arteris IP. “It also can provide flow control and error correction capabilities.”
Fig. 1: D2D digital controller is part of on-chip NoC interconnect. Source: Arteris IP
The goal is to keep those interconnects as short as possible to minimize the movement of data, and thereby maximize energy efficiency and performance. This is evident with off-chip memory, such as HBM, where the distances that signals need to travel have been steadily shrinking as packaging technologies improve.
“These chips are right next to the processors, and there is starting to be movement in the area of stacking in 3D,” said Steven Woo, fellow and distinguished inventor at Rambus. “Instead of chips being next to each other, and having millimeter interconnects, now they’re stacking chips on top of each other and having one to two orders of magnitude better interconnect length.”
But there also are an enormous number of architectural choices, and they are multiplying as more chips are added into packages. Those options have a big impact on what data has to be moved, where it has to be moved to, and what the optimal plumbing is for moving that data.
“2.5D is just as expensive as going up to the next chip layer above you,” said Marc Swinnen, director of product marketing at Ansys. “If you had to travel more than 10 microns across the chip, you might as well go up, and be the same cost as to travel to the next chip to the chip above it. So it doesn’t have to be a long distance wire before you justify going up and down. The technology is approaching that, and it is becoming increasingly fine grained.”
As of today, methodologies and tools don’t exist to do this fine-grained partitioning. “How are the tools going to separate paths and wires across chips at that level? In practice, we still partition into blocks, and the entire functional block is on one chip or the other chip. And then there’s wires across, and it’s not clear how you would do this fine-grained atomic dispersion across chips.”
But tools do exist for the interconnects themselves. “Right now the industry has a number of tools for different jobs,” said Rambus’ Woo. “For very long haul, we have optics. For moderate distances, we have things like Ethernet. And within a server, we have things like PCI Express. The industry has developed a number of tools in that toolbox, and distance is a primary determinant of which tool you use.”
The big tradeoffs with any data movement are the power it takes to move data, as well as the number of electrical and physical effects that need to be addressed when that data is moved.
“When we talk about the interconnect, there are interesting ramifications of optical versus electrical interfaces, and the latency and power implications of those as well as actually maintainability,” said Scott Durrant, product marketing manager at Synopsys. “The decisions that are made regarding the type of interface to use definitely do come into play in terms of all of those factors, and where to start with this, not to mention the economics of it.”
How far data needs to travel is an important consideration, and it can have a substantial impact on both power and performance. “If it’s within a rack, and there is a different set of criteria to focus on — such as if it’s between data centers or across the country — all of these things are important factors in deciding what interfaces to use both at the chip level and therefore what gets driven beyond that,” Durrant said.
Many of these considerations and tradeoffs depend upon the application, and they can vary widely. That, in turn, affects the relationship between the interconnects, including the IP used for those interconnects, and the interfaces. “Where possible or practical, designers try to make the interconnects modular so they can change them out if necessary,” Durrant said. “While that’s nice from a flexibility perspective, it doesn’t necessarily optimize for power or latency or some other criteria, even though it gives the end user the flexibility of changing the interface based on their needs for particular deployment.”
Saving power
There are other knobs to turn to save power, as well. Chip-to-chip communications, for example, typically maximize and reduce the number of lanes.
“For instance, if you have a chip-to-chip communication and media, and the processor for both chips allow you to run at a higher data rate, then I will design the system to run at that maximum data rate,” said Ashraf Takla, CEO of Mixel. “In doing so, you are reducing the number of lanes. And, of course, when you run at high data rate, you consume less power. But you’re also reducing the number of lanes. Usually that results in saving total power of the system. That’s more of a system issue, obviously, but it’s does affect the IP.”
Inside the chip there are more options to save power. “With the on-chip interconnect, the main question is how to connect different blocks inside the chip so the data gets passed between these different blocks in an efficient manner,” said Tom Wong, director of marketing for Cadence‘s IP Group. “When chips were less complicated decades ago, this was a non-issue because a simple silicon bus was sufficient. As devices gradually increased in complexity, so did the operation of the bus.”
This is evident with a microcontrol, which contains a CPU, EEPROM, simple communication interfaces like UART, as well as a watchdog timer and other elements to do basic functions.
“How communication takes place inside a microcontroller has been established for 30 years,” Wong said “It’s what we call a silicon bus. You have a motherboard, which is a PCB. You put the CPU on it, you put on the discrete chips, and these devices are connected to each other through copper traces on a PCB. When microcontrollers got complicated, they replicated the original structure, and put a bus inside the chip. Each module will have either an Ethernet address or a register. You flip the bit to 1, you get the bus. You flip the bit to 0, you don’t care about the data, because someone else wants the bus. The PCB bus was replicated onto silicon to make a connection. This happened in the very early days of system on a chip. The chip was no longer dedicated. There was a system inside the chip, and it communicated between blocks, on a PCB, or in a module. That worked for a few decades.”
With the advent of application processors for smart phones, companies like Qualcomm, Nvidia, and Samsung, among others, started building complicated SoCs that could do much more than a simple MCU. “It was no longer a microcontroller with one Arm core or even proprietary microcontroller core, along with non volatile memory, and the few IOs,” he said. “They started adding more things in the edge processor. It needed to have a little module that talked to the touchscreen. It needed a microphone, a speaker. It needed to do some voice processing. And don’t forget the camera. Now the SoC had to take the pixels from a camera, process the data for image and video, then send it somewhere else to make a JPEG or MPEG file. So it was video processing and audio processing inside the SoC.”
This also meant the simple silicon bus with a register didn’t work anymore. “The old fashioned controller meant that some computation was done, the register was set, it talked to this block, data was transmitted off-chip to somewhere else, and it was done. But when you are doing audio processing, video processing, a modem, and WiFi, these functions are not one direction. A packet is processed, it’s sent back to the system, the system does something, and sends it to another block. This means a simple silicon bus doesn’t work anymore. Interconnect fabrics are essentially smart buses. Not only is it communicating by sending electrons back and forth, it actually has processing power to process the packet, to know where the packets come from, and where they need to go,” he said.
Early interconnects facilitated the development of these application processors, and processors from every manufacturer have adopted this architecture, moving from SoCs to heterogeneous SoCs. Introducing the block into a piece of silicon to facilitate communication took the silicon bus to the next level. This approach worked for a little over a decade until several years ago, when the internet giants started doing AI/ML and wanted to build their own chips.
“Instead of having an Intel processor talk to a Xilinx FPGA or some custom ASIC, and connecting these chips on a motherboard, they said they would have to build their own chips because they couldn’t afford to put all of what they needed on the motherboard,” Wong said. “Instead of a single CPU or dual CPU, now there were 4, 16, 128 cores — all of which needed to talk to each other because the workload was being partitioned to the appropriate core to minimize power and to maximize performance, such as with Arm’s big.LITTLE architecture. This is where something is processed in core number one, and the result needs to go to core number two, number three, number four, depending on the workload. Now, all of a sudden, instead of sending minimal data between the modules inside an SoC, you’re sending a lot of traffic — constantly. This is why all the bottlenecks showed up, and a simple network on chip wouldn’t cut it anymore. At the same time, the applications processors were running at maybe a gigahertz. But ML/AI chips don’t run at a gigahertz. They run a much higher speed. You get more data, more traffic, and the benefit of Moore’s Law. Everything needs to run faster. However, Moore’s Law does a lot of bad things when you try to run fast, because the signal level is shrunk. The signal-to-noise ratio goes bad, and a lot of other problems show up. The original concept of a NoC has run out of juice because we didn’t anticipate people putting them in a data center for servers, running machine learning algorithms, running facial recognition, or doing autonomous driving prediction. Now we have a bottleneck inside the chip.”
Solutions
Looking at the network on chip/on-chip bus is a good place to start, because one of the biggest challenges for on-chip interconnects today is scalability.
“As chips become extremely large, the interconnect is touching all of the IP blocks in the chip. Benoit de Lescure, CTO at Arteris IP. “In this way, the interconnect is growing like the chip. Other components are not. A PCI controller will stay a PCI controller, but the interconnect size grows along with the size of the chip ,so there are scalability issues, especially because designing a good interconnect requires an understanding of how it will be implemented physically. How will it connect all those components on the chip? What amount of free space on the die will be left for the interconnect to use? What switch topology are you going to implement so that the physical aspects are easier later on? As the size of the problem grows bigger, it becomes significantly more difficult to come up with good interconnect decisions.”
Lescure noted that increasingly engineering teams are advised not to start the interconnect design process before they have a reasonable understanding of the floorplan, at least from a very high level. “Otherwise, the decisions they make will likely have to be redone later on. Previously, there were two approaches. One was divide and conquer, which meant when the problem was too big, it was sliced it into smaller problems to make it easier to solve. Also, hierarchies would be built into the design, sub-interconnects would be created for sub-systems. This was connected to the backbone interconnect so that everything would stay within reasonable size. That’s one way of doing it. Another way of doing it was pen and paper, sitting down down with the customer and looking at the floor plan and trying to come up with a reasonable organization of the switches, so that every connection between switches stays within some reasonable distance amount and these kinds of things.”
Thanks to attention by university researchers, and the fact that every major company in the hyperscale computing space needs a better way to deal with data in and between chips, there’s a lot of homegrown research happening in this space. The semiconductor ecosystem recognizes today there must be movement to a super-smart NoC, workloads must be considered thoroughly, and data movement has to be worked out. Out of this industry work came Compute Express Link (CXL).
CXL started its life within Intel about four years ago as part of joint development project with Arm on CCIX. CCIX became 1.0, 1.1, then became CXL 1.0, and has been adopted by a lot of different industry consortia, including the PCI SIG in conjunction with PCIe 5.0 and PCIe 6.0 to manage traffic. The CXL 2.0 Specification Evaluation Copy is now available for download.
The memory industry has taken note of CXL, as well, and DRAM will be sporting a CXL interface. “It’s not a traditional DRAM I/O anymore, and these I/Os and DRAM are going to be smart,” Wong said. “CXL is everywhere to manage the bottlenecks around traffic across different processors, because every block in the modern chip is a processor. It’s not just the CPU core. The DSP core is a processor. The graphics core is a processor. A modem is a processor. Everything is a processor. When you have data being moved around, being computed by different blocks, all the answers have to go back to somewhere for the final computation. How do you do that with a dumb bus? The way to fix that is with caches. Every CPU has cache memory. CXL helps maintain cache coherency so that when the data is needed from the next block, CXL calls the next block, and the data gets served up at the right time. That’s what cache coherency does, and it’s built into the CXL spec. This is why DRAM suppliers are supporting CXL. They want their memory, even external, to be cache coherent with the computation.”
Rambus’ Woo noted that CXL is being developed primarily because the other tools that are there don’t really fit exactly the needs that people really want. “There’s this physical interconnect based on PCI Express today, but there’s a logical part of it that helps serve a need that the other kind of interconnects don’t serve yet. From an interconnect design standpoint, there’s some really interesting activities happening. People are looking at chiplets, for example, and the various ways you can do things there. AMD has done a great job with chiplets and showing how to actually use those in the design of the system.”
Conclusion
With most applications today, the central focus is data and data movement. The interconnect plays a key role in those designs — how to move the data efficiently, quickly, and with the least amount of area.
What used to be a fairly simple technology secondary consideration is now a central part of all chip design, from initial architecture to manufacturing and test. Without a good interconnect strategy, performance will suffer and power will be wasted, and no chip or system will be competitive in its intended market.
Related
Big Changes In Tiny Interconnects
Below 7nm, get ready for new materials, new structures, and very different properties.
Interconnects In A Domain-Specific World
When and where tradeoffs between efficiency and flexibility make sense.
Different Levels Of Interconnects
Chip Basics: How different layers at the chip level can affect the performance across a system.
Interconnect Challenges Grow, Tools Lag
More data, smaller devices are hitting the limits of current technology. The fix may be expensive.
Choosing Between CCIX And CXL
What’s right for one design may not be right for the next. Here’s why.
Leave a Reply