More data, smaller devices are hitting the limits of current technology. The fix may be expensive.
Interconnects are becoming much more problematic as devices shrink and the amount of data being moved around a system continues to rise. This limitation has shown up several times in the past, and it’s happening again today.
But when the interconnect becomes an issue, it cannot be solved in the same way issues are solved for other aspects of a chip. Typically it results in disruption in how the tools and solutions in place are used, providing they are up to the task in the first place. These problems range from physical challenges, to logical and architectural challenges.
“Being ‘just a fabric’ that connects functional blocks doing the ‘real work’, interconnects are often neglected when building system architecture,” says Aleksandar Mijatovic, senior design engineer for VTool. “In small systems, interconnect challenges are not something you will be concerned about. The complexity of interconnect is not comparable with the complexity of other blocks.”
So what makes contemporary design interconnects so challenging? The answer to that question can be viewed from various levels, because interconnects literally are used at every level of a chip and for chips within a package. They span from metal 0 all the way to the between chips and chiplets. Tools need to account for all of these.
“At the physical level, there will always be interesting problems as speeds increase and designs become denser,” says David Choe, senior principal product engineer for the Custom IC & PCB Group at Cadence. “Designs strive for increased speed, lower voltage power rails and faster interfaces. That requires an increasing use of equalization and the same, if not more difficult, margins to meet. Selection of materials to handle manufacturing tolerances and better loss properties are now basic exercises, while new methods such as chiplets and silicon interposers are introducing new variables.”
At a higher level, the interconnect acts as a limiter. “The challenge is very similar for both SoCs and 2.5/3D systems,” says Andy Heinig, group manager for advanced system packaging at Fraunhofer Institute for Integrated Circuits IIS. “In every case, the interconnects limit the overall performance of the system. Currently, and in the near future, there are no technical solutions available to overcome the issue. The only way is to consider the limitation in the design process.”
Interconnects do not scale in the same manner as the rest of the design. “The interconnect poses the biggest SoC integration challenge, simply because the rule of divide and conquer for mastering growing complexity does not apply,” says Tim Kogel, principal applications engineer for Synopsys. “IP sub-systems like CPU, GPU, and accelerators, can all be developed and tested individually. However, when being integrated into an SoC, all sub-systems need to compete for access to the shared memory sub-system. The rise of AI has aggravated the interconnect problem, because artificial neural networks are based on brute force processing of huge data sets. As a result, AI enabled SoCs need to accommodate the bandwidth requirements of AI accelerators without starving other components.”
Problem within a chip
Chips are growing. “Ten years ago, the interconnect would be concerned with about 10K gates,” says Benoit de Lescure, CTO for Arteris IP. “Now they need to interconnect 10M gates on a chip, so there’s been a very significant increase in complexity. The number of clients on the interconnect has increased.”
Complexity is increasing in other ways. “In the past, a single master would speak to one slave at a time,” says VTool’s Mijatovic. “The challenge was simply for that master to access any of the slaves using as few wires as possible. And it was not especially hard since slaves were mutually exclusive. It started to get more complicated with new protocols, with multi-master, multi-slave systems. And now you need to provide many routes through the same interconnect. You cannot connect every master to every slave with independent traffic system in a brute force way. You would not know what to do with that many wires.”
There are physical challenges, as well. “In order to keep your signal losses at a minimum, and minimize electromigration, which impacts quality and reliability, you have to use wider traces,” says Rita Horner, senior product marketing manager for 3DIC at Synopsys. “This defeats the concept of why you want to go to the smaller technology.”
Fig 1: Higher resistance with smaller wire width. Source: Lam Research
Interconnects have not scaled in the same ways as logic. “Interconnect scaling has been somewhat retarded due to multiple patterning pegged by 193nm,” says Milind Weling, senior vice president of programs and operations at Intermolecular. “This could change with EUV or even increasing use of DSA. At that point, RC delays (and associated k reduction) will re-emerge as priorities. A good tradeoff would be to stick to k=~3.0 for inter-metal (Mx to Mx+1) dielectric and equally prioritize mechanical strength. However, k reduction will likely get pushed more aggressively for intra-metal (between 2 coplanar lines of Mx) dielectric because it is those intra-metal lines and spaces that will shrink faster with patterning advances.”
Having the interconnect keep up is important. “The technology for networking chips was established when 90nm was the bleeding edge,” says Arteris IP’s de Lescure. “It’s been designed with a certain amount of logic that is required for a typical application. They may say, ‘I want to have between 20 and 25 levels of logic between flip flops.’ As the processes are shrinking, people don’t want to redesign everything just to change the amount of logic between flops. There is flexibility in terms of putting pipelines that span long distances. So people keep adding pipelines where necessary, and that’s it. The technology itself doesn’t change. What will change is that frequencies are getting higher.”
None of this is happening fast enough for some chipmakers. “How much data are you moving in one go?” asks Mick Posner, senior director of product marketing for DesignWare IP, High Performance Computing Solutions at Synopsys. “Is it 256, 512, or even perhaps 1,024 data bits? 1,024 data bits is something we see on the horizon. This increases the amount of data being transferred in the same number of clock cycles. So we expect further scaling of the data path.”
Existing protocols are not keeping up. “On-chip protocols need to become more complex,” says de Lescure. “Three years ago, we started to hear about new communication patterns required, and one of them was the broadcast or multi-cast. In an AI chip, there will be a need for the master to be able to broadcast, or multicast, for things like weight coefficient to multiple processing elements. There are no simple industry standard protocols to do broadcast. If you have a master that wants to do a broadcast write, there is no semantic in AMBA AXI for that.”
Problems within a package
An increasing number of applications are being forced into multi-die solutions. “Devices are getting so large that physically, it is impossible to manufacture them with enough yield to make the device cost-effective,” says Synopsys’ Horner. “They’re encroaching the reticle sizes of the steppers. When you’re breaking the die into pieces, you have to have some kind of physical interconnect device, PHY, to be able to make those connections. Some of them could be parallel interfaces, which means they require a lot of traces between the two devices, or it could be a serialized/deserializer (SerDes) over differential pairs, which tend to be much faster. But you have to go through the conversion of parallel to serial, which adds latency. So there are subtle differences as to how you make this die disaggregation partitioning and also still be able to interconnect and communicate between the pieces.”
This has progressed beyond a simple two-die solution. “Traditionally, package design was simple enough that users could plan the interconnect with a spreadsheet and eyeball it to check the connections were properly made,” says John Ferguson, product management director for Mentor, a Siemens Business. “Now, with multiple dies, interposers, bridges and chiplets, that approach becomes impossible. A related problem is the lack of standards. This looks to be getting some new attention from various sources (Intel’s Advanced Interface Bus, OCP’s Open Domain-Specific Architecture). Without these standards, it becomes very difficult to ensure that the various chiplets being connected can actually work together in the same package.”
Physical analysis also becomes more complex. “If you are running long wires, especially outside of the chip, you’re potentially talking about a few inches of trace,” says Horner. “You have moved beyond the realm of modeling in the resistance and capacitance RC concept. You need to include inductance and mutual inductance and all these coefficients to be able to model the parasitics of the traces and interconnect between the devices. And depending on how well you extract those parasitics, that will impact the accuracy of the simulation results.”
As you begin to stack dies vertically as well as horizontally, it introduces new levels of interconnects. “It introduces a new level of hierarchy into the netlisting,” says Mentor’s Ferguson. “These require additional design planning and verification steps at each interface level. Compounding this, we also see the introduction of die-to-die coupling that can impact the system performance and behavior.”
The scope is becoming larger. “Traditionally we used to talk about AMBA buses and network-on-chip architectures,” says Synopsys’ Posner. “But now when we talk about interconnect, it’s no longer limited to the silicon. You have to consider how the interconnect interacts across dies or across chips. Does it have to be memory-coherent? These interconnect protocols have become very complex.”
That complexity adds to the tradeoffs that have to be made. “Time and attention is being spent on 3D stacking, particularly around power, performance, cost and footprint tradeoffs,” says Peter Greenhalgh, vice president of technology and fellow at Arm. “Performance continues to be a key driver of interconnect technology, especially high bandwidth die-to-die interfaces for symmetric multi-processing using chiplets or memory chips. Due to the range of chip topologies that are being created, design partitioning with flexibility around RAM integration is key.”
Some tools may not be keeping up. “Designing pure interconnect components, be they interposers, package RDL, or silicon bridges, is a challenge,” says Ferguson. “Consider on-chip latency for really large chips compared to the low resistivity for pure passive interposers. Traditional tools for chip design and verification were built on the assumption of having some active components. Without them they cease to work.”
Problems within a system
The need to get out of the package always will exist. “There is a lot of movement of data, and there’s a lot of memory access,” says Horner. “When you are going from a processing unit to the memory, it could be a cache, it could be accessing data from the storage. You have to make multiple hops and the turnaround time is prohibitive. You cannot access data in one hop, so there are a lot of stops, which adds latency. You have to try to bring your memory as close to the processor as possible, and lots of cache coherency concepts are going on in terms of standards such as the CCIX or CXL, which are allowing this memory at least virtually to come closer.”
Fig 2: System Interconnect builds upon existing concepts. Source: ODSA
These standards add to design complexity. “The explosion of standards and memory types is increasing pressure on the designs to be able to morph to different uses and markets,” adds Arm’s Greenhalgh. “When considering HPC, Cloud Server, SmartNIC, networking, mobile, industrial and automotive, we not only see large performance changes, but entirely different memory architectures (such as persistent/NVM) as well standards like CCIX/CXL. Strong delivery of coherence, Quality of Service and RAS features are simply assumed, despite being evolving and challenging topics. Once security demands are considered – from memory encryption to architecture to side-channels and beyond – it is clear to see why design and verification costs are rapidly increasing for state-of-the-art interconnects.”
Solution architecting
Pragmatically speaking, “one must understand the purpose and usage models of the final product,” asserts Mijatovic. “Should I build a highway, or will a footpath be enough? The ideal goal is the minimum design for the requested performance.”
Complex packages make that analysis more difficult. “Even with new packaging, the costs in terms of dollars, power, and performance of going from one piece of silicon to another continues to increase,” says Geoff Tate, CEO of Flex Logix. “The best architectures will be those that minimize the need to go off chip more than necessary. For example, in AI accelerators, many of the leading chips use eight or more DRAMs and are now moving to HBM. This gives good performance, but at high cost. AI accelerators that can achieve similar throughput with less DRAM bandwidth will be much lower power and lower cost.”
The system needs to be understandable, too. “One challenge, particularly with AI applications, is how people deal with this in a way that they understand,” says de Lescure. “A common approach is to make the interconnect regular and we see more and more mesh architectures being used. With a mesh architecture, switches and interconnect sneak around the computing blocks in a predictable fashion, so it’s implementation-friendly, and performance-friendly. A customer understands it. Threads of execution are increasing proportionally to the number of clients on the interconnect. The performance you get is proportional to the typical average distance that you’re trying to reach from your starting point.”
Interconnects require significant analysis. “As you scale up data movement, you can’t just put in pipeline stages to close timing because you kill your latency,” says Posner. “So parallel architectures need to be really proved out and that’s stressing development, because even with RTL design, it forces you to do full timing closure to prove out these architectures are possible — and then expand that into a dummy SoC environment where you are potentially going across the whole chip.”
Users are having analysis problems at all levels. “Currently, there are a limited number of tools and methods available to consider the influence of interconnects to the overall performance of the system,” says Fraunhofer’s Heinig. “There is a lot of research necessary in the coming years to find better metrics, methods and tools to predict the overall system performance under consideration of the influence of the interconnects.”
It’s clear the industry cannot afford to continue the way it has. “It’s the job of the SoC interconnect architecture to orchestrate the communication requirements of dozens, in some cases over 100 components, all competing for access to a shared memory subsystem,” says Synopsys’ Kogel. “The big problem is that the effective performance of a shared interconnect and memory subsystem is very difficult to predict — even more so when caches and cache coherent interconnect are part of the picture. The traditional approach was to use spreadsheets for rough estimates and then overdesign the interconnect by 2X to be on the safe side. This is not an option anymore. The free lunch served by Moore’s law is over. The overprovisioning of resources is becoming prohibitively expensive, especially in terms of power consumption and physical implementation issues.”
Analysis has to be done at multiple levels. “The initial requirements drive the topology,” says Darko Tomusilovic, verification director for Vtool. “In theory, you should first come up with the requirements for power and performance and determine the parts that are the most critical. Only then should we build the architecture according to that. At some point, you need to prove that the requirements have been met, and that is a process that involves going down in circles. Usually the process would be to get basic scenarios running, such as make sure that every master exercises every slave. The problem is that you get to verify performance, and deal with real concurrency very late.”
Most verification tools are ill-suited to the task. “Functional verification of the interconnect fabric is one of the hardest challenges of today’s big SoCs,” says Sergio Marchese, technical marketing manager for OneSpin Solutions. “With the number of connections reaching the order of hundreds of thousands, connectivity verification must be fully automated with solutions of adequate capacity and usability. Moreover, engineers can use plug-and-play apps and VIPs to find corner-case bugs in the various interface protocols used in the interconnect. But more is needed to ensure absence of high-level issues, like deadlocks and livelocks, something that simulation and emulation cannot achieve. With security IPs and features bringing additional complexity to the interconnect, the need to go beyond use-case scenarios verification and towards exhaustive, formal-based, verification is even stronger.”
One solution is to become more systematic about the specification and dimensioning of the interconnect architecture. “There are virtual prototyping solutions that enable the construction of executable models of the SoC Architecture to analyze power and performance,” says Kogel. “This allows exploring architecture trade-offs and fine-tuning the many design parameters to arrive at a just-right interconnect configuration.”
You also have to deal with the very low level. “Very accurate simulation techniques with high capacity and speed need to be employed to solve for not only the interconnects on individual fabrics, but also as a combined system in a single simulation,” says Cadence’s Choe. “The selection of a specific numerical method to analyze these complex and dense structures with acceptable accuracy is becoming increasingly important. The idea that a divide-and-conquer approach will accurately solve most problems is quickly becoming outdated. Solving separately at the chip, package, or PCB level and assuming some of the field interactions are minimal is an assumption that just cannot be made as 2.5D and 3D implementations come to the forefront.”
Conclusion
Interconnects tends to be ignored until they cause problems. Chips, packages, and systems have now reached a level of complexity where interconnects cannot be ignored. At the same time, they also could be used as a competitive advantage.
Unfortunately, existing tools and methodologies do not adequately address the issue. Within the chip, this forced the evolution from logic synthesis to physical synthesis. Will we be forced to go through a substantial retooling again to take the full implications of interconnects into account?
Related
Big Changes In Tiny Interconnects
Below 7nm, get ready for new materials, new structures, and very different properties.
Different Levels Of Interconnects
Semiconductor Basics: How different layers at the chip level can affect the performance across a system.
The Good And Bad Of Chiplets
IDMs leverage chiplet models, others are still working on it.
Choosing Between CCIX And CXL
Experts at the Table, Part 2: What’s right for one design may not be right for the next. Here’s why.
My favorite quote in this article, from Aleksandar Mijatovic, senior design engineer for VTool: “Being ‘just a fabric’ that connects functional blocks doing the ‘real work’, interconnects are often neglected when building system architecture.” This is so true. People are finally learning that the implementation of an SoC architecture is done by the interconnect. In other words, the SoC architecture IS the interconnect!
I always read these carefully to see if ANYONE will ever admit where AMD is on coherency, interconnects and IP integration using standards like CCIX, OpenCAPI, and CXL…
They have excelled inthe past few years, zooming back to the top of the Top500 and creating an elegant power-sipping architecture made with interconnects in mind to create chiplets, which AMD has in enterprise desktop and mobile including compute and graphics built with those same interconnects in mind…