When and where tradeoffs between efficiency and flexibility make sense.
Moving data around is probably the least interesting aspect of system design, but it is one of three legs that defines the key performance indicators (KPI) for a system. Computation, memory, and interconnect all need to be balanced. Otherwise, resources are wasted and performance is lost.
The problem is that the interconnect is rarely seen as a contributor to system functionality. It is seen as overhead. In an era where algorithms and workloads are changing more rapidly than chip designs, the interconnect may well become the piece of the puzzle that enables the systems to be resilient to advancement.
The interconnect is hardly a new issue, and while some additional hurdles are being added, it has remained the same problem for decades.
“We have been trying to predict the performance of communications for a long time,” says Frank Schirrmeister, senior group director for solution marketing at Cadence. “What used to be a high-level prediction of what the communication would look like has become so central in the era of domain-specific compute that companies are actually doing all kinds analyses much earlier and doing them with detailed models. I don’t think interconnect is left out. It is because of the ability to auto generate the implementation of them that it happens as part of architectural exploration.”
After that, teams must ascertain if the predictions were correct. “Once you have done the architecture investigation, what people do is fix the parameter of the interconnect and say, ‘This is the way I will configure the interconnect,'” says Johannes Stahl, senior director of product marketing at Synopsys. “This is the way I’m going to configure the memory controller, and that’s what I’m taking as a spec to go into RTL implementation. When they have the final RTL they can actually measure these KPIs.”
In many cases this analysis cannot be done purely at the architectural level. “The decision for such architectures must be supported by a lot of high-level simulations and considerations,” says Andy Heinig, group leader for advanced system integration and department head for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “One of the necessary simulations is system-level functional simulation/verification, but rapid floor planning, package estimations, power and ground estimation, thermal estimations, and many other issues have to be considered. PPA can be optimized, but a lot of systems are also limited by other factors like, for example, power delivery, heat dissipation, and package costs.”
A couple of issues have changed recently. The first is the amount of data that is being moved around, and the increasing needs to reduce power and latency, while increasing throughput for specific tasks and workloads. “There are three aspects to the problem statement for data movement,” says Mo Faisal, president and CEO of Movellus. “There is the bandwidth issue — how fast you can push the data. There are physical limits, such as how many wires you can use. And finally, there is synchronization to make sure that data remains coherent. Anytime you look at moving data from point A to point B, you look at how big your bus is, how wide your bus is, and how fast the bus can go and what you have to do to keep track of signal conditioning and synchronization.”
The second recent change is the evolution of system-level aspects. “The more defined you get, the easier it is to lock things down,” says Manuel Uhm, director for silicon marketing at Xilinx. “But that limits flexibility in what you can do with it in the future. AI and ML is an area that is evolving so rapidly, it is actually difficult to lock anything down in terms of the algorithms, in terms of the base kernels and operations, or compute functions that are needed to support that. Even the data types are changing. There used to be a lot of focus around the Int8 datatype, and folks are now finding that if they reduce down to Int4, they can usually get accuracy that is virtually identical, but significantly reduce the compute requirements and throughput needs.”
That has a huge impact on power. “The prediction of power consumption of chips under a given workload is one of the most complex tasks our industry must tackle today,” says Guillaume Boillet, director of product management for Arteris IP. “It requires both a very detailed representation of the hardware and the underlying traffic. However, the power consumption of convolutional neural networks varies almost linearly with the precision of the quantization. In other words, an implementation relying on a representation on 32 bits would roughly consume twice as much as an implementation on 16 bits.”
Futureproofing adds another dimension to the problem. “That’s where the term software-defined networking (SDN) comes in,” says Cadence’s Schirrmeister. “You define the bandwidth, being careful not to over-design things. You need the runway underneath to meet all your goals. Then you can define the SDN on top of it. This is difficult. People can’t really predict what the end bandwidth will be. And that is why everything has become more software loaded.”
Does that add more overhead? “In some cases, there could be,” says Xilinx’s Uhm. “If you don’t need all the throughput that is capable of being moved, there could be some over-provisioning. But to provide the flexibility, that’s always the case. If you don’t need that flexibility, just use an ASSP or build an ASIC. Lock all those things down and just never change them, or spend millions of dollars to change them later.”
The abstraction of design and the limitations of implementation make the prediction difficult. “A virtual prototype could be a very abstract model used for architectural exploration, where you just model the interconnect and memory, and the rest are abstract workload models,” says Synopsys’ Stahl. “This provides a good indication of tradeoff decisions, but no accurate value whatsoever. If you have a good functional model at that level, then you can write software, but in terms of real PPA values, you are too far away. It’s too abstract or too incomplete to have any values that matter as a prediction for implementation.”
Problem scope
Gone are the days when problems are constrained to the chip or to those that it is directly connected to. In many cases, the scope of analysis has to be very carefully defined. “Within a package, and that may be a monolithic die or dies connected 2.5D or 3D, you need to deal with the communication there,” says Schirrmeister. “Memory is central in those cases. But when you go up in scope and perhaps consider the data center (see figure 1), when you look at the rack and how they are connected and you have the notion of that extending throughout the network, now you are looking into latencies. You have to look at compute, latency, power, and performance — all interconnected when you make decisions. Do I process on my IoT end device? Do I process on my near edge? Do I process on my far edge? Do I process at a local data center? Or do I get it all into the cloud? This notion of hyper-connectivity plays into it, but you need to be really aware of the scope.”
Fig. 1: Scoping the interconnect problems. Source: Cadence
While these issues are not new to people designing data centers, they are new to chip/package designers. “We need to recognize there is a higher latency or higher delay time that needs to be described in models,” says Synopsys’ Stahl. “For people who model large systems already, this is nothing new. What we do see is a need for modeling the physics of these off-die connections as part of the process of verifying the system or tuning the system. Imagine you have a DRAM and the DRAM PHY. You need to train and optimize the PHY. You can use the standard sequence that is validated in some silicon node and maybe you are happy with that. But if you really want to push the boundaries of your system, you might want to be very specific for training this DRAM interface to the adjacent die.”
A lot of people are considering the migration to multi-die packages. “Yield is really a big factor behind this,” says Movellus’ Faisal. “There are silicon physical aspects that come into play when you start doing bigger and bigger tiles. You start running into second-, third-, fourth-order effects of variation, and degradation in your clocking, degradation in your power networks, and then issues around data movement and around power delivery. However, you have to understand the tradeoffs. If you were to cut a single 400mm square chip into 4 X 100mm square chips and go full bandwidth, you would be taking a 200W chip and turning it into a 250W chip. The additional 50W is I/O overhead.”
Dataflow and workload
While some chips are designed to tackle narrow problems, more flexibility is needed where workloads are less constrained. “You can have vision processing or audio processing or autonomous driving,” says Faisal. “Those are all different problem statements in terms of data structure and different graphs. That, in turn, modifies interconnect constraints.”
But it is possible to develop a flow for this. “You essentially map out the data flow for the algorithm, for the system that you’re developing, and you program the network-on-chip (NoC) to accommodate that data flow,” says Uhm. “It is really important to map out the algorithm and the data flow, then do some traffic analysis. We have a tool (see figure 2) that tells you if the goals you have established are possible. If you’ve tried to impose latency restrictions, it will work within those latency restrictions, or tell you it’s not possible. You set things like the width of the data and other things that are important.”
Fig. 2: Mapping out the interconnect. Source: Xilinx
This may not always be quite so easy, however. “The challenge is, at that stage of the design you don’t have the real software,” says Stahl. “You may only have abstract representations of the software workload. This has the right transactions and will generate representative traffic on the interconnect, but it’s not the actual software. It’s a representation of the software. It is an abstract workload model. That may be good enough for architects who want to make basic tradeoffs.”
Other tools are coming that help with the generation or workloads. “There are new languages, such as the Portable Stimulus standard (PSS) defined by Accellera Systems Initiative,” says Schirrmeister. “It allows you to generate test streams. Aspects of the use cases can be defined and create this as a test environment. Then it can go through all the cycles with the constraint solver.”
There are a lot of considerations when building a NoC. “There is a lot to take into account,” says Uhm. “You may have multiple topologies with different routing, having to make sure there are no deadlocks, no contention, which can happen with the bus.”
Fig 3: Network on Chip concerns. Source: Xilinx.
The NoC also can perform many other functions. “Each core will have variation and skew problems,” says Faisal. “You may need to modify voltages on a particular core to meet your performance target, but then if you want to go talk to another core, you have to put something in between them. You need a NoC in the middle because you cannot maintain the data transfer at the same edge. Your clocks on the two cores are not going to be aligned or synchronous.”
Many people consider interconnects to be tedious. “Imagine an SoC architecture, and you have all the blocks attached to the network-on-chip interconnect,” says Kurt Shuler, vice president of marketing at Arteris IP. “You may have 200 things connected to your NoC at the center of the chip. The NoC tool manages all of the meta data for the IP connected to it. Back-figuring all that information is a huge source of systematic errors. We all make mistakes. And that causes problems, not just for regular chips. But can you imagine that for typical functional safety requirements?”
Security
The communication of data is an attack surface and requires careful consideration. “You don’t have to protect the entire system from a security perspective,” says Faisal. “You just protect the key component of it. But the downside is that anytime you start playing around with protocols for security reasons, or any other reasons, you add a lot of latency to your system. If you want to start worrying about security in the communication protocols of interconnect, you have to make that tradeoff around latency and overall system throughput and performance.”
That takes us full circle to the initial architecting and analysis. “That’s why system analysis tools look at things like thermal effects, electromagnetic effects, fluidity effects,” says Schirrmeister. “There is some level of security in it, but it’s nothing more than the age-old race between somebody trying to secure a system, and a hacker trying to hack it. Once it’s hacked, that will continue for quite some time. The same is true for in-built debug systems. Those are attack surfaces for people. You have to be very careful how you deal with those. So, yes, there are new attack surfaces that are introduced by this, but people are thinking quite actively how to secure them.”
Conclusion
How long do you intend your chip to remain relevant? That is not a question many had to ask in the past. But with the rate in which algorithms and data types are changing, having a design that is not flexible enough may mean you have to keep turning the chips at an ever-faster pace. That makes it difficult to recoup the NRE.
However, if you add too much flexibility you may not be competitive initially. The interconnect is a key component of flexibility, and the design of it is being taken more seriously than in the past. The incorporation of multiple dies in a package also is creating new possibilities and making system-level architectural analysis even more important.
Leave a Reply