Partitioning For Better Performance And Power

Design tradeoffs become more complex, shift to the system level.

popularity

Partitioning is becoming more critical and much more complex as design teams balance different ways to optimize performance and power, shifting their focus from a single chip to a package or system involving multiple chips with very specific tasks.

Approaches to design partitioning have changed over the years, most recently because processor clock speeds have hit a wall while the amount of data that needs to be processed has skyrocketed. In the past, most processors came in the form of a CPU that was responsible for just about everything. But as processing needs increase, driving up transistor density, physical effects like heat and noise are forcing designers to leverage heterogeneous architectures with specialized accelerators and memories, either on a single die or in an advanced package package.

“Those CPUs were responsible for doing all the computing,” said Steven Woo, fellow and distinguished inventor at Rambus. “They were responsible for processing I/O. They were responsible for processing network traffic. They also were responsible for whatever graphics were there. Fast forward a couple of decades, and now we have orders of magnitude more gates, and the way that people think about processors now is very multi-core and multi-functional. This means in addition to having multiple CPU cores, now they also have things like graphics, some of them have specialized accelerators for things like encryption, and even high-performance computation like vector engines.”

This also requires more partitioning of processing, access to various resources such as memory, and prioritization of data traffic on- and off-chip.

“Within the processors — especially in something like an Intel Xeon, where there are multiple cores designed to support multiple users, with multiple threads of computation — the question starts to become, ‘If I’ve got all these cores, and potentially all these users, or maybe multiple programs all executing on the CPU, what’s a fair partition of all these resources?'” said Woo. “By ‘fair,’ you have to make it so that everybody can have their own resources, but if not everybody’s using all those resources, you don’t want them to go to waste. So they have to be both shareable and partitionable. We didn’t have to worry about that kind of thing in the past.”

Making this work requires a series of tradeoffs and some complex analysis of data flows and use models. “There are physical challenges as well, and questions about, ‘I understand, in theory, I can have all these cores, and all these resources, but how do I get them out because it takes time to cross a chip to get from one place to another?’ Here, we are seeing a very spatial type of layout where there will be a core, it’ll have some fraction of the memory resources and things close by. That means you have to think a lot about the connection topology. In this new era with so many transistors, we have to think a lot about what’s the right partitioning of those transistors, and how to make resources able to be provisioned, and globally accessible at the same time. That’s how you’re going to get the best performance and utilization,” he said.

2.5D and 3D approaches add to the confusion.

“The design approach is different, especially when you think about how to partition the floor plan if you’re using chiplets, because certain components need to be physically close together,” said John Ferguson, director of product management at Siemens Digital Industries Software. “You have to think about that. Then, if you’re planning a new design and you’re targeting a 2.5D or 3D implementation, how do you determine where to partition things into different dies? How do you determine what process is best per die? There aren’t great answers for that today. Everybody is working toward it. It’s a bit easier if everything is just two levels, since more levels of freedom means more variables, and it’s really quite complicated to try to figure out which one is optimal.”

Another big consideration is prioritization of resources, and this can can vary greatly by use cases and architecture. To deal with this, Arm has developed memory system resource partitioning and monitoring (MPAM) framework to tie resource controls to the software that accesses the memory system. This permits placing limits on resources used by a virtual machine or by a particular application.

“There are a few common use-cases for resource controls in computing,” said Steve Krueger, system architect at Arm. “Think about how an RTOS and non-RTOS share a system and in order to give the RTOS the level of responsiveness expected. Resources, such as part of the cache memory, can be dedicated to its exclusive use. This type of design might occur in an automotive system, for example, where a safety-critical function and a non-safety critical function both run on one integrated system, such as the backup camera display running on the in-vehicle entertainment system. There are many different uses for resource control, and MPAM provides a resource control framework that is adaptable to different system designs and different applications.”

Power limitations
The same is needed for power optimization, which may be done for a variety of reasons other than just conserving energy. For example, with each new node, dynamic power density is increasing, which in turn can cause overheating of certain parts of a chip or package. That, in turn, can affect overall functionality as well as premature aging of circuits that may be expected to last 10 or 20 years. As a result, design teams may set an upper limit for power in a particular part of a chip, and that in turn can affect performance among multiple components in a system.

Godwin Maben, a Synopsys fellow, lists six steps involving partitioning and prioritization in order to optimize power and performance:

  1. Partition the design so there are repetitive modules that are power-sensitive. This ensures fixing one block/partition/module results in power savings. For example, most of the AI/accelerator chips have many computing units, which if re-used across the design can be beneficial. In one recent chip, a particular compute unit was re-used more than 10,000 times in the design, so even a 1mW power saving in that unit would result in 10KmW savings.
  2. Clock gate in every which way possible and ensure that all the sequential elements, memory, and macros are gated when not needed to the most fine-grain level.
  3. Encode/decode all heavy traffic buses in a different way to reduce power, even including differential data.
  4. Shut down power to blocks when not needed, even if it’s for 1,000 clock cycles.
  5. Deploy DVFS/AVFS as much as possible, to take advantage of quadratic benefit of “V” on power.
  6. Since glitch is becoming dominant at lower geometries, and in compute-intensive designs, try to use glitch-tolerant architectures wherever possible to accommodate both timing and power.

Fundamentally, partitioning decisions always come down to the priorities for the design.

“Usually you will hear about the magic triangle of performance, power and area, then you work to settle into the correct positioning in order to have your perfect product such that, ‘We’ll give you maximum performance for the lowest power and the lowest price,'” said Aleksandar Mijatovic, digital design manager at Vtool. “This means area is something that is just a marketing lie. You cannot do that. But you can try very well to optimize your product needs with current technology, and to try to get the battery life as long as possible.”

Improving energy efficiency requires cooperation of both the hardware and software teams that will be using it. “They rely on someone knowing where the maximum traffic will be to make sure, for example, the entire internal interconnect is prioritizing that to lower latency, even sometimes putting a dedicated bus from a place and route point of view — or moving a certain block of memory closer to its biggest user to make life easier,” Mijatovic said. “Specifically, you do many tricks on the very basic clock-level scale. You’re trying to not have too many arbitrations in between the critical consumer and critical producer. You’re also trying in place-and-route to not make latency through the wires from part to part, which is old school on paper, but usually you will have multiple users and multiple producers, and everything is interlocked. So you will need to prioritize certain paths to have less priority than others. No matter how good your estimation tools are, the most important aspect is that the input to the tools is properly defined.”

While many tools will estimate power, throughput, and latency, the big question is whether it is configured to actually consider real usage. “If you miss that, then you’re planning for some other chip,” Mijatovic said. “You need to fit your application properly. First, it is important to analyze the use case, or in most chips, a set of the most common use cases to optimize the data lines, to optimize the place-and-route. Usually after that comes optimizing of power domains or gates by grouping things that will work together. This is especially tricky with power, since every power domain costs wiring and isolation cells. You are not getting anything for free, so you really need to give some thought to whether it is routed to a particular partition, how it will impact timing, or whether you are introducing latency, for example, in wake-up since every power-down actually requires power-up and wake-up processes. Maybe you’re wasting more energy than you’re saving just because you need to re-initialize things. It’s very interesting from the point of the architecture. It must be viewed big picture, especially for what you will need in the life of a chip.”

Consider, for example, what’s involved in configuring the interconnect for big chip. “Protocols and throughputs are specified. Number crunching is performed for optimization, etc. You get what you have asked for. If you have asked for a configuration, no tool will save you from not knowing your own project. If we are talking about prioritizing speed, which is one of the approaches, usually you start from one of the corners. It’s either very high speed, or you are trying to prioritize power. Likely, you will first try to do the fastest chip possible, and as long as you are in your requirements boundaries, then you try to get less area or lower power consumption. I have yet to see a project that is primarily area- or power-driven, and which starts from the lowest power consumption and tries to meet the requirements. Requirements are fixed. Power and area are good to have, but performance requirements must be met. Your customers may give you allowances for most things, but if you do not meet minimal performance, you are gone. This is why all the power partitioning, clock gating, use of different frequencies, and everything else are the second layer of techniques applied during the chip’s life cycle. The first thing you want to check is that your chip is functional. That, of course, can bring you to a completely unacceptable product from the perspective of power consumption and area, because if you have too big a chip, your production often will fall behind your competitors. You need to think smart from the start.”

Partitioning for lower latency
A key aspect of performance is latency, which only recently has become a big deal. Optimizing for lower latency is intertwined with partitioning.

“People didn’t have to deal with latency other than maybe in the finance world, when they did some sort of high-frequency trading and had to issue orders really fast,” said Tony Chen, director of product management and marketing at Cadence. “Today, because we have to move and process more data, when engineering teams start architecting their chip, they have to treat this as a priority and make certain tradeoffs to get the latency down. For example, before we may have emphasized the performance, while today we are willing to trade off some performance for the latency gain. You can have all the performance in the world, but if your latency is too long, that doesn’t help you.”

Fig. 1: Partitioning becomes especially complicated in a heterogeneous multi-chip package. Source: Cadence

Fig. 1: Partitioning becomes especially complicated in a heterogeneous multi-chip package. Source: Cadence

The same kind of tradeoff applies to power. “If you want to move data quickly, you have to consume a little bit more power, and you have to be smart in partitioning that design, where you may want to optimize a certain part of your design with a low latency path but the other part of the design you still want to keep the power low. So the design partitioning needs to be considered, as well,” Chen said. “One thing that’s specific to a lot of our customers is that in the interface between different building blocks, typically there are issues with synchronizing different blocks so that they can operate, and data can flow synchronously. In this past for this type of interface, people used a lot of flip flop gates to make sure synchronization happens properly since a lot of overhead happens in the interface. Today, we’re seeing our customers and our internal engineering teams rethinking the interface between block to block communications, and that’s also happening with the shift to chiplet designs. The same issue happens there — how do you get the chips to talk to each other and not incur the latency penalty?”

Arif Khan, product marketing group director at Cadence, said a key tradeoff on the performance is robustness. “When trading off latency, constraints are created on the system by saying, ‘Yes, you’re going to get this latency, but now your system design has to be tighter because if you really want this, you’ve got to manage your clock skews better, you’ve got to lay out these blocks better.’ You really have to be in control of both ends of the system, so that you can actually maximize the performance. Should that not happen, then you will fall back to worst latencies because the system cannot fail. You can’t design under the assumption of ideal scenarios. You can’t have data corruption because you’ve designed for your sunny day scenario.”

Tradeoffs involving partitioning can get very complicated, sometimes requiring judgment calls on what’s most important and to whom.

“You have to think about the things which matter absolutely the most to the market and the main end users,” said Rambus’ Woo. “The driver in a lot of this is really data movement. You want your performance, of course. The question is what the big roadblocks are to all of that. What we’re seeing is that the movement of data is really the thing that is limiting both performance and power efficiency. That leads to the point where you say, ‘When I have a core, its resources have to be close. When I have a bunch of data here, I have to figure out how to minimize its movement, because that’s going to impact the performance and the power that I’m spending. So data movement is a really big determinant of how things are put together.”

Making tradeoffs
How this gets partitioned depends a lot on the workload, Woo said. “The first thing you do is look at the workload and try to understand what it’s trying to do. In many cases, there’s a capture phase where you’re trying to gather a bunch of information, and you’re trying to compute on it, and then there’s something that you’re doing with the result. You look at each of those phases and decide where’s the best place to do each of these things. You have to balance that against how much data movement is needed to accomplish that, and it is a bit of a tug of war. There are places where some silicon is designed and much more efficient at certain tasks. GPUs are great at doing graphics and vector kinds of computation. That’s the right place to do those kinds of things, so long as you don’t pay such a heavy price to involve that engine that it kind of wipes away all the benefits. It really is a tradeoff game of, ‘I’ve got this incredibly powerful engine here, but how do I make sure that overall that equation balances and it’s to my advantage to make use of it?’ So you’ll see people change algorithms and reorganize computation.”

There are many ways to optimize a design, and lots of tradeoffs around those different choices that require different ways to partition a chip to improve data movement based upon what technologies are available at any point in time.

Fig. 02: Intel's XEON. Source: Intel

Fig. 2: Intel’s Xeon processor. Source: Intel

“You look at your application, and the way you might have written it 10 years ago could be completely different than the way you write it now,” Woo said. “If I refactor my application by breaking it down differently to use the new silicon that’s available, how might I do things? That’s a lot of what goes on with the hardware that’s available, and with the algorithms you want to try to implement. How do you refactor those things? Some applications that have been around for many years contain a lot of embedded knowledge, so it is worth optimizing it for the architectures that exist today. The refactoring task is very intensive for certain applications, and it is a bit of a tradeoff in trying to figure out what the development cost is going to be because sometimes the development cost actually works out to not be advantageous to undertake that. It’s expensive to do software development and then to re-verify everything, and sometimes the equation just doesn’t work out.”

At the same time, Siemens’ Ferguson said where and how those partitioning decisions happen depend on the company and the team involved because there are a lot of different opinions involved. “Today, it’s a little bit dictated by where there are existing components that either are already built or we know we can build, and know they’re going to yield, be reliable and reusable. If they can do that, then it makes it easy. They don’t have to think about where to partition because it’s self-defined. That would be done very early in the design process, such as, ‘Here’s some stuff we already have. Design around it to make this work.’”

Conclusion
While partitioning has been discussed quite a bit for a single chip, these conversations are now happening at the system level, as well.

“The way systems were designed and put together is based on a view of data movement from a couple of decades ago,” said Woo. “It’s evolved since then, but there’s a much greater willingness now to think about more radical changes to the way systems are put together. You can see things that will work well for chips that won’t necessarily be the right answer for the entire system, and vice versa. All of the answers are not yet there, but it’ll be driven a lot by the data movement needs and what people believe is going to happen in the next decade.”



Leave a Reply


(Note: This name will be displayed publicly)