Tradeoffs To Improve Performance, Lower Power

Customized designs are becoming the norm, but making them work isn’t so simple.


Generic chips are no longer acceptable in competitive markets, and the trend is growing as designs become increasingly heterogeneous and targeted to specific workloads and applications.

From the edge to the cloud, including everything from vehicles, smartphones, to commercial and industrial machinery, the trend increasingly is on maximizing performance using the least amount of energy. This is why tech giants such as Apple, Facebook, Google, Amazon, Microsoft, and others are creating custom silicon for their targeted applications.

Compute technology historically has swung between general-purpose and domain-specific, but the rising cost of scaling and the diminishing PPA (power, performance, and area) benefits, along with a focus on more intelligence in these systems, are tilting the scale heavily toward customization. That specificity can be designed into a complex SoC, but it also can involve advanced packages with different compute elements such as GPUs, CPUs, FPGAs, custom accelerators, and different memory types and configurations.

GPUs already were being used in the high-performance computing community, and people could see the power advantage,” observed Steven Woo, fellow and distinguished inventor at Rambus. “AI was a target application, as well, for what people wanted to do on the next supercomputers, so discussions began around how to use these GPUs as accelerators. But in all of this, GPUs give you a really difficult power envelope to operate under. You just look at the power envelope and see there’s no way to move the data around like you used to before. You know the picojoules per bit to move a piece of data. You look at that and say, ‘There’s no way I can do what I used to do.’”

In the past, developers could just wait for the next silicon node and there would be a bigger, more capable processor running on a smaller geometry technology base, able to deliver higher performance at lower power.

“As scaling has slowed, this is no longer a viable strategy,” said Russell Klein, HLS platform program director at Siemens EDA. “To meet increasing computational demands today, developers have been looking to alternative approaches. This is especially true on the edge, where there can be severe constraints on compute resources and available energy. There is always a tradeoff between having an accelerator be programmable and extracting the greatest performance and efficiency. GPUs, TPUs, and similar arrays of small processors have great flexibility in that they can be easily reprogrammed, but they leave a fair amount of performance and efficiency on the table. To achieve the absolute highest performance and efficiency, a fixed hardware accelerator is required.”

But there is a tradeoff between performance and energy efficiency. “Going as fast as possible means pushing the clock frequency and voltages to their limits. It also means adding more computational elements, more and larger cache memories, and wider data paths. All of these improve performance, but use more energy,” Klein said.

Particularly at the edge, this shift is being driven by the explosion in data and the penalties — economic, power, time — in moving all of that data to the cloud, processing it, and sending it back to the edge device. In fact, the edge buildout is a result of those factors and the need for real-time, or at least near real-time, results.

“The main challenge for these designers is meeting the highest possible performance inside a given energy budget,” said George Wall, director of product marketing for Tensilica Xtensa Processor IP at Cadence. “These use cases are driving the proliferation of domain-specific processing solutions. It’s all about computes-per-milliwatt and computes-per-millijoule in these use cases, and there is just no way to get there effectively without tailoring the compute resources to the task at hand. It doesn’t matter whether it’s automotive radar, or object classification for autonomous vehicles, or voice command recognition in a noisy environment. The growth of machine learning algorithms in these areas reinforces the explosive growth of compute requirements in these areas. But there is only so much energy to go around, and no one wants a bigger car battery, for example.”

This is particularly important in automotive electronics. “Intelligence requirements within the vehicle, and communication to and from the vehicle, are both increasing dramatically,” said Robert Day, director of automotive partnerships in Arm’s Automotive and IoT Line of Business. “However, defining what ‘edge’ means in a particular context, and then what sort of applications and processing are required for each variant of edge differ dramatically. In automotive, these could be roadside units that are linked to cameras that are watching the environment, connected to infrastructure elements such as traffic signals, or connected to vehicles in their locale. A common theme with these units is the local processing and aggregation of data, rather than feeding the data directly to the cloud for processing.”

The cloud still plays a role, but one that leverages its massive compute capability with an acceptable level of delay in results. “This data, once processed locally, can then be fed as an environmental snapshot to the cloud, which can then process the data further to provide a view of the world for that locale, be it weather, traffic, or accidents,” said Day. “This consolidated information can be fed back to vehicles within that locale to aid with driver assistance, route planning, and autonomous driving. These edge units will often run similar workloads to the cloud, but will need to run on more efficient, ‘embedded’ computing elements, including CPUs, GPUs, and MLAs. These edge units also could communicate to local vehicles, directly warning them in ‘real time’ of possible dangers such as accidents, weather conditions, or vulnerable road users in the vicinity.”

Minimizing data movement
Core to the design is understanding what works best where, and then partitioning the compute, memory, and I/O based upon performance and power requirements, with as many customized elements as time and cost will allow.

“You actually see the benefits in practice now,” Rambus’ Woo said. “Probably the biggest things that we see are that domain-specific processors are very good at matching the resources needed for a particular computation, not including the kitchen sink, and only including what you really need. People realized that computation is important in a lot of this, but what’s a bit more of a bottleneck is the movement of data. As a result, these domain-specific processors keep pushing forward the ability to compute faster and faster, but if you can’t keep those pipelines full, then there’s no point in doing it. A lot of this is also about moving data on and off chip. Once these chips have data on-chip, they try not to move it around. They try and use it as much as they can, because if data us constantly being moved on and off, that’s not a good thing either. AI accelerators, in particular, need more bandwidth as compute engines get faster, and there’s only a couple of really good memory solutions that will work. When you do have to move data, you want it to be as power efficient as possible. Then, once you get it on your chip, you try not to move it around at all.”

Some system architects take the approach of keeping some of the data stationary, and changing the computation going on around the data. “Think of it as a vast array of little compute cells,” said Woo. “These compute cells can hold a little bit of data, and there are different combinations of transistors to do different kinds of calculations on that piece of data. It’s actually better to have those compute resources all in one place, and then not to move the data around, than it is to have unique and dedicated resources and to be shipping the data back and forth between these little compute units. In machine learning, this is called weight stationary. Applied in machine learning, you perform all those repeated computations, multiplying a weight against an input. The weight is kept in one place, then other data is flowed through the architecture. The weight is used over and over again in some of these calculations, so it’s to your advantage to keep it, and make it stay put if you can.”

That has a big impact on performance, but it also requires a deep understanding of data movement within a particular domain.

“Efficiently caching data and intermediate results is critical for optimal performance,” said Siemens’ Klein. “Anything that minimizes the movement of data and reduces the size of memories in the design will improve power and performance. Moving computations from software on a general-purpose processor to hardware can dramatically improve the operating characteristics of the system.”

A hardware accelerator can dramatically outperform software for a number of reasons, he explained. “First, processors are fairly serial machines, while a hardware accelerator can perform lots of operations in parallel. Second, processors have a limited set of registers for holding intermediate results. In most cases they need to write results out to a local cache, or maybe even to main memory, which is slow and consumes a lot of power. In a hardware accelerator, data can be passed directly from one processing element to the next. This saves a lot of time and a lot of energy. Finally, a hardware accelerator can be tailored to the specific task it will be working on. The multiplier in a general-purpose processor needs to be able to handle anything any program can send to it, meaning it needs to be 32 or 64 bits wide. The multipliers in a hardware accelerator may only need to be 11 bits or 14 bits wide, whatever is needed to support the computation being performed.”

Many applications can move from floating-point to fixed-point number representation. Fixed-point operators are about half the size of equivalent floating-point operators. Further, these applications do not always need 32 bits to represent the numbers. Depending on the range and precision needed, a smaller representation can be used. This benefits the design in a number of ways. First, there is less data to move and store, which means smaller memories and smaller data paths. Second, the arithmetic and logical operators can be smaller.

Multipliers, one of the larger operators in a design, are roughly proportional to the square of the size of the factor inputs. An 8-bit fixed-point multiplier is about 1/16 the size and power of a 32-bit fixed-point multiplier. Note that a 1-bit multiplier is an “and-gate.” There’s nothing magical about the power of two. An accelerator, a multiplier can process any sized number, any number of bits. Any reduction in operand size, so long as it does not impair the computation, benefits the design.

Inferencing is one example of an algorithm that can use significantly reduced precision and still deliver good results. “Neural networks are typically trained using 32-bit floating-point numbers, but they can be deployed for inferencing using a much smaller fixed-point representation for features and weights,” Klein said. “The values in neural networks are usually normalized to between 1.0 and -1.0, meaning most of the range of the floating-point number representation is simply not used. The quantizing effect of using fewer bits can actually improve neural network performance in some cases. In a recent example, we were able to use 10-bit fixed-point numbers and still be within 1% of the accuracy of the original floating-point neural network implementation. A 10-bit fixed-point multiplier is about 1/20th the area and power of a 32-bit floating-point multiplier. That’s a big savings.”

To make this work requires a reasonably good estimate of the power, performance, and area for each option under consideration. Manually developing an RTL implementation is impractical when considering any more than a couple alternatives, which is one of the reasons why high-level synthesis has surged over the past couple years.

But rising design complexity and focus on specific domains also has increased the demand for quicker ways of making tradeoffs across a number of disciplines.

“For domain-specific processors, as the name indicates, you have to have a good knowledge of the domain,” said Markus Willems, senior product marketing manager, processor solutions at Synopsys. “That must be translated into an architecture. Then, you need to give it to the hardware designers, along with the software development tools for that processor — because in the end it’s a processor, so your end users expect to have a simulator and debugger and compiler and everything. That’s already four different groups that you need to bring together.”

This has a big impact on design methodologies. “Adopting an Agile approach is crucial for a domain-specific processor, because you need to make sure that you’re always verifying it against your domain,” Willems said. “It’s simple for controllers. Everybody knows RISC, but probably in the first years of RISC we also went through this process. Fortunately, by now everybody knows what it is supposed to do so, so then it’s about squeezing out the micro-architecture. Now, with a domain-specific processor you’re back into the step of innovation. If you innovate, there is uncertainty. You want to make sure you bring it back so that all the necessary disciplines are accounted for, because that’s what enables engineering teams to innovate.”

Fig. 1: A multidisciplinary Agile design flow. Source: Synopsys

But making that work also requires a number of tradeoffs, particularly around power and performance. Both of these are key to the user experience, and they are foundational to the whole domain-specific approach. “While lower-cost compute capabilities and wireless connectivity enable this mega trend in the market today of pushing more decision making out to the edge, the enhanced user experience is what this edge compute capability enables,” said James Davis, director of marketing at Infineon Technologies. “This is the value to the end-user.”

Examples of this could be in security system applications, where the edge compute capability needs to detect humans vs. animal, or voice wake-word detection in other applications. Davis noted that to optimize power and performance in these applications, systems architects and designers now can utilize autonomous peripheral control techniques. To make that as efficient as possible, they can determine which portions of a processor are active at different times in a particular use-case. That minimizes the overall system power, while allowing higher power portions of the system to be turned off or put into sleep mode.

There are other benefits, too. “This is yet another aspect of the design that systems designers need to consider and optimize,” he said. “With these new connected edge applications, we’re always being watched with battery-operated cameras or listened to with smart speakers from various suppliers or with connected sensors. They’re monitoring our steps, health, heart rate, etc. While enhanced edge compute capabilities enable more distributed computing, they also enable these advanced user experiences while protecting our privacy. This means instead of constantly sending this data to the cloud for processing, where the data could be compromised in transit or in the cloud, these systems can run the complex DSP/machine-learning algorithms at the edge and never send private data to the cloud. So, in addition to power and performance tradeoffs, an aspect of performance that affects the user experience is privacy and security that also needs to be considered.”

Further, Andy Heinig, head of the Efficient Electronics department at Fraunhofer Institute for Integrated Circuits IIS noted that to reach the more and more challenging power goals, optimization on different levels are necessary from the lowest transistor level and gate level up to the system level. “On system level the optimization can be done on the part of the software but also on the part of the hardware architecture. The biggest performance gains can be realized by hardware-software co-optimization. It allows the design of domain specific processing solutions. But to realize the greatest performance gain, it is not enough to look at the system level because some relevant impacts can only be seen on transistor level (e.g. power, performance, area, …). That is why models with very good correlation between the different levels as well with power, performance and area aspects are necessary in the future.”

Advanced packaging
One option that is gaining traction is advanced packaging. Already, some of the most advanced chips from Apple, Intel, and AMD are using this approach to customize their designs. Intel and AMD are using chiplets to make that whole process simpler.

“There’s a difference between the traditional multi-chip module (MCM) and the goal that people are shooting for now with chiplet heterogeneous integration,” said Marc Swinnen, director, product marketing for the semiconductor business unit at Ansys. “With MCMs, a standard chip is taken out of its package and placed with multiple other chips on a dense substrate, such as ceramic, laminate, or silicon interposer. However, the way the chiplet effort is progressing, a lot of the savings are in power and speed. When you break a chip apart into heterogeneous blocks, you’re going to pay a penalty in area, obviously, but you also want to avoid having to pay the penalty in speed and power.”

The way to do that is to remember that chips today have big I/O drivers that go off chip. “These chips have to drive a PCB trace, and PCB traces are big and long, have to take ESD shocks, and support a built-in external interface,” Swinnen said. “If you’re going to mount the chip on a substrate, where the wires are really short and really thin, you actually can get rid of these big I/O drivers and use much higher-speed, higher-throughput, low-power drivers that have an extremely short range and can’t drive much of a signal. And the good news is you don’t need much on these chiplets. That’s the promise — you can get rid of this entire power ring and get high-speed, low-power communication into chips, and make the cost of breaking them apart bearable.”

The downside is these individual chiplets, interconnects, and packages have to be designed especially for this purpose, so they aren’t really usable for anything else. It may take awhile, and a number of standards, before the industry has enough critical mass to make this work on a multi-vendor scale.

“If you look at where this has been implemented today by Nvidia, by Intel, by Qualcomm, they do have domain-specific versions of those chips, but they’ve been designed in a vertically integrated manner such that they design the package, the chips, the whole system,” he said. “It’s all one company and that works. But with an open market, where you take a random chip off the shelf and think it’s going to work across a number of applications, we’re not there yet. Also, to what degree will the chip need to be designed for its specific packaging environment versus the generic environment that we have today?”

The chip industry is just beginning to wrestle with all of the possible options to maximize performance for the minimal amount of energy for specific applications. That’s a system-level challenge, and it’s one that spans everything from a distributed supply chain to more tightly integrated hardware and software.

“You can come up with the most sophisticated architecture that has great ways of reducing power, but if your software team has to program everything in assembly, they will not take advantage of all the capabilities,” said Synopsys’ Willems. “It would be like giving them a race car while they’re taking driving lessons. That’s really the challenge — to keep programmability in mind from a software developer’s perspective.”

At the same time, architects and designers need to remember that domain-specific computing places stress on lots of other things surrounding the design. “The processing gets so much faster that the I/O becomes even more important,” said Rambus’ Woo. “Memory, chip-to-chip communications, and the concepts seen in things like HBM memory where you stack to try and improve the power efficiency — all of this argues for things like moving toward stacked architectures, shorter distances, and extremely power-efficient I/Os. This means memory interfaces are going to be that much more important, and that’s going to be true anywhere at the edge. It’ll be true at the end points. You’ll see a lot of people doing things with on-chip memory, on-chip caches, and related things, if they can. If they can’t, stacking would be probably the best way to address the problem. You really don’t have many choices. You really don’t have many ways out of the box, so you’ve got to figure out how to change the architecture to take advantage of those things.”

But this isn’t an option. It’s something all design teams will have to wrestle with.

“Facebook is making its own chips, along with Google, Amazon, and others,” said Ansys’ Swinnen. Everybody’s seen that if you don’t have control over that hardware, and you fall back on generic solutions, the other guys will just pass you by.”

Leave a Reply

(Note: This name will be displayed publicly)