New Architectures, Approaches To Speed Up Chips

Metrics for performance are changing at 10nm and 7nm. Speed still matters, but one size doesn’t fit all.

popularity

The need for speed is back. An explosion in the amount of data that needs to be collected and processed is driving a new wave of change in hardware, software and overall system design.

After years of emphasizing power reduction, performance has re-emerged as a top concern in a variety of applications such as smarter cars, wearable devices and cloud data centers. But how to get there has changed significantly. In the past, increased density was the preferred method of delivering power and performance improvements. In effect, the cheapest solution was  to throw more transistors and memory bits at a problem. That is no longer the case.

Even 10nm and 7nm may be a stretch for many chipmakers. After that, the semiconductor road map is hazy, partly due to physics, and partly because there is some skepticism that it will be affordable by enough companies to develop it. As a result, chipmakers are examining new hardware and software architectures, machine learning, and better data throughput both inside and outside of devices. And they are doing this on a market-by-market basis, because with limited power budgets one size no longer is ideal for all applications.

“It’s all about how  you optimize bandwidth and latency,” said Mark Papermaster, chief technology officer at AMD. “How big is your pipe feeding those engines? How fast can you move data in and out of those engines? And you have to design a balanced machine. That extends outside the chip, as well, to how you connect to the rest of the world. It’s the same thing on memory and I/O. You have to have enough pipes, or bandwidth, to optimize your latency to ensure that you don’t create bottlenecks.”

Those same principles apply, whether it is a blazing fast computer or a wearable device. But this also creates a quandary for semiconductor companies. If they develop custom solutions for specific markets it becomes  more difficult to define performance, and possibly more difficult to prove its value.

“Performance is always in context,” said Chris Rowen, a Cadence fellow and CTO of the company’s IP Group. “So what can you get into a laptop, a phone, the mirror of a car, or a Fitbit. Each of those is a compute-bound problem and it has to fit into a certain form factor. But there will be extreme 1-watt, 10-watt and 100-watt computing. There will be an extreme part at every level.”

He noted that will lead to new architectures. “Some problems are memory-bound. That raises a question of how you get larger, de facto shared memory at the same power level. There is ample reason to push Moore’s Law even further. We will need more compute power in a Fitbit. So pushing to smaller geometries will go on, but we will not rely as much on that. In the past, it was all about smaller, faster, cheaper transistors. There will be a slow shift toward architectural innovation.”

Signal speed
There are several key challenges that need to be addressed on the hardware side to improve performance at any node. One is faster signal speed without compromising signal integrity, which can be affected by a variety of factors such as memory congestion, thermal noise and process variation.

These issues become particularly problematic at advanced  nodes because wires don’t scale. If you think of a wire like a pipe for electrons, shrinking the diameter of the pipe makes it harder to push something through it. At 10nm and 7nm the wires are so narrow that it requires more power to push the electrons through them, which in turn generates more heat. That changes the power/performance equation, forcing chipmakers to really plan how the wires need to be routed in a design and to consider what the tradeoffs are.

“If you can improve the power integrity, it frees up resources for routing,” said Tobias Bjerregaard, CEO of Teklatech. “So now you need a full picture in order to optimize a design so you don’t constrain it in any direction. That requires a huge investment to be able to work smarter, and you need a high volume before you see any economic benefit from scaling. So only the big guys with sufficient volumes are going to harvest benefits from scaling that far. But benefits also were being left on the table when scaling from 65nm to 40nm, and even down to 28nm. So there is room for improvement everywhere.”

A second challenge involves parallelism and coherent distribution of processing, which has been a longstanding problem outside of corporate applications that have been built on database platforms, as well as some “embarrassingly parallel” tasks such as video rendering or scientific computation. Adding parallelism has a renewed sense of urgency at advanced process nodes, though, because even with some very expensive power management individual cores run too hot. Spreading the processing across cores lowers the heat, but that adds challenges in getting all of the cores to work in unison.

“You can distribute processing if it fits in the energy budget and you can afford the communication,” said Cadence’s Rowen. “What’s changed is that rather than re-using existing algorithms, we are starting fresh, and the new algorithms have more parallelism than in the past.”

This is the reason that heterogeneous cache coherency has been getting so much attention lately from companies such as Arteris and NetSpeed Systems. If the cores are sized differently, they can still get more work done than a single core, but they can do it with fewer watts than if all of the cores are identical.

“We’re seeing improvements coming from designing a new architecture at the same nodes, as well,” said Charlie Janac, chairman and CEO of Arteris. “After 14nm, the only thing that scales is density. Power, performance and cost per transistor do not scale, so you need to pay more attention to how you design chips to get a competitive advantage.”

Better software
One of the biggest opportunities for improving performance, as well as reducing power, is on the software side.

Software typically is viewed as a stack. Starting at the bottom are embedded drivers, which are built directly into the hardware and which tell hardware to perform specific functions, such as to turn on a block or to reduce the power when the heat reaches a certain temperature. Above that are the operating system and middleware layers, and on top of that are the applications. And all of this historically has been developed around general-purpose hardware, with most of the performance gains coming from choosing off-the-shelf or partially customized accelerators.

What’s changing is that hardware increasingly is purpose-built for a specific end market application. As a result, the software can play a role in defining that hardware specification, rather than being built to an existing spec. Hardware still has to deal with physics and work with tools that chipmakers have already purchased, so truly software-defined hardware may be something of an overstatement in most cases. But the software definitely is gaining influence throughout the process, in part because it is much easier to modify and fix, even after a device ships.

The result is much more tightly integrated hardware and software, with much less bloated code. This is particularly important in complex systems with many cores, where there’s a need to drive value from the advanced features provided by the architecture. One solution is to make use of performance monitoring units provided by the hardware.

“The key is to make the most efficient use of hardware resources,” said Paul Black, product manager at ARM. “This can have a significant impact on the run-time and power consumption of code. A typical example with configurations such as ARM’s big.LITTLE is to make sure that software runs on the correct core for its needs. The architecture of both the big and LITTLE cores looks identical to the software, and the cores have identical views of memory, but one is optimized for performance while the other is optimized for efficiency. Another example might be load-balancing between a CPU and GPU. The challenge is to ensure that code runs on the most appropriate cores. This can present opportunities for significant performance improvements.”

Black said that significant improvements also might be found when analyzing areas such as cache performance using hardware performance counters. “One of the examples that we ship with our performance analyzer is a simple program that sums the elements of an array. When the program sums the values by working along the rows of the array, cache use is very efficient and there are few cache reloads. However, when summing values by working down the array columns, the cache needs to be reloaded very frequently. It’s an example of two pieces of code with exactly the same result showing big differences in execution time and power consumption.”

Andrew Caples, senior product manager for the Nucleus product line at Mentor Graphics, agrees. “Everything is metamorphosing into multi-core, even MCUs. We’re being asked about multicore, multiple operating systems, and about how to separate those two—particularly on devices with limited resources. And it’s not just for one market. It’s across the board.”

Caples said that includes markets such as consumer and wearables, as well as automotive. “ADAS is the crème de la crème of embedded applications. There are lots of processing units, and you can run complex algorithms or simple algorithms on them with less mean time to failure, lower bill of materials cost, and less heat dissipation. But it also adds to the complexity, because now you’re dealing with multiple OSes. Now you have to develop and debug on a single SoC package.”

Debugging the software running on modern, multi-core processors presents a rich set of challenges. ARM’s Black pointed out that hardware and software complexity tends to rise over time. “Rising complexity of debug and trace infrastructure, increasing core counts, aggressive power management, and complex reset functionality can all be a nightmare for the debugger because every implementation is different. The only real way to address this in the debugger is with a functional and flexible scripting API that can customize and extend debugger functionality. However, debugger usability is also a key concern, as you very quickly accumulate a large library of complex and configurable scripts. So a good script management system which enables you to easily understand, manage, and share extensive script libraries, is essential.”

Machine learning
One of the new and promising areas in developing faster chips is machine learning. Rather than programming software, algorithms are created to optimize a system for specific use cases. In effect, you optimize the system rather than just the software running on it.

“The goal is highest performance without manual customization,” said Anush Mohandass, vice president of marketing at NetSpeed Systems. “The way to achieve that is machine learning. If you can throw a lot of training data at the device, that’s ideal. So you develop libraries and algorithms. But it’s a lot more understanding of different layers. It’s not just about operations per second.”

That raises  additional issues about benchmarking for the future, though. How a chip performs when it adapts to a specific use case may be different from one device to the next, or one user to the next, and it can change over time. “This is a lot more nuanced performance,” said Mohandass. “You have to understand how to train a machine, which means you really need to understand what the problem is and train it accordingly. The state of the art today is still one size fits all. We are insisting that it’s configurability, and we’re fighting that battle today.”

NetSpeed isn’t alone in this fight. Drew Wingard, CTO of Sonics, said there is ample opportunity to learn from different forms of computing and to apply it to new designs. “Whether it’s machine learning or ambient intelligence, there are lots of different forms of computing to mine. That could include little computations over massive number of nodes in a data center. Computer vision is the one that everyone talks about, and there have been spectacular results. But the model is all about delivering ever-increasing numbers of operations per second for the same amount of power or at an energy budget. There is creativity from the architectural level going on now, and the metrics are based on how many operations are needed to get that done or how much data you can stream through.”

Faster tools
All of this bodes well for EDA vendors, of course. More complex designs with more features requires more simulation of more things up front, and more debugging and verification on the back end of the design process. And all of this adds up to more powerful tools.

“Historically, EDA has been about 1.5% to 1.8% of semiconductor revenue,” said Antun Domic, executive vice president and general manager of the design group at Synopsys. “As the amount of work we are doing with key customers—foundries and other partners—increases, EDA will accelerate as opposed to the large, generic market. We need to make sure we can design super-advanced chips. At the same time, at established nodes, there are a lot of designs being done at 28, 40, or 65nm. We have to make sure we have enough of a supply of those chips, so we have to improve productivity there.”

Sales of emulation, which is simulation’s modern follow-on, have been rising consistently for the past several years as conventional simulators runs out of steam. So have integrated simulation for mechanical and electrical engineering, such as Ansys’ multi-physics simulation. And in the future, as more machine learning is implemented, that will help develop faster solutions targeted for even narrower markets more quickly.

Conclusions and questions
While speed is back in demand, getting more speed out of designs without increasing the power budget is getting harder at the most advanced nodes. As a result, there is a growing emphasis on solving problems with architectures and microarchitectures, and by using new techniques such as machine learning.

But all of this adds a number of question marks into design, as well. For example, will a device that is designed from the ground up and modified for a specific use case be recognized as a better use of resources by broad markets? Will the approaches be as reliable or more reliable than standardized approaches? Will they be able to tip the value proposition to customers in favor of higher selling prices? And how exactly to we distinguish a good customer experience from a bad one?

At this point there are a lot of questions and very little data. And the fear among some executives is that as the semiconductor industry moves down this path, there will  be many more questions that go unanswered for years to come.

Related Stories
Building Faster Chips
Why better performance is back in vogue.
Optimization Challenges for 10nm and 7nm
Part 1: What will it take to optimize a design at 10nm and 7nm? The problem gets harder with each new node.
10nm Versus 7nm
The economics and benefits of moving to the next process node are not so obvious anymore.



2 comments

markus says:

Your article aligns with the approach that we’re taking to benchmark development. Previous generations of benchmarks were targeted at the raw performance of the processor. Benchmarks now target systems, which attempt to measure the performance (and energy) of the processor + software. The two are inseparable. Where software can be operating system, drivers, protocol stacks, etc.

everest333 says:

“…Better software
One of the biggest opportunities for improving performance, as well as reducing power, is on the software side….”

May’s Law states, in reference to Moore’s Law:
“Software efficiency halves every 18 months, compensating Moore’s Law.”

aka Professor David May of the transputer fame
https://www.cs.bris.ac.uk/~dave/index.html

Leave a Reply


(Note: This name will be displayed publicly)