Hardware/Software Tipping Point

Has the tide turned from increasing amounts of general purpose, software defined products, to one where custom hardware will make a comeback?


It doesn’t matter if you believe has ended or is just slowing down. It is becoming very clear that design in the future will be significant different than it is today.

Moore’s law allowed the semiconductor industry to reuse design blocks from previous designs, and these were helped along by a new technology node—even if it was a sub-optimal solution. It lowered risk and the technology node provided performance and power gains that enabled increasing amounts of integration.

In the early 1990s, former Sony CTO Tsugio Makimoto made an observation that electronics cycled between custom solutions and programmable ones approximately every 10 years. This became known as Makimoto’s Wave. 2017 is the next crossing point in his prediction, and is meant to migrate us from highly flexible SoCs to more standardized integrated devices.

For many chips today, large parts of the chip remain dark. There is a finite amount of heat that can be removed from the die, and this limits the amount of power that the chip can consume. This has led to a significant increase in the time and attention paid to low-power design techniques, but so far most of this effort has been in the lower levels of design. In order to make further gains and to increase the percentage of the chip that can be doing useful work, higher-level design decisions need to be examined and architectural changes made.

Makimoto may well be right. If so, we will start to see additional functions being pulled out of general-purpose processors and implemented as high-performance, lower-powered blocks that are more customized for the task. Will that mean dedicated hardware, or just more specialized processors?

Slow turn or landslide?
“More and more we are seeing people build things that have a more focused system model in mind,” says Drew Wingard, CTO at Sonics. “We see targeted chips that are doing just neural network inferencing, or at the edge we see IoT devices that may power a smart watch. These have a more constrained set of system requirements at least with respect to the non-CPU resources.”

But this is not likely to be a landslide. “We will not get as big gains for the new nodes compared to previous ones because we are not reducing the voltage as much,” says Pulin Desai, marketing director for Cadence’s Tensilica Vision DSP. “There is still some gain, and new materials are helping. We also can look to multiple chips being interconnected rather than being on a single die. So there is room to go using process technology. It also depends on the segment you are working in where people are looking at different ways to utilize the available technology.”

Legacy means that this transition will be like steering the Titanic. “A lot of SoCs are built to give the software an easier task,” points out Doug Amos, ASIC design product marketing manager for Mentor, a Siemens Business. “You put more processors in, little processors, more caches and what you are trying to do is to use a hardware solution to compensate for software being inefficient. Software has tended towards fatwear and this has been going on for way too long. We are compensating for software inefficiency by building better hardware.”

Keeping silicon dark
Most of the time, dark silicon is talked about as being a bad thing because it is an inefficient use of chip area and resources. But is it fair to assume that all parts of a chip should be used all of the time?

“We see, in extremely energy constrained applications such as those running off batteries, that there is a strong desire to be able to have some reptilian brain operating on the chip all the time, even when the main processor may not need to be powered,” explains Wingard. “In those systems, the standard action is to put a much smaller, lower-power-and-performance processor on the chip as a system controller that is responsible for maintenance of the chip in those modes. This includes monitoring the state of the sensors in the system and trying to decide when the time is right to wake up the real machine.”

Wingard believes this strategy can be taken even further. “We advocate that people actually shut that off, as well, and let the energy processor in the network fabric be the brain. That implies some extra complexity and an extra level of partitioning in the system. The role of the operating system’s (OS) power management shifts a bit and then the OS responsibility becomes setting up the policy choices. Then you can move the actual management of the power states into hardware.”

Fundamentally there are two ways to control of power. “Software creates runtime variation in the thermal profile of chips,” says Oliver King, CTO for Moortec. “This makes it difficult to predict, at design time, the thermal issues unless the software is already well defined. There are a couple of approaches to this problem. The first is to have hardware which can sense and manage it’s own issues. The second is for software to take into account data from thermal and voltage sensors on die.”

Many in the industry are frustrated with how few of the power-saving features that have been designed into chips actually get used by software. “A lot of chips that have aggressive power management capability require so much firmware to be able to turn them on and off that they never get to use the features – they never have time to write this firmware,” adds Wingard.

Adding accelerators
The migration of software into hardware is usually accomplished with the addition of accelerators. These could be dedicated hardware, or more optimized programmable solutions that are tailored to specific tasks. These include DSP, neural networks and FPGA fabrics.

Cadence’s Desai provides one example: “For imaging you have a bare camera, and the first level of processing—which coverts bare patterns into RGB—has been optimized in hardware. If you can do something in hardware, then this is the best way. You get the best speed, the lowest gate count, the lowest power. People have perfected that over a long period of time. But after that, if you want any kind of imaging algorithm, then everyone has a different idea or different things they want to do, and thus they want a programmable machine. You could run that on a CPU or GPU, but that is not the most power-efficient solution. So they look toward a DSP that is more power-efficient and higher-performance.”

There are a number of different kinds of accelerators, each with its own set of attributes. But overall, the intent and utility of accelerators is the same. “It makes sense to deploy accelerators when the algorithms are well enough understood that you can make use of that hardware effectively,” points out Wingard. “Then we can do things with less energy.”

Sometimes, a standard becomes so important within the industry that custom hardware also becomes the right choice. An example of this is the H.264 video compression standard. “It would be foolish to do this with a programmable solution,” adds Desai.

It is also possible to generalize multiple standards into a class of operations and to create a partially optimized solution for them. “Audio and voice may be using 24-bit processing versus baseband, which may be doing complex math necessary for complex FFT and FIR,” explains Desai. “For vision we have different precisions, such as 8-bit or 32-bit fixed point math. We can do scatter/gather algorithms, and need to perform ALU and multiply operations at the same time. You are looking at vertical applications and optimizing the instruction set, the SIMD bits and the data width.”

Then there is the emerging area of neural networks. “While neural networks are not standardized, you know from a high level that there are standard ways of doing things,” says Desai.

Wingard goes on to explain that “the inner loop of a neural network looks like a matrix multiply and a non-linear function that determines what I do with the result of that matrix multiply. At the lowest level it is very generic. If you go up one level, there is a network topology that is implemented. We have logic connections to the plane before, and the one after. As you try to take advantage of knowledge of those network topologies you make choices about wiring resources. You may put bounds on the maximum connectivity. The interesting part of building these neural networks is deriving the topology. The learning process to get the weights and the connections is computationally intensive, but the way to get better results is by coming up with better topologies that provide better results.”

The world would be a lot simpler and more power efficient if there were not so many competing standards.

Embedding FPGAs
FPGAs have been used for a long time as co-processors but they require a coarse-grained approach to partitioning due to the high latency between the processor and the FPGA which resides in a separate chip. But that is changing.

“We have seen specific FPGA architectures that had arrays of ALUs – both academically and commercially—to do software offload,” says Amos. “Having it integrated into the SoC means that you will use more silicon area compared to hardware, but you do retain the ability to change your mind, which is the big advantage that software has over hardware. You still have to be able to do it economically. You buy this freedom and in this case it is silicon area. But it buys you a lot of choices.”

And for some applications, the economics do point in the right direction. “To maximize battery life, it is important to maximize computation achieved per Joule,” says Tony Kozaczuk, director of architecture solutions at Flex Logix. “We performed an analysis of the energy requirements for several DSP applications using an embedded FPGA compared to an ARM processor with memory accesses. Our findings showed that using embedded FPGA to offload some DSP algorithms took up to 5X less clock cycles and consumed less energy than an ARM Cortex M4F for those functions.”

There are other advantages, as well, for eFPGAs. “Compared to standalone FPGAs, eFPGAs offer multiple orders of magnitude lower interface latency and higher bandwidth,” said Steve Mensor, vice president of marketing at Achronix. “These are critical parameters when FPGA technology is used for hardware accelerators like data center web accelerator and machine learning, automotive sensor fusion, Mobile Edge Compute (MEC), and 5G wireless infrastructure.”

One of the advantages of including eFPGAs is that many of these markets are still nascent, so changes are likely over the life of a product because standards are still being defined. Being able to program the hardware adds an element of “future proofing” and flexibility. Traditionally this has been done with software, which is slower and less power-efficient, or with a discrete programmable logic chip, which adds to the cost.

These devices may be a new contender for that middle ground solution. “To have FPGA as a middle ground between the hardware and software is intriguing, especially when embedded into the SoC,” says Amos. “Someone has to be willing to pay for the extra area. There is also the tradeoff between NRE and chip costs. You may save the development costs of a new chip, which is not an inconsiderable savings, but each chip will cost more.”

What should be accelerated?
Deciding which software functions should be migrated into hardware or specialized processor is not an easy task. For one thing, the industry lacks the tools to make this easy. “You have to be able to measure it,” exclaims Amos. “Without this you cannot even measure how efficient the hardware/software combination is. What gets measured gets fixed.”

Amos explains that we need a new way to measure how this combination is performing in the real world. One option is to build the chip and measure it, but it would be much better to do this before the silicon has been fixed. “There are tools that look at hardware and can optimize it just by looking at it statically,” he says. “If we go from static to dynamic analysis, you can run vectors and find both the peaks and averages and that provides closer to reality measurements. But this is still south of where you need to be for software profiling, even though it helps a great deal.”

Others are using FPGA prototypes to measure how efficient the software is so that decisions can get made. “We need a model with just enough accuracy to fool you into thinking you are running on silicon,” concludes Amos. “As you go up the software stack, this should be easier and easier to do. As you go down the stack, almost to RTL, then you need more accuracy. Running a full stack validation model on an FPGA prototype may help.”

Improving software
Even without hardware changes, there is a lot of gain that could be made just by updating the software. “Legacy can never be ignored,” Amos says. “There is a lot of software out there, and to go back to code that has been running for 10 years is unlikely.”

But it does happen when the gains are significant enough, as demonstrated by the uptake of DSPs. “If you have a general idea about how to write a program for DSP, you should be able to write the code fairly easily, but the SIMD and vectorization aspects are not easy,” explains Desai. “We do have auto-vectorization compilers and we do provide libraries, so if you want to build an imaging algorithm we might already have it in the library that you can reuse. But you do need to have some understanding about parallel programming.

Amos also has a warning about getting too aggressive moving functionality into hardware. “We have had a migration from a community of hardware engineers when software was just emerging. Since then the universities are churning out software engineers, and there are not enough hardware engineers to make big changes. If we start moving software into hardware, who is going to do that?”

As with many things, economics will be the final decisionmaker. If something can be done more efficiently for the same or less dollars, then it will happen. Today, we are seeing some of the costs and benefits change but it will take time before they fully work through the system. Makimoto may well be proven to be correct again, but today that swing has only just started.

Related Stories
Embedded FPGAs Going Mainstream?
Programmable devices are being adopted in more market segments, but they still haven’t been included in major SoCs. That could change.
Tech Talk: EFPGA Acceleration
When and why to use embedded FPGAs.
FPGA Prototyping Gains Ground
The popular design methodology enables more sophisticated hardware/software verification before first silicon becomes available.
Custom Hardware Thriving
Predictions about software-driven design with commoditized IoT hardware were wrong.

Leave a Reply

(Note: This name will be displayed publicly)