Why Hardware-Dependent Software Is So Critical

It may not be the most glamorous type of software development, but getting it right is essential for the success of any hardware platform.


Hardware and software are two sides of the same coin, but they often live in different worlds. In the past, hardware and software rarely were designed together, and many companies and products failed because the total solution was unable to deliver.

The big question is whether the industry has learned anything since then. At the very least, there is widespread recognition that hardware-dependent software has several key roles to play:

  • It makes the features of the hardware available to software developers;
  • It provides the mapping of application software on to the hardware; and
  • It decides upon the programming model exposed to the application developers.

A weakness in any one of these, or a mismatch against industry expectations, can have a dramatic impact.

It would be wrong to blame software for all such failures. “Not everybody who failed went wrong on the software side,” says Fedor Pikus, chief scientist at Siemens EDA. “Sometimes, the problem was embedded in a revolutionary hardware idea. It’s revolutionary-ness was its own undoing, and basically the revolution wasn’t needed. There was still a lot of room left in the old boring solution. The threat of the revolutionary architecture spurred rapid development of previously stagnating systems, but that was what was really needed.”

In fact, sometimes hardware existed for no good reason. “People came up with hardware architectures because they had the silicon,” says Simon Davidmann, founder and CEO for Imperas Software. “In 1998, Intel came out with a four-core processor, and it was a great idea. Then, everybody in the hardware world thought we must build multi-cores, multi-threads, and it was very exciting. But there wasn’t the software need for it. There was lots of silicon available because of Moore’s Law and the chips were cheap, but they couldn’t work out what to do with all these weird architectures. When you have a software problem, solve it with hardware, and that works well.”

Hardware generally needs to be surrounded by a complete ecosystem. “If you just have hardware without software, it doesn’t do anything,” says Yipeng Liu, product marketing group director for Tensilica audio/voice IP at Cadence. “At the same time, you cannot just develop software and say, ‘I’m done.’ It’s always evolving. You need a large ecosystem around your hardware. Otherwise, it becomes very difficult to support.”

Software engineers need to be able to use the available hardware. “It all starts with a programming model,” says Michael Frank, fellow and system architect at Arteris IP. “The underlying hardware is the secondary part. Everything starts with the limitations of Moore’s Law, hitting the ceiling on clock speeds, the memory wall, etc. The programming model is one way of understanding how to use the hardware, and scale the hardware — or the amount of hardware that’s being used. It’s also about how you manage the resources that you have available.”

There are examples where companies got it right, and a lot can be learned from them. “NVIDIA wasn’t the first with the parallel programming model,” says Siemens’ Pikus. “The multi-core CPUs were there before. They weren’t even the first with SIMD, they just took it to a larger scale. But NVIDIA did certain things right. They probably would have died, like everybody else who tried to do the same, if they didn’t get the software right. The generic GPU programming model probably made the difference. But it wasn’t the difference in the sense of a revolution succeeding or failing. It was the difference between which of the players in the revolution was going to succeed. Everybody else largely doomed themselves by leaving their systems basically unprogrammable.”

The same is true for application-specific cases, as well. “In the world of audio processors, you obviously need a good DSP and the right software story,” says Cadence’s Liu. “We worked with the entire audio industry — especially the companies that provide software IP — to build a big ecosystem. From the very simple codecs to the most complex, we have worked with these providers to optimize them for the resources provided by the DSP. We put in a lot of time and effort to build up the basic DSP functions used for audio, such as the FFTs and biquads that are used in many audio applications. Then we optimize the DSP itself, based on what the software might look like. Some people call it co-design of hardware and software, because they feed off each other.”

Getting the hardware right
It is very easy to get carried away with hardware. “When a piece of computer architecture makes it into a piece of silicon that somebody can then build into a product and deploy workloads on, all the software to enable access to each architectural feature must be in place so that end-of-line software developers can make use of it,” says Mark Hambleton, vice president of open-source software at Arm. “There’s no point adding a feature into a piece of hardware unless it’s exposed through firmware or middleware. Unless all of those pieces are in place, what’s the incentive for anybody to buy that technology and build it into a product? It’s dead silicon.”

Those thoughts can be extended further. “We build the best hardware to meet the market requirements for power performance and area,” says Liu. “However, if you only have hardware without the software that can utilize it, you cannot really bring out the potential of that hardware in terms of PPA. You can keep adding more hardware to meet the performance need, but when you add hardware, you add power and energy as well as space, and that becomes a problem.”

Today, the industry is looking at multiple hardware engines. “Heterogeneous computing got started with floating point units when we only had integer arithmetic processors,” says Arteris’ Frank. “Then we got the first vector engines, we got heterogeneous processors where you ended up having a GPU as an accelerator. From there, we’ve seen a huge array of specialized engines that cooperate closely with control processors. And so far, the mapping between an algorithm and this hardware, has been the work of clever programmers. Then came CUDA, Cycle, and all these other domain-specific languages.”

Racing toward AI
The emergence of AI has created a huge opportunity for hardware. “What we’re seeing is people have these algorithms around machine learning and AI that are needing better hardware architectures,” says Imperas’ Davidmann. “But it’s all for one purpose — accelerate this software benchmark. They really do have the software today around AI that they need to accelerate. And that’s why they need these hardware architectures.”

That need may be temporary. “There are a lot of smaller-scale, less general-purpose companies trying to do AI chips, and for those there are two existential risks,” says Pikus. “One is software, and the other is that the current style of AI could go away. AI researchers are saying that back propagation needs to go. As long as we’re doing back propagation on neural networks we will never actually succeed. It is the back propagation that requires a lot of the dedicated hardware that has been designed for the way we do neural networks today. That matching creates opportunities for them, which are quite unique, and are similar to other captive market.”

Many of the hardware demands for AI are not that different from other mathematical based applications. “AI now plays a huge role in audio,” says Liu. “It started with voice triggers, and voice recognition, and now it moves on to things like noise reduction using neural networks. At the core of the neural network is the MAC engine, and these do not change dramatically from the requirements for audio processing. What does change are the activation functions, the nonlinear functions, sometimes different data types. We have an accelerator that we have integrated tightly with our DSP. Our software offering has an abstraction layer of the hardware, so a user is still writing code for the DSP. The abstraction layer basically figures out whether it runs on the accelerator, or whether it runs on the DSP. To the user of the framework, they are generally looking at programming a DSP instead of programming specific hardware.”

This model can be generalized to many applications. “I’ve got this particular workload. What’s the most appropriate way of executing that on this particular device?” asks Arm’s Hambleton. “Which processing element is going to be able to execute the workflow most efficiently, or which processing element is not contended for at that particular time? The data center is a highly parallel, highly threaded environment. There could be multiple things that are contending for a particular processing element, so it might be quicker to not use a dedicated processing element. Instead, use the general-purpose CPU, because the dedicated processing element is busy. The graph that is generated for the best way to execute this complex mathematical operation is a very dynamic thing.”

From application code to hardware
Compilers are almost taken for granted, but they can be exceedingly complex. “Compilers generally try and schedule the instructions in the most optimal way for executing the code,” says Hambleton. “But the whole software ecosystem is on a threshold. On one side, it’s the world where deeply embedded systems have code handcrafted for it, where compilers are optimized specifically for the piece of hardware we’re building. Everything about that system is custom. Now, or in the not-too-distant future, you are more likely to be running standard operating systems that have gone through a very intense quality cycle to uplevel the quality criteria to meet safety-critical goals. In the infrastructure space, they’ve crossed that threshold. It’s done. The only hardware-specific software that’s going to be running in the infrastructure space is the firmware. Everything above the firmware is a generic operating system you get from AWS, or from SUSE, Canonical, Red Hat. It’s the same with the mobile phone industry.”

Compilers exist at multiple levels. “If you look at TensorFlow, it has been built in a way where you have a compiler tool chain that knows a little bit about the capabilities of your processors,” says Frank. “What are your tile sizes for the vectors or matrices? What are the optimal chunk sizes for moving data from memory to cache. Then you build a lot of these things into the optimization paths, where you have multi-pass optimization going on. You go chunk by chunk through the TensorFlow program, taking it apart, and then either splitting it up into different places or processing the data in a way that they get the optimal use of memory values.”

There are limits to compiler optimization for an arbitrary instruction set. “Compilers are generally built without any knowledge of the micro-architecture, or the potential latencies that exist in the full system design,” says Hambleton. “You can only really schedule these in the most optimal way. If you want to do optimizations within the compiler for a particular micro-architecture, it could run potentially catastrophically on different hardware. What we generally do is make sure that the compiler is generating the most sensible instruction stream for what we think the common denominator is likely to be. When you’re in the deeply embedded space, where you know exactly what the system looks like, you can make a different set of compromises.”

This problem played out in public with the x86 architecture. “In the old days, there was a constant battle between AMD and Intel,” says Frank. “The Intel processors would be running much better if the software was compiled using the Intel compiler, while the AMD processors would fall off the cliff. Some attributed this to Intel being malicious and trying to play bad with AMD, but it was mostly due to the compiler being tuned to the Intel processor micro-architecture. Once in a while, it would be doing bad things to the AMD processor, because it didn’t know the pipeline. There is definitely an advantage if there is inherent knowledge. People get a leg up on doing these kinds of designs and when doing their own compilers.”

The embedded space and the IoT markets are very custom today. “Every time we add new hardware features, there’s always some tuning to the compiler,” says Liu. “Occasionally, our engineers will find a little bit of code that is not the most optimized, so we actually work with our compiler team to make sure that the compiler is up to the task. There’s a lot of feedback going back and forth within our team. We have tools that profile the code at the assembly level, and we make sure the compiler is generating really good code.”

Tuning software is important to a lot of people. “We have customers that are building software tool chains and that use our processor models for testing their software tools,” says Davidmann. “We have annotation technology in our simulators so they can associate timing with instructions, and we know people are using that to tune software. They are asking for enhancements in reporting, ways to compare data from run to run, and the ability to replay things and compare things. Compiler and toolchain developers are definitely using advanced simulators to help them tune what they’re doing.”

But it goes further than that. “There’s another bunch of people who are trying to tune their system, where they start with an application they are trying to run,” adds Davidmann. “They want to look at how the tool chain does something with the algorithm. Then they realize they need different instructions. You can tune your compilers, but that only gets you so far. You also can tune the hardware and add extra instructions, which your programmers can target.”

That can create significant development delay because compilers have to be updated before software can be recompiled to target the updated hardware architecture. “Tool suites are available that help identify hotspots that can, or perhaps should, be optimized,” says Zdeněk Přikryl, CTO for Codasip. “A designer can do fast design space iterations, because all he needs to do is to change the processor description and the outputs, including the compiler and simulator that are regenerated and ready for the next round of performance evaluation.”

Once the hardware features are set, software development continues. “As we learn more about the way that feature is being used, we can adapt the software that’s making use of it to tune it to the particular performance characteristics,” says Hambleton. “You can do the basic enablement of the feature in advance, and then as it becomes more apparent how workloads make use of that feature, you can tune that enablement. Building the hardware might be a one-off thing, but the tail of software enablement lasts many, many years. We’re still enhancing things that we baked into v8.0, which was 10 years ago.”

Liu agrees. “Our hardware architecture has not really changed much. We’ve added new functionalities, some new hardware to accelerate the new needs. Every time the base architecture remains the same, but the need for continuous software development has never slowed down. It has only accelerated.”

That has resulted in software teams growing faster than hardware teams. “In Arm today, we have approximately a 50/50 split between hardware and software,” says Hambleton. “That is very different to eight years ago, when it was more like four hardware people to one software person. The hardware technology is relatively similar, whether it’s used in the mobile space, the infrastructure space, or the automotive space. The main difference in the hardware is the number of cores, the performance of the interconnect, the path to memory. With software, every time you enter a new segment, it’s an entirely different set of software technologies that you’re dealing with — maybe even a different set of tool chains.”

Software and hardware are tightly tied to each other, but software adds flexibility. Continuous software development is needed to keep tuning the mapping between the two over time, long after the hardware has become fixed, and to make it possible to efficiently run new workloads on existing hardware.

This means that hardware not only has to be delivered with good software, but the hardware must ensure it gives the software the ability to get the most out of it.

Leave a Reply

(Note: This name will be displayed publicly)