There may be a second chance for co-design, but the same barriers also may get in the way.
The core concepts in hardware-software co-design are getting another look, nearly two decades after this approach was first introduced and failed to catch on.
What’s different this time around is the growing complexity and an emphasis on architectural improvements, as well as device scaling, particularly for AI/ML applications. Software is a critical component, and the more tightly integrated the software, the better the power and performance. Software also adds an element of flexibility, which is essential in many of these designs because algorithms are in a state of almost constant flux.
The initial idea behind co-design was that a single language could be used to describe hardware and software. With a single description, it would be possible to optimize the implementation, partitioning off pieces of functionality that would go into accelerators, pieces that would be implemented in custom hardware and pieces that would run as software on the processor—all at the touch of a button.
This way of working failed to materialize for several reasons. Primarily, the systems at that time were largely single-threaded applications designed to run on a single processor with a fairly simple system architecture. In order to do analysis, the environment and its implementation had to be deterministic, and any partitioning had to be coarse-grained enough so that the communications costs could be amortized. Under these restrictions, most partitioning was somewhat intuitive and thus automation did not provide enough value.
It was hoped that co-design would bring hardware and software teams closer together. “It required restructuring a complete industry,” says Roland Jancke, head of design methodology at Fraunhofer IIS/EAS. “It would have meant diminishing the borders between hardware and software design groups, and making it a rule, rather than an exception, that an executable architectural model exists for hardware and software designers to work with instead of numerous pages of written specifications.”
There is widespread agreement on that point. “The number of changes that a project team would need to make to adopt a co-design methodology was immense,” says Frank Schirrmeister, senior group director for product management and marketing at Cadence. “When you consider how conservative most project teams are, the likeliness of a new methodology being adopted is inversely proportional to the number of changes it requires.”
And until recently, that wasn’t as critical from a business standpoint. But even with rising complexity, it’s still not clear whether that will be enough to upend the existing way of doing things. “Companies have been able to get away with over-engineering the hardware and independently developing the software and still producing economically viable products, especially when their competition have been doing the same thing,” says Russ Klein, HLS platform program director at Mentor, a Siemens Business. “There are a lot of benefits to designing the hardware and software in concert, but the existing methodologies are deeply ingrained in the cultures of hardware and software developers, and the companies where they work. Overcoming that inertia will be difficult.”
What survived
Still, several technologies did come out of the co-design efforts, including virtual prototypes, co-verification, high-level synthesis (HLS) and software synthesis, although this last technology was not developed within the traditional EDA companies.
Virtual platforms are an abstraction of the hardware implementation. “I would claim that virtual platforms with software executing on a virtual model of the hardware are a half step towards co-design,” says Schirrmeister. “Virtual platforms have gained some traction because the number of changes is confined to one group, hardware, while the software guys are happy to use it because it is early and fast.”
But that was not enough to bring about change. “Co-verification products gave developers the ability to run and debug hardware and software together for the first time,” says Klein. “We thought that was going to be the beginning of the end of the great divide between the hardware and software teams. It didn’t have that impact on the industry. While folks who used it really liked it, it usually did not cause the company to adopt a more cooperative development methodology — despite the clear benefits.”
In the software industry, some companies use model-based development methodologies where the software is automatically generated from verified models. “Consider the F16 program,” says Schirrmeister. “Much of the software is auto-generated. Various UML tool suites are still in existence, and today, people are thinking about SysML. Then there are tools like Mathworks Real Time Workshop that also does software generation.”
Software languages also have advanced. “CUDA can be viewed as a heterogeneous processing environment where there are GPUs and CPUs working together,” points out Raymond Nijssen, vice president and chief technologist at Achronix. “OpenCL is another one where FPGAs are in that context.”
In the hardware world, SystemC was created as an attempt to bridge the two worlds. “The architect may write a single description, perhaps in SystemC,” says Tom Anderson, technical marketing consultant for OneSpin Solutions. “But it is modified extensively during the design process to produce RTL code suitable for synthesis and C/C++ code to run on the embedded processors in the SoC.”
Using generic C code presents other problems. “High-level synthesis (HLS) does abstract things, but you cannot just download some software and run it through HLS to get hardware,” says Schirrmeister. “You have to write the input in a way that it is predictable, and you know that it is going to be mapped into hardware.”
One approach to addressing these problems is using a utility to optimize the C code prior to running it through HLS, said Max Odendahl, CEO at Silexica. “The tool performs automatic and guided code re-factoring to make the code synthesizable by HLS compilers. Then we analyze the code to find parallelism and automatically insert HLS pragmas to guide the compiler on how to implement the functions in hardware. This approach can significantly reduce the barriers to using HLS and optimize the performance of the IP.”
Change through necessity
Two things could cause a breakthrough — necessity and/or opportunity.
“The pressure is there, be it from extremely efficient artificial intelligence processors for L4/L5 autonomous driving, from ultra-low power designs for distributed sensor networks, or from ultra-low latency communications for the tactile Internet,” says Fraunhofer’s Jancke. “Maybe the breakthrough is indeed accelerated if open-source hardware architectures, like RISC-V, take the step from academia to industry, and complementary open-source design tools are coming of age.”
Industry pressures are changing, too. “Turnaround times are beginning to drive this more and more,” says Achronix’s Nijssen. “Systems are reaching the scale where partitioning could be done by hand, and you could probably do a better job, but you don’t have the time or the people to do it. So you pay a price in terms of efficiency and let the tools give you a result much faster.”
Technology may force change, as well. “We are seeing the breaking of the existing methodology,” says Mentor’s Klein. “In the past, algorithms developed as software would be deployed on embedded system as software running on the embedded processors. If they did not run fast enough, the answer was simply to get a faster processor or more of them, and processor vendors obliged with ever-bigger, faster cores in larger clusters. What we see now, especially in the inferencing space on convolutional neural networks, is that that they simply cannot run anywhere near fast enough with software. These algorithms must be implemented, at least partially, in hardware. This is new territory for a lot of developers.”
Machine learning is creating opportunities for change within the industry. “The industry has created huge software stacks, like TensorFlow and Caffe, which are complex software algorithms for doing ML and inference,” explains Simon Davidmann, chief executive officer for Imperas. “Now they ask, what platforms can we run it on? People try to put it on x86 SMP, on GPUs, and they realized that it still takes forever. What they need is parallel hardware. They need large fabrics with lots of processors. Fifteen years ago, everyone was building these hardware fabrics with no software. Today, the whole world has the software and they need better platforms to run it on. So we have moved to a completely different place where they have the software and they are trying to build a hardware fabric to efficiently execute this software in parallel. This has created a resurgence in architecture.”
It is not just inferencing that is forcing more hardware solutions. “We are also seeing video processing pushing the boundaries of what can be done on software,” adds Klein. “The increasing image resolutions and frame rates are pushing some computational loads beyond what software can address. I suspect there will be algorithms related to processing some of the protocols in the 5G suite that will drive algorithm developers to look to hardware implementations.”
One of the problems with the original concept was it relied on completely deterministic platform and workloads. “You started off with architectural exploration, including some back of the envelope calculations, and that all works fine,” says Gajinder Panesar, chief technology officer for UltraSoC. “But people get things wrong and sometimes very badly. You need visibility into the system. Having an open-loop system, like hardware-software co-design 20 years ago, does not work. You need a closed-loop system where you have to live with the hardware you have—but by providing visibility inside the whole system with information that can be fed into some autonomous agent that says everything is behaving okay, or it isn’t. Then you can provide some remedial action or some intelligent tuning to sort things out.”
New opportunities
Several new technologies are creating opportunities, either because they change the status-quo or because they advance what we can do with existing technologies.
One such technology is the RISC-V processor ISA architecture. It provides a specification from which solutions can be generated. “RISC-V represents a very interesting opportunity,” says Klein. “It is an additional dimension that developers can explore to optimize their designs. Being open source, we should see a lot of innovation and creativity that would not be possible with traditional processor IP.”
This represents an interesting co-design opportunity. “Moving functionality back and forth between hardware and software is not trivial and is done infrequently,” says OneSpin’s Anderson. “RISC-V may change this situation a bit, since the specification allows custom instructions to be added to the processor. The decision whether to implement a new function in hardware potentially can be made later in the project. This puts pressure on the verification process, since the new features must be verified and the baseline functionality must be proven to be unbroken. Any RISC-V verification solution must be flexible enough to accommodate such extensions.”
Even before verification, up-front analysis is important. “Advanced software profiling tools are a must, so that software-informed hardware can be created,” says Chris Jones, vice president of marketing at Codasip. “There is certainly an opportunity for more automation of hardware generated via software-statistical profiling, though human directed inputs will always be more efficient. The openness of the RISC-V architecture makes it a fertile ground for innovative approaches to ISA optimizations.”
There are multiple dimensions to that possibility. “We are seeing innovation in three areas,” says Imperas’ Davidmann. “First, in the individual instructions; second, in the fabrics and the way in which they are laid out with different processors, and third, in new memory hierarchies. Traditionally, you had your L1, L2 and L3 cache and that seemed to work well. But with the new fabrics that is no longer the right structure and people need to innovate with them.”
Fixed hardware solutions can create limited solutions. “Embedded FPGA addresses this limitation, allowing the design to be updated for bugs and new features,” says Klein. “It trades off a bit of power and performance for that benefit, but it also allows a level customization of the hardware implementation not possible without it. The algorithms in eFPGA can potentially modify themselves over time as systems are used, adapting to the environment of the system. This will be essential for systems that learn in the field.”
This isn’t as simple as programming software at a high level of abstraction, but the results are more efficient. “FPGAs have a reputation for being hard to program,” says Nijssen. “Certainly, it is no walk in the park compared to pushing out some Perl or Python, but a lot of people are trying to do something about that. A lot of progress has been made in popularizing FPGAs, which was helped by Intel’s acquisition of Altera. What is changing is that the acceleration workload is not static. It needs to be highly programmable and it is evolving constantly, even post-silicon.”
This fundamentally changes the co-design view of the system. “In the literature, you will find that the accelerator is presented to the co-design tool as a set of fixed function accelerations,” explains Nijssen. ” You can take those concepts further with eFPGA because that library of functions becomes much richer. This is a tremendous opportunity for co-design to come up with much more efficient mappings. They can then choose which functions will be in the accelerators. The compiler is, in effect, coming up with its own instructions.”
Again, machine learning may have an impact here. “The algorithms are embodied in some form of software, perhaps C++ or Python or R,” says Klein. “Compiler technology can be used in automated transformations of these algorithms. We are seeing a lot of interest in using high-level synthesis in this capacity. This may be the catalyst that finally pushes hardware developers from RTL to a higher level of abstraction.”
A different space
When you start to question one architectural choice, others also come into the picture. “You will see demand for much more fine-grained co-processing,” predicts Nijssen. “That also brings up questions of cache coherency, because the more fine-grained it becomes, the more control you need to manage the memory resources so that you do not have any conflicts or inconsistencies in how the memory gets transferred.”
Others see a need to change the memory systems. “Cache is the curse of the devil,” says UltraSoc’s Panesar. “I know it is a necessary evil for some things, but as soon as you put in a cache, the system is no longer predictable. With machine learning we also have a different programming paradigm that will break hardware, which has stuck to previous programming paradigms where locality and temporal assumption are no longer valid.”
Existing tools have to adapt to these emerging opportunities. “What we found in the RISC-V world is that people are not building sophisticated out-of-order pipelines, they are not speculatively executing—they are building 64-bit in-order relatively simple pipelines and not a lot of superscalar,” says Davidmann. “We can do quite good timing analysis on this. We can do performance analysis using estimation about the way software is running on a single processor, and then allow you to add instructions and put timing for those in and help people tune the way in which the algorithm runs on the hardware. This is very different from the past, when people expected cycle-accurate timing.”
Systems are becoming more dynamic, too. “You may have to tune your system throughout the life of the product,” says Panesar. “There will be software updates. There are changes in sensors. There are changes in the environment. All that will affect the behavior of the software running on the hardware. Some chips will have an on-chip analytics subsystem that will take the data from the monitors and will either do level adjustments or local anomaly detection, and then there will be cases where that information is converted into metadata. For example, in the IoT space you may have something that augments the local analysis or anomaly detection with something in the Cloud.”
Conclusion
Many people had high hopes for co-design 20 years ago. Will it be different this time? It’s too early to tell.
“Folks are being forced to deal with this,” says Klein. “We are at the beginning of that. Maybe we will end up with a language that is suitable for hardware and software. But, then, I had hoped we would have made a lot more progress toward designing systems and not designing software and designing hardware separately, like we still seem to do. The cultural resistance to moving toward developing hardware and software together is really a very strong force that will take a long time to overcome.”
“If you always do what you’ve always done, you’ll always get what you always got”