Designers must carefully weigh the gains against the costs, many of which are not immediately obvious.
While the ability to extend a processor is nothing new, market dynamics are forcing a growing percentage of the industry to consider it a necessary part of their product innovation. From small IoT functions to massive data centers and artificial intelligence, the need to create an optimized processing platform is often the only way to get more performance or lower power out of the silicon area available.
Many consider extensibility as the ability to add instructions, and yet larger gains are possible when infrastructure and communications are considered as part of the opportunity space. For example, direct processor-to-processor communications through dedicated hardware channels can offer huge power and performance advantages over old techniques such as symmetric multiprocessing (SMP). These old mechanisms rely on communications across power-hungry bus architectures and shared memory.
Distributed processing also enables greater degrees of specialization, but to benefit from this the entire flow must become optimal enough that increased development costs can justify the performance gains. The industry appears to be at a tipping point, with RISC-V helping to nudge it towards being a standard design practice.
“A large portion of the industry is still — either because of legacy purposes or infrastructure that existed — committed to standard processors and standard architectures,” says Neil Hand, director of marketing for design verification technology at Mentor, a Siemens Business. “They deal with any additional capabilities they need through hardware accelerators. You have another set of people that have started to look at alternatives such as RISC-V.”
“The challenge, historically, has been that the provision of custom instructions was confined to niche applications where the tradeoffs in additional design effort, power and the need for customized software tool chains made sense,” says Tim Whitfield, vice president of strategy for Arm’s Automotive & IoT Line of Business. “The cost and overhead of managing the software outweighed the performance or power benefits. Often the use is limited to embedded applications where the software or user interface is not exposed to third parties.”
Performance improvements are forcing those alternatives. “If you can’t get the performance you need from a standard fixed ISA processor, you are forced to add hardware external to the processor and connect that somehow,” says George Wall, director of product marketing for the Tensilica group of Cadence. “With an extensible ISA, you can accelerate that particular function as an instruction, or set of instructions, and meet the overall performance requirement, without having to offload to fixed hardware.”
The desire to do this is certainly not new. “There’s been a dream to take C code and magically port over to a custom processor that can optimally run that C code,” says Graham Wilson, product marketing manager for ARC DSP processors at Synopsys. “The reality is that it’s not as easy to achieve that dream as was originally thought, but configurable and extendable processors get them closer to the dream. The ability to quickly respond to changes in the market, new algorithms, or different ways of implementing previous algorithms, provide the biggest successes for custom instructions and allows people to achieve much better performance.”
Impact of RISC-V
Adding instructions isn’t anything new. “A lot of processors have been around for a while, but they were very strictly controlled,” says Mentor’s Hand. “They allowed you to add special instructions, but it was without disturbing the main decode structure of the processor. What RISC-V has changed is there’s now a reusable software ecosystem, making it easier for people to consider these specialized processors. We see more people looking to build their own, or modify their own, because they don’t have to invest in that software ecosystem. The software ecosystem was always the expensive part in trying to explore novel architectures.”
This option has never existed before. “Combining a modular design approach with the ability to create custom instructions based on an open-source ISA is a groundbreaking idea for processors,” says Louie De Luna, director of marketing for Aldec. “The modular approach can cover a broader set of domains, and the custom instructions can address more domain-specific requirements.”
And because RISC-V is open, the possibilities are boundless. “RISC-V allows you to extend it, or to try out a configuration with many cores, where you need your own instructions to do coherency and moving data around,” says Simon Davidmann, CEO for Imperas Software. “Maybe you want a FIFO between them. The idea of an open ISA is that it gives you the ability to add simple things, like custom instructions to do specific algorithms, but also it allows you to connect arrays or play with communication. What RISC-V gives you is the freedom to innovate.”
Types of extension
Certain algorithms can benefit from the addition of instructions. “It might be a VLIW type of instruction or just an instruction that combines two very common operations like an add in a shift,” says Cadence’s Wall. “Depending on the extension and capabilities, you can sometimes incorporate some decision metrics into that data processing. You may have an instruction that says, ‘If this condition is true, I will load my register from this memory location. Otherwise, I retain the old value.’ Those types of instructions can be very valuable in image processing.”
Adding extensions is more than just adding computation. “You can also add interfaces,” continues Wall. “Through those interfaces one can connect external hardware blocks. Understanding the bandwidth and throughput between the processor and external hardware blocks is a fundamental part of the system architecture. You need to come up with a system architecture, and then come up with the algorithms that you intend to execute.”
That transforms it from being a software extension to an architectural extension. “You can add your own custom interfaces — whether it be GPIO, FIFO registers, an auxiliary register, or custom registers that can mirror the hardware blocks — so they become tightly coupled the hardware,” says Synopsys’ Wilson. “That gives the developers an optimal processor that is very tightly coupled and connected into their system and that goes through the extension of interfaces. Then instructions are created to access those GPIO, FIFO and auxiliary registers.”
Multi-core architectures are pushing this type extension. “When you’ve got an array of processors, they tend to sit in a matrix where they talk to people north, south, east and west,” says Imperas’ Davidmann. “Possibly there is a FIFO between them. You can programmatically do that, but it’s far more efficient to add your own instructions to control the communication fabric.”
There is a downside, however. “Now you have so many degrees of freedom and you’ve got to be careful which of those degrees of freedom you use,” warns Hand. “Every change you make will have an impact on verification.”
Quantifying gains
What can people expect when they extend a processor? “At DAC 2017, Microsemi reported creating custom DSP instructions for their RISC-V-based audio processor products,” says Roddy Urquhart, senior marketing director at Codasip. “Their custom processor delivered 4.24X the performance of the original core but required only a 48% increase in silicon area. Since power consumption is proportional to clock frequency, reducing the clock frequency reduced the power significantly more than the increase in power due to larger processor area. Furthermore, the code size shrunk to 43% of the original size with associated area cost and power benefits.”
The gains can be even larger. “We have a number of customers that have used Tensilica cores to accelerate AES encryption,” says Wall. “If you were to run that encryption on a pure processor, and then compare it to one with processor extensions, it is about 200X faster. That is an extreme case, but it gives you an example of the range of possibilities. The basic instructions in your standard RISC processor were not designed with AES in mind. There are some special XOR and shifts that are required. Being able to add those not only increases the data computation but also allows you to capture some of the decision making into hardware so there’s less need to perform branches which are costly in a processor.”
It may provide a smaller chip. “Apart from the performance, you may see gains in power and area. In terms of the power, you may need less instructions to perform some operations. If you look at accelerating some key benchmarks that are consuming 70% of the processor load and you accelerate those, you could half the number of clock cycles you need to implement an algorithm. That translates into a lower energy solution. While the core may be a little bit bigger because of the extra instructions, your resultant energy is much less. If you merge a large number of instructions, it also reduces your instruction code size, which will reduce your instruction memory size, and will reduce power consumption when you’re accessing memory.”
And the result could be a cheaper solution. “People have made changes to a processor that gave them orders of magnitude speed up of the software,” says Hand. “That allowed them to go to older process node and could actually save money without actually making too big a change to their design.”
Performance analysis
It all starts with analysis. “Not all computation is a good fit with a general-purpose processor core,” says Codasip’s Urquhart. “If software is profiled, it is possible to identify computational hotspots allowing designers to investigate how to extend the instruction set in order to better achieve design goals. Designers can create additional instructions and then use analysis tools for feedback on whether the performance bottlenecks have been addressed.”
Sometimes, you need a deeper dive to get all of the details. “You need to quickly get a feeling for what the impact is on area, complexity, and timing paths to ensure they do not impact achievable megahertz and other aspects,” says Wall. “You can very quickly flow it through your EDA process to come up with a gate level implementation, and then you can usually get a rough order of magnitude estimate of the power. That makes it easy to do a comparison of two different implementations. You may see that power went up by 10%, but I doubled my application performance.”
This is not an open-loop process. “You need to keep a tight rein because you are adding logic to the core,” says Wilson. “It is an iterative process of developing your instructions, doing a physical implementation, assessing how many gates, calculating maximum clock frequency and so forth. As long as you adhere to that, you will get a lower power and generally a smaller solution if you take into account the instruction memory.”
Maintaining the tool chain
If the tool chain is broken when extensions are made, the gains evaporate. Some solutions build that into the way in which extensions can be made while others provide more flexibility but may also allow you to get into more trouble. As a minimum, maintaining tool compatibility for RISC-V means that you remain compliant to the standard.
Many solutions start from a single description of the processor. “After extending the instruction set, the necessary golden reference ISS, RTL, software toolchain and UVM verification environment and tests for checking that the RTL matches the ISS are automatically generated,” says Urquhart. “This highly automated approach is both more cost-effective and lower-risk than alternative approaches involving manually, or partly manually, created extensions to the ISS, RTL, compiler and validation environment.”
Some solutions may go further than others. “We also use a single source for the instruction extensions, and that single source guides both the hardware design and the software design,” says Wall. “The software tool chain gets generated with the hardware. A key aspect is the ease of programming with these extensions. If you look at your standard RISC instruction set, the compiler decides which instructions need to be issued and how the resources used by those instructions get scheduled. That’s not so easy when extensions are made, but the compiler can make that decision, even when using custom instructions.”
Other solutions ensure that the tool chains cannot be broken. “We approach this in a different way to many other architectures,” says Arm’s Whitfield. “Our approach enables the addition of custom data-path instructions with minimal impact on either the hardware or the software ecosystem. One impact of adding custom logic to any CPU implementation is the requirement to ensure that not only your new instructions work correctly, but that you haven’t broken any other aspect of the CPU. This verification overhead can be quite considerable. Our approach reduces impact on verification and keeps the software tools and ecosystem consistent.”
The approach is sensible for many. “In the Arm solution, they do not give people complete freedom to add any old instruction,” says Davidmann. ” It is a very clean solution so that you could add to it and extend it a controlled way. It gives you the freedom to do certain extensions necessary for algorithmic improvements, but they did it in such a way that it couldn’t damage the rest of the Arm fabric. That means you didn’t need to revalidate everything. In the RISC-V world, there are no safeguards. You’ve basically got the RTL in front of you. It is very easy to add an instruction here or there or, but there is no policeman, apart from within your company.”
Verification costs can cause the gains to evaporate. “While the impact of software customizations can be kept more localized, this is much harder with hardware,” says Nicolae Tusinschi, design verification expert at OneSpin Solutions. “Adding a single custom instruction to a pipelined RTL processor has huge implications in terms of the functional corner cases that could hide bugs. What these companies need is a rigorous verification flow that can implement a sort of equivalence checking between the ISA, including the custom instructions, and the RTL implementation that their tools generate. This is similar to having an equivalence checking tool that verifies an RTL model versus the FPGA netlist generated by the synthesis tool of the FPGA vendor.” [The verification problems associated with custom processors were addressed in a previous article, “Practical Processor Verification.”]
Hardware / software co-development
In order to have an optimized processor, you need software. “If you have software you can certainly analyze your software and identify where some of the hotspots are,” says Wall. “But if you don’t have the software you can still identify what types of computation, or requirements, are going to be expected of the processor. And once you have that you can start making some of the trade offs of what the processor needs to do.”
In some cases, the traditional arrangement may be switched. “One company ran popular cell phone apps to measure the performance impact of their next generation systems and to aid them in making architectural choices,” says Hand. “What we’re trying to do is to provide a hardware software platform where people can start doing profiling. Then, as they migrate towards real hardware, they can start substituting that real hardware into the platform, either through simulation, emulation or prototyping. To take advantage of a lot of these tradeoffs, you are going to want to have you true hardware software co-design,” says Hand.”
Even if you do not have production software, progress can still be made. “We know that for audio, we have to be able to do fast Fourier transform (FFT) or infinite impulse response (IIR) algorithms,” says Wall. “Even though we don’t know every FFT or IIR that might be run on the processor, we are able to capture the basics of it. We can write a small software program that captures those algorithms and develop the extensions. Then we can analyze the impact of those extensions.”
Conclusion
Processor extensibility can have a profound impact on a design. This can be manifested in terms of significantly improved performance and lower power. It opens up architectures that could never have been considered in the past.
However, companies going down this path must be fully aware of the obstacles and hidden costs that will be incurred — especially while some of the technology is still in its infancy.
Many consider extensibility as the ability to add instructions, and yet larger gains are possible when infrastructure and communications are considered as part of the opportunity space. For example, direct processor-to-processor communications through dedicated hardware channels can offer huge power and performance advantages over old techniques such as symmetric multiprocessing (SMP). These old mechanisms rely on communications across power-hungry bus architectures and shared memory.
Distributed processing also enables greater degrees of specialization, but to benefit from this the entire flow must become optimal enough that increased development costs can justify the performance gains. The industry appears to be at a tipping point, with RISC-V helping to nudge it towards being a standard design practice.