How late can something be deferred during the development process? With dynamically extensible processors, that may be while it’s operating.
The convergence of two technologies, extensible processors and embedded FPGAs, is enabling the creation of processors that can be dynamically configured in the field. But it’s not clear if there is a need for them or how difficult would it be to program them. This remains an open question even though there is evidence of its usefulness in the past and new products are expected to reach the market soon.
Processor extensibility has been around for a long time, although interest has increased due to the open-source RISC-V core and the growing ecosystem that surrounds it. [See related article.]
The second technology seeing increased adoption is the embedding of FPGA resources into SoCs. Chips with a programmable fabric are not new, and many FPGA chips have embedded processors in them. Several soft-core processors have been defined that can be mapped into these programmable fabrics and these still see application today.
Using accelerators with processors is a common way to extend performance. These are designed to offload heavy computation from the processor, often utilizing an optimized piece of hardware for that function. Often it is accessed via the system bus or special purpose interface built into the processor.
Would it be possible to make that accelerator be part of the programmable fabric? Can the programming environment for it be created, and is there are market for such a product? “In general, dynamic reconfiguration of programmable logic is a great idea, and one that clearly can bring significant power and performance benefits to many applications,” says Tobias Welp, product manager for OneSpin Solutions.
Others agree. “Embedded FPGA (eFPGA) technology offers the prospect of adding custom functionality to a wide range of SoC applications,” says Chris Shore, director of product management for Arm’s Automotive and IoT Line of Business. “The advent of 5G, automotive and IoT has seen a shift in the demand of compute requirements toward the edge of the network. Achieving the tight performance and power requirements of these edge applications often requires application-specific compute or hardware accelerators. As the algorithms and functions of these accelerators evolves, so too does the hardware. The ability to integrate the FPGA component tightly with other logic on the same piece of silicon provides advantages not available to traditional FPGA products.”
Vision is driving a lot of architectural innovations. “The industrial segment has been using FPGAs for factory floor cameras,” says Joe Mallett, senior marketing manager for Synopsys. “They’ve been using IP cameras, where they’re doing a lot of the image processing in the camera. There is so much research right now that your base algorithm may be completely different in the future. You need some level of configurability to be able to optimize your algorithm for what you’re doing today, but you also need to be able to adopt better technologies of the future.”
Automotive vision is evolving even faster. “An IP camera probably wants to identify things that are moving and ignore things like swaying trees,” says Mallett. “But when you’re in a car, everything’s moving. You are processing the entire world dynamically, and at the same time you have to deal with IR vision at night and normal vision in the daytime.”
This places additional burdens on the algorithms. “They need something that’s much more specific for the problem they’re trying to solve, such as near real-time vision, while probably mixing that with other sensors like LiDAR and radar,” says Kurt Shuler, vice president of marketing at Arteris IP. “It’s not as general-purpose as something that you’d see from the more academic benchmark. So those guys are having to innovate a lot more than the traditional AI algorithms that you read about.”
And they are not standing still. “AI is, by comparison to the standard computer industry, an infant,” adds Mallett. “There is so much changing, and optimizations, and better ways of doing things. Somebody will always come up with a better and more efficient way to do it.”
The state of technology
Processor extensions happen in a few ways. “There are two main approaches to implementing additional instructions,” says Roddy Urquhart, senior marketing director at Codasip. “The first is to create an extended processor with the additional instructions fully implemented in the processor pipeline. The second is interfacing the base processor to a co-processor, where the additional instructions are executed. Today either approach would be normally implemented either in an SoC or in an FPGA.”
Interfacing also has a few options. “Custom Instructions really are small hardware accelerators focused on accelerating software execution that would otherwise take many cycles by the CPU’s standard instruction set,” says Andy Jaros, vice president of sales for Flex Logix. “eFPGAs are now being used to do similar acceleration functions by programming a specific accelerator into an FPGA fabric, usually connected to the system bus.”
But can that be taken further? “eFPGA can most easily be used to add accelerator functionality as memory-mapped peripherals connected to the on-chip bus,” says Arm’s Shore. “These can operate very flexibly and independently of the main processor. A more tightly coupled solution would be to use the eFPGA to implement an application-specific coprocessor, connected to a core via the built-in coprocessor interface. In this way, instructions executed by the coprocessor are integrated into the main instruction stream, making for a very straightforward software development process.”
Other possibilities may involve a fixed set of instructions, but a dynamically reconfigurable way in which processors talk to each other, or data moves around in a system. “AI chips contains a lot of multiply accumulate functions,” says Mallett. “You are pruning down the network, you optimize it, do all kinds of things to utilize resources in the best way you can. The math functions you are doing typically don’t change. You just configure the block for the right math function and use it from there on. Memories and interconnect need to be programmable.”
The process
To design such a processor, you have to have a plan. “The most difficult part in both acceleration techniques is determining what software function needs hardware assist,” says Flex Logix’s Jaros. “This is usually determined by analyzing the software execution on a target processor to see where the CPU expends most of its cycles. This usually leads to a relatively few functions that don’t efficiently execute with the native instruction set and lend themselves to hardware acceleration. Once identified, an RTL instruction or hardware accelerator can be written to reduce the CPU cycle count for those particular functions.”
Fig 1: Processor definition drives both hardware implementation and software tool chain. Source: Synopsys
Now you have to make the hardware decisions. “Once the instruction accelerator is identified, the CPU subsystem architect can choose to add it as a hardwired instruction to the CPU instruction extension interface, or alternatively, program it into an eFPGA that is attached to the CPU instruction extension interface,” Jaros says. “From a software perspective, both options look the same: the software developer codes in a function call in the software that tells the software compiler to use the new instruction instead of a combination of instructions from the native ISA. The software now gets the benefit of a hardware accelerator using standard software tools supporting extensible CPUs.”
That software/hardware interface has to be defined. “This would require pre-defining a set of say 32 functions in a standard format like 2 inputs and one output like c = co_processor_instr_1(a, b),” says Codasip’s Urquhart. “On the hardware side, a well-defined coprocessor interface would be needed to support the instruction format and to have the necessary control and status signals. On the compiler side, the functions would need to be defined as intrinsics or built-ins.”
There are some things that still can be implemented in either hardware or software, says Arteris’ Shuler. “When it comes to data flow management, what do I manage in software, what do I make automatic in hardware? There are some people who are doing things that are deeply embedded or deeply custom, where those algorithms are totally done internally. We see this happening a lot in automotive.”
Merging the pieces
Can these two acceleration methodologies be combined to provide the ultimate programmable and reconfigurable processor? “The answer is yes, and it has been proven in silicon with one of our customers,” says Jaros. “The benefit of using eFPGA in the instruction extension interface is that it now has a flexible ISA and the chip can be optimized for different applications. For a system company, this can reduce inventory costs and leverage higher volumes across a single SKU. For a semiconductor company, this enables customizable chips that are hardware customizable but customized in software.”
Being able to define optimized processors for multiple applications with a single chip could be a huge advantage. “Extending a processor with a co-processor implemented as an FPGA has the advantage that the co-processor bitstream and firmware updates can be supplied after tapeout,” says Urquhart. “In planning the SoC it would be necessary to budget the right size of embedded FPGA to create a processor for the desired additional instructions. The user would need to design the co-processor logic to implement the subset or whole set of pre-defined functions. On the software side, libraries and version control would be needed to ensure that the SDK could only issue those instructions that were supported by the co-processor.”
This type of flexibility does not come without potential problems. “The hardware architects of the chip have to be sitting next to the software architects to really come up with a good system,” warns Shuler. “If not, what you get is more of a toy system, where the software guys have to hack and hack and hack to get what they want. If you’re not anticipating these things, then it’s on the software guys’ shoulders. The smart companies have the software guys right next door.”
We also must be careful about how we use the term dynamically reprogrammable. “In the process of developing custom instructions, the software developer can create a library of custom instructions and use them when writing their software,” explains Jaros. “The compiler then schedules them when generating the binary executables. When the processor sees those instructions, it can kick off a DMA transaction to load that particular instruction from memory and program the eFPGA. Depending on the size of the eFPGA, chip frequency and other factors, this can take anywhere from 11µs to 223µs. So there would be a latency associated with dynamically reconfiguring the eFPGA with each new instruction, but that may be well offset by the improved performance.”
All of this needs to be measured. “Quantifying the improvements can be done by looking at the reduced number of cycles now needed by the CPU + custom instructions, and then adding back in the latency required to program the eFPGA and comparing the result to the cycles needed by the CPU without custom instructions,” Jaros explains. “Keep in mind that some of the latency will be hidden, as most modern processor architectures support out of order execution, so programming the eFPGA may not cause processor stalls if the custom instructions are written as a non-blocking instructions.”
Verification troubles
Flexibility impacts verification. “While the emergence of open source ISA’s like RISC-V with support for custom extensions gives an incredible amount of freedom to processor designers, it poses a very interesting verification challenge,” says Shubhodeep Roy Choudhury, CEO and co-founder of Valtrix Systems. “You have to make sure that all the designs are compliant and functionally correct. This calls for a shift in the way the test generators are designed. They need to be highly configurable to allow verification of custom features, along with legacy/baseline features.”
OneSpin’s Welp agrees with this difficulty. “Processors that can be extended dynamically could be one of the most challenging use of embedded programmable logic because functional verification also becomes dynamic in these scenarios. The user needs to verify the intended model and that the specific instance of programmed logic matches that model. How can this be verified if the intended model changes dynamically? Perhaps we could have a flow that performs verification and equivalence checking on a large set of intended models and their corresponding instances of programmed logic. If this verification covers a quantifiable measure of the configuration space, this could even allow for dynamic configuration to be applied to safety-critical applications.”
The programming model
Many architectures and processors have looked good in hardware, but the software programming model has never taken off. “Can you build a tool that will allow any C programmer to build hardware?” asks Mallett. “That is the fundamental challenge that you run into. It comes down to the fundamental difference between the hardware guys who think clocks, and the software guys who think interrupts. When you’re designing hardware from a pure software perspective, it might be optimized from the software side, but it may not necessarily be optimized from the hardware side. You still have this push and pull between hardware and software guys in terms of being able to optimize it for the best performance, best area, best power.”
When these processors are deeply embedded, the programming audience may be quite limited. “The companies have to decide if their customer base have the expertise to do this,” Mallett says. “They may also have to decide how badly it will burden their support — if and when a customer is doing something that doesn’t work. Third, how much of my secret sauce do I have to put out there in order for a customer to be able to configure this?”
Conclusion
While dynamic extensibility may sound like a logical next step in the trajectory of extensible instruction set processors, it does lead to a number of new challenges. It has been tried in the past, such as eMIPS from Microsoft back in 2006. All of these challenges need to be overcome before it becomes a viable solution.
For a company that does succeed, the rewards could be significant. Today, companies often use a single die with different configurations and packaging to change I/O or other capabilities, but this would enable significant changes in performance for defined applications. If the dynamic nature is fixed when the chip is shipped, the end user may never know what underlying technology is being used. For areas where the technology is advancing rapidly, it may be the only viable alternative to stop products being obsolete before they even ship.
Related
What’s So Important About Processor Extensibility?
Designers must carefully weigh the gains against the costs, many of which are not immediately obvious.
Leave a Reply