Customizing Processors

How custom a processor need to be depends upon many factors, but selecting the appropriate tool chain may be the right place to start.

popularity

The design, verification, and implementation of a processor is the core competence of some companies, but others just want to whip up a small processor as quickly and cheaply as possible. What tools and options exist?

Processors range from very small, simple cores that are deeply embedded into products to those operating at the highest possible clock speeds and throughputs in data centers. In some cases, they may utilize the same instruction set architecture (ISA), differentiated only the micro-architecture. Other times, the processor architecture is highly tuned for a single application or optimized for very specific conditions. While there are an infinite number of possible processors, there are only a few ways to go about designing and implementing them.

“A server, where you have to run arbitrary applications at very high speed, is a different game from a portable SoC,” says Gert Goossens, senior director for ASIP tools at Synopsys. “You may have a similar ISA, but the implementation will be strongly dependent on what process technology you can afford, and what clock frequencies you want to achieve. These elements are driven by your product, or your design, and those elements determine which micro-architectural implementations you want to have. The challenge is to do the tradeoff analysis to figure out what is best for your application domain.”

One-size-fits-all is a common design pitfall. “For example, one cannot expect to create a core driven by client workloads to work well for cloud native workloads, or a core focused on HPC to work for cloud native workloads,” says Jeff Wittich, chief product officer for Ampere. “Another is underestimating the amount of fine tuning of the core architecture that is required to deliver performance and efficiency leadership. It takes an extremely skilled and experienced design team to create a world-class processor that is fully validated in both pre- and post-silicon phases for its intended usage.”

For other applications, an off-the-shelf product may suffice. “If a pre-existing IP block meets your PPA and performance targets, that is the least-risky path,” says George Wall, director of product marketing, Tensilica Xtensa processor IP at Cadence. “Once you get beyond that, where off-the-shelf doesn’t meet the requirements, that’s when customers start looking at how to customize a processor. The criteria would include performance requirements, data throughput requirements, and what kinds of algorithms are going to be run on the processor. What other software is going to be run on the processor, such as RTOSes and other components like that? And what tool support is going to be needed, both for my own development purposes as well as my end user’s development purposes? Those give you the requirements that you need to map to, and then it’s a matter of how quickly can you explore various implementations. How easy is it for me to change the pipeline depth? How easy is it for me to select or de-select common instructions like multiplies or divides or floating-point operations? And how easy is it for me to add resources to the processor, be they instructions, registers, or even whether they’re interfaces. How quickly can I do that, and how is my software impacted by doing that?”

Architecture and microarchitecture
In general, you get performance either through parallelization or specialization. You get specialization mainly from the architecture, while you get the parallelization mainly from the micro-architecture. So how do you balance those two? Should people be looking at changing the architecture or the micro-architecture first?

Specialization has long been a subject of debate. “Back in the early ’80s, there was a RISC versus CISC discussion,” says Michael Frank, chief architect for Arteris IP. “Back then, Dave Patterson came down on the side of RISC and the industry followed. As we increased clock speeds, things got more difficult, and we started to hit the memory wall. Today, there is an advantage of CISC again, because you carry more work into the core of the processor in a single instruction.”

For most people, changing the architecture means changing the instruction set. “This is much easier for most people,” says Rupert Baines, chief marketing officer at Codasip. “You get most of the benefits, because you can get massive acceleration. And it’s just much easier to verify. Once you start going into the micro-architecture, you have to consider how you are going to verify it. If you stick to ISA changes, then you have a simulator that understands the new instruction. It is designed within the tool, so it can generate the variant, the UVM for that instruction, etc.”

Micro-architectural changes can be messy. “Changing the underlying micro-architecture is potentially fraught with peril,” says Cadence’s Wall. “It’s not something you want to do unless you have a full understanding of how those micro-architectural changes are going to ripple through the software tools, such as debuggers, as well as higher-level software such as our operating systems. The basic plumbing software that sits on top of every embedded processor, between the application and the core, is very sensitive to the micro-architecture, as are operating systems, compilers, and debuggers. Those are the types of tradeoffs which, if the customer wants to change the micro-architecture, they really need to understand the impact.”

To contain those verification risks, some industry groups are considering using a base processor plus an accelerator that is accessed through an interface. “We have customers that want to start from a RISC-V architecture, but they also need much wider instruction words,” says Synopsys’ Goossens. “They typically go in the direction of a VLIW architecture, but you may have a scalar slot in your VLIW that is basically the full RISC-V architecture. Then you have additional parallel issue slots, where you go to more parallel operations. You can continue to use a lot of the RISC-V ecosystem, but for high-performance applications, you get a performance boost from the parallel extensions that you made next to it.”

Degrees of customization
There are many levels of customization, ranging from selecting a piece of pre-designed IP up to full custom design. Among them:

  • Pre-existing IP block
  • IP blocks that allow for a certain amount of customization
  • Tool-based configuration and customization base on templates
  • Tools that allow arbitrary processors to be described at both the architectural and microarchitectural levels
  • Full custom design.

Within each of those categories, various flavors exist and are supported by a number of companies and tools.

“To design a high-performance processor core, one must use a combination of both manual design and synthesis tools,” says Ampere’s Wittich. “It is unlikely any auto-generated processor will meet the needs of cloud native workloads that have unique characteristics and stringent demands around latency and throughput. There will always be performance and efficiency left on the table if you do so. Every aspect of the core, like the size of the TLB or the branch prediction algorithm, can have big impacts on real workload performance and must be carefully optimized for the specific product usage.”

High-performance processors tend to use every possible means for optimization, and then these may be supplied as IP blocks. “The baseline functions to enable technologies like speculative execution have fairly different implementations based on the performance target, as such structures do not scale forever and need different forms depending on their sizes,” says Fred Piry, lead CPU architect and fellow at Arm. “Taking speculative execution as an example. Trying to tackle an out-of-order window of 20 cycles for a two-wide issue CPU requires a micro-architecture that is totally different from a CPU able to execute speculatively across 600 instructions or more. To cover all possible performance points, a single, configurable CPU is not sustainable, and this is the reason why Arm builds different micro-architectures, covering multiple requirements.”

There are limits to how far you can take any technique. “There is a diminishing return on investment,” says Hunglin Hsu, vice president of engineering at Arteris IP. “You only have a few architectural registers, so at the end of the day, register dependency will prevent you from retiring as many instructions as you dispatch. As you build bigger and bigger front end dispatch machines, your clock frequency starts going down. I would prefer to go to multi-core rather than building a gigantic wide superscalar core. And when it comes to multi-core, two things are the most important. One is the thread schedule in the OS. It needs to know CPU affinity. It needs to allocate a thread to a core that is close to the memory. The other thing that is important is the coherent fabric.”

Many things become interlinked at this level of optimization. “Features, like speculative execution or multiple issue, are all linked together, and there is no single design choice for each performance point,” says Arm’s Piry. “For example, it would be counterproductive to increase the execution parallelism without increasing, similarly, the capacity for the CPU to execute instruction speculatively. The number of pipeline stages dictates the maximum clock frequency the CPU will reach, and frequency has a direct implication on performance. The number of pipeline stages has a direct implication on performance loss incurred by a branch misprediction. The longer the pipeline, the higher the cost of a misprediction. As such, a CPU architect will need to understand the possible timing critical structures of its design, then find the best compromise in between frequency, area and power. So, pipeline stage needs to be defined upfront, otherwise configurability becomes complex.”

Building from a base
The introduction of the RISC-V ISA has ignited interest in the middle ground when it comes to processor design. “RISC-V certainly has a lot of benefits, having a common instruction set base and defining a common architecture,” says Wall. “It defines a certain set of ground rules, such that IP vendors and college students don’t have to reinvent the wheel. They don’t have to create the instruction set or create memory management models, or other great things that are in the RISC-V world. But in terms of what the processor looks like, it can really be as simple as a tiny two-stage microcontroller, or a massive high-performance core. The RISC-V ISA can be used in either one. It was designed with that scalability in mind.”

In many cases, processor generators auto-generate a functional implementation of a processor, based on user-specified customization of a baseline processor design. The supplier of the generator will have designed a number of base micro-architectures, and the customer can select from a limited set of customization options, such as the number of pipeline stages or size of caches.

“This approach works well when available base models can meet the typically modest or low-performance and power-efficiency requirements of the target application,” says Wittich. “They are also attractive where development cost and lack of specialized processor design expertise are key constraints, i.e., the user does not have the expertise, time, or budget to build a truly optimized processor.”

Many times, those base models are targeting specific application spaces. “You’re starting from something sensible, something reasonably common, but the customizations could take it in a very different direction,” says Codasip’s Baines. “You could add new instructions, and those new instructions will be utterly unique. Nobody else will ever have done those before. You could change the micro-architecture, and you could do that in quite fundamental ways. But you’ve got to stay within the guide rails of what the starting point was.”

These tools allow for exploration. “They allow you to do tradeoff analysis, where you compile real application code on your processor, you execute the generated code in the simulator, and you can profile it,” says Goossens. “You can really see the tradeoffs for your application. These tools, because of the automatic generation of compilation, of profiling, allow you to do an analysis that is not just guesswork, but it’s based on real data. That’s why we strongly advocate the use of processor design tools.”

While each tool is based on a different language, they all share a similar flow as shown in figure 1.

 

Fig. 1: Design and verification flow for a processor utilizing a generation tool. Source: Synopsys

Fig. 1: Design and verification flow for a processor utilizing a generation tool. Source: Synopsys

While some tools may only allow features to be selected from a list, others have more general processor description languages that allow for a greater range of processors to be developed. The nML language, for example, has an array of optimization possibilities as shown in figure 2.

 

Fig. 2: Features accessible within the nML language. Source: Synopsys

Fig. 2: Features accessible within the nML language. Source: Synopsys

An important part of this tool chain is the generation of the compiler. “When you have two processors with the same ISA but different implementations, you can in theory use the same compiler for both of them,” says Arteris’ Frank. “Think back to the AMD and Intel wars. A program compiled using the Intel compiler will run much better on that machine than on an AMD machine. It was due mostly to the compiler being tuned to a particular processor. Once in a while, it would do bad things to the other processor because it didn’t know the pipeline. There is definitely an advantage if there is inherent knowledge.”

The processor languages contain both the architecture and micro-architectural information and this can be used to generate an efficient compiler. “Our tool starts with LLVM,” says Baines. “It produces a customized version of LLVM and it is in effect a compile of the compiler. You write your variation, and it generates the compiler, the debugger, the profiler. They are all automatically outputs and they understand the changes you’ve made. The reason we can do that is the language we’ve got is only for processors. It is not a new RTL. It is not a general-purpose language.”

This is different from high-level synthesis that is more generic. “High-level synthesis is a considerably more effort-intensive approach to designing processors,” says Wittich. “Here, the user needs to have the deep expertise to be able to specify the algorithm or functionality they are targeting in a high-level language like C or C++. The synthesis tool can then generate a functional implementation. In addition to requiring insight and expertise, current HLS capabilities only support the synthesis of a limited range and scope of algorithms (such as signal, image and video processing). And the implementations generated, while functionally correct and generated quickly, are typically not well-tuned for optimum performance, power, and area. This approach has yet to be successfully demonstrated on a high-performance processor.”

Conclusion
The creation of a processor can vary between no effort (buying a piece of the off-the-shelf IP), to full custom design, which often ends up as a piece of IP. For those who require something that is more tailored to their needs, there are processor generation tools that can guide a user through the process, so long as they adhere to a set of limitations and rules. It is a tradeoff between the level of optimization, the amount of time and effort that can be afforded, and the added advantage it translates into with your product. Which of the tools would be most suitable for your next processor design depends upon how far you want to push the boundaries.

Related
How To Optimize A Processor
There are at least three architectural layers to processor design, each of which plays a significant role.



1 comments

JC Bouzigues, Menta says:

A way to customize and add versatility to a processor is to add a logic block to act as a processor extension…even better if you can program this logic block using an embedded FPGA (eFPGA).

Leave a Reply


(Note: This name will be displayed publicly)