How To Optimize A Processor

There are at least three architectural layers to processor design, each of which plays a significant role.


Optimizing any system is a multi-layered problem, but when it involves a processor there are at least three levels to consider. Architects must be capable of thinking across these boundaries because the role of each of the layers must be both understood and balanced.

The first level of potential optimization is at the system level. For example, how does data come in and out of the processing cores? The next level is the architecture of the processing cores themselves. This may involve adopting an existing processor, or adding instructions to an extensible core. The final level of optimization is the micro-architectural level. This is where implementation pipelines are defined.

When systems are created out of pre-defined components, the freedom of choice is restricted. But when custom silicon is to be deployed, it can be easy to be overwhelmed by the amount of flexibility and to dive into details before it is appropriate. A top-down discipline has to be maintained.

“The mission is to optimize the execution of a task or a group of tasks,” said Gajinder Panesar, technical fellow for Siemens EDA. “It’s not about the processor. It’s about the process you’re trying to optimize, and therefore you have to think about the system that is going to achieve the task. How you architect and partition the system design is a key question before you even begin to think about choosing a CPU and determining whether you need to customize it.”

Partitioning of tasks and architecting the system is one challenge. Once it has been ascertained that a processor is to be used, that processor needs to be optimized and potentially customized for a given task or group of tasks.

Processors can be optimized for any number of reasons, including combinations of throughput, latency, and power. “Specialization of a processor is basically the introduction of parallelism,” says Gert Goossens, senior director for ASIP tools at Synopsys. “It can be instruction-level parallelism, or it can be data-level parallelism. Processing vectors can be task-level parallelism, or maybe you deploy a multi-core architecture. The second technique is specialization. You make sure the functional units of your processor can execute things in one cycle that on a traditional processor would take hundreds of cycles.”

Often performance and power are quite tightly coupled. “The classic example would be to accelerate some tasks by adding computational resources,” says George Wall, director of product marketing for the Tensilica Extensa processor IP group of Cadence. “The goal is typically to produce a lower energy implementation. Energy is power times time. If you customize the processor, there typically would be a small increase in the power, which hopefully is offset by a significant decrease in the cycle time, overall providing a net win on the energy side.”

System-level optimization
In the ideal case, the processor always would be busy doing productive work, never having to wait for data to be made available. This is rarely the case, however. Every cycle the processor is idle, or performs a speculative task that is not used, results in wasted time and power.

“You have a processor, and they need someone to feed them,” says Michael Frank, fellow and system architect at Arteris IP. “They need cache. They need peripherals like interrupt controllers that supply the vectors. And when you get interrupts, they need SMMUs for virtualization. There is a whole ecosystem that needs to exist around the processor. It doesn’t operate on its own. And then you need the cache infrastructure that feeds the processor, because your processors have become fast. You cannot have them talk to the remote memory.”

In multi-processor systems, each of the processors needs to be orchestrated. “How will the accelerator be managed?” asks Sharad Chole, chief scientist for Expedera. “Where is the workload orchestration going to happen? What sort of bandwidth is required for the accelerator? How much DDR access or how much shared memory will be required at the workload level? When we discuss the solutions with the customers, it is typically a hardware/software co-design problem. You need to analyze the workload, and you need to define what impact the workload will have on the entire SoC. It is important to not miss that. We are not optimizing a single CPU core. If you optimize one CPU core in isolation, you end up with a multi-core architecture that is not deterministic, and the performance is dependent on compilation. How good is the compiler?”

Fig. 1: Key elements in processor optimization. Source: Brian Bailey/Semiconductor Engineering

Fig. 1: Key elements in processor optimization. Source: Brian Bailey/Semiconductor Engineering

Optimization happens on multiple levels. “Systems needs to be analyzed to make sure the communication works correctly and to locate any bottlenecks,” says Simon Davidmann, founder and CEO for Imperas Software. “Many systems are employing sophisticated communication that could be either synchronous or asynchronous. It is like a floor-planning challenge, to make sure that data doesn’t bottleneck as it flows through the system.”

Often, those communications involve software. “You have to consider the firmware and software running on the device because that determines whether the product is performant or not,” says Siemens’ Panesar. “We can illustrate this by thinking about the various types of visibility that can be delivered. A CPU-centric view would suggest that tracing instruction execution should be enough, but without visibility of what’s going on elsewhere in the system, it’s a very blunt instrument. Is the network-on-chip (NoC) correctly dimensioned and configured? Are memory bandwidth and latency affecting performance? Without a system-level view of factors such as these, all of the CPU customizations in the world will not deliver a successful product.”

Some of these decisions are influenced by the type of the processor core. “CPUs and GPUs can perform random data access, but AI processing cores are different in that they are designed to execute a limited set of algorithms with very specific and well-known data flows,” says Kristof Beets, vice president of technology innovation at Imagination Technologies. “This enables streamed processing and allows for much smaller logic, and less local caches. The operations and data flows supported are more limited than the GPU, and especially very limited compared with what a CPU supports.”

For many processors, the bus interface may be seen as a limiter. “Interfaces are in some cases equal, or more important, than the actual ISA in terms of creating an efficient design,” says Cadence’s Wall. “The traditional processors most commonly have a bus interface, such as an AMBA protocol-based interface, to interface to other elements on the SoC, to interface to the main memory storage, and to interface to various I/O devices. There’s only so much an interface like that can scale, based on the number of elements that are competing for those devices. Part of the embedded design process these days is to consider if it makes sense to interface the processor to a particular device over the system bus. Or, can there be an alternative way to interface it? And that is another way to extend the processor — to create interfaces like queue interfaces, or look-up interfaces, where these other devices can be interfaced more directly to the processor.”

Ignoring these types of issues can lead to less-than-optimal solutions. “Data movement is critical,” says Manuel Uhm, director for Versal marketing at AMD. “The I/O, the memory bottlenecks all had to be thought through. We’ve actually doubled up the onboard tightly coupled memory — basically the program memory — attached to every single core. We also added something called mem tiles, which basically builds the buffer that supports these cores. They’re not actually part of the core itself, unlike the tightly coupled program memory, but they are their own tiles that support all these. We learned, and the learning wasn’t all about the compute. It was about how you move the data, it’s how you manage memory, how you bring it all in. I/O is a huge part of that problem.”

A number of more recent workloads may not even lend themselves toward a von Neumann-like processor architecture. “We did not start by assuming we needed a processor,” says Expedera’s Chole. “We did not start with the von Neumann architecture. What we started with were the building blocks of operations that are used in a neural network — for example, matrix multiplication, convolution, activation functions, pooling layers. We started by defining how we could do this best. We looked at the costs of doing this, and how to make sure that every computational block is always busy. The problem was to make sure that all operands are available when the computation needs to happen. Then we built an architecture that does not have any back pressure and is completely deterministic.”

Processor architecture optimization
Until recently, the opportunities for processor architecture optimization were limited unless you were building a completely custom processor. “A processor architecture has two parts,” says Zdeněk Přikryl, CTO for Codasip. “First is the instruction set architecture (ISA), and second is the micro-architecture. That is the implementation of the architecture. In the case of a proprietary ISA, you are rarely allowed to change the ISA, so you are limited to micro-architecture changes. You can think of those changes being evolutionary rather than revolutionary. On the other hand, if you start with an open ISA, it gives you a really good starting point and you can focus on the innovation and key differentiation. That becomes your secret sauce. You can add new instructions that helps improve performance, it may reduce memory footprint, etc. In this case, it could be described as a revolutionary rather than evolutionary.”

Some ISAs that contain significant amounts of flexibility. “There’s configurability and extensibility,” says Rich Collins, director of product marketing at Synopsys. “Lots of the standard implementation is configurable in terms of widths of buses and sizes of memories. All those things are configurable without having to do any customization whatsoever. There may be pre-defined forms of extensibility, like providing a set of condition codes, a set of auxiliary registers, or extended sets of instructions, or even bolt-on hardware accelerators. You don’t have to just add custom instructions. If you have your own custom, secret-sauce accelerator, then you can interface that to the processor.”

The adoption of a baseline processor, or fixed architecture processor does have benefits. “People can re-use the ecosystem that already exists for a baseline processor, like Arm, RISC-V, or ARC,” says Synopsys’ Goossens. “That ecosystem could extend to peripherals, units for interfacing with your processor, and the software libraries that exist. The fact that you can re-use those elements is important, and that is a reason why people may prefer to have a RISC-V baseline, or an ARC baseline, and go from there.”

While some people in the industry decry the lack of processor architects, this is a temporary problem. “With the rise in RISC-V, there have become many more computer engineering courses that cover processor design, so the industry will have more people coming in the next few years,” he says. “I don’t think there’s a shortage of people who know how to build processors. It takes a couple of people as a crystallization point, and they collect the team around them to start building processors. There are no special skills required to build processors. It’s just another IP, but to build a good processor is not easy.”

It does require a good flow that enables architectural changes to be analyzed. “You need the software tools, especially compilers,” says Goossens. “You don’t want to program your extensions in low-level assembly code, and then do inline assembly and link it with your compiler generated code. Compilers must be able to exploit all the specializations you have added. Then you can get immediate feedback about the quality of your architecture by using real life application code. If you don’t have that, then you are using guesswork to make your extensions.”

There are two ways to handle this. “The traditional approach is to have a team of software engineers working on the software development kit (SDK), then a team working on RTL and another team working on verification,” says Codasip’s Přikryl. “There is nothing wrong with this approach, but it requires a lot of engineers, and you need to synchronize among them to be sure that they are aligned. The other approach is to use a tool that automates most of this. You describe the processor, including its ISA and microarchitecture, in a high-level language. Within Codasip we call that CodAL (a C-like language). From that, it generates the SDK, RTL and verification tools.”

Micro-architectural optimization
Given that the processor architecture has now been defined, it has to be implemented. “Micro architecture defines how well you execute the instruction set,” says Arteris’ Frank. “When people stick to the agreed upon instruction set architecture, you are squeezing the lemon, the micro-architecture, to provide more performance.”

The implementation of a processor is generally considered to be a standard RTL design process. However, that does not mean it is trivial in any way, especially if it utilizes out-of-order parallelism, speculation, branch predictors, or one of many other techniques.

There are many aspects to the design, creation, and optimization of a processor that can execute a defined task, or set of tasks, and those aspects are all tightly linked. Paying too much attention to one and ignoring others can create problems. As with most systems, starting at the top leads to the biggest gains.

“Hindsight is 2020,” says AMD’s Uhm. “If you don’t get something into the market, you can’t learn from that to get better. You see that with many AI companies today. Everyone who ran out of money after their first device is out of business. The folks who had some levels of success are onto their next chip, they learned from it, and have improved upon it. You need a constant feedback loop about what’s going right, what’s going wrong.”

Leave a Reply

(Note: This name will be displayed publicly)