Most applications can be decomposed into a number of tasks, and there are many options to create better implementations of them.
The optimization of one or more tasks is an important aspect of every SoC created, but with so many options now on the table it is often unclear which is best.
Just a few years ago, most people were happy to buy processors from the likes of Intel, AMD and Nvidia, and IP cores from Arm. Some even wanted the extensibility that came from IP cores like Tensilica and ARC. Then, in 2018, John Hennessy and David Patterson delivered the Turing Lecture titled “A New Golden Age for Computer Architecture: Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development.” While it was not the lecture that started the drive towards more optimization, it certainly elevated it to the global consciousness.
It is easy to get sidetracked into thinking of a subset of solutions. “The mission is not to optimize the processor,” says Gajinder Panesar, technical fellow for Siemens EDA. “Instead, the mission should be to optimize the execution of a task or a group of tasks. It’s not about the processor, it’s about the process you’re trying to optimize, and therefore you have to think about the system that is going to achieve the task. How you architect and partition the system design is a key question before you even begin to think about choosing a CPU and determining whether you need to customize it. It is these kinds of system-wide requirements that will help with ‘what the CPU needs to do and how.’”
The optimal solution always will be an ASIC. “If you build everything out of hard blocks for one specific use case and problem, you have extreme optimization in terms of performance and power,” says Manuel Uhm, director for Versal marketing at AMD. “However, if standards are still evolving, if all the algorithms aren’t 100% locked down, or if the customer wants to be able to be differentiate, then you have a strong case for the need for flexibility, the need for programmability and the need for reconfigurability. If you’re going into a space which has been tightly standardized and defined, and there’s a big market behind it, honestly, you’re not going to beat an ASIC. If you can get to market at exactly the right time, you can take advantage of it for two, three years before the next evolution comes along. But there are fewer and fewer places where that exists as a real scenario. The need for flexibility, programmability, reconfigurability, is becoming higher.”
Over time, several processor architectures have been developed to optimize broad classes of problems. “When it comes to optimizing processor designs, we have to keep in mind that each processing block has different strengths and weaknesses,” says Kristof Beets, vice president of technology innovation at Imagination. “The CPU, for example, tends to be most optimized for single-instruction-stream processing supporting high clock frequencies, with a processing architecture fine-tuned for branching, usually connected to a large cache structure. On the other hand, GPUs are designed with much more inherent parallelism in mind. GPUs were originally managing the processing of billions of pixels and vertices, which could all be handled in parallel without any dependencies and with minimal or zero branching.”
There are several ways to reach optimization goals, often with many of them considered at the same time. “The main reason you use a processor is that you want to have software programmability, and that’s a no brainer these days,” says Gert Goossens, senior director for ASIP tools at Synopsys. “Applications are evolving so fast, and you want to add smart functionality to your SoC, even after the SoCs have been built. Therefore, you need programmability. So you need embedded processors in your SoCs. But if you restrict yourself to standard processors, then sooner or later you hit limitations in terms of the performance that you can get, and in terms of power consumption.”
There are many definitions of optimal, and companies must figure out what it means to them. It may be different for each task contained within the same SoC. “Optimal for what?” asks AMD’s Uhm. “There are a set of tradeoffs, and there is no perfect solution. Tradeoffs are generally between flexibility and optimization, particularly where standards may not be set or are evolving. In price-sensitive markets, you may be trying to minimize cost. Or perhaps it’s about ease of use, because making a processor easy to use takes time. It may require hardening everything like an ASIC, and therefore it’s easier for an end customer to use. Or you may have an advanced set of tools, which is a massive investment. If you’re trying to broaden flexibility, that’s a tradeoff often with ease of use.”
The list of optimization goals is extensive. “There are probably as many goals as there are companies,” says Simon Davidmann, founder and CEO for Imperas Software. “Some people want to get the job done in the shortest possible time, some want the lowest possible development cost, some want it for the smallest area, lowest power, high performance, cost-effectiveness. Everybody’s got a different set of requirements. There’s a cross product of all of these things. When you’re optimizing something, you have to do it with the end in mind. You have to start by looking at your architectural design. This may include the type of processor architecture you are going to use. Then you can focus on optimizations.”
It is important to understand the context of the problem. “Are you looking at general-purpose programmability versus optimizing for a specific deployment use case?” asks Sharad Chole, chief scientist for Expedera. “The difference between these two approaches is that general-purpose programmable solutions don’t necessarily look at a specific workload. They are looking at the workloads as a very broad criteria, like is it scientific computing, graphics, supercomputing, neural network training or inference? When we are looking at this domain we are asking, what are 90% of the operations going to look like? And how do I optimize for those 90% of the operations? How do I make sure that I get the best performance for the area? How do I get the best performance for power?”
It all starts with the definition of a task. “You look at your algorithm, you decompose it, you identify where you have bottlenecks,” says Michael Frank, fellow and system architect at Arteris IP. “Then you figure out what you can do. You may group a sequence of instructions and figure out if you can build something that does this thing directly in hardware instead of having to sequence through a handful of instructions. You’re trying to get rid of loops, because loops are bad. Data flow is easier to analyze than control flow, but you need to do the data flow analysis to make sure you do not spend too much time getting data in and data out. That means dependencies need to be resolved.”
Understanding the workload
Benchmarks are, by definition, looking in the rearview mirror. They define what was important to enough people that they bothered creating a benchmark for it. The application space, particularly with AI is changing rapidly. Designs have to think what the application will be by the time silicon is produced.
However, they can provide a starting point for analysis. “Benchmarks are very important for existing IP,” says Imperas’ Davidmann. “You can probably download or work with Arm, SiFive, Andes, or Codasip, and you take your benchmarks and try to run those on their cycle accurate models. A lot of the processor IP vendors will run suites of benchmarks, and say this is the speed of our core on these benchmarks. Then you know that my application area requires these benchmarks, these workloads, so if they’re trying to select a core and benchmark the different vendors in those areas.”
But that may only be the first step. “Industry benchmarks, existing software, or rough prototyping software will provide some idea of the performance bottlenecks and if they can be alleviated,” says George Wall, director of product marketing for the Tensilica Extensa processor IP group of Cadence. “With the industry talking about the Shift Left strategy, more of that software development and software analysis is done as early in the design cycle as possible. It is almost guaranteed that by the time that the chip goes to market, some things will have changed. There’s always new networks coming up that might solve a problem better, and that is why you try to retain as much programmability as you can.”
Even when the eventual application is not known, attributes of it are likely to be fairly mature. “Performance depends on critical kernels in your application, and the processor architecture must be optimized to run those critical kernels,” says Synopsys’ Goossens. “In applications, like neural networks, people have a pretty good idea about what those critical kernels are. It’s always a bit of a guess because you never know what kind of algorithmic changes may happen, but people who are in the field have a pretty good idea about how the architectures must be specialized to run those critical kernels. In addition, there’s a lot of housekeeping software that sits around those kernels, a lot of control flow, and that can change depending on the application. So that’s why a processor with some carefully optimized functional units for critical kernels, plus the whole software infrastructure around it for the housekeeping code, that combination makes a lot of sense.”
There are different ways to look at those critical kernels. “We look at the characteristics of neural networks like ResNet, YOLO, MobileBERT, FSRCNN, etc.” says Paul Karazuba, vice president of marketing for Expedera. “For each of the networks, you see very different characteristics for the number of operations that the network has, the number of weights it needs to process, the number of activations that have to be handled during the execution of the entire neural network (see figure 1).”
Fig. 1: A Growing Diversity of Networks Demands Flexibility. Source: Expedera.
Some people may attempt to optimize for convolution networks or networks that are known to have sparsity. “Our architecture is not tailored towards running a CPU workload,” adds Expedera’s Chole. “The architecture is tailored toward running neural networks. We know there are layers in the neural network, there is a static graph, there is connectivity, there is dependency of the workload, and these layers can be broken down into smaller chunks. We call them packets, and so the question becomes how these packets should be scheduled, how the workload should be optimized to get the best out of the system.”
There are times when benchmarks do not exist. “Do you really understand your target markets?” asks AMD’s Uhm. “It’s not just the industry, but the dynamics of the industry and where it’s at in its lifecycle. If you’re going to be successful, you’d better understand that. It requires an incredible amount of insight and coordination, but also anticipation because of the long lead time of a semiconductor product.”
Uhm pointed to the original focus behind AMD’s AI engine in the Versal products. “The AI engines were originally called math engines. They were designed for wireless, in particular, targeting where we were going to end up going with 5G and 6G in the future. Just doing incremental improvements to the programmable logic and the DSP blocks, which we did, wasn’t going to be enough. We had to take a different approach. We took the processor-based approach. It’s all about supporting matrix math and linear algebra. The reason we did that was because a lot of what fundamentally goes into advanced signal processing, like beamforming, is matrix math. Knowing that, we could break that down into the instructions we needed to support. Some of those had to be supported in real time, and that drives restrictions around latency and what kind of cycle times that they have. And that expands even further to accommodate the memory and I/O, because if you end up bottlenecked in one or the other, all the compute in the world doesn’t matter. As it turns out, a lot of AI algorithms, particularly neural networks, use fundamentally the same mathematical functions. The difference between wireless and AI are the data types. Wireless uses complex data, usually 16-bit IQ. For AI learning, you’re using real data. When we came up with the first iteration of the engines, it was 8-bit, but now we’ve seen that you can get to 4-bit, and you can actually go lower than that without sacrificing hardly any accuracy.”
Conclusion
There are many ways to implement a processor to execute a task efficiently, but that should never be attempted without first fully understanding the target market and the applications. In many cases it requires building in significant amount of flexibility because the target applications are potentially a moving target.
Once the workloads are understood, the architecture and the implementation of the solution can be discussed.
Leave a Reply