The Challenge Of Optimizing Chip Architectures For Workloads

Hardware/software co-design has progressed beyond mapping one onto the other. It is about holistically solving a problem in the most efficient manner with a given set of constraints.


It isn’t possible to optimize a workload running on a system just by looking at hardware or software separately. They need to be developed together and intricately intertwined, an engineering feat that also requires bridging two worlds with have a long history of operating independently.

In the early days of computing, hardware and software were designed and built by completely separate teams, often in different rooms or different buildings, and with little or no direct communication or common management. That made it difficult to optimize anything.

More recently, hardware and software engineering teams have learned that huge gains can be made through greater cooperation. The walls first began to soften with the introduction of electronic system-level design (ESL) around the turn of the century, when EDA companies attempted to build flows that automatically created hardware and software from a unified description. With ESL, designs were partitioned and pieces were mapped to the most optimal implementation technologies. That proved to be too ambitious for ESL to gain significant traction, but today the notion of customizing processors for a defined workload is becoming standard practice.

In the race for faster and more efficient designs, optimization is not just about building better hardware or better software. It is about defining the workloads in ways that are cognizant of implementation capabilities. For example, it is important to understand the implications associated with defining an AI learning algorithm that has 1% better accuracy. Do the costs outweigh the benefits? Or if quantization is used, what impact will that have on accuracy, power, cost, and other factors? How efficient are the compilers or other tools that assist in mapping the application to the hardware?

Similarly in quantum computing, what workloads are possible given the error rates and other restrictions of the hardware? There is a process for maturing the system, but not everything happens at the same time or the same rate.

Where are we today?
To understand what’s changing, it is helpful to look at how the process worked in the past. “What happened is that you had one architecture for many, many years that does things in a certain way,” says Suhas Mitra, product marketing director for Tensilica AI products at Cadence. “You developed more software over time, and the hardware evolved, as well. Somebody designed a working architecture for a processor, it fetches instructions and executes. That worked for simple applications. But when they want to scale up to higher applications, they realized that they needed more memory, perhaps some cache. Then they need memory protection, or something else, they need to do certain things a certain way, because of low power — all driven by competition.”

The risk associated with any disruption can be high. “Many years ago, GPU architectures were designed by looking at the algorithmic requirements,” says George Wall, director of product marketing for Tensilica Xtensa Processor IP, Cadence. “But until a hardware prototype existed, there really wasn’t much for the software folks to do. They couldn’t take advantage of that hardware until they had the hardware in hand. So hardware development would precede software. That is changing. Shift left is pushing the industry to accelerate the start of that software development, so that they’re not stuck waiting for the hardware.”

The opening up of processor architectures also is playing into that trend. In a recent blog, Keith Graham, head of the university program at Codasip, wrote that “conventional research has been limited to software algorithms and external hardware resources due to fixed and closed processor architectures. (See figure 1.) Unfortunately, an important component of the research equation — the processor — has been left out. Processor architecture optimization involves two key concepts, tightly coupling application-specific functionality into the processor and enhancing the processor performance through cycle-count reduction.”

Fig. 1: Processor architecture optimization now included in university research. Source: Codasip

Fig. 1: Processor architecture optimization now included in university research. Source: Codasip

But this is not the path that traditionally has been used to optimize a system. “The real challenge here is that for domain-specific architecture developers, to significantly beat existing architectures such as CPUs and GPUs, they need to be highly specialized,” says Russell Klein, HLS platform program director at Siemens EDA. “But to address a large enough market to be commercially successful, they will need to be highly flexible — that is, they must be programmable. This is a tough combination, as the two goals are diametrically opposed. Developers with very specialized needs may want to create their own hardware accelerators, which eliminates the need for it to be highly programmable or target a large market.”

Many of the proposed architectures appear to be in search of a problem. “The evolution from CPU to xPU continues, and there is hardly a day that goes by that does not see a new company launch a new compute platform,” says Neil Hand, strategy director for the IC Design Verification division of Siemens EDA. “Many will not succeed. At its core, the creation of new architectures for a workload is all about the risk/reward ratio. The underlying question is whether anything is happening to change that risk/reward equation. And if so, does it accelerate the transition from general-purpose to workload-specific processing.”

Perhaps it is markets that fail rather than proposed architectures. “I would not assume that someone designs a chip and then thinks, ‘Let me find a problem where I can use that chip,'” says Michael Frank, fellow and system architect at Arteris IP. “That’s backwards thinking. It is possible that you have a chip, and then you look for an alternate use of that chip. Initially, that chip was designed by having a certain use case in mind.”

All of this supposes that hardware can meet minimum requirements. “The primary focus in quantum computing today is to create more reliable quantum computers that can perform all the operations required for fault-tolerant quantum computation, with the required accuracy and precision,” says Joel Wallman, R&D operating manager at Keysight Technologies. “There are two primary advances needed. First, the physical implementations of quantum systems need to be made more robust. Second, control systems for quantum computers need to be developed to implement extremely precise and fast, sub-microsecond control sequences with real-time feedback, enabling efficient circuit execution and error diagnostics.”

Evolving architectures
A significant amount of analysis needs to be performed to select the right architecture. “First you think about the requirements of that use case,” says Arteris’ Frank. “Where are the hotspots? What is the memory footprint? And what are the memory bandwidth requirements for a particular use case? Architects in big companies used to do spreadsheet-driven analysis. This has given way to much more realistic modeling, possibly using an emulator like Gem5 or QEMU. You take your application software, run it, and collect traces. These may be analyzed for the number of reads and writes within a certain amount of time, for the memory footprint, and for what kind of operations are being performed. For example, if you have a lot of vector type of operations, you would use this data to drive your decisions for building a machine. It’s a race to parallelize things. It’s a multi-level type of analysis that you do, and hopefully you don’t leave any rock unturned, because if you miss something important your system may not be able to deal with the workload.”

Once a workload has been accepted by the market, the requirements for it often evolve. “What I am seeing in the development community is not so much new workloads, but demands for better performance and efficiency for existing algorithms to deliver solutions that weren’t practical in the past,” says Siemens’ Klein. “Users want real-time 4K image processing and AI built into a pair of glasses for augmented reality, and they want it to weigh a fraction of an ounce, be unobtrusive, and run for a week on a single charge.”

Success breeds proliferation. “Video playback is a good example,” says Frank. “In the old days, they built specialized chips for that, just to decode video. Today, every little device has a video decoder and encoder for high resolution video in all different standards. You have a use case and that use case needs to address different markets. Performance requirements are a generic term because performance requirements may not only include getting the job done in a certain amount of time, it’s also I can get the job done with a certain amount of available energy. It’s a multi-dimensional problem.”

We are seeing a similar situation with AI. “Models are becoming extremely complex in the data center,” says Cadence’s Mitra. “Models also need to be simplified on the edge. That means the spectrum is extremely wide and that is what is driving domain specific computing. One company may want to target IoT and that dictates hardware that will only be applicable for that particular domain.”

For inference, which may happen in both locations, that has significant implications. “Inference is no longer happening just in the cloud, but at the edge, as well,” says Philip Lewer, senior director of product for Untether AI. “This has firmly established the need for AI workload acceleration spanning from standalone high-end AI accelerators that can reach 500 TOPs down to dedicated AI acceleration intellectual property for microcontrollers down in the tens of GOPs of performance.”

Predicting the future
People designing hardware face significant challenges. “Everything has a challenge and also a peril,” says Mitra. “You build something for a type of workload, and it doesn’t pan out because it took three years and now the market is moving in a different direction. Models are evolving, and not just by scaling. Today, for natural language processing, transformer architectures are taking over. If you build an ASIC, you harden everything. You are playing the game of power and area, and what will succeed in a market. There are certain markets where people favor FPGAs because they do not have to harden everything. They can keep some functionality in the fabric. Those are more flexible. Like anything else, there’s a pro and a con to every decision that you make.”

That challenge is being echoed around the industry. “We agree and see an increase in the usage of large language models architected on transformer-based networks,” says Untether’s Lewer. “These networks were initially focused on that use case, but we are now seeing them increasingly being used in other applications, like vision, where they are replacing CNNs.”

Economics drives everything. “If someone thinks an area has stabilized enough, and that they can build hardware to support the networks that they’ll need to run in a couple years, they have more confidence to harden things,” says Cadence’s Wall. “But having a level of programmability in any of these workload solutions is important, because you don’t want to have a design that just targets one particular class of network and then find out two years from now the market has moved on. You need to have that programmability and flexibility in the hardware.”

In addition to algorithmic advances, there are technology advances that may have a large impact. “The current paradigm of moving all the data to the central processing unit — the operative word being ‘central’ — and then back out to memory is incredibly inefficient,” says Klein. “In many systems, data movement limits the overall performance. Processing the data closer to where it is stored can deliver huge improvements in performance and power consumption. I see a big challenge for in-memory/near-memory compute in programming. How does a software developer describe the operations and synchronize the activity? To some degree it is a parallel programming problem, which the software development community has largely shied away from so far.”

The state of AI
AI was enabled by GPUs. “This was a situation where software was developed, and it defined a set of requirements for hardware,” says Frank. “At the time, the GPU was the best thing available. The GPU does very nice parallel processing of independent elements, and that is wonderful. ‘This is what my algorithm looks like.’ Or, ‘This is how I can shape my algorithm.’ In fact, sometimes you have to modify the algorithm to work better on existing hardware. At some point they realized that they didn’t need the whole GPU. Instead, they needed something that was more specialized.”

Many aspects of the GPU have evolved into AI accelerators. “The GPU uses floating point,” says Mitra. “But you only need fixed-point version of those floating-point models for inferencing. Today, people are experimenting with even smaller versions of floating-point models — like bfloat16. For some workloads, even 8-bit quantization may be sufficient. You may need higher quantization. There is no reason for us to believe there will be one right answer. There’s no one quantization format for everything. That becomes very hard for a silicon company, or some academic company, because they want to future-proof something. You will have to be able to re-orient, re-do things, and make things happen.”

As the workloads become better understood, other architectural changes emerge. “For compute parallelization, you’ll continue to see various types of spatial architectures,” says Lewer. “These allow complex graphs to be broken down into thousands of processing elements connected through some type of network on chip. Memory speeds and greater memory depths are enabling rapid access to a large number of weights and activations, and in turn you’ll continue to see a tighter coupling between memory and compute.”

What most architectures are doing today is attempting to optimize operations, not workloads. “Hardware looks for an application, but the workload is underlying,” says Frank. “The workload may be matrix matrix multiplies. You build something that does matrix matrix multiplies well, knowing there are a lot of applications that can make use of that. But you should not confuse application as a synonym for workload. It’s an element of an algorithm that you try to accelerate, and that defines what I would call the workload.”

Software is moving faster than hardware today, and that is causing hardware to look for strategies that not only meet the needs of software today, but which can adapt to future needs without sacrificing performance requirements and become non-competitive. The demands on software also are evolving, because they must consider more than one operating environment.

“There are phases of optimization,” says Mitra. “There will be phases of new network development, phases of new architectures, and phases of development where people will modify existing architectures to suit their needs. People may be looking at either training or inference or both, depending upon what they are trying to do, and that means they will require different solutions that meet PPA requirements.”

Hardware is being driven to evolve faster than ever, and now the economics of that have must be worked out. Software has directly become a driver for hardware, but it also is influenced by what is possible in hardware.


Balvinder Singh says:

Wonderful article , thanks

Leave a Reply

(Note: This name will be displayed publicly)