Architectural Considerations For AI

What will it take to be successful with a custom chip for processing a neural network targeting the data center or the edge? A lot of money is looking for that answer.


Custom chips, labeled as artificial intelligence (AI) or machine learning (ML), are appearing on a weekly basis, each claiming to be 10X faster than existing devices or consume 1/10 the power. Whether that is enough to dethrone existing architectures, such as GPUs and FPGAs, or whether they will survive alongside those architectures isn’t clear yet.

The problem, or the opportunity, is that the opportunity space is huge. When architecting an AI/ML device, a number of questions need to be addressed:

  • What is the application, or range of applications, it is intended to cover?
  • Is it for training or inference, and where is the inference going to be done?
  • What is the size of the market, and is that big enough to support a single-purpose product or does it need to be more flexible?
  • How sticky is a design win, and how long will it be before either new hardware gets an advantage or algorithms advance, obsoleting existing hardware?
  • How do you create a significant competitive advance that can be maintained?

Hardware creation always has been a continuum between a full-custom ASIC and a general-purpose programmable device. What has changed is that new types of architectures are being thrown into the mix that change the relationship between processing and memory. That means multiple continuums now exist. Or viewed another way, they provide an additional axis of freedom that has become relevant.

“The architectural innovation that we have seen in the last three or four years is something that we haven’t seen over the preceding couple of decades,” says Stelios Diamantidis, director of AI products and research at Synopsys. “Programmability will give you more flexibility, which is useful for things like autonomous vehicle environments. But then ultra-optimality is also extremely interesting for cases where you know your application will not change.”

For training, which is done in the data center, everyone is looking to dethrone the GPU. “The GPU remains a temporal machine,” notes Diamantidis. “It remains something that was architected for geometry processing, looking at vertices in isolation and in parallel streams. However, the architecture of the GPU has certainly evolved. The predominant GPU architecture in the data center today looks more like an AI chip, something designed from scratch, compared to a GPU from 2016.”

The same is true for FPGAs that now contain many fixed function blocks. “Do you harden an AI inference block?” asks Nick Ni, director of product marketing for AI and software at Xilinx. “Or do you harden a particular 5G algorithm block to optimize for a particular niche application? You have to look at the ROI and determine if this is for a small market segment, or can it be re-purposed for larger adjacent markets? There’s a whole strategy that goes behind it.”

While the FPGA is making a strong push for inference in the data center, no de facto solutions have emerged. “Most of the chips being built are specialized ASICs targeting an end application,” says Susheel Tadikonda, vice president of engineering at Synopsys’ Verification Group. “Consider vision. Facebook has its own headset for AR/VR, and inside that is their own vision processor. That particular AI chip is targeting this application. If I put that in an automobile, it might not do the job. That vision processor is doing something different. It has a different workload. A lot of ASICs cater to the end application. They are all playing within a niche market.”

Multiple solutions may coexist. “I don’t think it’s a winner takes all market,” says Anoop Saha, senior manager for strategy and business development at Siemens EDA. “It is a growing market and there are different demands from each part of the market. There is space for GPUs, there is space for processors, there is space for FPGAs, and there will be space for data center specific chips — specific for AI.”

The mix may change over time. “FPGAs are used when there is no off-the-shelf solution and the volumes do not warrant implementing a full ASIC yet,” says Bob Beachler, VP of product for Untether AI. “In the early days of the AI explosion (2016-18) there were only GPUs and FPGAs available to accelerate AI workloads. FPGAs were better than GPUs from a performance standpoint, but at the cost of ease-of implementation. Now, with the advent of purpose-built AI accelerators, the need to use FPGAs for AI is diminished, as AI accelerators provide substantially better throughput, latency, and performance compared to GPUs and FPGAs. FPGAs will always have a home in novel use cases where standard products are not available. Luckily for them there are always new applications being invented.”

Edge or data center?
The discussion needs to separate computation done in the data center and at the edge. “AI chips in the data center are all about chewing through massive amounts of data to do complex computations,” says Synopsys’ Tadikonda. “This is where we talk more in terms of learning, and GPUs dominate. There are very few companies attempting to make AI chips for the data center. Google is the most visible with the TPUs. There are efforts from other companies, but from an economics point of view it is going to be very hard for these companies to play a big role. To build an AI chip that targets the data center there needs to be volume, and AI chips are pretty expensive to make. Companies like Google can afford it because they are their own consumers.”

To succeed in the data center takes more than just a good chip. “You need strong hardware, and a board with the right heat sink and air flows,” says Xilinx’s Ni. “That needs to be qualified or certified across many servers and OEMs in order for it to be in the game. It’s not good enough to have a one peta-op chip that doesn’t fit into the right form factor, or doesn’t fit into most of the servers. That can be a very difficult problem.”

A strong case can be made for generality. “Learning chips are seeing decent volume, but the inferencing chip volume is quite a bit higher in comparison,” says Joe Mallett, senior product marketing manager at Synopsys. “The adoption of the FPGA was because of its power reduction compared to a GPU. Using a GPU for inferencing is very high-power, and FPGAs are starting to be considered the low-power, low-cost solution. Custom ASICs are a follow on to those FPGAs. Some companies are betting there are generic enough neural nets, where they can make devices that are semi-programmable, put in an ASIC, and take over the market from the entrenched players.”

A discussion about the edge becomes a lot more confusing. “People think that AI within your cell phone is the edge,” says Tadikonda. “But the definition of edge moves beyond that, especially because of 5G innovations. The consumer or edge AI is mostly devices such as IoT devices and cell phones. Then there is something else called the enterprise edge, which could include robotics, or your manufacturing unit, or even in your small cell where a lot of data gets aggregated in 5G. Then there is a next level of edge, such as the telecom edge. That deals with the data just before it hits the data center.”

Many companies are ready to predict that devices targeting these will become custom silicon. “Due to challenging design constraints in edge AI devices and applications, specialized AI processors are expected to replace general-purpose architectures over the coming years,” said Tim Vehling, senior vice president of product and business development at Mythic.

Another example comes from Dana McCarty, vice president for inference sales, marketing and applications at Flex Logix. “We believe that in AI Edge platforms, the GPU is what everyone is using today,” he said. “FPGAs are not used except in expensive pre-production prototypes. In 2022, edge AI accelerators will start to appear.”

If it sounds like the Wild West, Siemens’ Saha agrees. “It’s a complete Wild West out there. I don’t think we have anything that can be called a dominant player on the edge. It’s much more dynamic and much more uncertain as to how things will evolve. It is a big and growing market, and there is a disruption in the market driven by both the needs and application use cases, as well as what the existing players are doing. Add performance, power, and energy efficiency to it, and you need many different solutions.”

But herein lies the big difference. “There’s no such thing as an AI chip at the edge,” says Synopsys’ Diamantidis. “AI acceleration and processing is a key component of systems that operate at the edge, but those systems include different kinds of processing, as well. In many cases they have to include scalar CPUs.”

Tadikonda is in full agreement. “For most of the edge there is no independent AI chip. It’s an SoC, and part of the SoC is an AI IP. We are seeing around 5% to 8% of the die area being allocated to the AI engine. We expect to see that grow in the future.”

Developing an SoC requires a greater breadth of experience. “For any startup claiming they are going to do a big edge or endpoint AI, they not only have to get the neural network engine right — with the right neural network support and the right performance — but they also have to get the SoC around that right,” says Ni. “In addition, to get into markets such as robotics or automotive, you have to get the necessary safety certifications.”

On the edge, a good chip can create new markets. “Energy efficiency is far more important at the edge than at the data center,” says Saha. “Think about the extra value that a chip can create for the user if it can be battery-operated versus the need to be always plugged in. A chip very specifically designed for a specific task may not be high volume because it’s a very specific design, but if the financial metrics play out, you will see growth of those custom chips.”

The right chip can change the ROI. Diamantidis points to a hearing aid as an example. “Consider what happens if you’re able to fit a model in a hearing aid that can autonomously perform a task such as natural language processing. This is not possible today. It would mean I don’t have to transmit the audio content to a data center for processing, which means that my device is smaller because I did not need communications circuitry. My device consumes significantly less power because I don’t have to transmit anything. And the applications can operate at significantly lower latency and enable some very significant breakthroughs in terms of the user experience, simply because I was able to fit the model and run it on device versus having it communicate,” he says.

The software connection
The past is littered with significant hardware advances that failed because of inadequate software. “AI developers want no part in writing custom hardware using hardware description languages,” says Greg Daughtry, member of the technical and business staff at CacheQ Systems. “They are accustomed to using tool frameworks that are basically Python scripts. So how can they customize or add additional logic to differentiate themselves from what everybody else has?”

While the framework may be standardized, a lot of software is still involved. “A solution stack can be used directly by the AI scientists,” says Ni. “But we actually invest more in software engineers, or compiler engineers, than hardware engineers to ensure we are successful. Because of that, it means that the AI scientists can pretty much take any models they have trained on a CPU, GPU, or whatever, from TensorFlow or PyTorch, and directly compile them into FPGAs. That is a huge investment.”

Ni is quick to praise the early entrants, Nvidia and Intel. “I would definitely credit Intel and Nvidia because they invested in CUDA, as an example, more than a decade ago. They were then the first ones to jump onto deep neural network support in their tools. It took them a long time to build up a community of enough people who can basically take their trained network and do inference on GPUs. This effort literally takes many years. It took Nvidia more than 10 years, but the good news is they paved the way for everyone that follows. Now we know exactly what to do.”

But that is not the end of the story. “The real difficulty isn’t the application software,” says Mallett. “It’s the drivers, it’s the pieces that interface between the hardware that’s doing the functionality, and the operating system or whatever’s above it.”

Tadikonda goes a little farther. “The issue is how do you map your compiler output into these ASICs, or different architectures. That’s hard. This is the meat of every offering. The chip is actually quite simple. AI chips are very straightforward. The complexity is in the software that maps the model down into the hardware. Legacy providers have an advantage in that their library, and their application and framework support, is robust. People can just adopt it, and it becomes straightforward.”

What makes it so difficult? “Neural networks are evolving, and nowadays people are using more transformer-based networks,” says Ni. “These are disruptive changes. Hardware needs a very different approach to buffering and new memory hierarchies so that the next layer can start to compute when the data is ready. That is an extremely difficult computer architecture problem. In almost 90% of the cases, memory is the bottleneck for running any neural network workloads. Devices with a hardened task structure, such as a CPU or GPU, cannot modify their relationship to memory.”

The combination of evolving algorithms, understanding of what is possible in hardware and perfecting the software that maps the two is the task that the industry is grappling with. After they have spent the time and money building a successful candidate, will anyone manage to build a moat around themselves if they are successful?

Part two will explore the issues surrounding both hardware and software churn rates and the advantages that may be inherent in early design wins.

11 Ways To Reduce AI Energy Consumption
Pushing AI to the edge requires new architectures, tools, and approaches.
Making Sense Of New Edge-Inference Architectures
How to navigate a flood of confusing choices and terminology.
Configuring AI Chips
Keeping up with changes in algorithms and potential interactions.
Hidden Costs In Faster, Low-Power AI Systems
Tradeoffs in AI/ML designs can affect everything from aging to reliability, but not always in predictable ways.


Gil Russell says:

One of the most limiting factors is that we seem to be stuck in a Von Neumann mindset in which incremental hardware architecture is exceedingly difficult to “mentally climb out of” when it comes to solving a “Brain Inspired” problem of which we do not fully understand yet are approaching a useable set of options for integration into the world of commerce.

The implication is that the problem is “In-Memory” and that it is highly parallel in nature – and has to run on ~twenty Watts.

So far no one has come upon “The Master Algorithm” though some are exhibiting strong salable efforts in that direction – try Pattern Computer, Redmond/Friday Harbor, Washington, announcing a 15 second Covid-19 test without reagents this week. Their platform? “Pattern Discovery Engine” a trademarked item that beats IBM in their newly announced “Watson Discovery as a service”. There’s lots more coming from Pattern Computer – the world seems to be their oyster at present.

The original name for Pattern Computer was “Coventry” as in “sent to Coventy”. I’ll let you figure that out – the answer is a tell all about dogma…,

Michael Kanellos says:

Great story. I’ve been watching AI processor companies hatch at what seems like a rate of once a week and I’ve wondered where it might end.

Leave a Reply

(Note: This name will be displayed publicly)