Complex Tradeoffs In Inferencing Chips

One size does not fit all, and each processor type has its own benefits and drawbacks.


Designing AI/ML inferencing chips is emerging as a huge challenge due to the variety of applications and the highly specific power and performance needs for each of them.

Put simply, one size does not fit all, and not all applications can afford a custom design. For example, in retail store tracking, it’s acceptable to have a 5% or 10% margin of error for customers passing by a certain aisle, while in medical diagnosis or automotive vision, accuracy needs to be significantly higher. But accuracy of results also comes at a cost in terms of power and performance and design complexity. Add to that always-on/no downtime, throughput, data flow and management, latency, and programmability.

In AI, accuracy is a measure a probability that an answer is correct, and is defined as the number of correct predictions divided by the total number of predictions. For example, if out of 100 samples, 85 samples are correctly predicted, the accuracy is 85%.

Venkatesh Pappakrishnan, senior staff data scientist at Palo Alto Networks, contends it is almost impossible for an ML algorithm to achieve 100% prediction accuracy. In general, a good ML algorithm with 80% to 85% accuracy is more realistic. Achieving accuracy approaching 95% takes major effort, time, deeper domain knowledge, and additional data engineering and collection. Most likely, a model achieving 75% to 85% accuracy may be released and then improved upon later.

Another key metric is precision, which has a direct impact on accuracy. When implementing inference solutions, developers use the int(x) format to represent an integer. For edge inferencing, it is typically int8 or lower. Int1 represents a 1-bit integer, while Int8 represents an 8-bit integer. The higher the bit value, the higher the precision. A simple analogy is the number of pixels in a photograph. The more pixels, the higher the resolution. In inferencing, int8 will yield higher accuracy than int4. But it also requires more memory and longer processing time. In one test, NVIDIA demonstrated that an int4 precision had a 59% speedup compared to int8.

“There is a wide range of accuracy and precision requirements, and it all is dependent on the use case,” said Parag Beeraka, director of segment marketing for the IoT and Embedded, Automotive and Infrastructure Line of Business at Arm. “For example, if AI/ML is used for live language translation, then you do need to have higher accuracy and precision for it to be understandable. But if AI/ML is used for an object recognition use case, the higher the accuracy required, the more complex the mapping of the AI/ML models will be to low-power AI chips. You can reduce the level of complexity by sacrificing some precision and accuracy. This is the reason you see a lot of low-power edge AI chips using int8 (8-bit) format, but you’ll see a lot of newer ML technologies that also support lower (1-bit, 2-bit and 4-bit) formats that cater to much lower-power edge AI chips.”

When and where to make those tradeoffs depends on the application and the use case. “Accuracy and precision depend a lot on system-level use cases,” said Suhas Mitra, director of product marketing for Tensilica AI DSPs at Cadence. “Different metrics are used to determine what accuracy/precision can be tolerated for a certain application. For example, an image classification running on a low-power edge IoT device may be able to tolerate less accuracy vs. an automotive autonomy-based system that requires higher accuracy. All of these affect not only how one designs software but also the hardware.

Running on top of the AI/ML chips is software. Various AI/ML algorithms and implementations have evolved over time. In the past, algorithms would run on CPU. More and more such software, however, is being embedded in chips. For applications at the edge, specific software modules are being deployed.

“AI algorithms span diverse functions,” said Russ Klein, HLS platform director for Siemens EDA. “Some are quite modest and can run comfortably on an embedded processor, while others are large and complex, needing specialized dedicated hardware to meet performance and efficiency requirements. Many factors help determine where AI algorithms are deployed, but the hardware/software tradeoffs for software, an off-the-shelf accelerator, or bespoke hardware happen similarly to those for most embedded systems.”

Software is by far the most flexible and adaptable approach to implementing any function and offers the least expensive development. “Software is ‘future-proof,’” Klein said. “CPUs will run any yet-to-be-discovered inferencing algorithms. CPU-based systems can often be updated while deployed. However, software is likely to be the slowest and most energy-inefficient way to deploy AI algorithms.”

CPUs, GPUs, FPGAs and ASICs are used for inferencing chips today. While CPUs are still used in some AI/ML inferencing applications because of their flexibility, GPUs, FPGAs, and ASICs are preferred for deep neural networks (DNNs) and convolutional neural networks (CNN) because of their greater performance efficiency, and their appeal is growing across a variety of new applications.

GPUs have very efficient parallel processing and memory management. They typically use C/C++ high-level languages, and work well in high-performance DNN and CNN applications as data centers and autonomous driving. However, for edge inferencing applications, for example, wearables and security cameras, GPUs may be too power hungry.

FPGAs, in contrast, provide programmable I/O and logic blocks. An efficient way to map the AI/ML models using tools, such as hardware description language (HDL), is important for inferencing applications. Efficient memory management is also vital.

“Low power, programmability, accuracy, and high throughput are four conflicting forces pulling on the sheet corners of an efficient edge AI solution,” said Andy Jaros vice president of IP sales, marketing, and solutions architecture at Flex Logix. “ASIC solutions with model-specific accelerators will always be the most power-efficient solution at the expense of losing programmability. Embedded processors have been enhanced over the years to increase the multiply and accumulate (MAC) processing AI models need, but do not have the MAC density for high-accuracy requirements.”

Researchers and system designers are now exploring a number of paths using eFPGAs for AI processing solutions, Jaros said. “The approaches being investigated include model-specific, processor-custom instructions, where the instruction set can change on a model-by-model basis. That instruction set versatility can, in turn, leverage eFPGA DSP MACs for more traditional FPGA-based accelerators, utilizing binary or ternary models that run very efficiently in an FPGA logic structure, while retaining reasonable accuracy.  Leveraging eFPGA reprogrammability and flexibility enables an AIoT device, where the end customer can choose the right programmability, power, performance, and accuracy combination for their application,” he said. “Utilizing eFPGAs for AI also provides an added level of security for end users, since their proprietary instructions or accelerators can be programmed post-manufacturing in a secure environment. No one has to see their secret circuits. Bitstream encryption with PUF technologies, like our recent partnership with Intrinsic ID, adds a higher level of security to bitstream protection.”

By their very structure — and unlike GPUs and FPGAs — ASICs are customized for a specific application. Designs can be very expensive, depending upon the complexity and the process node, and making changes late in the design flow to accommodate updated protocols or engineering change orders, for example, can push those costs into the stratosphere. On the other hand, because of the dedicated functions for a specific application, the architecture is much more power-efficient.

“If low power is one of the key criteria, then ASICs are the right solution for building low-power AI chips,” said Arm’s Parag. “An eFPGA might be a good fit if the end device is a low-volume product. However, this could translate to higher cost. There are certain segments that can cater to the eFPGAs, but the majority of the volume would be ASICs.”

According to a report published by the Center for Security and Emerging Technology, ASICs delivered up to 1,000-fold performance in efficiency and speed compared with CPU technology, while FPGAs delivered up to100-fold. GPUs delivered up to 10-fold in efficiency, and up to 100-fold in speed. Most of these chips can achieve inference accuracy between 90% to almost 100%. (See table 1 below)

Table 1: Comparison of various chip technologies in AI/ML inferencing. Source: Saif M. Khan/Center for Security and Emerging Technology

“Specialized accelerators like GPUs, tensor processing units (TPUs), or neural processing units (NPUs) deliver higher performance than general-purpose processors, while maintaining much of a traditional processor’s programmability and demonstrating more per inference efficiency than CPUs. Increasing specialization, though, risks not having the right operations mix for next-generation AI algorithms. But implementing a specialized eFPGA/FPGA or ASIC accelerator, tailored for a specific AI algorithm, can meet the most demanding requirements.

Critical real-time applications, such as autonomous mobility, or those that must sip power from energy harvesting, could benefit from a custom developed accelerator. But customized accelerators also are the most expensive to develop, and if they do not have some amount of programmable logic built in, they could become outdated very quickly.

“An FPGA or eFPGA retains some amount of re-programmability, but at the cost of higher power and lower performance than an equivalent ASIC implementation,” said Siemens EDA’s Klein.

As with most designs, reusability reduces design cost. In some cases, up to 80% of a chip may be re-used in the next version. For inferencing devices, being able to re-use IP or other parts of a chip can significantly improve time to market, as well, which is important because algorithms are almost constantly being updated. While general-purpose chips, such as CPUs, can be used for inferencing with different software or algorithms, the tradeoff is lower performance. On the other hand, the reusability of an ASIC is much more limited unless a very similar application is implemented. The middle of the road is an FPGA or eFPGA, which has the most standard logic, allowing the software to be reprogrammed with minimum effort.

“There are a number of considerations in making part of the design reusable for AI,” said Arm’s Parag. “These include scalable hardware AI/ML accelerator IP (with good simulation and modeling tools), a software ecosystem to support the different frameworks on the scalable hardware accelerator IP, and multi-framework support models to cover the broadest number of use cases.”

Others agree. “The main consideration would be how to map new AI model topologies quickly,” said Cadence’s Mitra. “Sometimes we get caught up into squeezing every ounce out of the hardware, but AI networks change so quickly that optimizing every line of logic may have a counter effect. For a design to be reusable, it should be able to handle a big, broad umbrella of math-heavy functions that include various formats of convolutions, activation functions, etc.”

The scaling factor
Today’s AI/ML inferencing accelerator chip design has the challenge of packing high-performance processing, memory, and multiple I/Os, all within a small package. But high-performance processing consumes more power and generates more heat, and design teams must strike a balance between performance, power, and cost.

Add sensor fusion — audio, video, light, radar, for example — and this can become much more complex. But at least some of the industry learnings can be utilized.

“One simple way to solve the sensor fusion problem in the video/imaging interface is to adopt the MIPI standard,” said Justin Endo, senior manager for marketing and sales at Mixel. “Initially, MIPI was used for the mobile industry. It now has expanded to cover many consumer and AI/ML edge applications. For example, the Perceive Ergo chip, an inference processor, has 2 MIPI CSI-2 and 2 CPI inputs and one MIPI CSI-2 output, which support two simultaneous image processing pipelines — one high-performance 4K using a 4-lane MIPI D-PHY CSI-2 RX instance, and one 2K/FHD using a 2-lane MIPI D-PHY CSI-2 RX instance.”

Perceive’s Ergo chip, an ASIC, is capable of 30 FPS at 20mW in a video inferencing application. Other AI/ML chips may consume 2 to 5 watts, depending upon the chip architecture. In battery operated devices such as home security cameras, low power consumption matters. Low power consumption make the battery last longer, and in a wearable application, it also helps the device run at a lower temperature.

“Efficiency is important,” said Steve Teig, CEO of Perceive. “When higher performance is required, the difference in power consumption is more pronounced. For example, if the video performance is increased to 300 FPS, the Ergo chip will consume around 200 mW, while other chips may consume between 20W to 50W. This can be significant.”

Not everyone measures performance and power the same way, however. Syntiant, a supplier of inference always-on ASICs, demonstrated its performance in the Inference Tiny v0.7 test conducted by the tinyML organization. In one test, its products scored 1.8 and 4.3 ms in latency compared with other products ranging from 19 to 97 ms. In the category of energy/power consumption, Syntiant scored 35 µJ and 49µJ compared with others consuming 1,482 µJ to 4,635µJ. According to Syntiant, the chips can run full-scale inferencing operations at 140uW.

But making comparisons between these chips is something of a minefield. There are no universal standards available to measure AI/ML inference performance. Therefore, it is important for users to look at how a chip performs under a real-world workload in a specific domain, and to put that in the context of what is important for that end market. Performance may be less of an issue than power consumption in some applications, and the reverse may be true in others. The same goes for accuracy versus performance. To achieve higher inferencing accuracy requires more complex algorithms with longer execution time.

Balancing all of these factors is a constant challenge to AI/ML inferencing. What is the right chip, or combination of chips, depends on the application and use case. In some cases, it may be a fully customized design. In others, it may be some combination of off-the-shelf standard parts and re-used IP that are cobbled together to meet a very tight deadline.

“AI algorithms are growing more complex, with computations and parameters increasing exponentially,” said Siemens EDA’s Klein. ” Computational demands are far outpacing silicon improvements, pushing many designers toward some form of hardware acceleration as they deploy AI algorithms. For some applications, a commercial accelerator or neural network IP suffices. But the most demanding applications call for a custom accelerator to take developers from an ML framework into specialized RTL quickly. High-level synthesis (HLS) offers the fastest path from an algorithm to hardware implementation targeting ASIC, FPGA, or eFPGA. HLS reduces design time and, perhaps more important, demonstrates the AI algorithm has been correctly implemented in hardware, addressing many of the verification challenges.”

Leave a Reply

(Note: This name will be displayed publicly)