AI At The Edge: Optimizing AI Algorithms Without Sacrificing Accuracy

Using benchmarks to guide implementation of AI compression techniques without unduly impacting accuracy.

popularity

The ultimate measure of success for AI will be how much it increases productivity in our daily lives. However, the industry has huge challenges in evaluating progress. The vast number of AI applications is in constant churn: finding the right algorithm, optimizing the algorithm, and finding the right tools. In addition, complex hardware engineering is rapidly being updated with many different system architectures.

Recent history of the AI hardware conundrum

A 2019 Stanford report stated that AI is accelerating faster than hardware development. “Prior to 2012, AI results closely tracked Moore’s Law, with compute doubling every two years. […] Post-2012, compute has been doubling every 3.4 months.”

Since 2015, when an AI algorithm bested human error in object identification, large investments in AI hardware have driven semiconductor IP to expedite next-generation processing, memories, and higher bandwidth interfaces to try to keep pace. Figure 1 shows how an AI competition progressed quickly when back-propagation and modern neural networks were introduced in 2012 and combined with heavy compute GPU engines from Nvidia.

Fig. 1: After the introduction of modern neural networks in 2012, classification errors rapidly decrease and quickly beat human error results. 

AI algorithms

AI algorithms are too large and too demanding to execute on SoCs that are designed for consumer products requiring low power, small area, and low cost. Therefore, the AI algorithms are compressed using techniques such as pruning and quantization. These techniques enable the system to require less memory and less compute—but will impact accuracy. The engineering challenge is to implement compression techniques without impacting accuracy beyond what is needed for the application.

In addition to the growth in AI algorithm complexity, the amount of data required for inference has also grown dramatically due to the increases in input data. Figure 2 shows the memory and compute required for an optimized vision algorithm engineered to a relatively small footprint of 6MB of memory (memory requirement for SSD-MobileNet-V1). As you can see, the larger challenge in this particular example isn’t the size of the AI algorithm but rather the size of the data input. As the pixels increase due to increasing pixel size and color depth, the memory requirement has grown from 5MB to over 400MB in the latest image captures. Today, the latest Samsung mobile phone CMOS image sensor cameras support up to 108MP. These cameras theoretically could require 40 tera operations per second (TOPS) performance at 30fps and over 1.3GB of memory. Techniques in the ISPs and special regions of interest in AI algorithms have limited the requirements to these extremes. 40 TOPS performance isn’t yet available in mobile phones. But this example highlights the complexity and challenges in edge devices and is driving sensor interface IP as well. MIPI CSI-2 is specifically targeting features to address this with region of interest capabilities, and MIPI C/D-PHYs continue to increase bandwidth to handle the latest CMOS image sensors data sizes that are driving towards hundreds of mega-pixels.

Fig. 2: Requirements for SSD-MobileNet-V1 engineered to 6MB of memory, by pixel size benchmarking results.

Solutions today compress the AI algorithms, compress the images, and focus on regions of interest. This makes optimizations in hardware extremely complex, especially with SoCs that have limited memory, limited processing, and small power budgets.

Many customers benchmark their AI solutions. Existing SoCs are benchmarked with several different methods. Tera operations per second is a leading indicator of performance. Additional performance and power measures give a clearer picture of the capabilities of the chip, such as the types and qualities of operations that a chip can process. Inferences per second is also a leading indicator but needs context as to the frequency and other parameters. So, additional benchmarks have been developed for evaluating AI hardware.

There are standardized benchmarks like those from MLPerf/ML Commons and ai.benchmark.com. ML Commons provides measurement rules relating to accuracy, speed, and efficiency, which is very important for understanding how well hardware can handle different AI algorithms. As mentioned earlier, without understanding accuracy goals, compression techniques can be used to fit AI into very small footprints but there is a tradeoff of accuracy vs. compression methods. ML Commons also provides common datasets and best practices.

The Computer Vision Lab in Zurich, Switzerland also provides benchmarks for mobile processors and publishes their results and hardware requirements along with other information enabling reuse. This includes 78 tests and over 180 aspects of performance.

An interesting benchmark from Stanford, called DAWNBench, has since supported efforts by ML Commons, but the tests themselves addressed not only an AI performance score but also a total time for processors to execute both training and inference of AI algorithms. This addresses one of the key aspects of hardware design engineering goals in reducing overall cost of ownership, or total cost of ownership. The time to process AI determines whether cloud-based AI rental or edge computing-based ownership of hardware is more viable for organizations with respect to their overall AI hardware strategies.

Another popular benchmark method is to utilize common open-source graphs and models such as ResNET-50. There are three concerns with some of these models. Unfortunately, the dataset for ResNET-50 is 256×256, which is not necessarily the resolution that may be used in the end application. Secondly, the model is older and has fewer layers than many of the newer models. Third, the model may be hand optimized by the processor IP vendor and not represent how the system will perform with other models. But there are a large number of open-source models available that are used beyond ResNET-50 that are likely more representative of the latest progress in the field and provide good indicators for performance.

Finally, customized graphs and models for specific applications are becoming more common. This is ideally the best-case scenario for benchmarking AI hardware and ensuring optimizations can be effectively made to reduce power and improve performance.

SoC developers all have very different goals as some SoCs look to provide a platform for high performance AI, others for lower performance, some for a wide variety of functions, and others for very specific applications. For SoCs that don’t know exactly which AI model they will need to be optimized for, a healthy mix of both custom and openly available models provides a good indication of performance and power. This mix is most commonly employed in today’s market. However, the advent of newer benchmarking standards like those described above appear to be taking some relevance in comparisons after SoCs are introduced into the market.

Pre-silicon evaluations

Due to the complexities of optimizations at the edge, AI solutions today must co-design the software and hardware. To do this, they must utilize the correct benchmarking techniques, such as those outlined earlier. They also must have tools to allow designers to accurately explore different optimizations of the system, the SoC, or the semiconductor IP, investigating the process node, the memories, the processors, the interfaces and more.

Synopsys provides effective tools to simulate, prototype, and benchmark the IP, the SoC, and the broader system in certain cases.

The Synopsys HAPS prototyping solution is commonly used to demonstrate capabilities of different configurations of processors and the tradeoffs. In particular, Synopsys has demonstrated where bandwidths of the broader AI system, beyond the processor, begins to be a bottleneck and when more bandwidth of the sensor input (via MIPI) or memory access (via LPDDR) may not be optimal for a processing task.

For power simulations, vendors’ estimates can vary widely, and emulation has proven better over simulation and/or static analysis of AI workloads. This is where the Synopsys ZeBu emulation system can play an important role.

Finally, system level views of the SoC design can be explored with Platform Architect. Initially used for memory and processing performance and power exploration, Platform Architect has recently been used more and more to understand system level performance and power with respect to AI. Sensitivity analysis can be performed to identify optimal design parameters using Synopsys IP with pre-built models of LPDDR, ARC Processors for AI, memories, and more.

Summary

AI algorithms are bringing constant change to hardware, and as these techniques move from the cloud to the edge, the engineering problems to optimize become more complex. To ensure competitive success, pre-silicon evaluations are becoming more important. Co-design of hardware and software has become a reality and the right tools and expertise is critical.

Synopsys has a proven portfolio of IP that is being used in many AI SoC designs. Synopsys has an experienced team developing AI processing solutions from ASIP Designer to the ARC Processors. A portfolio of proven Foundation IP including memory compilers has been widely adopted for AI SoCs. Interface IP for AI applications range from sensor inputs through I3C and MIPI, to chip-to-chip connections via CXL, PCIe, and Die-to-Die solutions, and networking capabilities via Ethernet.

Finally, Synopsys tools provide a method to utilize the expertise, services, and proven IP in an environment best suited to optimize your AI hardware in this ever-changing landscape.



Leave a Reply


(Note: This name will be displayed publicly)