Inference-focused benchmarks can distract SoC designers from optimizing performance for end-to-end AI applications.
Artificial Intelligence (AI) is shaping up to be one of the most revolutionary technologies of our time. By now you’ve probably heard that AI’s impact will transform entire industries, from healthcare to finance to entertainment, delivering us richer products, streamlined experiences, and augment human productivity, creativity, and leisure.
Even non-technologists are getting a glimpse of how pervasive the potential impact to the world will be thanks to the deployment of accessible, cloud-based applications such as ChatGPT, an AI-powered chatbot that can hold natural language conversations with users on a wide range of topics, and DALL-E, an AI model that generates images from textual descriptions.
To meet the demand of researchers and companies looking to pioneer and profit from these AI applications, the AI hardware industry is experiencing significant growth. According to a report by Allied Market Research, the global AI hardware market was valued at $4.85 billion in 2018 and is expected to reach $261.11 billion by 2027, growing at a CAGR of 38.9% from 2020 to 2027.
Existing hardware juggernauts in both the semiconductor and hyperscaler worlds, as well as silicon startups aiming at both data center and edge markets, have been investing heavily, developing new processors, chips, and other hardware specifically designed for AI workloads. This competition has led to an abundance of choice of compute platforms for the AI application developer; but with so many available, it’s become increasingly difficult to choose the right one for their solution. Trying to anticipate the compute platform that will best support the applications of the future that are not possible or don’t exist yet today is even more herculean.
In an effort to make comparing these compute platforms more straightforward, engineers and researchers from Baidu, Google, Harvard University, Stanford University, and the University of California Berkeley founded MLCommons in 2018. MLCommons established MLPerf, a set of industry-standard metrics to measure machine learning performance. The MLPerf benchmarks have become useful tools for comparing the relative performance of different systems for Deep Neural Network (DNN) inference. DNN inference performance, however, is not always a good indication of a platforms broader AI application performance potential.
MLPerf benchmarks include a few, general-purpose DNNs architected for a variety of AI use-cases, such as Image Classification, Object Detection, Speech-to-Text, and Natural Language Processing (NLP). In theory, these MLPerf benchmarks allow an AI application developer an opportunity to see an apples-to-apples comparison of how each compute platform performs on tasks for which they wish to implement solutions; however, they fail to address one of the most important aspects of deploying AI applications: the compute surrounding the DNN inference.
In MLPerf’s Inference Rules, they state that “sample-independent pre-processing that matches the reference model is untimed“ including:
Excluding the time needed to perform these operations is problematic because rarely, if ever, is deploying an AI application as simple as passing image data directly from a camera or sensor to a DNN for inference. Equally rarely are the raw inference outputs of those DNNs meaningful without some type of post-processing.
As an example, let’s deep-dive into a real-world example AI application that you may already be familiar with: face recognition.
Below is a simple flow diagram depicting a face recognition application, similar to the Face ID application popularized by Apple as an alternative to numerical passwords for its devices:
The entire Face Recognition application pipeline is composed of eleven kernels. Nine of these kernels are composed of classical algorithms that might normally be compiled to target a DSP or CPU instead of an inference accelerator. These classical algorithms are necessary to extract meaningful information from the DNNs’ inferences, but might not be reported according to the current MLPerf’s Inference Rules if this application were adopted as a benchmark.
Anyone who has ever tried to deploy an object detection pipeline like the one described earlier knows that these algorithms cannot be ignored when optimizing for performance. These algorithms, at best, compute element-wise operations that scale linearly with the size of the data. At worst, they can scale extra-linearly to become bottlenecks in high-throughput, big data, or low-latency applications.
As an example, let’s focus on Kernel #6: Non-max suppression (NMS). This kernel computes a non-max suppression filter on the bounding-box coordinates predicted by the Face Detection DNN in Kernel #5. NMS is a post-processing step that is commonly used in object detection algorithms to remove duplicate detections of the same object. The process involves comparing the scores of all detections and keeping only the one with the highest score.
While this may seem simple, it can be a computationally expensive operation because it involves comparing each detection with all other detections and sometimes sorting them by score. This means that the computational complexity of NMS increases quadratically with the number of detections. Therefore, if there are a large number of detections, which is often the case in object detection tasks, NMS can become very time-consuming and slow down the entire pipeline.
AI inference benchmarks, like MLPerf’s, lend themselves to situations where classical algorithms and DNN inference are segmented to run on different, specialized compute nodes. DNN inference kernels are targeted to run on an AI accelerators or Neural Processing Units (NPU) and classical algorithms are compiled for whatever CPU or digital signal processor (DSP) is available. Benchmarks are comfortable building this assumption into their reporting structure because the current industry-standard way of designing hardware for AI applications, like the Face Recognition pipeline described earlier, is using heterogeneous computing.
Heterogeneous compute nodes are computing devices with different architectures optimized for specific tasks, e.g. an AI SoC might include a CPU, a GPU, and an AI accelerator. Heterogeneous computing, as a design principle for AI, introduces several challenges:
Programmability challenges are particularly concerning for AI application developers looking to emulate performance metrics reported in benchmarks in their own solutions.
AI inference benchmarks, such as MLPerf’s, do not represent all of the facets of AI compute relevant to application developers. Further, they distract SoC designers from optimizing performance for end-to-end AI applications that will enable them to capture the fast-growing AI hardware market. For the entire AI compute, there are at present no standardized benchmarks.
Quadric replaces the heterogeneous design paradigm with an architecture uniquely designed for both DNN inference and the classical algorithms that typically surround inference in AI pipelines: all in a single, fully programmable processor. Software developers targeting the Chimera General Purpose Neural Processing Unit (GPNPU) architecture will experience improved productivity by removing the need to partition an AI application between two or three different kinds of processors while still reaping the performance benefits of a processor optimized for machine learning workloads.
Leave a Reply