SPONSOR BLOG

The Best AI Edge Inference Benchmark

Proxies for actual workloads can only be so accurate.

March 4th, 2021 - By: Dana McCarty

When evaluating the performance of an AI accelerator, there’s a range of methodologies available to you. In this article, we’ll discuss some of the different ways to structure your benchmark research before moving forward with an evaluation that directly runs your own model. Just like when buying a car, research will only get you so far before you need to get behind the wheel and give your future vehicle a few test drives. Often times, many of the benchmarks available tell you information, but don’t tell you anything about how your model will actually run on any given accelerator.

Let’s see how some common metrics may help or hinder your search for the inference accelerator that is best for your application and criteria.

TOPs = Tera Operations per Second

The most common number you’ll see is the peak operations that an architecture can theoretically achieve (calculated by multiplying the number of multiply-and-accumulate units in the hardware by the nominal clock speed times two since one MAC is two operations). But how do you know if the rest of the accelerator (memory subsystem, software, etc) can keep the MACs well utilized? The pitfall with this metric is that without an understanding of utilization (how often those compute units are doing actually useful calculations), this number often tells you nothing about anything!

Micro Benchmarks: Convolutional kernels and individual layers

Another approach is to develop micro-benchmarks that test the capability of an accelerator for each of the subsystems required. Deriving an accurate utilization figure to correlate with TOPs can actually be more trouble than it is worth. Different accelerators may have very different architectures, which are rarely well documented, that can cause results to be skewed. Not all layers are created equally, so you’ll have to understand the different memory and compute patterns of every layer’s convolutional kernel. Ultimately, while a ‘micro’ benchmark approach like this holds promise, in practice it’s difficult to really draw meaningful conclusions from benchmarking at this level, especially without having the details of the accelerator’s micro-architecture.

Model level benchmarks

This is where benchmarks begin to be useful, because models will have a variety of layers with different levels of utilization within a single model, and the units of measurement (inferences per second) will be more directly applicable to your workload. Usually these benchmarks are based around a standard set of common CNNs, such as ResNet, VGG, YOLO, and more. With these types of benchmarks, at least now the work being done can be tied more directly to an end application such as object detection or image classification.

Fig. 1: Differing aspects of common CNN models.

However, these models are still just proxies for your own model. Some of these models are residual networks with skip connections, while others have novel operations in their object detector heads. They all have different structures that dramatically change the kind of optimizations that can be applied when different hardware solutions are available. While these models might be similar in some ways to your model, chances are there won’t be a single model so close to yours that you can confidently make a conclusion about performance. However, if you derived your own model from an open-source benchmark, that may make that open source benchmark a better indicator of your customized model’s performance.

Other factors to consider

Some accelerators implement Winograd transformations which can accelerate convolutions by 2.25x. The benefits of Winograd acceleration show up in inferences/second (throughput), and require an additional step when looking at MACs/second or utilization because Winograd is computed in a different, more efficient way.

Some accelerators prune models or skip zero weights to speed computation. The former may lose accuracy while the latter reduces MACs. The impact of these optimizations is clear in inferences/second, but may not be clear in intermediate metrics.

What these benchmarks won’t tell you

Only your actual model running on the solution will provide definitive results. There’s simply no other way to accurately measure your performance without running your workload! Neural networks come in so many different shapes and sizes, that even a few models’ performance results won’t tell you enough about your own model’s performance. What’s more, running your own model will give you a chance to probe out the solution’s software stack, and determine whether it is flexible enough to handle your set of evolving workloads. After all, even after doing all the research you can when purchasing a car, there’s always something you don’t know to consider until going on that test drive!

Dana McCarty

(all posts)
Dana McCarty is vice president of sales and marketing for inference products at Flex Logix. Previously, McCarty was a strategic advisor for medical device, MRAM IP and EDA companies. Prior to that, he was vice president of North American sales for Arm. Before that, he was vice president of global sales for MaxLinear. Prior to MaxLinear, McCarty was senior vice president of worldwide sales for LitePoint, a Teradyne company. He also spent 12 years at Broadcom in various roles, including vice president of emerging business and channel sales, vice president pan-Asia sales, and Taiwan country manager. McCarty started his career as a test and product engineer at AMD before transitioning into sales. He holds a Bachelor of Science in Electrical Engineering from the University of Texas at San Antonio.

Knowledge Centers
Entities, people and technologies explored

Startup Funding: Q1 2025

AI chips and data center communications see big funding; 75 startups raise $2 billion.

by Jesse Allen

Advanced Packaging Fundamentals for Semiconductor Engineers

New SE eBook examines the next phase of semiconductor design, testing, and manufacturing.

by Bryon Moyer

Chip Industry Week in Review

AI export rule to be scrapped; SEMI, EU request; Cadence, Nvidia supercomputer; AI co-processor; Imagination's new GPU; semi sales up; imec, TNO photonics lab; NSF key to national security; flexible packaging control system; SiConic test engineering; USB 4 support; SiC JFETS; magnetic behavior in hematite.

by The SE Staff

The Best AI Edge Inference Benchmark

TOPs = Tera Operations per Second

Micro Benchmarks: Convolutional kernels and individual layers

Model level benchmarks

Other factors to consider

What these benchmarks won’t tell you

Dana McCarty

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Recent Comments

About

Navigation

Connect With Us

The Best AI Edge Inference Benchmark

TOPs = Tera Operations per Second

Micro Benchmarks: Convolutional kernels and individual layers

Model level benchmarks

Other factors to consider

What these benchmarks won’t tell you

Dana McCarty

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored