Proxies for actual workloads can only be so accurate.
When evaluating the performance of an AI accelerator, there’s a range of methodologies available to you. In this article, we’ll discuss some of the different ways to structure your benchmark research before moving forward with an evaluation that directly runs your own model. Just like when buying a car, research will only get you so far before you need to get behind the wheel and give your future vehicle a few test drives. Often times, many of the benchmarks available tell you information, but don’t tell you anything about how your model will actually run on any given accelerator.
Let’s see how some common metrics may help or hinder your search for the inference accelerator that is best for your application and criteria.
The most common number you’ll see is the peak operations that an architecture can theoretically achieve (calculated by multiplying the number of multiply-and-accumulate units in the hardware by the nominal clock speed times two since one MAC is two operations). But how do you know if the rest of the accelerator (memory subsystem, software, etc) can keep the MACs well utilized? The pitfall with this metric is that without an understanding of utilization (how often those compute units are doing actually useful calculations), this number often tells you nothing about anything!
Another approach is to develop micro-benchmarks that test the capability of an accelerator for each of the subsystems required. Deriving an accurate utilization figure to correlate with TOPs can actually be more trouble than it is worth. Different accelerators may have very different architectures, which are rarely well documented, that can cause results to be skewed. Not all layers are created equally, so you’ll have to understand the different memory and compute patterns of every layer’s convolutional kernel. Ultimately, while a ‘micro’ benchmark approach like this holds promise, in practice it’s difficult to really draw meaningful conclusions from benchmarking at this level, especially without having the details of the accelerator’s micro-architecture.
This is where benchmarks begin to be useful, because models will have a variety of layers with different levels of utilization within a single model, and the units of measurement (inferences per second) will be more directly applicable to your workload. Usually these benchmarks are based around a standard set of common CNNs, such as ResNet, VGG, YOLO, and more. With these types of benchmarks, at least now the work being done can be tied more directly to an end application such as object detection or image classification.
Fig. 1: Differing aspects of common CNN models.
However, these models are still just proxies for your own model. Some of these models are residual networks with skip connections, while others have novel operations in their object detector heads. They all have different structures that dramatically change the kind of optimizations that can be applied when different hardware solutions are available. While these models might be similar in some ways to your model, chances are there won’t be a single model so close to yours that you can confidently make a conclusion about performance. However, if you derived your own model from an open-source benchmark, that may make that open source benchmark a better indicator of your customized model’s performance.
Some accelerators implement Winograd transformations which can accelerate convolutions by 2.25x. The benefits of Winograd acceleration show up in inferences/second (throughput), and require an additional step when looking at MACs/second or utilization because Winograd is computed in a different, more efficient way.
Some accelerators prune models or skip zero weights to speed computation. The former may lose accuracy while the latter reduces MACs. The impact of these optimizations is clear in inferences/second, but may not be clear in intermediate metrics.
Only your actual model running on the solution will provide definitive results. There’s simply no other way to accurately measure your performance without running your workload! Neural networks come in so many different shapes and sizes, that even a few models’ performance results won’t tell you enough about your own model’s performance. What’s more, running your own model will give you a chance to probe out the solution’s software stack, and determine whether it is flexible enough to handle your set of evolving workloads. After all, even after doing all the research you can when purchasing a car, there’s always something you don’t know to consider until going on that test drive!
Leave a Reply