Be aware of bottlenecks that can constrain theoretical peak performance.
Many times you’ll hear vendors talking about how many TOPS their chip has and imply that more TOPS means better inference performance.
If you use TOPS to pick your AI inference chip, you will likely not be happy with what you get.
Recently, Vivienne Sze, a professor at MIT, gave an excellent talk entitled “How to Evaluate Efficient Deep Neural Network Approaches.” Slides are also available. Her talk goes into much more detail than this blog.
What is TOPS?
TOPS = trillions of operations per second; where a Multiply is considered an operation and an Add is considered an operation. TOPS can be for 4 bit integers, 8 bit integers, 16 bit floating point, etc. Typically for inference the key compute building block is a Multiply-Accumulate (MAC), which is 2 operations.
You compute TOPS for a chip by looking at how many MAC hardware units there are and what frequency they run at. If you have 1000 MACs running at 1Gigahertz (1 ns per MAC) then 1 you have 1 TOPS.
More TOPS means more silicon area and more cost. But it doesn’t necessarily mean more throughput.
There can be many other bottlenecks. For example, in a Xilinx FPGA there may be thousands of MACs that by themselves can run very quickly, but the FPGA interconnect that connects them is the bottleneck. In an ASIC, the MACs can have bottlenecks moving data in and out while contending for buses and cache memory access and DRAM access. MACs that are running but cannot get data to process do no useful work.
So TOPS is merely an indicator of the theoretical peak inference performance.
In our experience, the customers who ask “how many TOPS” do you have assume that more TOPS will result in more throughput linearly for all neural network models.
What really matters to a customer is the model they want to run.
In fact, for edge systems, neural network processing can be well described as application-specific inference because any given customer only cares about their model’s performance.
You may think ResNet-50 is the solution to the problem that TOPS does not predict performance. But unfortunately ResNet-50 is not a model representative of what customers really want to run. The reason is that ResNet-50 is a benchmark categorizing 224×224 images whereas customers typically want to process much larger megapixel images.
An accelerator that runs ResNet-50 well for 224×224 images may be brought to its knees when processing megapixel images because the memory capacity needed for intermediate activations grows very large and typically requires very high memory capacity and/or bandwidth.
We find that the most commonly requested model people want to use is YOLOv3 object detection and recognition on 1 or 2 megapixel images. If you are going to pick one open source model to benchmark accelerators, YOLOv3 is a good choice.
Here is a chart we presented recently at the AI Hardware Summit benchmarking our new InferX X1 using YOLOv3 INT8 batch=1 at 608×608 to Nvidia Xavier NX, Nvidia Tesla T4 and Blaize El Cano.
More TOPS and more DRAM connections directly correlate to larger and more expensive die and packages.
We compare the efficiency of the architectures, since the chips are more than 10x different in size, by normalzing to throughput (inferences per second) per number of DRAM connection vs throughput per number of TOPS.
You can see the 3 GPU solutions roughly cluster, so for them TOPS roughly correlates with throughput. But InferX X1 has a very different, and much more efficient solution, and so gets ~5X more throughput per number of TOPS. But even then, we have seen customer models with unusual convolution operations where InferX X1 has throughput relative to Xavier NX 10x more than for YOLOv3.
TOPS will lead you to the wrong conclusion. The only good way to benchmark an accelerator is to run your model on it and the alternatives.
Leave a Reply