Use Inference Benchmarks Similar To Your Application

How the wrong benchmark can lead to incorrect conclusions.


If an Inference IP supplier or Inference Accelerator Chip supplier offers a benchmark, it is probably ResNet-50.

As a result, it might seem logical to use ResNet-50 to compare inference offerings. If you plan to use ResNet-50 it would be; but if your target application model is significantly different from Resnet-50 it could lead you to pick an inference offering that is not best for you.

ResNet-50 has 50 stages, processes 224×224 images, has 22.7 Million weights and needs 3.5 Billion MACs per image. Typically vendors provide this benchmark with a large batch size.

How can using this lead you to the wrong conclusion?

Batch Size
If your application is in the data center, large batch sizes may be the right thing. But if your application is outside the data center in a car, a camera, an edge server, etc, then you probably need batch size = 1 or maybe 2 or 4.

A lot of inference architectures load weights slowly, so their throughput is optimized by using large batches. But if they load weights slowly the throughput at batch=1 is likely to be half or well less than half of the throughput at the large batch size.

Image/Input Size
ResNet-50 processes 224×224 images. It is very unlikely this image size is what your application will use. Today image sensors typically generate 2 Megapixel images, and using larger images results in higher precision of inference results.

YOLOv3, real time object recognition, processes images of various sizes.

A larger image results in larger intermediate activation sizes (an intermediate activation is the output of a layer).

ResNet-50’s largest intermediate activation is less than 1MB. YOLOv3’s largest intermediate activation, for 2 Megapixel images, is 67MB!

So larger image sizes will put more stress on the on-chip SRAM and the off-chip DRAM bandwidth required.

ResNet-50 requires 3.5 Billion MACs per image. YOLOv3 needs 400 Billion MACs per image (for 2 MegaPixels) – this varies linearly with image size: a 224×224 image needs 9.5 Billion MACs, still a lot more than ResNet50.

More MAC intensive models may stress inferencing chips in different ways depending on their architecture.

Every inference solution will have a fixed number of MACs but each layer of the model requires different numbers of MACs. This requires the inference architecture to either do parallelism if the layer requires fewer MACs than the hardware provides, or to break down the layer into sub-steps if the MACs required exceeds the number available in the hardware. Different architectures may have different challenges in doing this.

ResNet-50 is a CNN: a convolutional neural network.

Most imaging applications (including Lidar/Radar) use convolutions but not all: MobileNet v2 was written for PCs/phones and avoids convolutions that don’t map well to a processor.

If your application is for speech or other non-convolution models, your model won’t benefit from the specialized hardware for CNNs in most inference IP/chips.

Also, some inference IP/chips implement the Winograd transform which for 3×3 convolutions with a stride of 1 results in a reduction of MACs by 2.25x but an increase in weight size by 1.8x.

For ResNet-50 Winograd can result in an increase of throughput of about 1.4x based on the number of layers that can be accelerated. But for YOLOv3 the acceleration is 1.7x.

But be careful: Winograd increases weight size by 1.8x so what used to fit in the SRAM of a chip/IP now might have to be stored in DRAM, and the DRAM bandwidth grows not just by 1.8x but by 1.8x times the acceleration factor (so for YOLOv3 the DRAM bandwidth is 1.8×1.7 = 3.1!).

But some chips/IP may store the weights in non-Winograd format but expand them when brought into the processing unit.

Why don’t your suppliers offer you relevant benchmarks?
The real mystery is why almost no one even offers a ResNet-50 benchmark.

Maybe it’s because their architectures are difficult to model so they don’t know what the throughput will be: cache hit rates, bus contention issues, DRAM bandwidth contention.

Or maybe it’s because their throughput isn’t very good and they know that if they told you, you wouldn’t use them. So instead they cloud the issue with fuzzy numbers like TOPS and TOPS/watt without specifying batch size or process/temperature/voltage; or by comparing to a processor or FPGA, which of course they’ll be faster than.

Insist on benchmarks that are your relevant to your application or even better to benchmark your model. If they won’t, find another supplier.

Leave a Reply

(Note: This name will be displayed publicly)