High Neural Inferencing Throughput At Batch=1

Beating latency in neural networks while retaining high hardware utilization.


Microsoft presented the following slide as part of their Brainwave presentation at Hot Chips this summer:

In existing inferencing solutions, high throughput (and high % utilization of the hardware) is possible for large batch sizes: this means that instead of processing say one image at a time, the inferencing engine processes say 10 or 50 images in parallel. This minimizes the number of times weights need to be loaded, which is typically the slowest step for existing inferencing solutions.

The disadvantage of larger batch size is that latency increases.

And in edge applications where there is only one sensor it isn’t possible to batch at all: at the edge performance needs to be measured at batch=1.

For existing inferencing solutions, throughput and hardware utilization % drops dramatically with smaller batch sizes. This is because in existing architectures loading the weights takes a long time and while weights are loading the MACs (multiplier accumulators) sit idle not doing useful computation.

The number of weights in a neural model is much bigger than the size of the image data it processes: ResNet-50 has >20 Million weights while the images are just 224×224 pixels. YOLOv3 has >60 Million weights while the images can be of any size, but even high resolution is just 2 Megapixels.

At Batch=1 it is still possible to achieve your target throughput with low % hardware utilization, but it means you’ll need more hardware meaning more cost and power.

NMAX: High Throughput at Batch Size = 1
NMAX is a new neural inferencing architecture from Flex Logix that is able to load weights very quickly. This means NMAX has throughput that is almost as high at Batch Size of 1 as a batch sizes of 10+.

NMAX is a modular architecture that can deliver 1 to >100 TOPS of throughput; and NMAX is scalable: twice the MACs means roughly twice the throughput.

Below is a comparison of NMAX, in three sizes, to two existing Data Center class inferencing solutions: NVidia’s Tesla T4 and Habana’s Goya, both recently announced.

You can see the throughput rolloff of about 50% for Goya going from Batch Size of 10 to 5 to 1.

Goya has throughput at Batch 10 similar to an NMAX 12×12 Array but drops to the throughput of a half size NMAX 6×12 at Batch 1. NMAX has some small roll-off at batch 1 but stays much closer to peak throughput.

Goya and T4 don’t give MAC utilizations, but we estimate T4 is <25%. NMAX is 60-70% throughput on the same ResNet-50 Model.

NMAX achieves its performance with higher utilizations and with 1/8th the DRAM. High Utilization at Batch = 1 means you need less hardware for your target throughput which means a smaller, cheaper chip. Less DRAM means lower cost and bandwidth.

Our best estimate is that NMAX is ~1/3 the power of the Habana and T4 solutions at equivalent throughputs. The NMAX architecture is optimized for edge applications: high throughput at batch 1, low cost, low power.

Leave a Reply

(Note: This name will be displayed publicly)