SPONSOR BLOG

High Neural Inferencing Throughput At Batch=1

Beating latency in neural networks while retaining high hardware utilization.

December 6th, 2018 - By: Geoff Tate

Microsoft presented the following slide as part of their Brainwave presentation at Hot Chips this summer:

In existing inferencing solutions, high throughput (and high % utilization of the hardware) is possible for large batch sizes: this means that instead of processing say one image at a time, the inferencing engine processes say 10 or 50 images in parallel. This minimizes the number of times weights need to be loaded, which is typically the slowest step for existing inferencing solutions.

The disadvantage of larger batch size is that latency increases.

And in edge applications where there is only one sensor it isn’t possible to batch at all: at the edge performance needs to be measured at batch=1.

For existing inferencing solutions, throughput and hardware utilization % drops dramatically with smaller batch sizes. This is because in existing architectures loading the weights takes a long time and while weights are loading the MACs (multiplier accumulators) sit idle not doing useful computation.

The number of weights in a neural model is much bigger than the size of the image data it processes: ResNet-50 has >20 Million weights while the images are just 224×224 pixels. YOLOv3 has >60 Million weights while the images can be of any size, but even high resolution is just 2 Megapixels.

At Batch=1 it is still possible to achieve your target throughput with low % hardware utilization, but it means you’ll need more hardware meaning more cost and power.

NMAX: High Throughput at Batch Size = 1
NMAX is a new neural inferencing architecture from Flex Logix that is able to load weights very quickly. This means NMAX has throughput that is almost as high at Batch Size of 1 as a batch sizes of 10+.

NMAX is a modular architecture that can deliver 1 to >100 TOPS of throughput; and NMAX is scalable: twice the MACs means roughly twice the throughput.

Below is a comparison of NMAX, in three sizes, to two existing Data Center class inferencing solutions: NVidia’s Tesla T4 and Habana’s Goya, both recently announced.

You can see the throughput rolloff of about 50% for Goya going from Batch Size of 10 to 5 to 1.

Goya has throughput at Batch 10 similar to an NMAX 12×12 Array but drops to the throughput of a half size NMAX 6×12 at Batch 1. NMAX has some small roll-off at batch 1 but stays much closer to peak throughput.

Goya and T4 don’t give MAC utilizations, but we estimate T4 is <25%. NMAX is 60-70% throughput on the same ResNet-50 Model.

NMAX achieves its performance with higher utilizations and with 1/8^th the DRAM. High Utilization at Batch = 1 means you need less hardware for your target throughput which means a smaller, cheaper chip. Less DRAM means lower cost and bandwidth.

Our best estimate is that NMAX is ~1/3 the power of the Habana and T4 solutions at equivalent throughputs. The NMAX architecture is optimized for edge applications: high throughput at batch 1, low cost, low power.

Geoff Tate

(all posts)
Geoff Tate is a technology strategy advisor. He was the founding CEO of Flex Logix (now part of Analog Devices). Before that, he was the founding CEO of Rambus, and prior to that he was senior vice president of AMD's processor group. He received his BSc in computer science from the University of Alberta, and an MBA from Harvard Business School.

Knowledge Centers
Entities, people and technologies explored

Startup Funding: Q1 2025

AI chips and data center communications see big funding; 75 startups raise $2 billion.

by Jesse Allen

Advanced Packaging Fundamentals for Semiconductor Engineers

New SE eBook examines the next phase of semiconductor design, testing, and manufacturing.

by Bryon Moyer

Chip Industry Week in Review

AI export rule to be scrapped; SEMI, EU request; Cadence, Nvidia supercomputer; AI co-processor; Imagination's new GPU; semi sales up; imec, TNO photonics lab; NSF key to national security; flexible packaging control system; SiConic test engineering; USB 4 support; SiC JFETS; magnetic behavior in hematite.

by The SE Staff

High Neural Inferencing Throughput At Batch=1

Geoff Tate

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Recent Comments

About

Navigation

Connect With Us

High Neural Inferencing Throughput At Batch=1

Geoff Tate

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored