SPONSOR BLOG

Inference Acceleration: Follow The Memory

What two different neural network models can show us about the importance of using memory effectively.

March 7th, 2019 - By: Geoff Tate

Much has been written about the computational complexity of inference acceleration: very large matrix multiplies for fully-connected layers and huge numbers of 3×3 convolutions across megapixel images, both of which require many thousands of MACs (multiplier-accumulators) to achieve high throughput for models like ResNet-50 and YOLOv3.

The other side of the coin is managing the movement of data in memory in order to, first, keep the MACs supplied with the weights and activations to achieve the highest hardware utilization and, second, do this using the least power possible.

Let’s use two popular neural network models to examine the challenge of memory in inference. We’ll assume the Winograd Transformation is used for the popular 3×3, stride 1 convolutions.

Unfortunately, few inference vendors give any benchmarks, choosing instead to use vague statements of TOPS without giving the details behind them.

When a benchmark is given it is usually ResNet-50, and often without stating batch size (batch size is critical: most architectures do well with large batch sizes, but few have high throughput at low batch sizes – if a batch size is not given you can assume it’s large). It’s important to note that ResNet-50 is a neural network that nobody actually plans to use in their products; instead, it’s used to compare different architectures. Be careful because performance on ResNet-50 may not correlate with performance using the much more challenging models needed to do, for example, real time object detection and recognition.

The biggest difference between ResNet-50 and YOLOv3 is the choice of image size. Look at what happens if ResNet-50 is run using 2 Megapixel images like YOLOv3: MACs/image increase to 103 Million and the largest activation to 33.6 MB. On large images ResNet-50’s characteristics looks close to YOLOv3.

Let’s follow the memory in ResNet-50 using the traditional 224×224 images.

Note that caching has little benefit for neural models, which is very different than traditional processor workloads where processing is huge: the 22.7 Million weights are cycled through not being re-used until the next image. A weights cache needs to hold all the weights: a smaller weights cache just flushes continuously. Similarly with activations: in some models, some activations are used again in later stages, but for the most part activations are generated and used immediately only to feed the next stage.

So for each image processed by ResNet-50 the memory transactions required are, assuming for now that all memory references are to/from DRAM:

15 MB input image size read in
7 MB weights read in (assuming 8-bit integers which is the norm)
3 MB of activations are written cumulatively at the end of all of the stages
All but the last activation is read back in for the next stage for another almost 9.3 MB
This gives a total of 41.4 MB of memory read/writes per image
We are ignoring here the memory traffic for the code for the accelerator since there is no data available for any architecture. Code may benefit from caching.

Memory references to DRAM use about 10-100x the power of memory references to an SRAM on-chip.

To reduce DRAM bandwidth there are two options for ResNet-50:

Add enough SRAM to store all 22.7MB of weights on chip
Add SRAM on chip to store intermediate activations so stage X writes to the activation cache and stage X+1 reads from it. For ResNet-50 the largest intermediate activation is 0.8 MB so 1MB of SRAM eliminates about half of the DRAM traffic.

Let’s look at YOLOv3 to see the DRAM traffic needed without on-chip SRAM:

6MB input image size (remember each pixel has 3 Bytes for RGB)
9 MB weights read in
475 MB activations generated cumulatively as output of all of the stages written to DRAM
475 MB activations read back in for the next layer
This gives a total of 1,108 MB = 1.1 GB of DRAM traffic to process just one image!
Much more SRAM is required to reduce DRAM bandwidth: 62MB for weights caching and, since the largest intermediate activation is 64MB, another 64MB for activation caching: this would eliminate DRAM bandwidth but 128MB in 16nm is about 140 square millimeters which is very expensive.
The practical options for cost effective designs are an activation cache big enough for most layers but all: only 1 layer has a 64MB activation output, 2 layers have 32MB activation outputs, and 4 layers have 16MB activation outputs – all the rest are 8MB or less. So there is a tradeoff here between activation cache size and DRAM bandwidth.
For weights there is no tradeoff: either storage all 61.9MB on chip or have them all on DRAM.
You can see why YOLOv3 doesn’t run faster with batches>1: multiple batches requires saving multiple activations and the activations are too big.

The trend is to larger models and larger images so YOLOv3 is more representative of the future of inference acceleration – using on-chip memory effectively will be critical for low cost/low power inference.

Geoff Tate

(all posts)
Geoff Tate is a technology strategy advisor. He was the founding CEO of Flex Logix (now part of Analog Devices). Before that, he was the founding CEO of Rambus, and prior to that he was senior vice president of AMD's processor group. He received his BSc in computer science from the University of Alberta, and an MBA from Harvard Business School.

Knowledge Centers
Entities, people and technologies explored

Startup Funding: Q1 2025

AI chips and data center communications see big funding; 75 startups raise $2 billion.

by Jesse Allen

Advanced Packaging Fundamentals for Semiconductor Engineers

New SE eBook examines the next phase of semiconductor design, testing, and manufacturing.

by Bryon Moyer

Chip Industry Week in Review

AI export rule to be scrapped; SEMI, EU request; Cadence, Nvidia supercomputer; AI co-processor; Imagination's new GPU; semi sales up; imec, TNO photonics lab; NSF key to national security; flexible packaging control system; SiConic test engineering; USB 4 support; SiC JFETS; magnetic behavior in hematite.

by The SE Staff

Inference Acceleration: Follow The Memory

Geoff Tate

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

Big Changes Ahead For Interposers And Substrates

What Exactly Are Chiplets And Heterogeneous Integration?

Sponsors

Recent Comments

About

Navigation

Connect With Us

Inference Acceleration: Follow The Memory

Geoff Tate

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

Big Changes Ahead For Interposers And Substrates

What Exactly Are Chiplets And Heterogeneous Integration?

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored