Inference Acceleration: Follow The Memory

What two different neural network models can show us about the importance of using memory effectively.

popularity

Much has been written about the computational complexity of inference acceleration: very large matrix multiplies for fully-connected layers and huge numbers of 3×3 convolutions across megapixel images, both of which require many thousands of MACs (multiplier-accumulators) to achieve high throughput for models like ResNet-50 and YOLOv3.

The other side of the coin is managing the movement of data in memory in order to, first, keep the MACs supplied with the weights and activations to achieve the highest hardware utilization and, second, do this using the least power possible.

Let’s use two popular neural network models to examine the challenge of memory in inference. We’ll assume the Winograd Transformation is used for the popular 3×3, stride 1 convolutions.

Unfortunately, few inference vendors give any benchmarks, choosing instead to use vague statements of TOPS without giving the details behind them.

When a benchmark is given it is usually ResNet-50, and often without stating batch size (batch size is critical: most architectures do well with large batch sizes, but few have high throughput at low batch sizes – if a batch size is not given you can assume it’s large). It’s important to note that ResNet-50 is a neural network that nobody actually plans to use in their products; instead, it’s used to compare different architectures. Be careful because performance on ResNet-50 may not correlate with performance using the much more challenging models needed to do, for example, real time object detection and recognition.

The biggest difference between ResNet-50 and YOLOv3 is the choice of image size. Look at what happens if ResNet-50 is run using 2 Megapixel images like YOLOv3: MACs/image increase to 103 Million and the largest activation to 33.6 MB. On large images ResNet-50’s characteristics looks close to YOLOv3.

Let’s follow the memory in ResNet-50 using the traditional 224×224 images.

Note that caching has little benefit for neural models, which is very different than traditional processor workloads where processing is huge: the 22.7 Million weights are cycled through not being re-used until the next image. A weights cache needs to hold all the weights: a smaller weights cache just flushes continuously. Similarly with activations: in some models, some activations are used again in later stages, but for the most part activations are generated and used immediately only to feed the next stage.

So for each image processed by ResNet-50 the memory transactions required are, assuming for now that all memory references are to/from DRAM:

  • 15 MB input image size read in
  • 7 MB weights read in (assuming 8-bit integers which is the norm)
  • 3 MB of activations are written cumulatively at the end of all of the stages
  • All but the last activation is read back in for the next stage for another almost 9.3 MB
  • This gives a total of 41.4 MB of memory read/writes per image
  • We are ignoring here the memory traffic for the code for the accelerator since there is no data available for any architecture. Code may benefit from caching.

Memory references to DRAM use about 10-100x the power of memory references to an SRAM on-chip.

To reduce DRAM bandwidth there are two options for ResNet-50:

  1. Add enough SRAM to store all 22.7MB of weights on chip
  2. Add SRAM on chip to store intermediate activations so stage X writes to the activation cache and stage X+1 reads from it. For ResNet-50 the largest intermediate activation is 0.8 MB so 1MB of SRAM eliminates about half of the DRAM traffic.

Let’s look at YOLOv3 to see the DRAM traffic needed without on-chip SRAM:

  • 6MB input image size (remember each pixel has 3 Bytes for RGB)
  • 9 MB weights read in
  • 475 MB activations generated cumulatively as output of all of the stages written to DRAM
  • 475 MB activations read back in for the next layer
  • This gives a total of 1,108 MB = 1.1 GB of DRAM traffic to process just one image!
  • Much more SRAM is required to reduce DRAM bandwidth: 62MB for weights caching and, since the largest intermediate activation is 64MB, another 64MB for activation caching: this would eliminate DRAM bandwidth but 128MB in 16nm is about 140 square millimeters which is very expensive.
  • The practical options for cost effective designs are an activation cache big enough for most layers but all: only 1 layer has a 64MB activation output, 2 layers have 32MB activation outputs, and 4 layers have 16MB activation outputs – all the rest are 8MB or less. So there is a tradeoff here between activation cache size and DRAM bandwidth.
  • For weights there is no tradeoff: either storage all 61.9MB on chip or have them all on DRAM.
  • You can see why YOLOv3 doesn’t run faster with batches>1: multiple batches requires saving multiple activations and the activations are too big.

The trend is to larger models and larger images so YOLOv3 is more representative of the future of inference acceleration – using on-chip memory effectively will be critical for low cost/low power inference.



Leave a Reply


(Note: This name will be displayed publicly)