What two different neural network models can show us about the importance of using memory effectively.
Much has been written about the computational complexity of inference acceleration: very large matrix multiplies for fully-connected layers and huge numbers of 3×3 convolutions across megapixel images, both of which require many thousands of MACs (multiplier-accumulators) to achieve high throughput for models like ResNet-50 and YOLOv3.
The other side of the coin is managing the movement of data in memory in order to, first, keep the MACs supplied with the weights and activations to achieve the highest hardware utilization and, second, do this using the least power possible.
Let’s use two popular neural network models to examine the challenge of memory in inference. We’ll assume the Winograd Transformation is used for the popular 3×3, stride 1 convolutions.
Unfortunately, few inference vendors give any benchmarks, choosing instead to use vague statements of TOPS without giving the details behind them.
When a benchmark is given it is usually ResNet-50, and often without stating batch size (batch size is critical: most architectures do well with large batch sizes, but few have high throughput at low batch sizes – if a batch size is not given you can assume it’s large). It’s important to note that ResNet-50 is a neural network that nobody actually plans to use in their products; instead, it’s used to compare different architectures. Be careful because performance on ResNet-50 may not correlate with performance using the much more challenging models needed to do, for example, real time object detection and recognition.
The biggest difference between ResNet-50 and YOLOv3 is the choice of image size. Look at what happens if ResNet-50 is run using 2 Megapixel images like YOLOv3: MACs/image increase to 103 Million and the largest activation to 33.6 MB. On large images ResNet-50’s characteristics looks close to YOLOv3.
Let’s follow the memory in ResNet-50 using the traditional 224×224 images.
Note that caching has little benefit for neural models, which is very different than traditional processor workloads where processing is huge: the 22.7 Million weights are cycled through not being re-used until the next image. A weights cache needs to hold all the weights: a smaller weights cache just flushes continuously. Similarly with activations: in some models, some activations are used again in later stages, but for the most part activations are generated and used immediately only to feed the next stage.
So for each image processed by ResNet-50 the memory transactions required are, assuming for now that all memory references are to/from DRAM:
Memory references to DRAM use about 10-100x the power of memory references to an SRAM on-chip.
To reduce DRAM bandwidth there are two options for ResNet-50:
Let’s look at YOLOv3 to see the DRAM traffic needed without on-chip SRAM:
The trend is to larger models and larger images so YOLOv3 is more representative of the future of inference acceleration – using on-chip memory effectively will be critical for low cost/low power inference.
Leave a Reply