Real-Time Object Recognition At Low Cost/Power/Latency

Benchmarks for a new neural inferencing architecture.


Most neural network chips and IP talk about ResNet-50 benchmarks (image classification at 224×224 pixels). But we find that the number one neural network of interest for most customers is real-time object recognition, such as YOLOv3.

It’s not possible to do comparisons here because nobody shows a YOLOv3 benchmark for their inferencing. But it’s very possible to improve on the inferencing performance of other devices.

ResNet-50: NMAX versus Nvidia Tesla T4 and Habana Goya
The table below uses ResNet-50 to compare NMAX (Flex Logix’s neural inferencing architecture), in a few configurations which we’ll explain soon, versus the Nvidia Tesla T4 and Habana Goya. ResNet-50 has >20 million weights and it takes about 7 billion operations (one operation = one multiply or one accumulate) to process just one image.

According to the published results, NMAX outperforms T4 and Goya while using just 1 DRAM to so. NMAX achieves 87% to 90% utilization of the MACs even while operating at batch size = 1. NMAX is 10x lower power and lower cost than T4/Goya because we need less silicon area for MACs and far fewer DRAMs. Customers tell us it is the DRAM cost and power that dominates in existing inferencing solutions.

For data center applications large batch sizes are okay (within latency constraints). But for edge applications (cars, cameras, airplanes, etc), there is often just one sensor or a few, so batch size needs to be 1 (or maybe 2 or 4). So for the edge, the only column in the table above that is relevant is batch = 1 – T4 and Goya suffer significant performance loss at low batch sizes because they cannot load weights as fast as NMAX can.

Improvements with YOLOv3
Real time object recognition is a much more demanding neural network model: YOLOv3 has 62 million weights, 3x that of ResNet-50, and requires 800 billion operations for one 2 megapixel image, 100x ResNet-50 (which uses 224×224 pixel images).

YOLOv3 can use any image size, but customers want high-resolution images because the accuracy of object recognition increases with higher resolution.

The table above shows NMAX configurations doing autonomous driving. The 6×6 configuration processes HD Video for 1 camera, the 12×6 does 2 and the 12×12 does 4. The DRAM bandwidth is only 10-14 gigabytes/second. This results in about 10x lower cost and lower power for the neural network subsystem (accelerator and DRAMs) than other approaches, using a quarter of the DRAMs and higher MACs utilization, which results in much lower silicon area. 

Architecture matters
The key here is a modular architecture that distributes processing, keeping it close to on-chip SRAM memory so that data movement is minimized. Weights need to be kept close to the MACs, which are also distributed, to speed up loading. And interconnect technologies must allow high-bandwidth, reconfigurable connections between compute and memory blocks stage by stage. In the YOLOv3 table above you see that on-chip SRAM bandwidth is huge: this is why we need little DRAM bandwidth.

NMAX can be “arrayed” from 1×1 to 12×12 or larger to scalably deliver from 1 to >100 TOPS. And throughput is essentially linear: 2x the MACs gives 2x the throughput for a given model. NMAX SRAM on-chip is variable to accommodate different models: ResNet-50 needs half the SRAM that YOLOv3 does, on average. But NMAX is not an eFPGA: there is no Verilog programming for NMAX; the NMAX Compiler takes Tensorflow and Caffe to program the NMAX array directly.

NMAX is now in implementation on TSMC16FFC/12FFC using the building blocks from our silicon proven EFLX eFPGA – this means rapid time to market.