Modeling AI Inference Performance

TOPS may correlate with cost, but not necessarily with throughput.


The metric in AI Inference that matters to customers is either throughput/$ for their model and/or throughput/watts for their model.

One might assume throughput will correlate with TOPS, but you’d be wrong. Examine the table below:

The Nvidia Tesla T4 gets 7.4 inferences/TOP, Xavier AGX 15 and InferX 1 34.5. And InferX X1 does it with 1/10th to 1/20th of the DRAM bandwidth of the other two (the Nvidia chips use higher bandwidth, more expensive DRAMs).

On YOLOv3 2 Megapixel images, a much more relevant benchmark, the T4 gets 16 frames/second vs 12 frames/second for the X1, which has 7% of the TOPS and 5% of the DRAM bandwidth.

So TOPS and DRAM bandwidth correlate with cost but not necessarily with throughput.

To design a high efficiency inference chip requires the ability to model performance ahead of silicon (the InferX X1 is close to tape-out now).

People have asked us how we can accurately predict our performance before silicon on so many models. To us it was critical to be able to do it to ensure we made the right architectural tradeoffs. So we invested in what was needed to ensure we could do it right.

To be able to model performance with good accuracy three things are required:

  1. Software and Hardware Readiness – all of the SoC blocks must be well defined and there must be real-world models, like YOLOv3, ready to simulate from start to finish
  2. The Inference software/compiler must be developed in parallel with the hardware: manually optimized benchmarks are not feasible in production, so must run compiled-produced code
  3. The architecture must be deterministic with minimimum resource contention and static scheduling and resource allocation.

If the software team works on the compiler after the hardware team, the likelihood of efficiency is low.

If the architecture is not deterministic, the ability to reliably predict performance is low.

For example, cache hit rates are well understood for existing code but very hard to predict for totally new workloads. If buses are shared, contention can be very, very hard to predict. In the case of InferX X1, the compute resources are located within an eFPGA fabric so there is a programmed interconnect for the duration of the layer directly connecting the memory containing the input activation to the MACs to the LUTs for activation back to the memory storing the output activation. So this is totally deterministic.

Layers are reconfigured quickly (in ~2 millionth of a second). In a model like YOLOv3 processing 2 Megapixel images, it takes >300 BILLION MACS per image. At a little more than 100 layers that is >3 billion MACs per layer on image. So reconfiguration time is very small compared to compute time.

Deep layer fusion can allow multiple layers to be implemented simultaneously with one feeding directly to the next – this can eliminate many of the largest activations (in YOLOv3, the largest activation is 64MB from layer 0 to layer 1: with layer 0 and 1 fused together the 64MB is directly passed between the layers with no DRAM writes or reads).

InferX X1 brings in the weights and code for the next layer during the execution of the current layer – they are stored in cache locations and then quickly loaded during the short reconfiguration period. Doing this “hides” almost all DRAM traffic behind computation time. For YOLOv3 2 megapixels, just 4% of cycles are DRAM traffic that stalls the MACs.

Finally, we are running the full InferX X1 RTL, including the PCIe and LPDDR4 controllers, on Mentor Veloce emulation boxes. This allows us to boot the SoC in emulated Linux, load the X1 kernel driver, run the model and trigger on events and dump detailed waveforms and vcd vectors, as well as performing system level performance and power-rail analysis.

Our nnMAX Compiler tool is now available to run models in TFLite or ONNX and gives predicted performance which we expect will be very very close to the actual silicon/boards we expect to have early in 2020. Contact us at [email protected] if you would like to get our software to try out on your model(s).

Leave a Reply

(Note: This name will be displayed publicly)