SPONSOR BLOG

Modeling AI Inference Performance

TOPS may correlate with cost, but not necessarily with throughput.

November 7th, 2019 - By: Geoff Tate

The metric in AI Inference that matters to customers is either throughput/$ for their model and/or throughput/watts for their model.

One might assume throughput will correlate with TOPS, but you’d be wrong. Examine the table below:

The Nvidia Tesla T4 gets 7.4 inferences/TOP, Xavier AGX 15 and InferX 1 34.5. And InferX X1 does it with 1/10th to 1/20th of the DRAM bandwidth of the other two (the Nvidia chips use higher bandwidth, more expensive DRAMs).

On YOLOv3 2 Megapixel images, a much more relevant benchmark, the T4 gets 16 frames/second vs 12 frames/second for the X1, which has 7% of the TOPS and 5% of the DRAM bandwidth.

So TOPS and DRAM bandwidth correlate with cost but not necessarily with throughput.

To design a high efficiency inference chip requires the ability to model performance ahead of silicon (the InferX X1 is close to tape-out now).

People have asked us how we can accurately predict our performance before silicon on so many models. To us it was critical to be able to do it to ensure we made the right architectural tradeoffs. So we invested in what was needed to ensure we could do it right.

To be able to model performance with good accuracy three things are required:

Software and Hardware Readiness – all of the SoC blocks must be well defined and there must be real-world models, like YOLOv3, ready to simulate from start to finish
The Inference software/compiler must be developed in parallel with the hardware: manually optimized benchmarks are not feasible in production, so must run compiled-produced code
The architecture must be deterministic with minimimum resource contention and static scheduling and resource allocation.

If the software team works on the compiler after the hardware team, the likelihood of efficiency is low.

If the architecture is not deterministic, the ability to reliably predict performance is low.

For example, cache hit rates are well understood for existing code but very hard to predict for totally new workloads. If buses are shared, contention can be very, very hard to predict. In the case of InferX X1, the compute resources are located within an eFPGA fabric so there is a programmed interconnect for the duration of the layer directly connecting the memory containing the input activation to the MACs to the LUTs for activation back to the memory storing the output activation. So this is totally deterministic.

Layers are reconfigured quickly (in ~2 millionth of a second). In a model like YOLOv3 processing 2 Megapixel images, it takes >300 BILLION MACS per image. At a little more than 100 layers that is >3 billion MACs per layer on image. So reconfiguration time is very small compared to compute time.

Deep layer fusion can allow multiple layers to be implemented simultaneously with one feeding directly to the next – this can eliminate many of the largest activations (in YOLOv3, the largest activation is 64MB from layer 0 to layer 1: with layer 0 and 1 fused together the 64MB is directly passed between the layers with no DRAM writes or reads).

InferX X1 brings in the weights and code for the next layer during the execution of the current layer – they are stored in cache locations and then quickly loaded during the short reconfiguration period. Doing this “hides” almost all DRAM traffic behind computation time. For YOLOv3 2 megapixels, just 4% of cycles are DRAM traffic that stalls the MACs.

Finally, we are running the full InferX X1 RTL, including the PCIe and LPDDR4 controllers, on Mentor Veloce emulation boxes. This allows us to boot the SoC in emulated Linux, load the X1 kernel driver, run the model and trigger on events and dump detailed waveforms and vcd vectors, as well as performing system level performance and power-rail analysis.

Our nnMAX Compiler tool is now available to run models in TFLite or ONNX and gives predicted performance which we expect will be very very close to the actual silicon/boards we expect to have early in 2020. Contact us at [email protected] if you would like to get our software to try out on your model(s).

Geoff Tate

(all posts)
Geoff Tate is a technology strategy advisor. He was the founding CEO of Flex Logix (now part of Analog Devices). Before that, he was the founding CEO of Rambus, and prior to that he was senior vice president of AMD's processor group. He received his BSc in computer science from the University of Alberta, and an MBA from Harvard Business School.

Knowledge Centers
Entities, people and technologies explored

Startup Funding: Q1 2025

AI chips and data center communications see big funding; 75 startups raise $2 billion.

by Jesse Allen

Advanced Packaging Fundamentals for Semiconductor Engineers

New SE eBook examines the next phase of semiconductor design, testing, and manufacturing.

by Bryon Moyer

Chip Industry Week in Review

AI export rule to be scrapped; SEMI, EU request; Cadence, Nvidia supercomputer; AI co-processor; Imagination's new GPU; semi sales up; imec, TNO photonics lab; NSF key to national security; flexible packaging control system; SiConic test engineering; USB 4 support; SiC JFETS; magnetic behavior in hematite.

by The SE Staff

Modeling AI Inference Performance

Geoff Tate

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Recent Comments

About

Navigation

Connect With Us

Modeling AI Inference Performance

Geoff Tate

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

RISC-V’s Increasing Influence

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored