TOPS, Memory, Throughput And Inference Efficiency

Evaluate inference accelerators to find the best throughput for the money.

popularity

Dozens of companies have or are developing IP and chips for Neural Network Inference.

Almost every AI company gives TOPS but little other information.

What is TOPS? It means Trillions or Tera Operations per Second. It is primarily a measure of the maximum achievable throughput but not a measure of actual throughput. Most operations are MACs (multiply/accumulates), so TOPS = (number of MAC units) x (frequency of MAC operations) x 2.

So more TOPS means more silicon area, more cost, more power and maybe more throughput, but that depends on other aspects of the inference accelerator.

TOPS is not enough information. You need to know the throughput for your model, your image size, your batch size – this will tell you if the chip or IP will meet your throughput requirement.

But at what cost and power? For that you need to measure inference efficiency.

Inference efficiency
Throughput/$ (or ¥ or €) is the inference efficiency for a given model, image size, batch size and allows comparison between alternatives.

Little price information is available, but we can estimate cost by looking at the key factors of the cost of the chip.

All inference accelerators will have 4 key components that will make up most of the chip:

  1. MACs (let’s assume for now that all are INT8, but many will have INT16 and BFloat16 options);
  2. SRAM (may be distributed or central on the chip);
  3. DRAM (each DRAM requires a DDR PHY on chip and about 100 extra BGA balls);
  4. The interconnect architecture that connects the compute and memory blocks along with logic that controls execution of the neural network model.

More MACs, more SRAM, more DRAM and more interconnect will both improve throughput AND increase cost.

The objective is to get maximum inference efficiency: maximize throughput (for a given model, image size, batch size) with the least MACs, SRAM, DRAM and interconnect. This will maximize throughput/$. Note that roughly $ and power will correlate: power dissipation comes from MACs, SRAM, DRAM and interconnect – more of each will translate to more power.

Some (but not many) companies provide additional data for their inference accelerator chip: TOPS, number of DRAMs (determines DRAM bandwidth) and throughput for ResNet-50. Throughput/$ can be approximated by looking at the combination of throughput/TOPS, throughput/SRAM and throughput/DRAM.

ResNet-50 is not likely the best benchmark to use. No one actually uses it in an application. But it is the only benchmark with sufficient data to make some comparisons. Keep in mind that the relative performance on larger models and larger image sizes will likely change significantly depending on the characteristics of each architecture.

Below we will compare inference accelerators with TOPS from 400 (Groq) to ~0.5 (Jetson Nano). Few of them give all the data we’d like but there is enough to see some trends. Inference chips are listed if they have published TOPS and ResNet-50 performance for some batch size. The chips are ordered from highest ResNet-50 throughput to lowest, with two columns showing batch=1 throughput and batch=10+ throughput. Where a batch size was not given we assume it is a large batch.

Notice that TOPS and throughput have a loose correlation but some chips deliver more throughput from fewer TOPS than others. This is because architecture, SRAM size and number of DRAM are also very important in determining throughput.

Throughput/TOPS: an indicator of how efficiently MACs are utilized for a model
Let’s look at Throughput/TOPS. This tells us how efficiently a chip uses its’ MACs, at least for a given model.

None of the chips except InferX X1 indicate how much SRAM they have (X1 has 8MB). More SRAM and more DRAM will both help improve the utilization of MACs but at a cost. So the highest throughput/TOPS is not necessarily the best throughput/$: we need to know about how much memory is used which adds to cost.

The table below shows throughput/TOPS for ResNet-50 in descending order of throughput.

Throughput/DRAM: an indicator of how efficiently DRAMs are utilized
Next we look at ResNet-50 throughput/DRAM (the number of DRAMs, not the Gigabits: DRAM is used in inference primarily for bandwidth not for capacity).

The table is sorted in descending throughput/DRAM for ResNet-50.

Throughput/SRAM: an indicator of how efficiently SRAM is utilized
SRAM size can be as big or bigger than the area for MACs so knowing SRAM capacity is very important in estimating throughput/$. Unfortunately very few chips provide it: only 2 are in the table below and the SRAM size for Hailo-8 is an estimate by Microprocessor Report.

Estimating Throughput/$ by plotting Throughput/TOPS vs Throughput/DRAM and then Throughput/TOPS vs Throughput/SRAM MB
The highest throughput/$ architectures will be good at throughput/TOPS, throughput/DRAM and throughput/SRAM. The available data is limited but we can draw some conclusions.

We have the most data on TOPS and DRAM — we plot them below for ResNet-50 batch=1 then ResNet-50 batch=10+.

There is also a little data on TOPS and SRAM which is plotted below for ResNet-50 batch=1.

Conclusion: how to use this methodology for your application
Decide which model, image size and batch size is most relevant for your application.

Then ask your vendors to give you their INT8 throughput for that model/image size/batch size AND give you their TOPS, megabytes of on-chip SRAM and number of DRAM used to achieve the throughput. Plotting the results using the approach above will give you insight into the key components of throughput/$ for your application.



3 comments

YanjunMa says:

Very nice summary. Quick questions:
1. on the formula: TOPS = (number of MAC units) x (frequency of MAC operations) x 2, where is the x2 coming from? Is it from the two operations per MAC: multiple and addition?

2. Also “ResNet-50 is not likely the best benchmark to use. No one actually uses it in an application. “, which algorithms are being used?

Tanj Bennett says:

ResNet-50 is a relatively old benchmark of small size and simple topology (convolution, using early layers specialized to finding primitive features). It is still relevant if the task you need is a variant of image recognition over moderate sets of images. Other models popular for ML benchmarks these days include GNMT, NCF, and BERT (currently the heavyweight) which are used for tasks such as recommendation and language comprehension.

Also, training and inference are generally different. Inference uses a derived network with a pre-computed topology and set of coefficients, valuing measures like rapid throughput. Batch size in this case leverages pipelining and to put multiple observations through the pipeline and may correspond to using a shared ASIC from multiple clients who each need to set up differently. Batch size for learning, the process which calculates the coefficients, has much greater implications for memory size and interconnect bandwidth. Batch sizes vary with the problem, with small batches tending to be noisier and quicker, while batches that are too large may slow the iterative pace of the calculations. This article appears to be mostly looking at inferencing, the delivery end of AI.

YANJUNMA says:

GNMT, NCF, and BERT are for language processing, translation, etc. ResNet is for image processing.

Leave a Reply


(Note: This name will be displayed publicly)