Neural Network Performance Modeling Software

How to build an inference chip on the edge.


nnMAX Inference IP is nearing design completion. The nnMAX 1K tile will be available this summer for design integration in SoCs, and it can be arrayed to provide whatever inference throughput is desired. The InferX X1 chip will tape out late Q3 this year using 2×2 nnMAX tiles, for 4K MACs, with 8MB SRAM.

The nnMAX Compiler is in development in parallel, and the first release is available now. Evaluation licenses are available to use the nnMAX Compiler to get accurate Performance estimates for any nnMAX configuration, including the X1. Currently TensorFlow Lite Models for INT8 are supported. Later this quarter, ONNX support also will be available, and support will be extended to INT8/BFloat16 (any mix of layers).

Unlike other inference architectures, nnMAX is a fully deterministic architecture. For each layer the datapath is programmed, using our XFLX interconnect which has been proven out from 180 to 12nm, to provide a fixed path from SRAM to hardware units and back to SRAM. The datapath is reconfigured in a microsecond or less between layers.

nnMAX Compiler Flow
Below is the architecture of the nnMAX Compiler.

The user can specify any nnMAX array spec: the array is N rows by M columns with X MB of SRAM per tile, where X is 1MB, 2MB or 4MB. The InferX X1 chip is 2×2 with 2MB/tile for 4K MACs and 8MB SRAM in total. Throughput increases roughly linearly with increasing array size. Depending on the model, the amount of SRAM required will vary; also more SRAM will typically reduce DRAM bandwidth required.

Different parsers allow inputs from Tensorflow Lite and ONNX; and perhaps other models in the future, although so far all customers have indicated Tensorflow Lite or ONNX meet their needs. The parser converts the neural model into an internal representation format.

The nnMAX Compiler front-end groups layers (layer fusion) to maximize throughput into a series of configurations.

Then the nnMAX Compiler generates the soft logic which will control the nnMAX array execution for the duration of the configuration. The soft logic runs in the eFPGA LUTs of the nnMAX tiles.

Before X1 silicon availability we will be able to verify all of the generated configuration, datapath and soft logic on an FPGA prototype to ensure 100% functional accuracy.

The next step is back-end place-and-route, which uses our existing EFLX eFPGA place-and-route. It has been running for 4 years for eFPGA arrays in 180nm, 40nm, 28nm, 16nm, 14nm and 12nm. The one change is that the interconnect array has added flip-flops for pipelining, when needed, to achieve 1.067GHz operation over worst case conditions.

The final step is generation of the configuration binaries which will be loaded into the nnMAX array or InferX X1 to run the desired neural model. Multiple models can run on nnMAX.

Modeling configurations

The nnMAX Performance Modeler is available now under an evaluation license. Currently any TensorflowLite INT8 model is supported.

The Modeler figures out which layers to fuse into successive configurations and then computes how many cycles are required to execute each configuration, to reconfigure between layers and how many cycles of DRAM “stall” may occur due to large activation read/wrights for example.

The customer sees for their nnMAX configuration, their model and their batch size what the throughput is in frames/second.

Other useful information is provided:

  • DRAM bandwidth & SRAM bandwidth
  • TeraMACs/second
  • MAC utilization
  • Array area
  • Whether either configurations or weights are stored in SRAM

It is possible to provide this information because nnMAX execution is 100% deterministic. There is no bus contention or RAM contention, because datapaths are configured using our interconnect between memory, hardware and back to memory.

There is room for further improvement in our software. The layer fusion algorithm is good but we have noticed a few cases where it is possible to do better. And there are cases where weights or configurations of earlier layers may be able to be saved in SRAM for later layers that use the same ones.

Leave a Reply

(Note: This name will be displayed publicly)