Author's Latest Posts


Where Is The Edge AI Market And Ecosystem Headed?


Until recently, most AI was in datacenters and most was training. Things are changing quickly. Projections are AI sales will grow rapidly to $10s of billions by the mid 2020s, with most of the growth in Edge AI Inference. Edge inference applications Where is the Edge Inference market today? Let’s look at the markets from highest throughput to lowest. Edge Servers Recently Nvidia annou... » read more

Modeling AI Inference Performance


The metric in AI Inference that matters to customers is either throughput/$ for their model and/or throughput/watts for their model. One might assume throughput will correlate with TOPS, but you’d be wrong. Examine the table below: The Nvidia Tesla T4 gets 7.4 inferences/TOP, Xavier AGX 15 and InferX 1 34.5. And InferX X1 does it with 1/10th to 1/20th of the DRAM bandwidth of the ... » read more

Advantages Of BFloat16 For AI Inference


Essentially all AI training is done with 32-bit floating point. But doing AI inference with 32-bit floating point is expensive, power-hungry and slow. And quantizing models for 8-bit-integer, which is very fast and lowest power, is a major investment of money, scarce resources and time. Now BFloat16 (BF16) offers an attractive balance for many users. BFloat16 offers essentially t... » read more

eFPGA Macros Deliver Higher Speeds from Less Area/Resources


We work with a lot of customers designing eFPGA into their SoCs.  Most of them have “random logic” RTL, but some customers have large numbers of complex, frequently used blocks. We have found in many cases that we can help the customer achieve higher throughput AND use less silicon area with Soft Macros. Let’s look at an example: 64x64 Multiply-Accumulate (MAC), below: If yo... » read more

AI Inference Memory System Tradeoffs


When companies describe their AI inference chip they typically give TOPS but don’t talk about their memory system, which is equally important. What is TOPS? It means Trillions or Tera Operations per Second. It is primarily a measure of the maximum achievable throughput but not a measure of actual throughput. Most operations are MACs (multiply/accumulates), so TOPS = (number of MAC units) x... » read more

TOPS, Memory, Throughput And Inference Efficiency


Dozens of companies have or are developing IP and chips for Neural Network Inference. Almost every AI company gives TOPS but little other information. What is TOPS? It means Trillions or Tera Operations per Second. It is primarily a measure of the maximum achievable throughput but not a measure of actual throughput. Most operations are MACs (multiply/accumulates), so TOPS = (number of MAC... » read more

Do Large Batches Always Improve Neural Network Throughput?


Common benchmarks like ResNet-50 generally have much higher throughput with large batch sizes than with batch size =1. For example, the Nvidia Tesla T4 has 4x the throughput at batch=32 than when it is processing in batch=1 mode. Of course, larger batch sizes have a tradeoff: latency increases which may be undesirable in real-time applications. Why do larger batches increase throughput... » read more

Neural Network Performance Modeling Software


nnMAX Inference IP is nearing design completion. The nnMAX 1K tile will be available this summer for design integration in SoCs, and it can be arrayed to provide whatever inference throughput is desired. The InferX X1 chip will tape out late Q3 this year using 2x2 nnMAX tiles, for 4K MACs, with 8MB SRAM. The nnMAX Compiler is in development in parallel, and the first release is available now... » read more

Multi-Layer Processing Boosts Inference Throughput/Watt


The focus in discussion of inference throughput is often on the computations required. For example, YOLOv3, a power real time object detection and recognition model, requires 227 BILLION MACs (multiply-accumulates) to process a single 2 Mega Pixel image! This is with the Winograd Transformation; it’s more than 300 Billion without it. And there is a lot of discussion of the large size ... » read more

Inference Acceleration: Follow The Memory


Much has been written about the computational complexity of inference acceleration: very large matrix multiplies for fully-connected layers and huge numbers of 3x3 convolutions across megapixel images, both of which require many thousands of MACs (multiplier-accumulators) to achieve high throughput for models like ResNet-50 and YOLOv3. The other side of the coin is managing the movement of d... » read more

← Older posts