Edge Inference Applications And Market Segmentation

What matters for different customers in the rapidly growing edge inference market.


Until recently, most AI was in data centers/cloud and most of that was training. Things are changing quickly. Projections are AI sales will grow rapidly to tens of billions of dollars by the mid 2020s, with most of the growth in edge AI inference.

Data center/cloud vs. edge inference: What’s the difference?

The data center/cloud is where inference started on Xeons. To gain efficiency, much of the processing shifted to ICs with lots of MACs, especially Nvidia’s GPUs.

In the data center, an inference accelerator must be able to run ALL of the models in the data center. The data centers are cooled, and the servers have further cooling so PCIe boards can burn 75W to 300W TDP. And since data centers are bigger than football fields, at any given time there are likely dozens of different jobs running the same model: so they can be batched to give higher aggregate throughput (but at some cost in latency). The size of new models running in the data center are growing fast.

Inference at the edge (systems outside of the cloud) are very different:

  • Other than autonomous vehicles, edge systems typically run one model from one sensor.
  • The sensors are typically capturing some portion of the electromagnetic spectrum (we’ve seen light, radar, LIDAR, X-Ray, magnetic, laser, infrared, …) in a 2D “image” of 0.5 to 6 megapixels.
  • The sensors capture data at frame rates from 10 to 100 frames/second.
  • The applications are almost always latency sensitive: the customer wants to process the neural network model as soon as frames are captured to take action – quicker is better. So customers want batch=1. (Batching from one sensor means waiting to accumulate 2, 4 or 8 images before even starting to process them – latency is very bad.)
  • Many applications are accuracy critical: think medical imaging for example. You want your X-Ray or Ultrasound diagnosis to be accurate!
  • The customer models are typically convolution intensive, often derivatives of YOLOv3.
  • Some edge systems incorporate small servers in them (think MRI machines which are big and expensive to begin with) and can handle 75W PCIe cards.
  • Many edge servers are lower cost and can benefit from less expensive PCIe cards with good price/performance.
  • Higher volume edge systems incorporate inference accelerator chips that dissipate up to 15W (no fans).

Edge inference applications

The application everyone thinks of first is typically autonomous vehicles. But real autonomous driving is a decade or more away. In the 2020s, the value of inference will be in driver assistance and safety (detecting distraction, sleep, etc). Design cycles are 4-5 years, so a new inference chip today won’t show up in your vehicle till 2025 or later.

What are the other markets using edge inference today?

Edge servers

Last year, Nvidia announced inference sales outstripped training for the first time. Much of this was likely shipped to data centers, but there are also many applications outside of data centers.

This means that sales of PCIe inference boards for edge inference applications are likely in the hundreds of millions of dollars per year and rapidly growing.

A lot of edge servers are deployed in factories, hospitals, retail stores, financial institutions and other enterprises. In many cases, sensors in the form of cameras are already connected to the servers, but they are just recording what’s happening in case of an accident or a theft. Now these servers can be super-charged with low cost PCIe inference boards.

There are a wide range of applications: surveillance, facial recognition, retail analytics, genomics/gene sequencing, industrial inspection, medical imaging, and more. Since training is done in floating point and quantization requires a lot of skill/investment, most edge server inference is likely done in 16-bit floating point with only the highest volume applications being done in INT8.

Until now, edge servers that did inference used the Nvidia Tesla T4, a great product but $2000+. Many servers are low cost and now can benefit from inference accelerator PCIe boards at prices as low as $399 but with the throughput/$ the same or better than T4.

Higher volume, high accuracy/quality imaging

Applications include robotics, industrial automation/inspection, medical imaging, scientific imaging, cameras for surveillance and object recognition, photonics, etc. In these applications, the end products sell for thousands to millions of dollars, the sensors capture 0.5 to 6 Megapixels, and “getting it right” is critical, so they want to use the best models (for example, YOLOv3, which is a heavy model at 62 million weights and >300 billion MACs to process a 2 megapixel image) and to use the largest image size they can (just like humans, we can recognize people better with a large crisp image than a small one).

The main players here are Nvidia Jetson (Nano, TX2, Xavier AGX and Xavier NX) at 5-30W and $250-$800.

Customers we talk to are starved for throughput and are looking for solutions that will give them more throughput and larger image sizes for the same power/price as what they use today: when they get it, their solutions will be more accurate/reliable and market adoption and expansion will accelerate. So although the applications today are in the thousands or tens of thousands of units, this will grow rapidly with the availability of inference that delivers more and more throughput/$ and throughput/watt.

Today there are inference accelerators that can outperform Xavier NX at lower power and at prices for million/year quantities that are 1/10th of Xavier NX. This will drive much higher volume applications of performance inference acceleration.

This market segment should become the largest over time because of the breadth of applications.

Low accuracy/quality imaging
Many consumer products or applications where accuracy is nice but not critical will opt for very small images and simpler models like Tiny YOLO. In this space the leaders are Jetson Nano, Intel Movidius, and Google Edge TPU at $50-$100.

Voice and lower throughput inference
Imaging neural network models require trillions of MACs/second for 30 frames/second of megapixel images. Voice processing is billions of MACs/second or even less for just keyword recognition. These applications, like Amazon Echo, are already very significant in adoption and volume but $/chip is much much less. The players in this market are totally different than the above market segments.

Cell phones
Almost all cell phone’s application processors have an AI module of the SoC for local processing of simple neural network models. The main players here are Apple, Qualcomm, Mediatek, Samsung. This is actually the highest unit volume of AI deployment at the edge today.

What matters for edge inference customers

First, is latency. Edge systems are making decisions on images coming in at up to 60 frames per second. In a car for example, it is obviously important that objects like people, bikes and cars be detected and their presence be acted upon in as little time as possible. In all edge applications latency is #1 which means batch size is almost always 1.

Second is numerics. Many edge server customers will stay with floating point for a long time, and BF16 is the easiest for them to move to since they just truncate 16 bits off their FP32 inputs and weights. Fanless systems will be INT8 if they are high volume but many will be BF16 if volumes stay in the thousands given the cost and complexity of quantization. An inference accelerator that can do both gives customers the ability to start quickly with BF16 and shift seamlessly to INT8 when they are ready to make the investment in quantization.

Third is throughput for the customer’s model and image size. Any given customer typically is running one model and knows their image size and sensor frame rate. Almost every application wants to process megapixel images (1, 2 or 4) at frame rates of 30 or even 60 frames/second. Most applications are vision CNNs but there are many applications with much different models, even ones processing 3 dimensional images or processing images in time (think MRI, etc…) or LIDAR or financial modeling. The only customers who run more than one model are automotive which must processor vision, LIDAR and 1 or 2 other models simultaneously.

Fourth is efficiency: almost all customers want more throughput/image size per dollar and per watt. Most tell us they want to increase throughput and increase image size for their current dollar budget and power budget. But as throughput/$ and throughput/watt increases, new applications will become possible at the low end of the market where the volumes are exponentially larger.

Edge inference is on the edge of rapid growth

The availability of vastly superior inference accelerators to replace CPUs, GPUs and FPGAs at much higher throughput/$ and throughput/watt will cause rapid market expansion.

And the presence of numerous competitors in fast growing markets will result in rapid innovation and further improvements in throughput efficiency and accuracy.

The next five years will see a tidal wave of growth and innovation for customers and for those inference chip companies with the superior architectures.

Leave a Reply

(Note: This name will be displayed publicly)