The different forms of inference at the edge and the outlook for the accelerator ecosystem.
Until recently, most AI was in datacenters and most was training. Things are changing quickly. Projections are AI sales will grow rapidly to $10s of billions by the mid 2020s, with most of the growth in Edge AI Inference.
Edge inference applications
Where is the Edge Inference market today? Let’s look at the markets from highest throughput to lowest.
Edge Servers
Recently Nvidia announced inference sales outstripped training for the first time. Much of this was likely shipped to datacenters, but there also are many applications outside of datacenters, generally referred to as “the edge.” This means that sales of PCIe inference boards for edge inference applications are likely in the hundreds of millions of dollars per year and rapidly growing.
There are a wide range of applications: surveillance, facial recognition, retail analytics, genomics/gene sequencing, and more. Since training is done in floating point and quantization requires a lot of skill/investment, most edge server inference is likely done in 16-bit floating point with only the highest volume applications being done in INT8. PCIe inference boards range from 75W (Nvidia Tesla T4) to 200W (Habana Goya).
Autonomous Vehicles
A year ago auto makers and suppliers were talking about going quickly to full autonomous driving with their own custom chips. Today plans are more modest, using off the shelf solutions (we hear Xavier AGX and NX a lot) for mid-2020 model years with object detection and correction, on megapixel images, being used as a driver supplement for increased safety. Volumes, for now, are in the tens of thousands of conspicuous test vehicles such as Google’s Waymo with large cameras and conspicuous Lidars and trunks of electronics. In 5 years, volumes could be in the millions for well-integrated mass market Level 2 object detection and correction.
Fanless Systems for Image/CNN Applications
The main players here are Nvidia Jetson (Nano, TX2, Xavier AGX and Xavier NX) at 5-30W and Intel Movidius Myriad at single digit watts but ~1/10th of the throughput. The applications here cover a very wide range: surveillance cameras, gene sequencing, home doorbells, medical systems (e.g. ultrasound), photonics, robotic vision and more mostly doing CNNs but also doing a wide range of models very different from image CNNs.
Fans are unacceptable in this market. Customers we talk to are starved for throughput and are looking for solutions that will give them more throughput and larger image sizes for the same power/price as what they use today: when they get it their solutions will be more accurate/reliable and market adoption and expansion will accelerate. So although the applications today are in the thousands or tens of thousands of units, this will grow rapidly with the availability of inference that delivers more and more throughput/$ and throughput/watt.
This market segment should become the largest over time because of the breadth of applications.
Voice and Lower Throughput Inference
Image CNNs require Trillions of MACs/second. Voice processing is billions of MACs/second or even less for just keyword recognition. These applications, like the Amazon Echo, are already very significant in adoption and volume but $/chip is much much less. The players in this market are totally different than the above market segments.
What matters for edge inference customers
First is latency. Edge systems are making decisions on images coming in at up to 60 frames per second. In a car, for example, it is obviously important that objects like people, bikes and cars be detected and their presence be acted upon in as little time as possible. In all edge applications latency is #1 which means batch size is almost always 1.
Second is numerics. Many edge server customers will stay with floating point for a long time, and BF16 is the easiest for them to move to since they just truncate 16 bits off their FP32 inputs and weights. Fanless systems will be INT8 if they are high volume but many will be BF16 if volumes stay in the thousands given the cost and complexity of quantization. An inference accelerator that can do both gives customers the ability to start quickly with BF16 and shift seamlessly to INT8 when they are ready to make the investment in quantization.
Third is throughput for the customer’s model and image size. Any given customer typically is running one model and knows their image size and sensor frame rate. Almost every application wants to process megapixel images (1, 2 or 4) at frame rates of 30 or even 60 frames/second. Most applications are vision CNNs, but there are many applications with much different models, even ones processing 3 dimensional images (think MRI, …) or LIDAR or financial modeling. The only customers who run more than one model are automotive, which must processor vision, LIDAR and 1 or 2 other models simultaneously.
Fourth is efficiency: almost all customers want more throughput/image size per dollar and per watt. Most tell us they want to increase throughput and increase image size for their current dollar budget and power budget. But as throughput/$ and throughput/watt increases, new applications will become possible at the low end of the market where the volumes are exponentially larger.
Edge inference accelerator ecosystem outlook
The king of the market today is Nvidia with Tesla T4 PCIe Inference Boards and Nvidia Jetson modules for fanless edge systems.
Intel’s Movidius Myriad X has numerous adoptors but throughput is an order of magnitude below Nvidia Jetson Xavier AGX/NX and customers using them make significant sacrifices in image size, frame rate, model complexity and thus prediction accuracy. Intel recently announced at Hot Chips in August the Spring Hill, now called NNP-I, but has not published any benchmarks other than one ResNet-50 benchmark of unspecified batch size at Hot Chips.
Intel’s market capitalization is over $200 Billion and Nvidia’s is over $100 Billion. For both of them success in edge inference is very important to maintaining their market valuation.
What about startups?
In edge inference the first apparent success is Habana Labs with the Habana Goya inference board at 200W with higher throughput than Tesla T4. Intel is rumored as of today to be in discussions to acquire Habana for over $1 billion, which would make it their 3rd major AI acquisition after Movidius and Nervana.
Intel and Nvidia will both be under pressure to acquire emerging winners rather than lose critical market share that would impact their market capitalization. This makes venture capitalists very happy. Not all emerging winners will choose to be acquired depending on their economics and need for capital.
Groq and Blaize (nee ThinCI) both claim working silicon but neither have published any benchmarks or specifications. Both seem to be targeting datacenter levels. Mythic also claims working silicon but has not given benchmarks or specs.
Most startups are targeting PCIe boards for edge and datacenter applications, like Habana, with 75W or much more per board.
Very few startups, other than Flex Logix and Mythic, are targeting CNN Fanless Systems. Both companies’ chips can also be used in low-power, lower-cost but high performance PCIe boards for edge servers.
Another group of startups are targeting the much lower throughput, lower power voice inference market segment.
People claim there are almost 100 AI chip startups today. Almost none have come to market despite some having been funded as much as 5 years ago. Money has flowed in large amounts to the companies from investors betting on the promise of the market and the alleged competitive advantages promised by the founders. Customers tell us that they have heard a lot of promises, almost none of which are delivered upon when silicon becomes available. It appears that many startups do not have accurate performance modeling and/or did not develop their software together with their hardware architecture, resulting in major shortfalls in actual performance achieved. We hear rumors that one high profile startup has received silicon that is far short of customer expectations and we have seen dozens of resumes from team members saying the group is being shut down.
Neural networks are very portable with standards like TensorFlow Light and ONNX, so in a given market segment there will be a few winners and a lot of losers and the decisions will come quickly in 2020 and 2021. By late 2020 the herd of AI chip startups is likely to be cut in half and by late 2021 to at most 20 companies. At most, there will be 2-3 winners in each major segment: Training, Datacenter Inference, Edge Server Inference, Fanless Image/CNN systems and Voice Inference; with 3 winners each that makes about 15 survivors.
Edge inference is on the edge of rapid growth
The availability of vastly superior inference accelerators to replace CPUs, GPUs and FPGAs at much higher throughput/$ and throughput/watt will cause rapid market expansion.
And the presence of numerous competitors in fast growing markets will result in rapid innovation and further improvements in throughput efficiency and accuracy.
The next 5 years will see a tidal wave of growth and innovation for customers and for those inference chip companies with the superior architectures.
Well we’re already shipping GAP8 in production. Image inference with a few tens of mW.