Whether data needs to be processed immediately makes a big difference to inference implementation.
Deep Learning and AI Inference originated in the data center and was first deployed in practical, volume applications in the data center. Only recently has Inference begun to spread to Edge applications (anywhere outside of the data center).
In the data center much of the data to be processed is a “pool” of data. For example, when you see your photo album tagged with all of the pictures of your dog or your family members, it was done by an inference program that ran “in the background” processing your photos. The data is available in large chunks and there is nobody waiting for the results, so data can be processed in large batches to maximize throughput/$.
At the edge, the data to be processed is typically coming from a sensor (a camera typically, but also LIDAR, Radar, a medical imaging device and others).
The most common sensor is a camera: typically they capture images at 30 frames/second in megapixel sizes.
So the data on the edge is coming in streams and typically needs to be processed in real time, so latency becomes very important.
Let’s consider an example: a 2 Megapixel camera capturing 30 frames/second.
So a new frame is available every 33 milliseconds.
In a typical application there are 3 steps in the processing pipeline:
For example, if the application is autonomous driving, all 3 steps must be completed in a very short time in order to detect and avoid hitting objects like pedestrians or cars.
Different applications will have different needs based on what they are doing and their power/cost/size constraints.
Let’s consider an application that is running a YOLOv3 neural network model which detects and recognizes objects.
Consider the Nvidia Xavier NX: it has 3 processing units which can run the model
What is the throughput of the Nvidia Xavier NX for YOLOv3?
For a “Pool” application it is as follows:
What about for a Streaming application? Each image arrives every 33 milliseconds. The images need to be processed in order to be acted on sequentially. For example, if tracking a pedestrian first you must detect the pedestrian when they first come in sight, then track them frame by frame as they move. The Xavier NX cannot keep up with 30 frames/second. It cannot even keep up with 15 frames/second despite the “Pool” throughput of 17 frames/second above.
Let’s break it down assuming we process every 2nd image arriving every 67 ms.
Image 0: arrives at 0 msec, dispatched to the GPU, processed by 95msec
Image 1: arrives at 67 msec, GPU is busy, so dispatch to DLA1, processed by 67+290 = 357 msec
Image 2: arrives at 133 msec, GPU is available, processed by 133+95 = 228 msec
Image 3: arrives at 200 msec, GPU is busy, DLA1 is busy, dispatch to DLA2, processed by 490msec
Image 4: arrives at 267 msec, GPU is available, processed by 362 msec
So you can see the images are being processed out of sequence which is not acceptable for the application.
If we instead choose to process 10 frames/second, images arrive every 100msec, so the 95 msec GPU can keep up. So the Streaming Throughput of the Xavier NX for this application is 10 frames/second.
Streaming throughput = the inverse of the latency of the execution unit that can “keep up.”
AI Inference is very new to most of us so it is easy to get confused. Large batch size throughputs sound very impressive, but for edge applications processing is done on Streams in batch sizes of 1 and execution latency is what matters. Streaming throughput is the inverse of execution latency.
Leave a Reply