Issues And Challenges In Super-Resolution Object Detection And Recognition

Comparing the impact of using megapixel images and larger models.


If you want high performance AI inference, such as Super-Resolution Object Detection and Recognition, in your SoC the challenge is to find a solution that can meet your needs and constraints.

  • You need inference IP that can run the model you want at high accuracy.
  • You need inference IP that can run the model at the frame rate you want: higher frame rate = lower latency, more time for decision making, with batch=1 processing.
  • And you need inference IP that does the above within your cost and power budgets.

First, let’s look at the issue of accuracy.

Accurate detection and recognition of objects can be increased in 3 ways:

  • First, larger image sizes: just like humans, it’s easier to recognize objects in a large, high resolution image than a small, grainy one. But just going from 640×640 to 1280×1280 increases the pixel count 4x, which will increase compute workload per frame by 4x or more. The larger image and activations will stress the memory subsystem and cause more reads/writes to go to DRAM instead of local SRAM, slowing performance further.
  • Second, training: the best model won’t detect objects if it has not been trained correctly. Recently, a Cruise Automation autonomous vehicle in San Francisco crashed into a bus because it was a reticulated bus and the car had only been trained on normal buses, so it didn’t realize the back half was part of the bus. There are many types of cars and trucks and the angle of orientation is important too: something trained to detect cars from the front or side may be challenged to detect cars from a drone image looking down.
  • Third, the model: models have receptive fields and basically look for images at various sizes and can have more filtering to improve accuracy. If a model has a receptive field for large images of cars and for medium size images of cars, it won’t “see” cars that are smaller than the medium size receptive field. Adding more receptive fields increases the compute workload.

A state-of-the-art Object Detection and Recognition model is Yolov5, the latest in a family.

There are eight version of Yolov5 – the first four process 640×640 images, the second four process 1280×1280 images.

  • Yolov5s, 5m, 5l and 5x process 640×640 images with an increasing number of layers
  • Yolov5s6, 5m6, 5l6 and 5×6 process 1280×1280 images with even more layers
  • Yolov5l6 is 184 layers versus 85 for yolov5s; and has to process 4x the pixels. In addition to having more layers, each layer has a larger dimension, which results in a higher number of operations per layer.

Let’s see how a selection of these models do in detecting small objects – here is an aerial view of a congested intersection processed on Yolov5s:

Fig. 1: Aerial view of intersection as processed by Yolov5s.

43 vehicles are detected by Yolov5s, but some are missed.

Next let’s see how Yolov5s6 does with four times the pixels:

Fig. 2: Aerial view of intersection as processed by Yolov5s6.

Now 61 cars are detected. In addition to the cars, one motorbike and one person are also detected due to the super resolution capability of the network. One thing to note is that the probabilities of the detections are relatively low, and as a result, there are two key misdetections above: a truck on the lower left corner, and a person on the lower right corner. Still the results are much better than Yolov5s.

Finally, we see below how Yolov5l6 does (l is for large, not numeral 1):

Fig. 3: Aerial view of intersection as processed by Yolov5l6.

Yolov5l6 detects all of the vehicles that are fully visible, but misses a few that are visually obstructed. This could be for reasons such as training (the model may have been trained to detect vehicles at street level not from above), etc. Yolov5l6 also detects each vehicle with higher detection probabilities compared to Yolov5s6, which eliminates the misdetection issues seen in previous figure. Lastly, two people in the scene are detected, which is very impressive as the people are very small and are difficult to spot.

An argument can be made that in the future, even better performing networks with even longer latencies will be desired. This will increase the need for flexible high performance AI inference IP like InferX.

The benefits of using megapixel images and larger models are clear, but the compute required goes up by an order of magnitude (4x the pixels and 2-3x the layers). How can you run these big images and models within your SoC’s area and power budget?

InferX is hardware and software IP that is available for integration in your SoC for finFET nodes from 16nm to 3nm. InferX hardware comes as a tile which can be built into arrays for more processing, then delivered with an AXI bus interface to connect to your SoC’s NOC.

Fig. 4: InferX compute tile.

InferX is 80% hardwired: almost all of the datapath is hardwired. But it is 100% reconfigurable because the 16 tensor processors are connected by a reconfigurable interconnect and eFPGA is used as the control plane to manage operation. eFPGA can also be used to implement new operators, which pop up all the time as models continue to evolve. Unlike fully hardwired solutions, having eFPGA means you can always adapt to changing models.

The table below shows the performance of 1 to 8 tiles in N7. InferX is optimized for low latency batch=1 operation. And InferX works efficiently with relatively low DRAM bandwidth, which is important because each DRAM requires 100+ package balls to connect and large ball-grid packages and their substrates get exponentially expensive as ball count grows.

Fig. 5: InferX performance in N7, batch=1, 1 DRAM for 1-2 tiles, 2 DRAMs for 4+8 tiles.

Even 8 tiles of InferX is only about 50mm2 in N7. 8 tiles of InferX outperforms even Orin AGX 60W using less DRAM bandwidth than Orin AGX.

Of course, for many of your applications you may only need 1 or 2 tiles.

You can get more information on InferX at

Leave a Reply

(Note: This name will be displayed publicly)