Getting the best performance for a convolutional neural network requires an architecture that is flexible.
For a neural network to run at its fastest, the underlying hardware must run efficiently on all layers. Through the inference of any CNN—whether it be based on an architecture such as YOLO, ResNet, or Inception—the workload regularly shifts from being bottlenecked by memory to being bottlenecked by compute resources. You can think of each convolutional layer as its own mini-workload, and so the challenge of any inference solution is to ensure that all of these different mini-workloads run efficiently. Having an architecture that can adapt to the changing requirements of the inference by handling both extremes of computational intensity (layers with few filters or with many filters) is crucial for getting the best performance.
As we discussed in past blog posts, not all convolutions are the same because the compute to memory access ratio can change dramatically between different convolutions. As a rule of thumb, the ratio of compute operations to bytes required is set by the number of filters in a convolutional layer. Thinking about it another way, the more features you are looking for in a single layer, the more you are reusing the data you have accessed to perform computations.
In the chart below, we see that for a few different models, each has many different convolutions, and these convolutions all lie on different points of the compute operations to memory spectrum. As it also turns out, most convolutions are unique to each model, so the ability to run one model doesn’t necessarily mean that you can run another! Instead, it’s better to look for an architecture that is flexible enough to handle all the different kinds of models.
Beyond offering a more optimal datapath, a flexible architecture also provides the advantage of being able to accommodate future versions of your model. As your model evolves, so must the architecture beneath it. Reconfigurable computing is a must have for any application with a long life and iterative updates, which is something you would find in an edge or embedded device. This flexibility ensures that you can update your model in the field without worrying about whether the underlying hardware can support it.
Ultimately, the ability for the underlying hardware to reconfigure does not need to be exposed to developers who are already focused on achieving maximum accuracy for their models. Instead, a good edge inference solution should include a robust compiler that is able to leverage the reconfigurability of the underlying hardware automatically, reducing the cognitive load required by deep learning engineers to achieve maximum performance. The benefits of reconfigurable computing such as maximizing performance and ensuring future support can only be unlocked by a compiler stack developed in sync with the hardware.
In summary, if you are looking for the best performance for your edge inference workload, you should look for a solution built on flexible hardware with a mature compiler that can unlock the benefits of the reconfigurable computing platform.
Leave a Reply