Performance Metrics For Convolutional Neural Network Accelerators

The challenges of defining a benchmark to assess inference hardware.


Across the industry, there are few benchmarks that customers and potential end users can employ to evaluate an inference acceleration solution end-to-end.

Early on in this space, the performance of an accelerator was measured as a single number: TOPs. However, the limitations of using a single number has been covered in detail in the past by previous blogs. Nevertheless, if the method of calculating utilization is constant across different platforms, utilization numbers can be useful to demonstrate the relative efficacy of one architecture compared to another.

Table 1: The table above is organized by levels of abstraction across the market, computational, and mathematical spaces. For each level of abstraction, there exist one or more evaluation metrics. These metrics can be measured using benchmarks of fundamental operations.

Attempts at a standardized microbenchmarking suite for convolutional kernels and other common neural network operations have not resulted in widespread adoption. Macrobenchmarks, or benchmarks of full models, are better indicators for system-level performance, but there is potential for microbenchmarks to give a fuller picture of performance over the computational space.

Finally, at the top of the performance evaluation stack, there are macrobenchmarks of actual CNNs. MLPerf’s contributors and organizers have done a notable job focusing on a list of models across a range of potential end applications, and in building infrastructure around macrobenchmarks to make running them more accessible. However, this method of comparing performance across platforms is still incomplete, as the performance of one CNN model is not necessarily a suitable predictor of performance for another CNN on the same hardware.

Despite the variety of metrics to measure the performance on different acceleration architectures, the key decision makers and system builders still struggle to make apples-to-apples comparisons between different inference acceleration solutions, requiring any evaluation to include an actual customer workload. Furthermore, performance measurements are just the beginning of a customer’s evaluation process, as the work to integrate a new component or process into an existing system or workflow can negate any performance upside to a superior architecture. Nevertheless, by raising the performance of fundamental operations across different layers of the inference stack, chip, micro-, and neural architectural designers will continue to push the boundaries of their respective layers’ and allow for new and better products at the application level.

Leave a Reply

(Note: This name will be displayed publicly)