An end-to-end workflow for deploying embedded machine learning.
Today, I’ve teamed up with Ram Cherukuri of MathWorks to provide an overview of the MathWorks toolchain for machine learning (ML) and the deployment of embedded ML inference on Arm Cortex-A using the Arm Compute Library.
MathWorks enables engineers to get started quickly and makes machine learning possible without having to become an expert. If you’re an algorithm engineer interested in leveraging deep learning networks in your embedded application, this article gives you an overview of the end-to-end workflow – from designing your complete application, with the expressive power of MATLAB, to deploying it as a standalone application on Arm Cortex-A processors. We illustrate how the workflow overcomes the common challenges encountered when moving from training to deployment on an embedded system.
Read on to find out about:
MATLAB – Deep Learning Framework
MATLAB is a comprehensive deep learning framework that provides an end-to-end workflow – from data access and data preparation to training – all the way to deployment of the complete application. It’s being used by engineers across industries to train deep learning algorithms for common tasks, such as object detection, classification, and semantic segmentation.
We’ll discuss a few key MATLAB capabilities for deep learning such as labeling ground truth, training off-the-shelf networks for detection and classification, and generating optimized C++ code from the trained networks for Arm Cortex-A processors. We’ll also cover how to deploy these algorithms on Cortex-A processors using code generation, leveraging Arm platforms for embedded inference.
Get started quickly
You can start by creating a network from scratch or start with a pretrained network, using a transfer learning approach to train it with your own data. MATLAB supports a full range of network architectures – from convolutional networks to LSTMs – and is interoperable with open-source deep learning frameworks. It also allows you to import and export models with other deep learning frameworks using the ONNX model format or specific converters.
Before you can start training a network model, you need a set of labeled training data. This may be a corpus of images annotated with the locations and labels of objects of interest – something that requires you to sift through every image or frame of video or time series data, labeling the location of each object of interest. This process is known as ground truth labeling, and it is often the most time-consuming part of training.
To save time – and your sanity! – you can automate this laborious task using MATLAB’s data tools, such as datastores for reading large data sets and labeling ground truth, and data labeler for audio, video, image and time series data.
Training your network
Now that you have labeled data, it’s time to start with a pre-trained network and train it with your new data. Taking image classification as an example, you might consider popular pre-trained models such as VGG, ResNet or Inception. You can import these models into MATLAB and modify the network to suit your application.
If this approach interests you, check out this comprehensive list of pre-trained models supported in MATLAB and an example to get started with your own network.
MATLAB’s Deep Learning network analyzer provides a very intuitive way to visualize your network architecture. Once you’re ready for training, you can train your model on your local machine either on the CPU or GPU or scale to clusters all with a few simple training options.
Finally, your algorithm is not just the network. Typically, there is pre-processing logic to prepare the input data before it is passed to the trained network for inference and then the predicted output is also used in post-processing logic. For instance, in the case of images, you must resize and/or extract the region of interest; for audio signals, you may have to extract certain features to get the input to the trained network. The pre- and post-processing logic is easily expressed by the various domain-specific higher-level functions provided by MATLAB.
The next step is to generate optimized code for the entire application and deploy the generated code as a standalone application on GPUs or CPUs or FPGAs.
Using the Compute Library
Arm Cortex-A processors are commonly used across many vision platforms, from low-cost hardware like the Raspberry Pi to automotive-grade vision platforms like the NXP S32V vision processor. Cortex-A is particularly well suited for applications that need to minimize power consumption and maintain high levels of data processing such as automotive, IoT, and other embedded applications.
To get optimal performance on Arm Cortex-A platforms, Arm provides the Arm Compute Library, a collection of low-level functions optimized for Arm CPU and GPU architectures which can be used for image processing, computer vision, and ML.
The Code Generation technology in MATLAB leverages this library to automatically get the best performance without becoming an expert in writing low-level code on the Arm architecture. Because the library is mature, well tested, and continuously improved, MATLAB will automatically benefit from future improvements in performance and efficiency.
Generating and Deploying Your Code
When generating application code, an entry point is defined which will receive the input data. You can verify the behavior of your application by calling this function using test inputs in MATLAB. You can also use data acquired from sensors such as the webcam or the audio device to test your application with live data.
function output = entry_point_func(input) I = pre_processing_function(input); persistent trainednet; if isempty(trainednet) trainednet = coder.loadDeepLearningNetwork('trained_network_saved.mat'); end prediction = trainednet.predict(I); output = post_processing_function(prediction);
Having tested the algorithm successfully in MATLAB, you can either generate optimized C++ code and compile it into an executable or generate a static or dynamic library that you integrate into a larger application and deploy to an Arm Cortex-A processor.
MATLAB Coder provides a command line API as well as an app to guide you through the workflow to generate code for the entire application. As shown in the image below, MATLAB Coder lets you specify a few parameters of the Arm Compute Library to get optimal performance on your target and you can deploy the generated code to any Arm processor that supports NEON.
This workflow is already being used by the NXP Vision Toolbox to deploy deep learning networks on the Arm Cortex-A53 processor included in the NXPS32V234.
Going further
If you’re interested in finding out more, MATLAB’s Deep Learning Toolbox provides simple commands for creating and interconnecting the layers of a deep neural network. Examples and pre-trained networks make it easy to use, even without knowledge of advanced computer vision algorithms or neural networks. You’ll find everything from industrial automation applications, such as defect detection, to automotive applications like pedestrian detection.
Leave a Reply