Combining SLAM And CNN For High-Performance Augmented Reality

Enabling realistic interactions between real and virtual objects.


Robotics and headsets or goggles are the most common hardware devices requiring AR/VR/mixed reality, and AR is coming to mobile phones, tablets, and automobiles as well. For hardware devices to see the world around them and add to that reality with inserted graphics or images, they need to determine their position in space and map the surrounding environment.

Simultaneous localization and mapping (SLAM) algorithms provide a geometric position for the AR system. SLAM algorithms can build 3D maps of an environment while tracking the location and position of the camera in that environment. The algorithms estimate the position of the sensor (built into the camera, cellphone, goggles, etc.) while modeling the environment to create a map (Figure 1). Knowing the sensor’s position and pose combined with the generated 3D map of the environment lets the device (and the user looking through the device) move through the environment in reality.

Figure 1: SLAM algorithms build a 3D map of the surroundings by identifying points and edges of objects and performing plane extraction from the data

SLAM can be implemented in multiple ways. Visual SLAM is a camera-only version that doesn’t rely on fancy inertial measurement units (IMUs) or expensive laser sensors. Monocular visual SLAM – which has become very popular – relies on one camera like the one in a mobile phone. A typical implementation of monocular visual SLAM includes several key tasks:

  • Feature extraction or the identification of distinct landmarks (like the lines forming the edge of a table). Feature extraction is often done with algorithms like ORB, SIFT, FAST, SURF, etc.
  • Feature matching between frames to determine how the motion of the camera has changed.
  • Camera motion estimation, including loop detection and loop closure (addressing the challenge of recognizing a previously visited location).

These tasks use many calculations and will have an impact on choosing the best hardware for an AR system.

Adding deep learning/CNNs for perception
While SLAM provides the ability to determine a camera’s location in the environment and a 3D model of the environment, perceiving and recognizing items in that environment require deep learning algorithms like CNNs. CNNs, the current state-of-the-art for implementing deep neural networks for vision, complement SLAM algorithms in AR systems by enhancing the user’s AR experience or adding new capabilities to the AR system.

CNNs can be very accurate for object recognition tasks – which include localization (identifying the location of an object in an image) and classification (identifying the image class – i.e., dog vs cat, Labrador vs German Shepherd) based on pre-training of the neural network’s coefficients. While SLAM can help a camera move through an environment without running into objects, CNNs can identify that the object is a couch, refrigerator, or desk, and highlight where it is in the field of view. Popular CNN graphs for real-time object detection – which include classification and localization – are YOLO v2, Faster R-CNN, and Single shot multibox detector (SSD).

CNN object detection graphs can be specialized to detect faces or hands. With CNN-based facial detection and recognition, AR systems can add a name and social media information above a person’s face in the AR environment. Using CNN to detect the user’s hands allow game developers to place a device or instrument needed in the game player’s virtual hand. Detecting a hand’s existence is easier than determining the hand positioning. Some CNN-based solutions require a depth camera output as well as R-G-B sensor output to train and execute a CNN graph.

CNNs can also be applied successfully to semantic segmentation. Unlike object detection, which only cares about the pixels in an image that could be an object of interest, semantic segmentation is concerned about every pixel. For example, in an automotive scene, a semantic segmentation CNN would label all the pixels of the sky, road, buildings, individual cars as a group, which is critical for self-driving car navigation. Applied to AR, semantic segmentation can find ceilings, walls, and the floor as well as furniture or other objects in the space. Semantic knowledge of a scene enables realistic interactions between the real and virtual objects.

Hardware implementations for high performance systems
Both SLAM and CNN algorithms require a significant amount of computations per camera-captured image (frame). Making a seamless environment for the AR user – to merge the real world with the virtual without significant latency – requires a video frame rate of 20-30 frames per second (fps). That means the AR system has about 33 to 40ms to capture, process, render and display results to the user. The faster it can complete those tasks, the faster the frame rate is and the more natural the AR feels.

Considering a monocular (single camera) SLAM system for an SoC, computational efficiency and memory optimization are both critical design concerns. If the camera captures a 4k image at 30 fps, that means 8,294,400 pixels a frame or 248,832,000 pixel a second need to be stored and processed. Most embedded vision systems store each frame in an external DDR and then – as efficiently as possible – transfer portions of that image for vision processing (Figure 2).

Figure 2: Vision data is stored in off-chip memory and transferred to the processor over the AXI bus

Processing the algorithms necessary for advanced AR systems on a CPU – such as a mobile phone’s application process – is inefficient. Offloading to a GPU, which is present in an AR system for drawing graphics, will speed up SLAM and CNN calculations compared to the CPU. However, while the performance advancements provided by GPUs helped usher in the era of AI and deep learning computing, implementing a deep learning algorithm on a GPU could require 100W of power or more. The most optimized approach is to allocate embedded vision processing to dedicated cores.

Performance and power efficiency can be achieved by pairing a flexible CNN engine with a vector DSP. The vector DSP is designed to handle applications like SLAM, while the dedicated CNN engine can support all common CNN operations (convolutions, pooling, elementwise) and will offer the smallest area and power consumption because it is custom-designed for these parameters.

For a SoC designer of an AR system, embedded vision processor IP provides an optimized solution to address performance/power concerns. The DesignWare EV61, EV62, and EV64 Embedded Vision Processors integrate a high-performance 32-bit scalar core with a 512-bit vector DSP, and an optimized CNN engine fast for accurate object detection, classification, and scene segmentation. The vector DSPs are ideal for implementing the SLAM algorithm and run independently of the CNN engine. The processors are fully programmable and configurable and combine the flexibility of software solutions with the high performance and low power consumption of dedicated hardware.

The combination of SLAM and deep learning algorithms like CNN will make new and improved AR systems possible, opening up new experiences in gaming, education, autonomous vehicles, and more. Building complex AR systems with stringent performance, power, and area requirements can be simplified by using embedded vision processors as a companion to the host CPU. EV processors provide AR system developers with the ability to combine deep learning and evolving SLAM techniques.

A version of this article was first published in Synopsys DesignWare Technical Bulletin.

Leave a Reply

(Note: This name will be displayed publicly)