SPONSOR BLOG

Combining SLAM And CNN For High-Performance Augmented Reality

Enabling realistic interactions between real and virtual objects.

April 4th, 2019 - By: Gordon Cooper

Robotics and headsets or goggles are the most common hardware devices requiring AR/VR/mixed reality, and AR is coming to mobile phones, tablets, and automobiles as well. For hardware devices to see the world around them and add to that reality with inserted graphics or images, they need to determine their position in space and map the surrounding environment.

Simultaneous localization and mapping (SLAM) algorithms provide a geometric position for the AR system. SLAM algorithms can build 3D maps of an environment while tracking the location and position of the camera in that environment. The algorithms estimate the position of the sensor (built into the camera, cellphone, goggles, etc.) while modeling the environment to create a map (Figure 1). Knowing the sensor’s position and pose combined with the generated 3D map of the environment lets the device (and the user looking through the device) move through the environment in reality.

Figure 1: SLAM algorithms build a 3D map of the surroundings by identifying points and edges of objects and performing plane extraction from the data

SLAM can be implemented in multiple ways. Visual SLAM is a camera-only version that doesn’t rely on fancy inertial measurement units (IMUs) or expensive laser sensors. Monocular visual SLAM – which has become very popular – relies on one camera like the one in a mobile phone. A typical implementation of monocular visual SLAM includes several key tasks:

Feature extraction or the identification of distinct landmarks (like the lines forming the edge of a table). Feature extraction is often done with algorithms like ORB, SIFT, FAST, SURF, etc.
Feature matching between frames to determine how the motion of the camera has changed.
Camera motion estimation, including loop detection and loop closure (addressing the challenge of recognizing a previously visited location).

These tasks use many calculations and will have an impact on choosing the best hardware for an AR system.

Adding deep learning/CNNs for perception
While SLAM provides the ability to determine a camera’s location in the environment and a 3D model of the environment, perceiving and recognizing items in that environment require deep learning algorithms like CNNs. CNNs, the current state-of-the-art for implementing deep neural networks for vision, complement SLAM algorithms in AR systems by enhancing the user’s AR experience or adding new capabilities to the AR system.

CNNs can be very accurate for object recognition tasks – which include localization (identifying the location of an object in an image) and classification (identifying the image class – i.e., dog vs cat, Labrador vs German Shepherd) based on pre-training of the neural network’s coefficients. While SLAM can help a camera move through an environment without running into objects, CNNs can identify that the object is a couch, refrigerator, or desk, and highlight where it is in the field of view. Popular CNN graphs for real-time object detection – which include classification and localization – are YOLO v2, Faster R-CNN, and Single shot multibox detector (SSD).

CNN object detection graphs can be specialized to detect faces or hands. With CNN-based facial detection and recognition, AR systems can add a name and social media information above a person’s face in the AR environment. Using CNN to detect the user’s hands allow game developers to place a device or instrument needed in the game player’s virtual hand. Detecting a hand’s existence is easier than determining the hand positioning. Some CNN-based solutions require a depth camera output as well as R-G-B sensor output to train and execute a CNN graph.

CNNs can also be applied successfully to semantic segmentation. Unlike object detection, which only cares about the pixels in an image that could be an object of interest, semantic segmentation is concerned about every pixel. For example, in an automotive scene, a semantic segmentation CNN would label all the pixels of the sky, road, buildings, individual cars as a group, which is critical for self-driving car navigation. Applied to AR, semantic segmentation can find ceilings, walls, and the floor as well as furniture or other objects in the space. Semantic knowledge of a scene enables realistic interactions between the real and virtual objects.

Hardware implementations for high performance systems
Both SLAM and CNN algorithms require a significant amount of computations per camera-captured image (frame). Making a seamless environment for the AR user – to merge the real world with the virtual without significant latency – requires a video frame rate of 20-30 frames per second (fps). That means the AR system has about 33 to 40ms to capture, process, render and display results to the user. The faster it can complete those tasks, the faster the frame rate is and the more natural the AR feels.

Considering a monocular (single camera) SLAM system for an SoC, computational efficiency and memory optimization are both critical design concerns. If the camera captures a 4k image at 30 fps, that means 8,294,400 pixels a frame or 248,832,000 pixel a second need to be stored and processed. Most embedded vision systems store each frame in an external DDR and then – as efficiently as possible – transfer portions of that image for vision processing (Figure 2).

Figure 2: Vision data is stored in off-chip memory and transferred to the processor over the AXI bus

Processing the algorithms necessary for advanced AR systems on a CPU – such as a mobile phone’s application process – is inefficient. Offloading to a GPU, which is present in an AR system for drawing graphics, will speed up SLAM and CNN calculations compared to the CPU. However, while the performance advancements provided by GPUs helped usher in the era of AI and deep learning computing, implementing a deep learning algorithm on a GPU could require 100W of power or more. The most optimized approach is to allocate embedded vision processing to dedicated cores.

Performance and power efficiency can be achieved by pairing a flexible CNN engine with a vector DSP. The vector DSP is designed to handle applications like SLAM, while the dedicated CNN engine can support all common CNN operations (convolutions, pooling, elementwise) and will offer the smallest area and power consumption because it is custom-designed for these parameters.

For a SoC designer of an AR system, embedded vision processor IP provides an optimized solution to address performance/power concerns. The DesignWare EV61, EV62, and EV64 Embedded Vision Processors integrate a high-performance 32-bit scalar core with a 512-bit vector DSP, and an optimized CNN engine fast for accurate object detection, classification, and scene segmentation. The vector DSPs are ideal for implementing the SLAM algorithm and run independently of the CNN engine. The processors are fully programmable and configurable and combine the flexibility of software solutions with the high performance and low power consumption of dedicated hardware.

Conclusion
The combination of SLAM and deep learning algorithms like CNN will make new and improved AR systems possible, opening up new experiences in gaming, education, autonomous vehicles, and more. Building complex AR systems with stringent performance, power, and area requirements can be simplified by using embedded vision processors as a companion to the host CPU. EV processors provide AR system developers with the ability to combine deep learning and evolving SLAM techniques.

A version of this article was first published in Synopsys DesignWare Technical Bulletin.

Gordon Cooper

(all posts)
Gordon Cooper is a product marketing manager for Synopsys’ Embedded Vision Processor family. Cooper brings more than 20 years of experience in digital design, field applications and marketing at Raytheon, Analog Devices, and NXP to the role. Cooper also served as a Commanding Officer in the US Army Reserve, including a tour in Kosovo. He holds a Bachelor of Science degree in Electrical Engineering from Clarkson University.

Knowledge Centers
Entities, people and technologies explored

Shift Left Is The Tip Of The Iceberg

A transformative change is underway for semiconductor design and EDA. New languages, models, and abstractions will need to be created.

by Brian Bailey

Partitioning In The Chiplet Era

Understanding how chiplets interact under different workloads is critical to ensuring signal integrity and optimal performance in heterogeneous designs.

by Ann Mutschler

NAND Flash Targets 1,000 Layers

New techniques go beyond improved deposition and etching, but challenges stack up, too.

by Bryon Moyer

3.5D: The Great Compromise

Pros and cons of a middle-ground chiplet assembly that combines 2.5D and 3D-IC.

by Ed Sperling

AI’s Role In Chip Design Widens, Drawing In New Startups

Focus is on letting engineers do much more with the same or fewer resources — and less drudgery.

by Karen Heyman

What Comes After HBM For Chiplets

The standard for high-bandwidth memory limits design freedom at many levels, but that is required for interoperability. What freedoms can be taken from other functions to make chiplets possible?

by Brian Bailey

Memory Fundamentals For Engineers

eBook: Nearly everything you need to know about memory, including detailed explanations of the different types of memory; how and where these are used today; what's changing, which memories are successful and which ones might be in the future; and the limitations of each memory type.

by The SE Staff

Why Small Fab And Assembly Houses Are Thriving

Megafabs churning out the most advanced processors are not the only game in town.

by Bryon Moyer

Combining SLAM And CNN For High-Performance Augmented Reality

Gordon Cooper

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Shift Left Is The Tip Of The Iceberg

Partitioning In The Chiplet Era

NAND Flash Targets 1,000 Layers

3.5D: The Great Compromise

AI’s Role In Chip Design Widens, Drawing In New Startups

What Comes After HBM For Chiplets

Memory Fundamentals For Engineers

Why Small Fab And Assembly Houses Are Thriving

Sponsors

Recent Comments

About

Navigation

Connect With Us

Combining SLAM And CNN For High-Performance Augmented Reality

Gordon Cooper

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Shift Left Is The Tip Of The Iceberg

Partitioning In The Chiplet Era

NAND Flash Targets 1,000 Layers

3.5D: The Great Compromise

AI’s Role In Chip Design Widens, Drawing In New Startups

What Comes After HBM For Chiplets

Memory Fundamentals For Engineers

Why Small Fab And Assembly Houses Are Thriving

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored