A DSP For Implementing High-Performance Sensor Fusion On An Embedded Budget

Efficiently executing the algorithms used to extract relevant information from the data produced by sensors.

popularity

By Markus Willems and Pieter van der Wolf

Sensor fusion refers to the combining of data from multiple sensors to obtain more complete and accurate results. By using the information provided by multiple sensing devices, it is possible to achieve better context awareness. Smart mobile devices, autonomous driving, smart home appliances, industrial control, and robotics are just a few of the applications that take advantage of sensor fusion.

Three key things are required for sensor fusion to be successful: miniaturization of sensors, sophisticated algorithms to extract the relevant information from the streams of data produced by the sensors, and an SoC that provides the required performance to execute the algorithms within the available budgets for power consumption and cost.

Sensors are often implemented as Micro-Electro-Mechanical Systems (MEMS), which led to smaller sensors, as well as much cheaper sensors, which are suited for mass applications. As a result, sensors like accelerometers, gyroscopes, magnetometers, as well as cameras and microphones, can be found in many consumer devices. And radar sensors will soon be part of this list for consumer devices, enabling gesture control on an ultra-low power budget. Radar, and of course cameras, are well-established sensors in today’s automobiles, and their numbers are rising from generation to generation, with LiDAR poised to be part of the next generation of Advanced Driver Assistance Systems (ADAS). It takes multiple, and different, sensors to obtain complete and accurate results. As with the human body, in which each “sensor” has complementary strengths and provides unique information, sensors in embedded systems must do the same. Taking the ADAS example, radar is robust in different light and weather conditions, LiDAR provides a wide field of view with good angular resolution, while camera-based vision allows for fast and accurate object classification (figure 1).


Fig. 1: Multiple different sensors in an ADAS system.

Sophisticated algorithms are required to (1) extract information from the sensor signal and (2) combine the information from the different sensor streams. Depending on the application, the complexity of these algorithms can vary largely, resulting in very different performance requirements. And depending on the application, the performance needs might vary over time, too. An always-on smart home device might only wake up when a certain voice command has been detected, while an ADAS system must monitor its environment on an ongoing basis.

Sophisticated algorithms need an SoC that provides the necessary performance to execute them. And as with any design, it must stay within the available power and area constraints, as this will largely impact the overall profitability. Heat dissipation and limited battery capacity are two main drivers, depending on the application. Ideally, such an SoC is fully programmable to allow for maximum flexibility. Algorithms are likely to evolve during the lifetime of a product, sensors might require different calibrations during their lifetime, and it is highly desirable to use the same SoC for multiple variants of a product, with the differentiation provided by the software.

Let’s look at a couple of application examples. A pedometer, or “step counter”, is part of any mobile phone these days. It contains multiple sensors, such as an accelerometer, gyroscope, magnetometer, and sometimes sensors for pressure and temperature (for altitude tracking). These sensors are relatively cheap to produce, and they generate a constant stream of information. It takes between 10-50 MIPS to process the data, and to combine it into a meaningful output, something that can be handled by a small MCU.

For an always-on smart home device, one might see the combination of microphones, cameras, and radar, too. These devices enable a smart interaction with the user, as they detect the presence of a user, and then respond to commands. “Smart” sensors will be used to limit power consumption, e.g. starting face recognition (complex algorithm, high performance requirements) only after a face has been detected (simpler algorithm, low performance requirements). The compute requirements will vary largely over time. The system has to provide the peak performance when required but needs to dynamically manage compute resources and the power they consume. With the amount of data coming from vision, voice and radar sensors, it quickly takes multiple billions of operations/second, or GOPS, to process the data.

What are the key features needed for the efficient implementation of sensor fusion?

Sensor fusion contains two main phases: the (1) extraction of information, and the (2) combination of information to derive a result. This is illustrated in figure 2.


Fig. 2: Sensor fusion processing chain.

For the camera, it is image signal processing with functions such as image scaling, color space conversion, filtering, or feature detection. Here the data is represented as pixels, with a data format of 8-bit, up to 16 bits.

And finally for radar, such frontend processing includes range and velocity FFTs, and constant false alarm rate (CFAR) for thresholding. Due to dynamic range and precision requirements, data types are half-precision or full-precision floating point.

Phase 2 is the combination of information, the backend processing. The algorithms to be used are very application dependent. Tasks may include object detection, recognition, tracking, as well as prediction, e.g. using recursive estimators like Kalman filters. AI-based machine learning algorithms might be applied, as well as linear algebra operations. Data types will, of course, strongly depend on the algorithms.

Because of these specific, but varied requirements, sensor fusion requires a digital signal processor (DSP) that is versatile, scalable, and enables PPA optimization and efficient software development. Let’s take a closer look at each:

  1. Versatility

Algorithms and data types largely depend on the application. Therefore, a DSP architecture has to support a rich instruction set for the efficient implementation of different algorithms, with a specific focus on performance-critical operations such as FFTs or linear algebra operations. The DSP has to support integer and floating-point data types with different precisions.

Such a DSP needs to qualify as a flexible compute resource, meaning it needs to be able to perform “classical” filtering operations typically associated with a DSP, as well as machine learning and computer vision algorithms.

  1. Scalability

To avoid a one-off investment, scalability is key. While the requirements for the different sensors vary, it is highly desirable to use the same baseline architecture for all signal-processing requirements across different designs, to limit the effort for system integration, and to maximize overall software development productivity. Scalability enables the designer to pick the configuration that delivers the best PPA for the target application.

Scalability is not only about the hardware. A significant investment is in the software, including kernels optimized for the specific architecture. It is important that such software can be reused across these SoCs, enabling the reuse for different versions of an SoC (such as a low-end/medium-end/high-end version).

  1. PPA optimization

The are many facets to the optimization of performance/power/area. Starting with performance, it is about cycle efficiency (i.e. number of cycles it takes to perform a specific function) of the core itself, with the available processing engines and an ISA that enables the utilization of these engines. This is directly linked to the efficient support of data movement, in parallel to data processing, which is then linked to a rich set of (ideally configurable) interfaces, e.g. for connecting accelerators and peripherals directly to the core, without going through system memory.

The maximum frequency at which a DSP can be clocked is another performance aspect. It determines how much horsepower the DSP can provide in cycles per second, but also impacts the effort required for timing closure in physical SoC design.

Low power consumption is directly linked to performance efficiency, as well as to the option to wake up certain cores only when needed (as described for the smart-home applications, which wait for the wake-up information).

Finally, small area has a direct impact on cost, as well as on leakage.

  1. Efficient software development

Software development needs to be efficient, as for almost all projects a large portion of the investment (and the people involved) is spent on software development and testing. It takes a high-level programming model with an optimizing compiler, and a rich set of libraries with off-the-shelf optimized kernels for filters, transforms (e.g. FFT), vector math, linear algebra, and machine learning. And, of course, it requires low-level modules such as drivers, DMA handlers, interrupt handlers. As significant investment goes into the software, it is important that such software is portable over a wide range of architectural options, e.g. supporting different vector lengths without any need for recoding.

DesignWare ARC VPX DSP IP

VPX DSP IP is a family of VLIW/SIMD processors targeting a broad range of signal processing applications, from always-on devices to automotive ADAS to vision to machine learning and high-performance computing. An overview is given in figure 3.


Fig. 3: ARC VPX DSP IP Block Diagram.

The VPX family is an excellent match for the sensor fusion requirements, as it provides scalability and versatility to achieve best PPA, and software development efficiency for overall productivity.

All VPX family members are based on the same VLIW/SIMD architecture. Customers can scale the solution to their needs, selecting from different vector lengths ranging from 128-bit to 512-bit. It is not uncommon to start with a certain vector length in mind, only to realize that the PPA requirements call for a different configuration. The vector-length agnostics (VLA) programming model makes this very easy to do, as code can be migrated among VPX family members. VLA programming ensures that investments into the software are safe, enabling flexibility for the current project, and reusability for future projects. Besides the vector-lengths, customers can select from single, dual, or quad-core configurations, with the multicore configurations pre-integrated and prepared for cache coherency and shared multi-channel DMA.

Besides the different vector lengths, each VPX core is highly configurable, which allows to tailor the architecture for best performance with the lowest area at hand. Taking the example of an application with no need floating point needs, but tight area and power constraints: using the ARChitect configuration tool, users can select not to include the (scalar and up to two vector) floating-point units. Another example of such an optional unit is the specialized vector math unit, for the very efficient execution of operations like sin(x), cos(x), 2^x, div, sqrt, 1/sqrt, log_2(x) etc.

As explained above, depending on the sensors, and the algorithms applied on the sensor data, different data types are needed. VPX supports are wide range of data types, from floating-point to cover the dynamic range required by applications such as high-resolution radar to small-scale integer types used for AI applications.

The VPX instruction-set architecture (ISA) is tuned to the efficient execution of key signal processing kernels, such as FFTs or matrix operations. Taking the example of an FFT operation: through the combination of vector load/store double (which refers to transmitting the data from memory up to twice the vector length) and dedicated FFT instructions, it is possible to perform all FFT operations in software, even for multi-sensor radar scenarios. This avoids the cost of a dedicated hardware accelerator, resulting in power and area savings.

ISA and microarchitecture (i.e. the way the different functional units are implemented) are the key ingredients to achieve best PPA. But it takes a software development environment to unleash the capabilities of the hardware. VPX comes with the MetaWare tool suite which includes an optimizing C/C++ compiler, simulation tools, and a sophisticated debugging environment. It includes a rich set of libraries providing optimized kernels for signal-processing, linear algebra and machine learning. These kernels have been written in a vector-length agnostic way, so code remains portable across all members of the VPX family. To support the growing need for AI, MetaWare also offers the NN SDK, and advanced graph mapping tools supporting TensorFlow, Caffe, ONNX.

Fig. 4: Libraries provided with MetaWare, optimized for VPX.

The VPX family includes the VPXxFS variants (VPX2FS, VPX3FS and VPX5FS), tailored for Functional Safety (FuSa) certification. These cores meet random fault detection and systematic functional safety development flow requirements to achieve up to full ASIL D ISO 26262 compliance. The VPXxFS DSPs integrate hardware safety features such as ECC protection for memories and interfaces, safety monitors and lockstep mechanisms. A comprehensive set of safety documentation helps automotive designers achieve ISO 26262 functional safety certification. In addition, the VPXxFS DSPs offer a “hybrid” option that enables users to select required safety levels up to ASIL D in software and post-silicon.

Conclusion

Sensor fusion is a rapidly growing market, finding its way into almost any application domain. Fueled by the availability of low-cost sensors, and enabled by advanced algorithms, it enables new user experiences in different markets, including smart mobile devices, automotive, health or industrial control. Sensor fusion results in a heterogenous signal processing workload, as different sensors require different data types to represent the data, and different DSP algorithms to extract the information relevant for the actual fusion process. The fusion process, i.e. combining the various sensor information streams and deriving meaningful decisions from it, is very much application specific. To handle these heterogenous workloads, it takes a processor that is scalable to handle different data formats and performance requirements, as well as versatile and configurable to tune the architecture, including its memories and interfaces, to meet the PPA requirements. The ARC VPX family is an ideal solution for sensor fusion applications: with vector-lengths of 128-bit, 256-bit or 512-bit, it addresses a broad range of signal-processing workloads. With a tailored instruction set, and a dedicated math hardware engine it provides excellent cycle efficiency with unmatched PPA. Its vector-length agnostic programming model ensures that the software can be reused across all members of the VPX family, protecting this significant investment.

Markus Willems is a senior product marketing manager at Synopsys.

Pieter van der Wolf is a principal R&D engineer at Synopsys.



Leave a Reply


(Note: This name will be displayed publicly)