Confusion Grows Over Sensor Fusion In Autos

Multiple approaches are being explored for multiple data types, but it’s still too early to say which is best — or whether any of them will shorten time to market for autonomous vehicles.


A key strategy for fully autonomous vehicles is the ability to fuse together inputs from multiple sensors, which is essential for making safe and secure decisions, but it’s turning out to be much harder than first imagined.

There are multiple problems that need to be solved, including how to partition, prioritize, and ultimately combine different types of data, and how to architect the processing within a vehicle so that it can make decisions based on those various data types quickly enough to avoid accidents. There is no single best practice for how to achieve that, which is why many automotive OEMs are taking very different approaches. It also helps explain why there are no fully autonomous vehicles on the road today.

“There are three primary ways to look at the problem,” said David Fritz, vice president of hybrid-physical and virtual systems, automotive and mil-aero at Siemens Digital Industries Software. “One approach is to fuse the raw data from multiple sensing sources before processing. While this approach can reduce power consumption, bad data from one sensor array can contaminate good data from other sensors causing poor results. In addition, the transmission of huge amounts of raw data poses other challenges with bandwidth, latency, and system cost.”

A second approach is object fusion, where each sensor processes data and represents its sensor-specific processing results as an interpretation of what it detects.

“This has the advantage of seamlessly integrating results from onboard sensors. Infrastructure sensors, and those on other vehicles,” Fritz said. “The challenge of this method is a universal representation and tagging of objects so that they can be shared across disparate vehicles and infrastructures. The third option — and the one we find most compelling from the power, bandwidth and cost perspective — is a hybrid of the first two methods. In this method, objects are detected by the sensors but not classified. In this case, point clouds of objects are transmitted to onboard central compute systems, which classify (tag) point clouds from different sensors, both internal and external. This significantly reduces bandwidth and latency requirements, keeps the cost and load on the sensors low, and allows the vehicle to interpret, or classify, the objects in any way it likes, thereby eliminating the need for a universal classification standard.”

This discussion in the automotive ecosystem is just beginning, and there are plenty of challenges to overcome.

“You need to figure out which objects you have, and when to use them,” said Frank Schirrmeister, vice president of business development at Arteris IP. “All the formats are very different. If you’re looking into lidar, there are funky maps with distances. In cameras, it is RGB, and there is a set of pixels. With thermal, there is something else. Even before you correlate and fuse all these things, you somehow need to make sense of the formats. From an architecture perspective, that may lead to the processing being most desirable to at the sensor or close to it. Then, the object correlation is done between the different bits. But you need to figure out details, like how hot the object is, how far the object is, etc. There is a Venn diagram of these different sensors having an overlapping set of characteristics, some of them being better than others.”

Sensor fusion is an area of rapid innovation, enabled by continuous improvements in algorithms and the chip industry’s deep knowledge of SoC architectures.

“A common denominator in sensor fusion is the need for a heterogenous processing approach, as it requires a combination of signal processing — often using a DSP, AI processing on a dedicated accelerator, and control code using a CPU,” said Markus Willems, senior product manager at Synopsys. “Depending on the type of sensor, different data types need to be supported. This includes 8-bit integer processing for image data, or 32-bit single-precision (SP) floating point for radar processing, while AI processing might require bfloat16, among others. Running different types of processors on a single chip calls for a sophisticated software development flow, leveraging optimizing C/C++ compiler and function libraries, as well as graph mapping tools supporting the latest neural networks including transformers being used in sensor fusion. Memory, bandwidth and latency are key design parameters, and designers expect to see early availability of processor simulation models and SoC architecture exploration tools to examine what-if-scenarios.”

While sensor fusion gets a lot of attention in automotive, it’s useful for other markets, as well.

“We are focused on the automotive because there is going to be an image sensor in the camera, radar and maybe a lidar, said Pulin Desai, product management group director in the Tensilica IP Group at Cadence. “There may also be an image sensor and an IMU in a robotic application. There might be multiple image sensors, and you will fuse those things. Other sensors include gyroscopes, magnetometers, accelerometers, and these sensors are being used in so many different ways in so many different areas. While there’s a lot of focus in the automotive side, the same image sensors and radar sensors in home sweeping robots. That might have a very similar architecture to a drone. And any kind of unmanned vehicle has those kinds of sensors.”

There is a lot of data streaming in. Figuring out where to process it all is a challenge, in part because not all data is in the same format.

“Here exists the classic edge computing situation, where you need to decide how to balance the processing throughout the whole chain — from where you get the data from the analog world, to where you make a decision in the brain or interact with the driver for hybrid use models,” said Arteris’ Schirrmeister. “Object correlation sounds much more realistic, but there are all kinds of challenges. Thermal, lidar, and radar all use different types to even represent the data. If you look at lidar, because it’s essentially giving you points within a certain distance, that’s a completely different type of data than what you would get from a camera. Correlating those all together is certainly not trivial and can be quite compute intensive. Even more, you then need to decide if the different items seem to disagree with each other. If so, what do you choose? Do you use some average value? It’s definitely a challenge for all those sensors to be combined.”

When it comes to the actual fusion of the data, Siemens’ Fritz has observed a number of approaches. “Some of the early forays into this, where NVIDIA got a jumpstart, was them saying, ‘We can do a lot of the AI stuff. When the sensor data comes in, we can use our high-end GPUs, try to lower the power consumption on those, then process that with neural networks.’ That’s how we ended up several years ago with a rack in a trunk that had to be water-cooled. Then you throw in the lidar guys who say, ‘I know you can’t pay $20,000 per unit for lidar, so we’re working to get the lidar cheaper. And somebody says, ‘Well, wait a minute. Cameras are like 35 cents. Why don’t we put a bunch of cameras and fuse all of this together?’ That started several years ago with a brute-force, pretty much brain-dead approach. And that is the approach of, ‘I have raw lidar data. I have raw camera data. I have radar, lidar, camera. How do I put all that together?’ People did some crazy things, such as converting the lidar data into RGB. ‘We have multiple frames because there is distance information. Then we’ll run it through the simplest convolutional neural network to try to detect objects and classify them. That was the extent of it. But some people are still trying to do that.”

Tesla, in contrast, still mainly relies on camera data. Fritz said this is possible because of the capabilities of the stereo camera, or even sequential frames over a fixed period of time in a mono camera, using parallax to determine depth. “Because of this they say, ‘Why do I need lidar? And because I don’t have lidar, then I don’t have the sensor fusion problems.’ It just simplified things. But say the lens on the camera gets covered with water or dirt. They have those issues to worry about. At the other end of the extreme, if you’re relying completely on lidar, I’ve seen scenarios where you have a 2D representation of a person being walked across the street, and the car thinks it’s a real person. Why? Because of the reflection. There’s all kinds of things that happen to lidar that people don’t know about, and it’s extremely difficult to filter those out.”

Fusing different data types also depends on what type of sensors are present. “People are talking about the early, the mid-, and the late fusion,” said Cadence’s Desai. “This all depends on our customer and our customer’s customer’s system design, which says what type of problem they’re trying to solve. We are agnostic to some of these things because there are stereo sensors that can do early fusion, or late fusion, because your image and data both have identified the object and you do the late fusions of that. There might also be a mid-fusion, which are more like a system vendor’s choice of how they want to do the fusion, how much computation they want to do, how robust the information is, or what type of problem they’re trying to solve. How difficult is this? Well, it depends on the type of fusion.”

Types of sensor fusion

Fig. 1: Different fusion options. Source: Cadence

Fig. 1: Different fusion options. Source: Cadence

Another consideration that comes into play, especially with the tremendous focus on AI/ML techniques, is when to use them, or if classical DSP is more appropriate, Desai said. “I draw the parallel to some of what we did in the past, versus what we do today. There are certain problems where you have a certain deterministic way that you can achieve a very high success rate with AI. For example, when we were doing face and people detection in 2012 and 2013, we used classical computer vision algorithms, however, at that time, they were not very accurate. It was very difficult to achieve the accuracy. Then, when we moved to AI, we were getting very robust performance with the face detection and people detection. So now there is a very deterministic situation where you say, ‘I’m going to do face detection, and I can achieve what we call in human as 99% accurate, and that AI can give me 97% accuracy.’ Why do I need to play with something that’s not good enough? I would go and use this AI because I know exactly what it does, and it gives the best accuracy out there. But there are certain situations, such as when I’m still trying to figure things out, where I need to try different algorithms and play within my environment. I need to be able to X, Y or Z, and I need flexibility. There, you continue to use your digital signal processor for those algorithms.”

Also, a lot of time with AI engines, the data that goes into the AI engine must be pre-processed, which means it must be in a specific format.

“In specific data types your AI engine may be saying, ‘I only do fixed point,’” Desai explained. “So you might use a programmable engine to do that. Then, once you put certain things in AI, you may not have a lot of flexibility. And in four years when something new comes in, you may have to change it. There are a lot of different factors. Essentially, if you’re doing something very deterministic, you will know you can achieve a very high performance rate, and you know it today. You may say, ‘I’m going to throw AI in to solve that problem today. Tomorrow, I may still do that.’ Then, I add flexibility by using a programmable engine. Or, if I don’t know I need to play with it, then I will still use the classical algorithm to work with it. Even if I have AI, I still need to do the pre-processing and post-processing of the data, so I need to use my classical DSP algorithms for it.”

Experimentation will be a requirement as automotive OEMs and systems companies evolve their computing architectures toward sensor fusion.

Siemens’ Fritz believes that during this evolutionary period, the right way to handle development is to hire and/or carve out some small teams that do a lot of pilot projects. “Those could be a dozen or two dozen people. Their target might be 300 prototypes in a test environment by 2026 or 2028, for instance.”

Still, where each OEM stands today depends on the OEM, how long they’ve been doing architecture development, and how they want to do this going forward.

“Different OEMs have different levels of expertise. Some are trying to ramp up their teams to figure this out,” Fritz noted. “The majority of OEMs have a bit of the ‘not invented here’ syndrome going on, such that they think they can do this themselves because they have lots of smart people. The problem is, are you going to bump up from 100 to 200 ECUs and double the weight of the vehicle? In other words, they don’t tend to have the people on staff now who think about this in a holistic way. They think about it in terms of, ‘I’ve got a hammer, therefore this must be a nail.’ Then they fail miserably.”

As with most new technologies, the developer recognizes they need a compiler for their CPU, so they try to build their own. “Then they find out that the two people they thought could do it, can’t, and realize they need four more, then, a dozen or two more,” he said. “By the end they’re so emotionally invested in it, it’s hard to kill it, and it just lasts forever until finally they end up buying the chip they need and firing the 100 internal development people. That happens often, and in automotive it’s no different. Sensor fusion is one of a few key areas where we’re seeing that phenomenon play out as we speak. Like everything else in this space, it’s like the starting gun went off years ago, people started running, and then realized, ‘I haven’t trained for this marathon.'”

Leave a Reply

(Note: This name will be displayed publicly)