Dirty Data: Is the Sensor Malfunctioning?

Why sensor data needs to be cleaned, and why that has broad implications for every aspect of system design.


Sensors provide an amazing connection to the physical world, but extracting usable data isn’t so simple. In fact, many first-time IoT designers are unprepared for how messy a sensor’s data can be.

Every day the IoT motion-sensor company MbientLab struggles to tactfully teach its customers that the mountain of data they are seeing is not because the sensors are faulty. Instead, the system design that incorporates those sensors is missing some crucial step in the data cleaning process.

“I battle this every day,” said MbientLab CEO Laura Kassovic in a recent presentation, warning engineers just how difficult training IoT wearables with machine learning can be. Tools and hardware have improved over the years, but basic understanding for dealing with the data is still lagging, she said.

“I applaud users for trying to use sensors to solve problems and research complex topics,” she said. “It’s brave, it’s fun, it’s wild, it’s hard. My issues are with those who blame their failures on our sensors instead of their methodology and failure to solve the real problem. Sensors don’t lie. Sensors aren’t biased. Sensor data is always correct. It is only the user that can misuse or misinterpret the sensor data.”

Sensors aren’t always easy to use, however. And not all data generated by sensors is valuable. The key is to figure out where the real value is, and to separate out that data and discard the rest.

“Most sensing is very cheap,” said Aart de Geus, chairman and co-CEO of Synopsys. “There are some exceptions to that, such as artificial eyes. But some also falls into the AI category, such as the wrist watch that picks up various measurements. What kind of insights can you get? Can you predict a heart attack? If you can, that is of pretty high value. So how much would you pay for what? If you have one minute, you can scribble down ‘thank you’ to your wife and that’s about it. If you have an hour, you can call for a medivac. If you have a few hours, the value and risk change again.”

Data comes in a variety of forms across many applications. What is considered clean in one case may require much more work than in another application. And some of it can be done locally, while other data can be cleaned in a data center.

“Let’s say you have a facial recognition application and only certain employees are allowed to enter this building,” said de Geus. “Every month you update the AI network in the edge device and it will be up-to-date on all the faces. It may do a lot of work because there are a lot of people coming in all the time, but not all of that has to be updated all the time.”

In other cases, data may need to be scrubbed in real time. The tragic example of the Lion Air crash of a new Boeing MAX 8 aircraft, which on Oct. 29 killed all aboard, may be heading toward “the sensor did it” category. The black box recovered from the flight showed inconsistent data from one of the two angle-of-attack (AOA) sensors. With one half of the data apparently incorrect, it was enough to trigger this plane’s anti-stall system into a nose down action, which the pilots wrestled all the way into the Java sea.

It’s too early to tell what really happened in this case. “It’s not just a sensor. There are multiple aspects of this system,” said Mahesh Chowdhary, director of STMicroelectronics’ Strategic Platforms and IoT Excellence Center. “There is a sensing part, a connectivity part, and then a computing part. There is some algorithm that looks at sensor data and determines what is the orientation of the airplane. Multiple features have to work together harmoniously and synchronously to provide the information about the orientation of the airplane.”

But not all data is good, and even data that is assumed to be valuable may be corrupted or inaccurate. From the seemingly simple IoT system to a larger safety-critical system, when sensor system designs fail, is data—especially dirty data—often the culprit? And how do you know if the sensor or the data is bad? Or is the logic in the algorithms or the firmware that reads and acts upon the data faulty? It would first help to agree on what dirty data is.

“It’s an ambiguous area. Is the sensor working right? Well, yeah it is but it’s not working the way you intended it. So, is it user error or is it sensor error? I find the whole concept of dirty data is super ambiguous because if you get the sensors working right, it’s just not working as intended by the user,” said Robert Pohlen, a product line director at TT Electronics, a company that designs sensors and helps clients create various sensor-based systems.

The data processing path
To understand the difference between clean and dirty data is, it’s important to understand how data goes from point A to point B.

To say that data from sensors undergoes post-processing is an understatement. A basic transducer converts one form of energy to another, with or without assistance from external power, to create either an analog or digital signal. The original conversion stems from real-world analog signals—sound, light, temperature, magnetic forces, pressure, and so forth. Somewhere along the line, whether inside the sensor or on the printed circuit board, the analog signal gets conditioned—or amplified if needed—and converted to a digital signal. After that, the data usually is sent to a microchip or some other processor for further filtering through algorithms to clean the noise and pull the relevant information in a useful form.

Compute architectures are just beginning to come to grips with this kind of data-first approach, where some data needs to be pre-processed at the edge, while other data can be sent off to more powerful servers to be cleaned up.

“Edge computation is going to be a big play,” said Robert Blake, president and CEO of Achronix. “The fundamentals are all there. We know what all the base building blocks are. We need to figure out how to efficiently move data around in whatever formats, paying attention to the memory hierarchy of how you move the data the least distance to get it to the computation. These are fundamentals to how to get more efficient computing.”


It’s also critical for extracting data that needs to be acted upon immediately from data that might be used to identify trends over time, and for removing data that holds no value. This is even harder when you consider there are many different types of data, and in some cases multiple data types may be required to navigate the physical world or to form a conclusion about whether someone is about to suffer a medical emergency.

Data also can start out clean and end up dirty, either through updates or viruses. “Globally, all components need to be as secure as possible, so you want to build trust up from the hardware,” said Helena Handschuh, a Rambus fellow. “Once you security boot up, the communication data already has some sort of trust. But there are also insecure, unknown components, and that requires intrusion detection and software analysis on larger sets of data. That allows you to see if anything has been corrupted. In an automotive scenario, you want to detect which part is giving you anomalies or weird data. That’s a security issue, but it’s also a safety issue.”

Dirty data needs to be addressed, but where and how it becomes dirty determine the action that needs to be taken. If the sensor itself generates dirty raw data, designers need to take that into account from the start. “Solving a sensor problem requires a lot of domain expertise,” said Kassovic. “It requires knowledge of the sensor at the hardware level, understanding of the data extracted from the sensors and experience with software (algorithm) development.”

For instance, don’t mistake data from an accelerometer with data from a GPS. “An accelerometer only measures acceleration of a body,” she said. “What most fail to understand is that is not a substitute for a GPS, which outputs the absolute position of a body in space. Every single application is unique enough that it requires a unique approach to most optimally extract the correct end metric. I am always perplexed by the number of users who think the data coming from the sensors should look exactly like their college textbook. Real-world sensor data is imperfect. When you open your physics, engineering, or computer science textbook, it is littered with perfect curves of bodies in motion. When you take data from the real world, those same curves are going to look quite different. There is noise and error in the real world.”

Every single application is unique enough that it requires a unique approach to most optimally extract the correct end metric.

Understanding data
So how exactly do you deal with dirty data? The first step is to understand and interpret the output of a sensor. Sensor data tends to be relative rather than absolute, and sensor readings in the real world aren’t always perfect. 

Sensor makers see basic issues with the noise, filters and algorithms—and they provide tools to help. Some systems designers and platform vendors on the user end of the system dealing with the data can see valid data that is populating their database incorrectly. They provide a watchful eye and tools to help.

“I see dirty data on the analog side, not on the digital side. Dirty data is noisy data. Noise would be my biggest concern,” said TT Electronics’ Pohlen. “Noise can be induced from lots of different sources. You could just have just electrical noise that’s being picked up from your wiring harnesses or caused by components going bad.”

Noise caused by some kind of external influence on the actual sensing mechanism is not dirty data, in Pohlen’s eyes. “You know, for example, it’s a light sensor and you have an ambient source of light. I wouldn’t consider that dirty data because that’s really not truly what you’re trying to measure but it is measuring it correctly.”

Uncalibrated sensors generate more dirty data than calibrated ones. “Computation with raw sensor data that is not calibrated is what generally dirty data is essentially referred to as—or one that has a lot of noise on it,” said ST’s Chowdhary. “Besides the physical part of sensors using some phenomena, like measuring Coriolis acceleration for example to detect rotation of a device, rotation of a user, or rotation of a phone, you have signal conditioning blocks. These signal conditioning blocks operate at different conditions for low power mode, where the objective of the designer is to minimize the current consumption for the sensor if you can use that block. If you do that, the noise on the sensor data moves up because the more power you apply to signal conditioning, the cleaner is your data.

“Considering these different aspects, dirty data is sensor data that is not calibrated, sensor data that has been impacted by input of noise, whether the noise is due to purely signal conditioning blocks or from external disturbances,” said Chowdhary. He puts external disturbances, such as when a magnetometer is affected by an external magnetic, into the dirty data category. “You know that data can all be clumped together and categorized as dirty data.”

Even within a batch of sensors, sensors can have variations and issues from manufacturing. Once in the field, the sensor can get damaged or blocked. A ground crew can damage a plane’s sensor, even an AOA sensor. Parts can go bad or wear out. Sensors need to be recalibrated.

From an enterprise point of view trying to make sense of the data, “in sensor-based device networks, dirty data can be the product of one or many issues. Issues can be caused by but not limited to time series laps, sensor unit measurement, date/time calibration, inappropriate associations of sensors, improper aggregation of data point across regions, etc. Dirty data could also be as simple as data produced not meeting the business objective and thus is unstable or unusable or invalid.” said Pratik Parikh, the director of product marketing at Liaison Technologies, a company that helps put the usable data on a platform for enterprises to use.

Others have specific definitions of the term. “Dirty data is well-formed data reported by your devices that is invalid in some way. It doesn’t immediately get flagged as this is garbage that we can’t even interpret,” said James Branigan, the cofounder of Bright Wolf, an IoT system integrator. “You can totally read it in, but you find out at some point that that data is actually completely invalid.”

In the IIoT and IoT, the risks of dirty data are contaminating a company’s data lake and other risky behavior. And it wastes money. “The reason it is a problem is because in all these IoT systems, as you look for value in the data and you make programmatic analytics that are going to run over those incoming data values, you are going to connect those analytical outputs to your enterprise system in some way,” Branigan said. “There’s some interesting event that is going to happen as the output of all this. And if you base that interesting event on bad assumptions—dirty data that came in—you get into that classic garbage in, garbage out. Dirty data can cause you real harm where you are starting to incur real economic cost, because these automated actions are being kicked off by data that is not actually not valid.”

Branigan sees three dirty data issues. “One, something is physically wrong with the sensor. Either the environment has changed or the sensor is having an error that it cannot detect itself, and it is giving you well-formed but completely garbage data.” The next category involves whether the firmware that runs on the device has software bugs. Even newer versions of firmware “can cause different issues where well-formed data is reported in that is totally erroneous. The third category, which is really nefarious, is where you need very specific knowledge of the machine operations in order to understand how to interpret the data that comes in. Without that knowledge you may interpret a data packet as valid, when some other part of the system did not intend it to be interpreted that way.”

So is dirty data clear as mud? Perhaps the term is too general to be useful?

Help with cleaning chores
Lots of tools are out there to help clean data. “There are so many great tools out there. Matlab, Labview and Python are the most popular. Our very own MetaWear APIs support filters in all major coding languages. I typically recommend that our users use the tools they are most comfortable with. Python is a great tool because it has many machine learning libraries available that are open source, easy to use, and well documented,” said MbientLab’s Kassovic. MbientLab also uses Bosch’s FusionLab as they offer a Bosch sensor along with their own.

Bosch-Sensortec, which also provides drivers and libraries for their sensors, wants the sensor system to detect, interpret, monitor, be context aware and prediction intent, writes Marcellino Gemelli, who is responsible for the business development of Bosch Sensortec’s MEMS product portfolio. ST provides libraries, drivers and tools for setting up sensors, along with microcontrollers that could help streamline design.

Finding the right person with the right expertise goes a long way. “What I firmly believe today is you can’t send a software engineer to do a firmware engineer’s job,” said Kassovic.

On the enterprise side, having a data scientist in the loop to clean data will take too much time. “With machines generating the data, whole new classes of dirtiness can happen beyond human generated data. That is really what the focus of cleaning your dirty data needs to be,” said Branigan. “There are lots of big data cleaning tools in the big data market place but those are centered around the data scientist. You get a fairly static data set, you need to go and clean it and you need to go and analyze it to look for something interesting. That approach really works well at the rate humans generate data. At the rate that machines generate data, that approach doesn’t scale. It’s not even possible. You end up having these ingestion systems that are taking live feeds from the devices, streaming analytics over them and then hooking those outputs up to some enterprise system so the action happens automatically.”

Moving to digital may help. “Moving towards digital communications definitely helps. All things being considered that like the sensor—you are assuming the sensor is getting good data and what the data you’re collecting, is it noisy due to analog? I see the natural trend would move towards digital where you could have error-checking built in. There is some room for noise in the digital system. If this noise is on the lines, who cares really because it’s either high or low and then you have some kind of error check to go along with it. If that’s the case, you can just throw the data out,” said Pohlen.

“Although raw data may be filtered, compensated and corrected, in most cases there are definite limits to what a user can do with it,” writes Marcello Gemelli, responsible for the business development of Bosch-Sensortec’s MEMS product portfolio in a recent article.

“The first step to overcome these challenges is implementing and integrating proper sanitation tools,” said Liaison Technologies’ Parikh. “These sanitation tools not only have to deal with the quality of data but also with validation of identity, trust, time series, and each data point from the perspective of the project. Each project has unique requirements. The project implementer can and should use common technology features but must be ready to do mass customization as needed to achieve business objectives.”

Liaison Technologies provides data cleansing, filtering, management along with de-duplication detection. “One of the key features we provide is the tracking of data lineage, which allows us to track the data from it’s the raw introduction to a cleansed structured format. Customers can trace and monitor the data lineage and if need to make course connection then can replay the data after making appropriate changes to business logic.”

Redundancy may be a good, yet expensive solution for safety critical systems. “Everybody wants to get to a higher ASIL rating, but do they necessarily want to commit to having more sensing?” said TT Electronics’ Pohlen. “Again, it all comes down to it might be correct data, it might be incorrect data, but on the back end, how do you interpret that data. Unless you have some kind of self-diagnostic within your sensor, the best way is redundancy.”

Ed Sperling contributed to this report.

Related Stories
Security For MEMS, Sensors
Security is an ongoing issue with ubiquitous MEMS and sensors.
Data Vs. Physics
The surge of data from nearly ubiquitous arrays of sensors is changing the dynamics of where and how that data is processed


Martin Maschmann says:

What a bad idea it was to run the “nose down” procedure after a stall detect based only on a single pair of sensors.
Obviously other sensors have not been consulted , like artificial horizon, height , gyro, whatever.

This is not the first time that a sensor fail causes
> 100 deaths, so how the heck could it happen again?

Doc says:

I may be battling against the tide here, but it’s distracting to see the word data continuously used as a singular.

Data are plural.

Datum is singular.

Leave a Reply

(Note: This name will be displayed publicly)