Rethinking Big Data

Being able to mine data accurately with billions upon billions of sensors will require a different approach for processing that data.


You have to marvel at the sheer genius of what modern day, edge-of-the envelope marketing schemes can accomplish. For example, terms such as the Internet of Things, (also referred to as the Cloud of Things, or the Internet of Everything, or even Internet of Interconnect) have become sexy, interesting, exciting camouflage layers over the rather dull M2M industry.

The same is about to happen with analytics. It is getting a new suit, shave, and a haircut, and being called “Big Data.” According to Ian Morris, Principal Applications Engineer for RF Connectivity Solutions at NXP Semiconductors, “big data is one of the most popular topics in our worlds.”

There is a lot of noise being made about the IoT and big data. From retail to medicine, to defense, to homeland security, travel and logistics. And that only scratches the surface. According to Morris, the IoT is of interest to a lot of vendors because of the potential to sell into it—everything from software to networking to silicon. “It is not just one vertical market, and big data represents a tremendous opportunity from a sensor perspective.”

With the sheer volume of data collection devices, and once the IoT really exists, the amount of data in the virtual universe will be astronomical – 40 zettabytes, conservatively, by the year 2020. And the number of sensors acquiring this data is just as astronomical. No one wants to venture a guess as to the number of sensors, but the numbers being tossed around for IoT devices is between 50 and 200 billion. And most devices are brimming with sensors. Smartphones alone integrate such sensors as accelerometers, compasses, GPS’, light and sound sensors, altimeters and more. If one wants to find a prototypical IoT device, this is it.

Smartphones are envisioned as an intelligent listening station that can monitor our health, where we are and how fast we are traveling, our touch, the velocity of our car, the magnitude of earthquakes and countless other things that weren’t even on the radar screen a few years ago. And smartphones are only one of a myriad of intelligent IoT devices.

Extrapolating that, if there are only five sensors per intelligent device, and if the 200 billion is anywhere near reality, the number of sensors will be in the trillions, eventually. And with all those sensors collecting all that data, one can understand why there needs to be a revolution in analytics.



Big data vs. traditional analytics
What makes big data a bit different than traditional analytics is how the data are looked at, and what the expected results will be. That is actually a credible case. With the amount of data being generated, traditional analytics don’t have the right tools, nor can they process the data efficiently, even with next-generation supercomputers like the Titan and Tianhe-2. The massive amounts of data that needs to be analyzed will choke present analytical methodologies, mainly because the analysis needs to be real-time and transparent.

Under the big data umbrella, “every app needs to be an analytical app,” says Gartner Fellow David Cearley. “Everyone who is doing any kind of data analysis has to figure out a way to manage how best to filter the huge amounts of data coming from the IoT, social media and wearable devices, and then deliver exactly the right information to the right person, at the right time. Analytics will become deeply, but invisibly, embedded everywhere.”

For statisticians, Big Data challenges some basic paradigms. One example is the “large p, small n” problem (in this case we define “p” as the number of variables, not a value). Traditional statistical analysis generally approaches data analysis by using a small number of variables on a large number of data. In that case, the number of variables, p, is small and the number of data points n, is large. A typical example of this might be in sales where there are a number of different options for a refrigerator, including color, ice maker, door accoutrements, drawers, size, doors, and such. While still a decent number of variables, when compared to the data of the users, polling them for what they want, the number of variables is still small, when compared to the sample size of the consumer.

Big data looks at it from a different direction. Here an example might be in medicine, and for this example, cancer. To apply this to a big data application, this situation is reversed. In a cancer research study, using genomics, the researcher might collect data on 100 patients with cancer to determine which genes confer a risk for that cancer. The challenge is that there are 20,000 genes in the human genome and even more gene variants. Genome-wide association studies typically look at a half million “SNPs,” or locations on the genome where variation can occur. The number of variables (p = 500,000) is much greater than the sample size (n = 100).

This big data approach is the paradigm shift. In traditional analytics, when p is larger than n, the number of parameters is huge relative to the information about them in the data. When using this approach, there will be a plethora of irrelevant parameters that will show up as statistically significant. In classical statistical analysis, if the data contain something that has a one-in-a-million chance of occurring. But if you analyze the data from a half million places, (big data) that one-in-a-million discovery will show up more often. The trick is to determine its relevancy vs. chance randomness.

This is what statisticians call the “look everywhere” effect and is one of the issues that plagues big data, because data-driven analysis yields so much more, and wider results, than the traditional hypothesis-driven approach.

There are a number of solutions that have been developed to tame this effect. In reality, most data sets, no matter how massive, only have a few strong relationships. The rest is just noise. So by filtering out these significant parameters, the rest can be considered irrelevant. If the one-in-a-million data points are outside of the significant filters, then they are chance, and can be discarded.

How to do it is fairly simple and a standard mathematical approach to a variety of analytics – setting some parameters to zero. This works well, but requires a lot of iterations of the data. By varying which parameters are set zero, and running redundant analysis, eventually, the “thimbleful” of meaningful data will be uncovered.

The problem with this is that it is computationally intensive and would take a tremendous amount of time to compile with classical statistical hardware/software. But fortunately technology has come to the rescue. Today, because of technological advancements in both hardware and software, the approach is feasible.

One of these advancements is called L1-minimization, or the LASSO, invented by Robert Tibshirani in 1996. One of the places it works well is in the field of image processing, where it enables the extraction of an image in sharp focus from a lot of blurry or noisy data. There are others, such as the false discovery rate (FDR) proposed by Yoav Benjamini and Yosi Hochberg in 1995, which makes some assumptions that a certain percentage of the data will be false. Subsequent analysis can be done on the data to determine the validity of the assumed false data to determine if the random assignment of it being false is valid.

The third dimension
Most statistical analysis, up to now, has been in two dimensions – n and p. Big data adds a third one – time. Big data analysis within the IoT, will be in real time. Data will have to be analyzed on the fly and decisions made on the fly. And, these data will be of a whole new type – images, sounds, signals, time-relative measurements, and infinite-space measurements. Such data is not only infinite, but complex. They may require analysis in a geometrical or topological plane, or a three-dimensional paradox.

One of the more interesting applications of this new dimension is Web analytics. The pressure for Web companies to deliver meaningful results to clients so they can “sell” their services is a relentless driver. Such companies benefit greatly by accurately predicting user reactions to produce specific user behaviors (i.e., clicking on a client sponsored advertisement).

This is a perfect big data analysis case. The number of n will be huge (a million clicks, for example). The p may be large, as well (thousands, or more, variables – which ad, where, how often, etc.). Now, because n is much larger than p, in theory, classical analysis can be used—except for the time factor. In many cases, the algorithms may have only milliseconds to respond to the click, with another click right behind the first one, and so on. Therefore, these algorithms have to constantly change to the input variables from the user (rotating ads for example).

An elegant solution to this challenge, across the Web, is to use massive parallel processing across banks of computers. The interesting condition here is that this approach was a combination of the holy grail of computing – speed, with the holy grail of statistics – analysis. In the end, such a solution is actually works fairly well. Rather than deliver the correct answer every time, but takes too long, this approach delivers the right answer most of the time, quickly.

Privacy – the sore thumb
Readers privy to this site are well aware of the looming security issues of the IoT. The depth and breadth of recent breaches only too well reminds us of how vulnerable our data is. There is a wide girth of approaches to trying to protect big data, and traditional means of data security don’t always work efficiently. Thusly, various approaches are being developed.

“One example where protecting big data is critical is with oil pipeline monitoring,” says Chowdary Yanamadala, senior vice president of business development at ChaoLogix. “Every so many feet (it varies with the pipeline), they have a flow monitor that senses a number of parameters about the oil flow, such as pressure, density, flow rate. The sheer volume of data from all of the sensors is staggering and securing it is critical. But because of the magnitude of this ‘big data,’ securing the data itself becomes tricky. A lot of security means a lot of overhead, and that can bog this type of M2M data collection. An approach we found works well is to secure the authentication and use verification techniques to insure the data hasn’t been compromised.”

Protecting the voluminous amount of big data, and in real time, will take some novel solutions. In the traditional sense, anonymizing n and p doesn’t scale well as these variable increase. Network-like data pose a special challenge for privacy because so much of the information has to do with relationships between individuals, which changes on the fly and is dynamic in content.

There are some bright spots on the horizon. One developing technology is “differential privacy,” a methodology to commoditize security to where the user can purchase just as much security for their data as they need. But by and large, trying to secure big data still is in its infancy.

There is little doubt that big data will be the backbone of information in the IoT. Big data is new to a lot of sectors. Not the data, nor the collection, but the analysis. Also, new types of data are coming on the scene and new methodologies will be required to analyze it.

One of the largest challenges will be to be able to mine meaningful statistics, in real time and from multiple vectors, simultaneously. To do this will require a merging of scientific, analysis, computational and mathematical practices. New approaches will be required, as well as different perspectives on what is being analyzed.

Statistical analysis is a powerful tool that can, with some degree of certainty, glimpse into the future. With big data and the IoT, and next-generation statistics, we can understand and direct the effects of logistics, medicines, the weather, infrastructures, economics, environments, finances…the list goes on and on. Statistics and analytics will have the power to save and improve lives, increase reliability and lower costs, and improve and unlimited set of things and processes.