A San Francisco-based startup has come up with a novel solution for solving the big data analytics challenge by leveraging GPUs.
It’s now well known that with the latest innovations in parallel programming and GPU technology, graphics processing units (GPUs) can be harnessed today to deal with the enormous data sets regularly encountered in applications ranging from ADAS, artificial intelligence, and gaming to deep learning, scientific computation, and high-performance computing.
But how exactly do you find what you need in a sea of data?
This is where MapD comes into the picture. Founded by CEO Todd Mostak, MapD has a database and visualization software platform to approach big data, and touts that its software enables queries of multi-billion row data sets in milliseconds via GPUs.
The SQL database, as well as the visual rendering of results runs on any of Nvidia’s off-the-shelf GPU cards. Their IP is the way in which they are using the GPUs for data analytics.
To this end, he believes a new computing paradigm is emerging. “GPUs have been popular for some time for things like HPC but they are really expanding into a lot of interesting areas: machine learning and data analytics.”
“MapD is a very fast relational database, and visual analytics system that leverages the parallel power of GPUs to get huge speed ups,” Mostak noted. “One of the issues with CPU servers is that they are essentially core-limited, but MapD can use up to 40,000 GPU cores per server, which gives a tremendous amount of parallel performance for any kind of data exploration or data discovery task so things that used to be compute bound or memory bound, now can be done interactively on these GPUs.”
The way Mostak created MapD is an interesting story in itself. He was doing graduate work on Middle Eastern Studies at Harvard, and trying to do an analysis on hundreds of millions of Tweets, basically taking the Tweets and finding what Egyptian census district they were in, and then also do analysis comparing the Tweets with different Islamist forums, and scoring the users on how politically Islamist they lean. He claims he’s always been a geek about computers, had taken programming classes but didn’t major in computer science. Halfway through his graduate degree he realized he wanted to and used his electives to take Computer Science courses — thereby discovering his true calling. That led him to MIT where he took a database course under database gurus Mike Stonebraker and Sam Madden.
It was then he realized that people in many industries are awash in a sea of data that they really don’t have visibility into. “They have plenty of storage; storage has become a commodity so it’s easy to get stuff on disk, but then what? You have terabytes or petabytes of data, and you really can’t see what’s going on. Having a tool that allows for interactive exploration of data seems pretty powerful for many companies who, if they were using CPU systems, are either CPU-bound, or memory-bound and unable to really explore this data.”
MapD knew it had to address the caution often expressed about using GPUs because of the difficulty in programming them, combined with how to get what is needed from a dataset.
Mostak explained, “We are using this relatively exotic technology — commodity — but still exotic technology in the GPUs. One of the things I wanted to do is make it very approachable and usable by common business analysts. Essentially, we married two things: a SQL database built on top of GPUs. SQL is the lingua franca of the analyst world. On top of that, we built a very intuitive drag and drop interface, (MapD Immerse). You don’t even have to know you’re using GPUs behind the scenes; it just kind of magically works. It doesn’t need special tuning — the application takes care of that.”
A second premise MapD kept in mind were the multiple attempts in the past to use GPUs that took a wrong architectural approach. “A lot of those systems would use just one GPU, and keep the data on the CPU side so when a query was sent, it would come into the CPU side, move it over the PCI bus to the GPU, run the query, and then move the data back. Once you did that, any advantage that you would get by running the query on the GPU was largely lost in the time it took to move it back and forth. We decided to make GPUs first class citizens,” he asserted.
Further, instead of using one GPU, MapD uses up to 16 GPUs per server, so all of a sudden there is a lot of memory across these GPUs: up to a quarter terabyte of DRAM on the GPUs, used as a hot cache whereby the DRAM is used for the L1 cache where as much compressed data as possible is kept, and when the queries come in, it’s already there — thus the speed.
“If you can’t ask the questions that you want because it’s too slow, you’re only going to ask the questions that you can get answered in the time you have, which may be very limiting,” Mostak pointed out. “There’s a bit of a psychological effect, too. Analysts, if they always have to go out and get a cup of coffee whenever they want to run a query on their data, it’s very worrisome and eventually they will just do the minimum so that they can get their job done. And there’s often a notion of risk. So either, by not asking the right questions, or asking the right questions but taking a down sampling approach, both involve a lot of risk in that neither provide a full picture of the data.”
While not specifically targeted at the semiconductor design industry, it seems a perfect fit for exploratory, fast querying any design dataset especially power/performance/thermal analysis, simulation, and emulation, to name a few. Then on the back end, one could imagine use in the semiconductor fab on any kind of structured log data to determine, for example, the factors correlated with low chip yield and defects.
MapD customers currently run the gamut: Verizon, social media companies, hedge funds, the U.S. government, and companies in the IoT space with significant sensor data pouring in.
For all of the nitty gritty details, check out MapD’s website.
Leave a Reply