Architecting For AI

Experts at the Table, part 1: What kind of processing is required for inferencing, what is the best architecture, and can they be debugged?


Semiconductor Engineering sat down to talk about what is needed today to enable artificial intelligence training and inferencing with Manoj Roge, vice president, strategic planning at Achronix; Ty Garibay, CTO at Arteris IP; Chris Rowen, CEO of Babblelabs; David White, distinguished engineer at Cadence; Cheng Wang, senior VP engineering at Flex Logix; and Raik Brinkmann, president and CEO of OneSpin. What follows are excerpts of that conversation.

L-R: Manoj Roge, Ty Garibay, Chris Rowen, David White, Cheng Wang, Raik Brinkmann

SE: How do we get AI into use as quickly as possible? Obviously inferencing is a huge part of it. What is the best way to understand it?

Rowen: The relevant things about understanding the basic idea of deep learning or neural networks is that you have a large, trainable, mathematical model of some hidden system and instead of programming the behavior of this complicated function, you train the behavior of this complex function by providing lots of examples. Typically millions of examples for a large number of inputs. What’s the right answer? What is the way the function should behave? The mechanisms of deep learning automatically train that function so that it becomes a very effective mimic of whatever it is you’re trying to model: the human brain, the behavior of physics, whatever it is. And, there are really two distinct dimensions of such a thing in real use.

First, there is the training process where you iteratively and exhaustively keep exposing this network to those inputs and gradually evolve the mathematical function so that it matches as closely as possible. That typically is a very compute intensive task. It requires heavy duty compute under most circumstances. It’s often done in the cloud, today it’s often done on a GPUs because the software’s there and the compute resources are there. But when you put it to work, you don’t need to iteratively train it, you just evaluate the function. So you just provide the inputs, its already trained, and it produces its best approximation of the right answer. It’s deterministic. It is a computational function, which is actually conceptually very simple. It’s usually a kind of a layer cake of multiplies and adds and nonlinear functions and you go through layer after layer of this, but it’s highly deterministic. It’s conceptually simple, computationally hard, but it is something that you can run now in real time on lots of different kinds of hardware though enormous progress is being made on the hardware to make it even faster and even lower power and even cheaper.

Garibay: That’s the key. What we’re here to talk about is the rapid evolution of inference hardware in terms of, most critically, reducing the power consumption, given that a lot of inference where the real opportunity for dramatic growth in inference is at the edge where there’s either battery constraints or thermal envelope constraints, the limitations for the most part on deploying inference in the millions and billions at the edge is in power efficiency.

White: I’ve been working in and out of the AI/machine learning space since the early 90s and there’s a huge graveyard of products that people tried to apply machine learning and AI and that graveyard is a curse because it’s fairly easy to come up with a great idea, and an algorithm, and to create cool prototype. And then you have this big chasm between there and scalable deployment of the product that uses machine learning. At least my observation, one of the reasons for that huge chasm — and there’s a chasm with everything that you buy, especially with machine learning and AI — and it’s because a lot of the solutions are very deterministic, and I think the key to success is being able to build in flexibility and adaptability and learning into these approaches. If you take the example of IoT, I’ve talked with IoT vendors and they can make a number of IoT sensors work perfectly in a lab. But you take them out of that lab and put them in a factory in the Philippines and all of these unforeseen environmental conditions impact those sensors that they never really envisioned. So, in general, building flexibility, adaptability into those systems is something that machine learning and deep learning have a lot of problems to do. At the same time, we have to get past the sort of thinking of learning as deterministic in terms of how we build real time learning and adaptive systems. I think that’s going to be the real key parts of being able to better bridge that gap.

Wang: Are you saying to train at the same time that inference is performed because inference by nature is a non-deterministic outcome? Training, on the other hand, is hugely complex.

Garibay: One of the things that I think is going to have to happen is we have to be able to tune the training, to some extent, on site. There’s algorithmic work being done to figure out how I do the base-level training and then add on without having to redo it all.

Wang: If you are wrong, how do you correct and add to the training?

White: There are two different ways to do it. There are people that will do a local calibration or retraining onsite to the environment conditions, but the conditions in the factory, for example, are always going to be changing. So I see there’s larger promise of building systems that can expand adapt.

SE: Where should these algorithms be run? On the edge, or in the cloud?

Roge: First, as has been said already, there are two components, training and inference. Then you’ve got to add another dimension which is, do you do it in the cloud, where you’d get a lot more compute power, or do you do it on the edge? For edge, there would have to be two criteria why you’d want to do it on edge. One is availability of the network bandwidth, and second, is the real-time aspect. Like in autonomous driving, you can’t wait. You have to sense the real-time traffic conditions; even a fraction of a second could lead to an accident so even the training would need to happen at the edge. Another good example is the factory. All of these mission-critical systems must be trained on the edge. The way we look at it, and we will get into the architectures, is there will be use cases where you need to do training on the edge that will get into the precision requirement, and things like that. You may not need a double precision floating point and things like that, whereas for a lot of complex in the cloud, you may need that kind of precision, and GPUs are good at it.

Today’s landscape is that GPUs are primarily used for training, and CPUs for inference in the cloud, but that’s going to change, we’ll get into the next topic, which is on the architecture, and what you need for inference.

Brinkmann: You touched on the choice for inference between being on the edge or being in the cloud. I want to add there is one more vector that you need to consider why you would perform the inference on the edge is privacy and security. It is a big concern that the data actually the leaves your premises and goes through whatever channel into some other premises. You may want to keep the raw data local for that reason, not just because of bandwidth limitations or latency, which is a big concern. It’s also the privacy and security of the data this is the primary problem in some applications, so one more reason maybe to look at the edge.

Garibay: It’s definitely the marketing case for our friends at Wave Computing. We’ll see if it takes.

Rowen: There’s also fundamental bandwidth reasons, and cost reasons that, especially when you look at high data rate sensors, and cameras are the poster child for high data rate. You simply cannot imagine that the world’s 20 billion CMOS sensors are all going to send their data to the cloud. You do the arithmetic on it, whether it’s network bandwidth or cloud storage or cloud compute, you definitely get the wrong answer. So it’s quite clear for all of the reasons: latency, privacy, security, cost, power that at least some of the process, and in fact most of the computational cycles have to take place closer to the edge for these high data rate devices. If you have a thermometer that produces a byte every minute, it doesn’t matter where you do it. But when you talk about video processing, and to a lesser extent audio, you have a have a whole force of gravity which is pulling that back towards the edge.

Roge: I just wanted to give some statistics on a point that Chris made. There will be 30 to 50 billion connected devices by 2020. And today, less than a percent of that data is analyzed, that these connected devices spew out. There is going to be a lot of data; it is doubling every two years; and it’s just not economical to send it to the cloud to process when there are processors at the edge.

Brinkmann: One interesting trend that I wanted to add on this inference side is that you see inference being performed on the same data in different ways. If you’re going to the edge as an example, if you have a particular sensor to prevent someone breaking into your house or a sensor that should actually recognize specific people, you don’t want to run the inference algorithm of recognizing a specific person all the time, for power reasons, for example. So what you will have is a camera image sensor and you analyze the data to see if there’s any movement there; it’s a very simple thing. Once that is true, you kick off the next algorithm: is there actually a person, which is another algorithm which is maybe a little more power intensive than the first one. Once you see there is someone there, you run the face recognition, which is a lot more compute intensive and power intensive, but if you layer that up, you can actually run things like that on a battery for a very long time because you’re not running constantly the very heavy influence algorithm; you layer that up, and that’s an interesting approach where you can make this AI edge computing really something that you can deploy on a battery-based device. As there are so many billion connected devices, you have to power them, so I can’t imagine having 20 million or 20 billion wires going all around the world to patch up all these little IoT devices; most of them will be battery powered, and that’s where you come up with these kinds of solutions to say, okay, how can we extend the lifetime from a few days of battery life to a few weeks to 5 or 10 years.

SE: This relates to the architecture. Will there need to be specific architecture choices depending on where an algorithm is processed? What are the choices? What is the best way to think about the architecture approach?

Rowen: I have sort of a model, and it builds exactly on what Raik saying, which I would call cognitive hierarchy, which really says that at the bottom, you want as simple processing as you can because that’s the always on part of it. Then, if that finds something of interest, where something of interest might be as simple as movement, it escalates to the next level which may be another process or another processor in the same chip which has a more sophisticated process that consumes more power per bit of data that goes into it, but it’s called far less often. It then does a level of processing and so, even today, you may have three or four levels of that kind of process. The top of the pyramid may well live in the cloud because if it’s face recognition, you only identify if something is really a face maybe twice a day, and even your battery powered device may be able to afford actually communicating with the outside world twice a day so you can have something where the goal is that you want a system that behaves as smart as the smartest element, and dissipates power like the smallest element, and hopefully not the other way.

I think this is the natural way that you think about the architecture. Then you get into the details of, for your given algorithm, what are those layers that you can identify and how obsessed are you about power? Because clearly today the hierarchical system is more complex to design than the monolithic system. And so if you’re really pushing the envelope on either intelligence or power, you’re going to be inclined to take special action, but especially if you’re trying to save power or cost at the edge, I think this investment in building hierarchy is going to pay off and you’ll see a lot more of it.

White: To Chris’s point, the data that’s most valuable will have to make its way to the cloud.

Garibay: From the hierarchy point of view, there’s a hierarchy of the data as well. Part of the smarts, and one of the reasons that Intel bought Mobileye is that Mobileye is able to gather a large amount of driving data given their installed base, but how do they know which data is worth saving? It’s the same problem Tesla has. There has to be enough processing in that car at the edge to say, ‘Oh, that was interesting. Let’s send that back to HQ because you can’t be dumping the entire data set anymore so there’s a hierarchy of data — very high bandwidth, close to the sensors, process, process, process, and then at each level, getting to lower and lower bandwidth and then making a choice to say, ‘high bandwidth this or that’ example of information that for some reason this edge device decided might be useful for the rest of time. Now the data center, the cloud may decide, ‘I’ve seen that before. I don’t need it.’ But the edge device will make its own choice and try to optimize its use of the local bandwidth.

Related Stories
Security Holes In Machine Learning And AI
A primary goal of machine learning is to use machines to train other machines. But what happens if there’s malware or other flaws in the training data?
IBM Takes AI In Different Directions
What AI and deep learning are good for, what they’re not good for, and why accuracy sometimes works against these systems.
Applying Machine Learning To Chips
Goal is to improve quality while reducing time to revenue, but it’s not always so clear-cut.

Leave a Reply

(Note: This name will be displayed publicly)