Verifying AI, Machine Learning

OneSpin’s CEO looks at what’s needed in this computing space, which technologies are winning, and what the key metrics will be.

popularity

Raik Brinkmann, president and CEO of OneSpin Solutions, sat down to talk about artificial intelligence, machine learning, and neuromorphic chips. What follows are excerpts of that conversation.

SE: What’s changing in machine learning?

Brinkmann: There’s a real push toward computing at the edge. Everyone knows how to do the data center now, and you can expect incremental improvements over time with GPUs, CPUs, and DSPs. But if you want to go the edge, you have power consumption requirements that are orders of magnitude lower at the same or higher data rates.

SE: There’s been a lot talk about that, particularly for cars where computing needs to be parsed across the sensor, the vehicle’s central brain and in the cloud.

Brinkmann: In many cases, especially in IoT, that’s what people are looking to as the application space. You have requirements for latency, privacy and low power that do not fit the cloud model. You need to do a lot of pre-processing and inferencing locally, which is nothing new. But what is new is that people are thinking about doing the learning on the edge. That means you no longer just do the inference part of the network, which is easier in a sense. You also do the learning because you need context awareness in your devices. If you train your own device, you can make it yours and not the one that everyone has. So it can be trained to recognize the people you know, or whatever is specific to you. It is learning for you. The general consensus is that it’s not viable to put all the data on the cloud, so some of it needs to be done locally. But that requires totally new technology to perform the learning. The fundamental problem is how deep learning is done today. It uses gradient descent for linear regression and back propagation. You need to transpose the matrix that you feed forward in the network. It’s an exact copy of it, transposed. Feeding all the data back, it trickles backwards through the network. That takes a lot of power and it takes ages to do so. There are new ideas about how to perform the learning, replacing the back propagation, which is fundamental to deep learning right now. That’s the interesting piece because it will turn everything on its head when it comes to how the architectures look.

SE: How do you prove this works? We’re dealing with technology that in some cases has lives depending on it. Entire factories will be running on machine learning. But we don’t have any history to show what can go wrong.

Brinkmann: It’s going to be the same approach used for functional safety in automotive. It’s a statistical measure. That’s the only way to do it. There’s no functional verification that you can do because fundamentally what you create is a statistical model. You can only statistically analyze it afterwards.

SE: Do we have enough processing power to actually simulate this? You’re not just simulating one piece. You are simulating the whole system and how it works, particularly if you’re offloading some learning to the edge and up into the cloud.

Brinkmann: Simulation is doable. The whole data center model can be verified. Basically, you’re modeling these with Caffe or TensorFlow. If you want to stay in the cloud, you do the floating point version of it, or just do the whole thing. When you want to go to an edge device, which has lower performance, you want to have an integer version of it. You have to model it, to be safe, in your cloud environment with integer position, as well, and verify it as well before you put it into the chip. Of course, you can run the chip as a simulator itself, but in the end you need to understand the effects of these modeling questions, like reducing the position of the internal arithmetic. And you need to analyze that in the lab and make sure it works before you deploy it.

SE: As you’re training these things, will the system be the same as what you started out with? As you distribute the intelligence across the system, it’s not static. This is like knowledge coherency. Can you really make sure all that data has gotten to where it needs to go intact, and that you have a continuum to the next step?

Brinkmann: Again, it’s statistics. You have to verify it all over again. When you run one cycle of learning, you perform the validation again, and then you deploy it again.

SE: So it’s almost like you have to do constant verification?

Brinkmann: Yes. It’s a big requirement on the infrastructure. You need to compute many things to do that. It’s nothing we have done before in engineering. It’s a question of trust in the end—can you trust the whole framework and chain of things? It will be interesting to see how companies approach that question. If you deploy it on a mission-critical path or application, people will want to know whether they can trust it and why.

SE: Where does OneSpin fit in?

Brinkmann: We have several projects on machine learning. We hired data analysts for the team in R&D. We’re using the technology right now to improve our tools and processes internally, like optimizing recurring tasks. You want to make sure your test runs in the right order and you’re not missing anything. Then, on the next level up, you upgrade the tools. You’re analyzing chips and verification data and building the predictive models out of it so we can make better decisions inside the tool for which engine to run, or how long to run it and what to expect from it.

SE: Your focus is trying to improve your products, not necessarily the process of machine learning, right? So you’re not concerned with pruning data to make the machine learning portion more efficient.

Brinkmann. Exactly. Then there’s the other question of what AI chips will look like and what verification challenges come with it. That’s what we’re looking into most of the time, trying to understand it. That’s why we’re studying the algorithms and technologies to see what the potential chip will look like, what types of markets will there be for AI chips, and what verification challenges will come with it.

SE: What have you found?

Brinkmann. Two things. There’s one path that is ‘more of the same.’ In the data center and other applications, you just scale up GPUs and CPUs to an extent we haven’t seen before. That’s where you get a lot of leverage from existing processes. But it will have challenges going bigger and bigger with resets and integration questions. To some extent, the individual blocks may even get simpler. You can think of some IP blocks that are used in these over and over again to fully verify them formally because they’re important and simple enough, so formal can really scale to the full verification level of a key piece of this chip. And then there’s a completely different path. There are new architectures for chips to address power challenges. The data center is also power-challenged, but not in the same way as IoT devices.

SE: How about power at the edge?

Brinkmann: The closer you get to the edge, the more important power consumption becomes. Even in automotive, you cannot afford a 300-watt computer engine. You have implications that come with it. It’s not just the power itself. It’s the power train that you have to put in place. It has some weight. The car gets bigger and heavier if you put in more wires. And you have to provide cooling for it and bigger batteries. A little more power can have big implications on the rest of the system in the car. That’s why car suppliers are really concerned with power consumption. It’s not that they can’t afford the 300 watts by itself. It’s more about the implications it has on the rest of the system, so they’re trying to get it down to maybe 30 watts.

SE: What is the biggest weight there?

Brinkmann: The wiring.

SE: So what’s the solution?

Brinkmann: To address the power challenge, you have to get the compute to the data. The power consumption in AI is dominated by data movement.

SE:. One of the biggest energy problems involves moving large quantities of data, right?

Brinkmann: Yes. There is a massive amount of data with the number of pixels you have in sensors. It’s pixel explosion. You need to reduce the amount of data on the edge down to something that’s relevant. Maybe you want to keep the resolution higher in some places where it’s interesting. That’s where we need AI to recognize what’s important—like a face, another car, a road sign. You can model the data to some extent, and you narrow it down. But you still need to have a chip that’s able to perform this task with low power. So people are trying to get compute into the data. One technology that people are looking at is in-memory processing technology, where you distribute little processing elements throughout the DRAM structure. You’re basically combining logic and memory technology. That hasn’t been done before. The challenges there are massive because the whole process today has been optimized for cost. It’s not made for fitting anything new in there, even if it’s just small IP blocks distributed throughout it. The logic process is on the other end. Now you’re trying to combine both.

SE: Memory is also being added into the network, as well.

Brinkmann: Yes, you also can push local memory toward the compute elements, and at some point it becomes a discussion about what is going where. The fact is you need to bring them closer together. If you move the memory to the compute or the other way around, it may not make a big difference in the end. If you look at the architectures of the TPU (TensorFlow processing unit) from Google or the (Intel) Nervana architecture, they use little processing elements—little DSPs with local memory—and then they have a network so these elements can communicate with each other and pass the data through.


Fig. 1: Intel’s Nervana chip. Source: Intel

SE: So taking all of this into account, what will an AI chip look like?

Brinkmann: It depends who you are and where you are. In the data center, the GPU will dominate for quite some time as the best compromise between flexibility, performance and power. There’s another aspect, which is the flexibility of the programming environment. If you want to be more flexible and more horizontally able to address different types of learning and AI tasks, you need to have a very versatile platform to perform that. The CPU is basically on the top of that list. It can do anything. If you go down to GPUs and DSPs, those become less and less versatile, but you can perform much better on certain tasks.


Fig. 2: Tensor processing unit architecture. Source: Google

SE: And the DSPs and GPUs are cheaper, right?

Brinkmann: Yes. You have a good compromise for the data center there, but for edge devices it will be different. There I see more vertical solutions. Narrowly focused architectures that will work for certain applications, like a specific form of image processing or a specific form of natural language processing, will be streamlined for a specific task. There’s still a lot of flexibility in there to handle different ways of doing things, but it won’t be able to handle completely different architectures of neural networks or new learning schemes. These vertical things would be tied into a certain scheme of AI. That’s actually the interesting piece because this space is moving so fast that it’s hard to say what the right architecture will be. When do you make the decision that you want to put your eggs into this basket and say, ‘I’m going to bet on this architecture and this particular narrow niche and make a chip for it?’

SE: How about FPGA prototypes? Are those still useful?

Brinkmann: I’ve talked to people who try to do AI chips and they say they don’t even bother building a FPGA prototype. They go straight from a simulation model into a hardware implementation, and then they go straight to the fab to get it done. Time to market is so critical to them that they don’t want to wait because the algorithms will be outdated. A specialized chip has a very narrow window of opportunity in time.

SE: So do you still develop individual chips, or do you use some sort of packaged platform where 80% of it’s done and then one or two elements are changed out?

Brinkmann: It would be smart to try that, at least, to add in flexibility. But I’m not sure if it’s possible to combine the power and throughput requirements that you want.

SE: What about FPGAs as a core piece of the final design?

Brinkmann: There are mixed reports on the usability of FPGAs in this space. Some say it’s great because you can reprogram them and do all kinds of things. Others say a GPU is more cost-effective and flexible. FPGA may be too expensive at the scale you need, in some cases, to perform the task. Different people are approaching this differently.

SE: Formal tools will work across any of these, right?

Brinkmann: Yes, and I believe there will be an explosion of different designs and people trying them out. If they’re looking at a short time to market, a formal solution is a must-have. They need to have formal to make sure one particular thing really is right the first time.

SE: Is there a point you get to where you just build a library of all the different things we need to test for using formal?

Brinkmann: I don’t think so. There will always be something new that people think up that you need to address by writing assertions and making sure it works. We can build now several IP blocks or verification IP for specific functions. There are protocol checkers. We have something new, verification IP for floating point units, and that’s really cool. But it targets one particular functionality that many people have, and people may want different precision floating point units. That’s something you can’t foresee and anticipate.

SE: Can you have different chips in AI that work together?

Brinkmann: You still need to verify that the models match, and that’s the tricky part. That’s a verification issue by itself—to make sure that one version, the floating point, matches the integer model.

SE: You need to map it and the mapping is more distribution than a fixed number and it changes over time?

Brinkmann: Yes. You want to re-verify some key properties of your network. You can’t prove that they are equivalent, because they are not, so you need to have metrics on how the network is performing and what the expectations are. You may even need to retrain it to get the error rate down again.

SE: What about neuromorphic chips?

Brinkmann: No one has come up with something really usable yet, but there’s lots of research going on. If you look at how deep learning is done today, it has nothing to do with how the brain works. The brain doesn’t do back propagation and compute the transposition of things. There’s no symmetric relationship between outgoing neurons and incoming data, and the other way around, to feed back what they should do next time. If you succeed at simplifying the system you potentially can accelerate the learning and have a simpler architecture and model, instead of the back propagation that is being done today. That’s where research is heading. No one has a solution yet, but there’s some good work coming out on probabilistic spatial propagation. It’s a new way of performing learning.

Related Stories
The Great Machine Learning Race
Chip industry repositions as technology begins to take shape; no clear winners yet.
What Does An AI Chip Look Like?
As the market for artificial intelligence heats up, so does confusion about how to build these systems.
Machine Learning Meets IC Design
There are multiple layers in which machine learning can help with the creation of semiconductors, but getting there is not as simple as for other application areas.
Using Machine Learning In EDA
This approach can make designs better and less expensive, but it will require a huge amount of work and more sharing of data.