What Will The Next-Gen Verification Flow Look Like?

Experts at the Table: Machine learning is an essential element for dealing with complexity and shorter design cycles, but it may require a different mindset for engineers.

popularity

Semiconductor Engineering sat down to discuss what’s ahead for verification with Daniel Schostak, Arm fellow and verification architect; Ty Garibay, vice president of hardware engineering at Mythic; Balachandran Rajendran, CTO at Dell EMC; Saad Godil, director of applied deep learning research at Nvidia; and Nasr Ullah, senior director of performance architecture at SiFive. What follows are excerpts of that conversation.

SE: What do you see as problems in verification today that will have to be solved in the future? Or looked at differently, what comes next that isn’t here today, and what doesn’t come next if you can’t get there?

Garibay: There’s been a continuing challenge of just exponentially more complex designs. We continue to try to raise the level of abstraction modularly. The new thing is not how we use AI to verify things, but how do we verify chips that are intended to execute AI in neural networks. That’s a discontinuity that is significantly challenging.

Rajendran: There needs to be a big cultural change in the way verification works in every organization. Some of the choices are technology choices, but many of the choices made for a particular technology are very human choices. It’s the people who decide what tools and workflows they use, and that often depends on what they’re familiar with and what they’ve done before. So there needs to be a change in the process. I cringe whenever I talk to a verification architect or somebody who says, ‘This is how we have done it all along. I’ve taped out about 200 chips in my life, so who are you to tell me what am I supposed to do?’ There needs to be a change in the thought process. We need to look at verification more holistically.

Ullah: Verification is growing and getting stronger, but there are a number of issues that are coming up. The first is that there are shorter and shorter design cycles. When I was at a large consumer electronics company, we had to bring out a new, complex chip for a phone every single year. That’s a very tight cycle to be able to get a chip with performance and functionality improvements every year. As the cycle gets shorter, we have less time to do all the proper things with all the tools for verification. The second problem is that as designs get more complex, people want all kinds of variations of SoCs. At SiFive we do cores for open source, and people want a lot of different SoCs around that. To deal with these variations you need a very flexible verification system. That leads to a third problem. All the data generated from this is huge. You don’t have enough people to analyze all this data. That’s where I see some of the AI techniques may be useful. Machine learning must to be able to look at trends and find problem spots. Integrating some of these algorithms will be very useful in the verification flow. And last, we’re getting a lot more performance verification — not just functional verification — and some of that requires us to do much more complex systems. With a phone, we found that after running it for some time the performance started to degrade. What was happening was that the reset for branch tables was not working. We never would have found that without running it for days at a time, though. That’s where we need to start integrating emulation, simulation and verification, so we can see what’s happening on a larger scale

Schostak: Along with the increase in complexity, there’s a new form. It’s no longer individual solutions, but combined solutions — even at the IP level. So you’re not necessarily talking about just needing a CPU for a specific application. You may have something off to the side, which does something separately. So now you’re talking about the CPU plus accelerators together providing that overall performance or that functionality you want to see. From that perspective, you need to build more into individual units rather than keeping things nicely separated, without being able to do the normal functional decomposition.

Godil: The increase in complexity has been exponential. We’ve been building more and more complex systems. A lot of this has been driven by both the designs being more complex, but also time-to-market pressures causing us to do things in an incredibly short amount of time. What is not going to be sustainable is that managing these systems, and the decisions that we make in our verification processes, are all conveniently done today by humans relying on their intuition. At some point the system will be too complex to be managed by humans. And so the verification of tomorrow will look very much like the verification of today, except that you’ll have a new important component — a sort of intelligent system option that can manage and make all these decisions on the fly. That also will allow you to offload some of the human decision-making to machines using deep learning and machine learning. The other thing that we’re going to see is that we’ve enabled what these sorts of intelligent systems do based on feeding in data. A lot of work will need to be done in instrumenting and exposing data used in our verification process to enable intelligent systems to help us out in verification.

SE: There are two different sides of this. One is that we’re developing AI chips. The second is we’re starting to add in AI/ML/DL into the verification of all chips. But there also is overlap between all of these areas, because some of the most complex chips have an element of AI in them. So how do these worlds go together, and how do we know we have sufficient coverage in verifying these very new and constantly optimizing architectures?

Garibay: I’m a doubter when it comes to the application of true AI to chip design. Google has put something out there about using AI to do placement. Arm has previously published some topics, or some papers about using AI to guide verification. The problem is we generally don’t have a lot of labeled data to learn from. Arm can take advantage of it because they have test suites and all that from previous generations. It’s really hard to see how others do it for real machine learning — for heuristic-driven, what we would have called machine learning in the ’80s. So there are all sorts of things you can do. People are doing regression management to try to sort through 1,000 tests and say, ‘Okay, here’s the 20 that failed and these look like machine glitches, these look like real problems, and this looks like a problem I’ve seen before.’ That seems like it’s an achievable goal, and people are doing it with real tools today. It’s a management help to do filtering and that kind of stuff. But taking the true machine learning techniques to guide verification is different. And even though I’m working on AI, I don’t I don’t see how it plays well.

Godil: I’ll take a counter position to that. Where we’re at today, I share your viewpoint. It’s going to be a really hard problem trying to get deep learning to work. You mentioned supervised learning and the need for labels. In deep learning a couple of years ago, that was by far the most popular technique. There was ImageNet, there was a supervised data set, and we were able to show really great returns. But back then, that was really the main success story. I don’t think we had a lot of other big success stories for unsupervised learning. Since then, with conversational AI or natural language processing, people now have basically gone ahead and scraped the Internet with no labeled data, and they’re able to build some really amazing complex systems. So this whole area of unsupervised learning has really taken off. The other thing that’s also happened is that we’ve seen a lot of very successful deployment of reinforcement learning. This is the AlphaGo system that you DeepMind showed. The cool thing with reinforcement learning is that there’s no labeled data. You basically give the agent back the box that he gets to play with, and over time it learns how to optimize that black box or learns the rules for that black box and learns how to play with it. A lot of the problems that we’re looking at internally in chip design look to us like reinforcement learning problems, where there’s a black box. Throw in simulation and now I have this black box that I can run as many times as I want. So if you think about a testbench, for example, I can generate infinite data, run my testbench as many times as I want, let the agent interact with that. And over time, it will figure out what it needs to go do.

SE: How about constantly evolving software that’s co-designed with the hardware?

Godil: Going back to unsupervised learning with natural language processing systems, one of the biggest opportunities that AI has in terms of helping with chip design that we’re not taking advantage of today is code changes. We know that if I change a line of code, that is going to impact where my next bug is coming from it and it’s going to impact what performance degradations I might see. Ideally you want to customize for that. Today, if there’s a really big change, like if I rewrite the scheduler, various engineers can decide to change the balance of runs that are targeting the scheduler and they’ll manually make some big macro changes. But it’s impossible for us to fine-tune this for every single check-in that happens, and we can have hundreds or thousands of check-ins every day. If we can build AI agents that can understand code changes, and understand what those code changes mean, that could be a great way to feed into some sort of an intelligent system that can then take advantage of that knowledge and customize the flow for that data to take advantage of the things that have happened during the day. That’s really the big opportunity that’s sitting in front of us, but it’s an incredibly difficult opportunity. So it’s not going to happen tomorrow. But but I do think it’s possible.

Rajendran: I’m working with customers trying to solve three problems. In AI, we have to define exactly what you want to solve. The first problem is the DUT problem. Because there are a lot of random variables that are being passed to the DUT to figure out, you have to determine if it’s really going to test all of your coverage. ‘Can I get right results? Will I be able to find a bug?’ So instead of random going through a constrained, you are using reinforced learning. We’ve seen good results in the initial stages. The second part is the predictability of the verification. Many times in a typical workflow the user wants to have a design review, and the teams are very distributed. They want to predict when a job will finish, whether there is 80% likelihood it will take 4 hours, or whether it will take 10 hours. We are having really good results in predicting when a particular set of jobs will finish — and it’s dynamically changing, so you can query the model anytime. This helps build to the next step. In every organization you have a regression system, which is the verification engine that runs every night. We are changing the paradigm where we have a separate agent, whose job is to make sure that chip is good, it’s verified, it runs okay, and that it’s going to going to keep running all the time until the code is done. So that’s the part that is coming to adaptive build. If I check in this particular module file, what are all the dependencies that it’s going to impact? What is the test propagation? What is the error propagation? So some of them you start from an initial framework, but later on we build the model based on tying into the JIRA system, tying into the check-in code. Then our organization (or the company that I’m partnering with currently), through every bug that’s open, is tied into a particular version of the code base at runtime. Everything has a major correlation to it. And that’s quite exciting too. These all seem like science fiction, but it’s very close to having really good results. The person who started this engagement at our company is a pessimist. He said, ‘I don’t believe in this stuff. It’s hype. But I’m going to try this.’ That’s how it started. And we’re very excited by the results.

Schostak: I’m seeing similar results with reinforcement learning for things like targeting coverage, targeting tests, which are the straightforward ones everyone talks about. It’s more than simple test ranking. I wouldn’t necessarily call it AI. It falls within the machine learning bracket. The thing to remember is if you actually get most of what you want with some sort of machine learning algorithm, does it actually make sense to spend an awful lot more time trying to see whether you can get results that are 1% better? One of the difficulties is that throughout the lifetime of a project your RT verification environment and your DUT are changing, so at a certain point your model’s fidelity is going to reduce and you’re going to have to redo the training. So you’ve got the tradeoff between, ‘Am I spending more cycles getting my machine learning up to speed versus competing withe the time I’m saving by using it to optimized regression?’



Leave a Reply


(Note: This name will be displayed publicly)