Applying Machine Learning To Chips

Goal is to improve quality while reducing time to revenue, but it’s not always so clear-cut.

popularity

The race is on to figure out how to apply analytics, data mining and machine learning across a wide swath of market segments and applications, and nowhere is this more evident than in semiconductor design and manufacturing.

The key with ML/DL/AI is understanding how devices react to real events and stimuli, and how future devices can be optimized. That requires sifting through an expanding amount of data and using automation to identify complex patterns, anomalies, and what works best where.

“We’ve been gathering data that we use for developing our own methodology,” said Mike Gianfagna, vice president of marketing at eSilicon. “For memories today, we look at the design, the memories, and model different memory configurations. You can run models on it against real implementations. So we take a generic memory model that is parameterized and map it to real memory. We also look at manpower and schedules from previous designs, and do the same for compute resources and EDA licenses. If you only own 12,000 CPUs and you need 24,000, you need to use a cloud-based solution for that bubble. But that doesn’t happen quickly. You have to plan for it, and a lot of that is around memory.”

This is just the beginning of a wave of activity to shrink design cycles and reduce potential problems based upon experience.

“The whole industry is learning about how to build and debug systems using artificial intelligence and machine learning,” said Michael Sanie, vice president of marketing for the Verification Group at Synopsys. “There is a lot of modeling and simulating about which stack can be used for AI algorithms. The goal here is to use artificial intelligence in tools as well as outside of tools.”

To a large degree, this is a recognition of just how complex system-level design has become—even the tools need help from other tools.

“Right now, as the EDA ecosystem works on machine learning, we’re trying to figure out how that can help us with larger verification problems, and it comes back to, ‘It’s not the 10 transistors I’m working on early on,'” said Steven Lewis, a marketing director at Cadence. “I’m good with that. It’s the 1 billion transistors that make up that memory. It’s how to route that, and routing in the physical layout world. What is the best way to lay out that circuit, to place those components? That’s always been a little bit of a machine learning task. We didn’t call it machine learning, but it was really in the algorithms that we would all try to develop for figuring out the best way to approach was going to be. If you can start working with certain topologies, if you start getting an understanding of the way a 7nm transistor is going to act and behave, you know a lot about that going in, so you can use much better judgment as to when you start placing them, when you start routing them, when we start to analyze them.”

Some time ago, chipmakers began pushing for six-sigma designs, but they stopped discussing it once complexity reached the point where the number of simulations required to achieve six sigma quality began taking too long. But with demands by automakers for defect-free electronics, talk of six-sigma has returned, and the only way to get there in a reasonable amount of time is by leveraging machine learning.

“With machine learning, I can program in the behavior of the transistors so I don’t have to run 10 million simulations, statistically speaking,” said Lewis. “I can use algorithmic and machine learning techniques to determine a minimum amount of simulation I need to run, but it still gives me the accuracy I’m looking for. And if I can program more into the algorithms, if I can program more of that behavior into what I’m working on. Then I can take in that data and do a good job. That’s where machine learning can help us.”

This is happening across the design world already to some extent, but its use will grow significantly in the future.

“What this technology allows you to do is explore more than people ever could in the past,” said Ty Garibay, CTO at ArterisIP. “You can use this technology to figure out where the holes are. If you still can’t prove something, you can let it be antagonistic. With the traditional form of functional safety, you put two of the same thing down, which is what you see in automotive braking system, engine control, airbag control. But as we move to drive by wire, basically you’re creating a server in the car. You’re consolidating function into a system and communicating by wire. A 777 jet does that today, but there’s a big cost/function difference. To make this work in a car we’re going to need to develop new techniques and figure out how to apply them, and replace techniques that were verifiable in the past but too expensive and slow.”

Mountains of data
That will require sifting through an enormous volume of data.

“In Silicon Valley almost everybody is good at getting some form of data,” said Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “The guys who are really good are the ones who go through that chain, and understand that chain of data from insight to action. We’ve put that intelligence back into our design environment and our IP. Machine learning is one aspect of it.”

Mohandass said a key aspect of utilizing machine learning involves mining that data in the first place. “What is the training data that you use and how wide is it? Are there biases to it? Are you biased toward one form of design or another? We spent a lot of time getting our training set up to speed and without any bias. A second aspect is once you’ve got the data, how do you transform it into insight? Third is our machine learning environment. How does that drive action, both in terms of our design strategies as well as how can we drive action to our customers? When you’re generating the data, especially if you’re feeding into a machine learning engine, the magnitude of the data explodes on you. You’re not looking at hundreds and thousands of lines. You’re looking at millions of lines. So suddenly just looking at graphs of a million things won’t make any sense, so you’re trying to cluster and look at trends.”

A very practical use of machine learning within the context of data mining is to search for anomalies.

“We’ve done a lot of local smarts so what comes out is high value, good content,” according to Rupert Baines, CEO of UltraSoC. “Today that mostly means an engineer is looking at it, reading through it, and graphing it. At some point, she’ll say, ‘That looks odd,’ and will run some scripts to try and find out why it looks odd.”

This can help accelerate an engineer’s efforts by helping them find things, calling items to their attention that might be a pattern of behavior that has changed suddenly.

“This might be things like, in a camera application, pixels that aren’t changing,” said Baines. “You say, ‘That looks odd.’ You’d expect pixels to change. That probably means there’s a camera failure, and it’s stuck-at. It might be a security application where one particular process is never supposed to access secure memory, and it tries to do so, and the anomaly detector will say, ‘That’s not right, sound an alarm.’ The nice thing about that last one is that we will sound the alarm. It’s like a burglar alarm. If someone was pacing around outside your house at night, you’ve got a light sensor and it will detect them moving, turn on a light and sound an alarm — even if they haven’t actually tried to break in. Even if your lock still works perfectly and they couldn’t get into the house, you know somebody is trying something. It’s an additional layer of security on top of TrustZone or OmniSecure or what have you.”

ML and simulation
Another area where machine learning is being used is to determine how repeatable simulation will be if it is run the following month, or if it is compared to other simulators. This applies to manufacturing test, as well, where results need to be correlated cycle by cycle, signal by signal.

“The challenge with machine learning is you are giving up the human ability to control things,” said Mark Olen, a product marketing manager at Mentor, a Siemens Business. “That’s the whole point. If I run my simulation over my thousand CPUs using an ML technique with Portable Stimulus, it will produce a set of results based on the way the design responds because it learns from the design’s response. But let’s suppose that after I run for a few hours and we find two bugs in the design, we send that design back to the designers and they fix their Ethernet block or fix the fabric arbitration scheme or whatever, and run the simulation again. If you run the simulation again using machine learning, you won’t get the same result. You won’t get the precisely same cycle-to-cycle correlation because the design is behaving differently, and it should be because they fixed it. However, this causes a lot of uncertainty with engineers because they say they want to run the exact same stimulus, under the exact same conditions, and the fact is you changed the design. Because of this, we put a switch in some of our technology that could essentially shut down part of the machine learning ability so that it could run in a mimicking mode of something that had been done previously. At the same time, there are some leading-edge customers that are comfortable with this concept and are using the full capabilities.”

With design teams are running simulation over and over, this gets into the data mining of off-the-chip metrics. Here, what has been useful both internally at Mentor and for its customers is an open-source software ecosystem called Jenkins. “This is a really hot topic right now, and everyone likes it because it’s free,” said Olen. “But it’s not completely free because even though it is an open source ecosystem, you have to invest as a user to actually provide integration. We’ve put a lot of investment into integrating our system into the Jenkins environment.”

One of the primary advantages of Jenkins is its ability to act like a trigger, he said. “You can have a timing trigger that says when everyone is going home on Friday night and there are 10,000 desktop computers sitting idle, let’s use them. So at 9 p.m. on Friday, kick off a regression run regardless of whether it is needed or not. It’s free. And then once the results of those regression runs are done, we can combine all of that result, send it back to Jenkins so that it can now drop an email on some vice president of engineering’s desk when they show up on Monday morning, and it says, ‘Good news, we ran 800 hours of simulation and there were no failures.’ The other triggering that happens is that instead of being based on the wall clock, it’s based on conditions. For example, if there’s a certain rate of code change in a design, it can kick off regression runs. Or every time a file is revised. You can automatically say at the end of the day, at 8 p.m. every night, if more than three files have been revised during the day, then kick off a regression run that night.”

This produces regression run after run after run and the accompanying mountains of data. This isn’t transactional data. It’s not about a bus acknowledgement or a fetch or things that actually occur on the chip. Instead, it’s all of the results from simulation, such as bugs detected, coverage achieved, and cycles run. But one of the problems is that from there it’s necessary to figure out how to combine multiple types of data and put them into a large database that can be mined.

“So if I ran 10,000 CPUs, and I look at that and figure out that across a simulation farm that’s distributed around the globe how much of that work was redundant and useless, is there a chance that I could actually achieve the same amount of work next time on 5,000 CPUs instead of 10,000? Of course, no one ever does that. What they actually do is double the amount of work and still use the 10,000 but now they are able to expand the scope,” Olen said.

Addressing big pain points
David White, senior group director of R&D at Cadence, who has been working on machine learning since the early 1990s, co-edited and co-wrote one of the first textbooks in machine learning. He said it’s an interesting toolset to have for particular sets of problems. “Around 2009, as I was working with more and more customers, I started seeing the same types of problems getting worse and worse. The problems were in three major areas. The first was scale. We’re dealing with larger and larger designs, more design rules, more restrictions, and the result was just more and more data. Whether it is simulation data, extraction data, dealing with more and more shapes and geometries, larger tech files, we began getting heavier and heavier with data. Now you’ve got more complicated designs and electrical rules, there are more interactions between chip, package and board, thermal is becoming a problem, so now complexity is growing as well as the overall scale. Third, both of these began to impact semiconductor design teams in terms of productivity, because scale and complexity create more and more uncertainty, which leads to more redesign, missed schedules, among other things.”

At the core of these issues, was the fact that the industry was dealing with more data-driven problems, which required data-driven solutions, White said. Again, the key was to get data into format in which it can be used data mining. “Any of our implementations or solutions that use machine learning use some form of analytics and data mining on the front end, machine learning optimization, and some form of parallelization,” White said. “Typically, people will just call that machine learning, even though the walls between them are somewhat fluid.”

—Ed Sperling contributed to this report.

Related Stories
Deep Learning Spreads
Better tools, more compute power, and more efficient algorithms are pushing this technology into the mainstream.
Data Buffering’s Role Grows
As data streaming continues to balloon out of control, managing the processing of that data is becoming more important and more difficult.
Machine Learning’s Growing Divide
Is the industry heading toward another hardware/software divide in machine learning? Both sides have different objectives.
EDA Challenges Machine Learning
Many tasks in EDA could be perfect targets for machine learning, except for the lack of training data. What might change to fix that?



Leave a Reply


(Note: This name will be displayed publicly)