Bugs That Kill

Behind closed doors: Semiconductor executives talk about the bugs they fear the most and the problems solving them.

popularity

Are simulation-resistant superbugs stifling innovation? That is a question Craig Shirley, president and CEO of Oski Technology, asked a collection of semiconductor executives over dinner. Semiconductor Engineering was invited to hear that discussion and to present the key points of the discussion.

To promote free conversation, the participants, who are listed below, asked not to be quoted directly. All comments have been paraphrased and made anonymous.

Shirley painted a picture of the industry as having followed Moore’s Law. “Around the mid-2000s, that changed. We no longer had the frequency scaling and we no longer had the power efficiencies. While we still had transistor density scaling, we had to pivot from synchronous sequential design to parallelism. What has happened on the verification side? We still have simulation and constrained random simulation. We still have emulation. We live in a sequential logic world and simulation. The idea of parallelism is lost on simulation. So we have new-world design but old-world verification. That has created the notion of simulation resistant superbugs that go through simulation undetected.”

Unsurprisingly, because everyone sitting at the table has a bias towards formal methods, a rallying cry went up for greater adoption of formal. “Our goal is to find the most difficult bugs in the fastest time that we can. Formal brings a lot of value, but it has not been recognized in that way.”

Some verification teams set the bar very high. “Our goal for verification is to run zero simulation cycles! We want to have the design completely verified through formal. I believe it is possible, but the challenge is whether we get to that in 5 years, 20 years or 40 years. We have been increasing the effort in terms of the coverage provided by formal, and we try to push the boundaries in terms of tools capability and staffing. In the latest chip, we found issues using formal that we could not reproduce in simulation, but that we could reproduce in silicon. The architects were not able to convince themselves that the cases were not possible.”

The industry has been working on ways to make formal more approachable. One method is to create waveform since this aligns with the designers’ perspective. “They can see, they can visualize the design and then they can pinpoint where problems are. That is where the designers get excited.”

But with each solution is a potential problem. “One nice thing about formal is that it shows the designer the shallowest expression of the bug. This is awfully nice since we frequently get bugs at the end of multi-hour simulation runs.”

But those shallow expressions may not appear plausible. “When we show designers a bug found using formal, they often respond saying that is not possible in a real system environment. These events cannot happen in this tight sequence of cycles. But they could happen over a long period of time, and there is really no difference. Designers who style themselves as being system-level knowledgeable engineers look at a formal trace and dismiss it for not being a realistic scenario. That is not the point. The point is that you could get to here eventually.”

Some participants believe that early warning signs hint at a self-inflicted downturn for the U.S. semiconductor industry. “We have this syndrome in the U.S. and it showed up in the car industry. We think that the substandard cars we are producing are actually satisfactory cars. It turns out that some people who are more diligent than us in other countries are making much better cars than us. If we don’t watch this, it will come back and bite us. The semiconductor industry has been doing quite well over the past years, but this trend may not continue if we don’t watch the quality.”

For some companies it appears that the safe road is the preferred road. “It happens with clock gating all the time. Managers want to limit the amount of clock gating being done because it means more aggressive superbugs could be missed.”

Without a safety net, many companies are scared of being too aggressive. They question how many millions of cycles of simulation it will take to find problems caused by parallelism.

Formal first
There are two mindsets in the industry about when formal should be applied—to find the last 10 bugs, or the first 10 bugs. “Always the first 10 bugs. It takes months to get any results from simulation. They are attempting to get the simulation infrastructure up. If the first 10 bugs are not formal…”

“But nobody thinks that way. They all think of formal as how do you close the last gap that you didn’t find with simulation.”

“What do they do when RTL is being written and the testbench is being constructed? With UVM, building testbenches has become complicated and effort intensive, to the point where it is orders of magnitude more than the design. You need a mechanism to start finding bugs before the testbench is ready.”

Some companies have successfully bridged that divide and find their design team appreciates the early feedback. “The first three months of verification is owned by formal. The designers like that we have ways to find bugs long before the testbench is set up.”

But others are stuck with an older mentality. “This is not the mindset of most development managers. They believe UVM is the gold standard and it is what you do. If it takes engineer months to find the first bug via the UVM testbench, and you want it sooner, then you apply more engineers to it. They are not looking for an alternate strategy.”

The engineering mentality
Part of the problem is the approach the industry has grown up with. “Once you have written RTL, the first thing you want to do is simulate it. You can’t stop them from doing that. You have to catch them before they are at RTL. Is there any way to force them to do formal before they write RTL?”

This requires a conceptual change. “Today, the approach is what does my design do when I apply this stimulus? We need them to ask the question, ‘How does my design behave?’ After writing this piece of RTL, if you bring up a formal testbench that you haven’t done any simulation on, it is more of an observational thing versus testing it against a preconception of what it is supposed to do. Does it meet that preconception?”

Change is hard. “When you leave verification to the design team, the first thing they want to do is find out what happens when they apply reset. Does it start working? Does formal provide the same kind of adrenaline kick that they get from simulation?”

This leads to a feeling that formal is less immediate. “Getting engineers excited about their impact is important. We may have to go upstream to start getting engineers excited about doing formal. There are a lot of other things that are catching their eyes on a daily basis.”

There is an industry bias that has to be overcome. “There remains a negative connotation about verification. Verification gets the second-rate people because the designers create—and that is positive. The concentration always has been on the negative with verification. How do you get the notion of creating the formal mathematical definition of what you are trying to create to become part of the design process? Could we make formal define the intent of what is to be created and the design team is trying to match that. That now puts the notion of verification in a positive light, and the whole industry will start creating better people.”

Parallel complexity
The structural side to this problem is typified by comments such as “executives and top management grew up in the single-execution world,” said one participant. “‘Guys who were the designers during the serial world are now the managers in the parallel world.’ And, ‘Management thinks serial and we have to change their mindset to think parallel.”

One suggestion is to think of designs as being composed of both serial parts and parallel parts. “CPU development, as an example, has two problems. One is a serial problem—the core itself—and one is a parallel problem, coherency. The verification plan should not only focus on bringing up the core, but also bringing up the memory sub-system in parallel. Your coherency protocol then gets attention earlier in the pipeline.”

Part of the problem is the complexity. “As designs become more complex, historically we always relied on the designers who were brilliant enough to comprehend the entire operation of the design in their head. As you move to larger and more complex designs, you are going up this curve where fewer and fewer people on the team are capable of holding the entire operation of the design in their head. At some point, you reach a threshold where nobody can comprehend it.”

What happens then is an assignment of responsibilities. “The scariest blocks that are really hard. We never have problem with those blocks. The blocks that have the most bugs are the simplest blocks. Why? Because that is where I put the less capable engineers. I have a bell curve of people and I take the brightest guys and I put them on my scariest block and they never have a problem. Then I get a silly bug in the corner of the chip that we didn’t even think about.”

Staffing issues
This led to an in-depth discussion about staffing issues. “The kids that are coming into the industry right now, in the U.S. and Europe, are not the sharpest knives in the stack.”

The participants questioned why this is the case. A few key items emerged. Chip design is no longer sexy. What is sexy are software companies such as Facebook and Google, because they are in front of people all the time. Gaming is also attracting a lot of talent.

This is not an issue that happened overnight, and there are no quick fixes. The impact of it will affect the industry because there is a time lag between when someone enters the industry and when they are expected to become the architects and lead verification engineers. If a lack of talent is coming into the pipeline, that gap will continue throughout history.

It was suggested that the biggest issue is in making semiconductors look like an interesting problem to solve. “The smartest kids want to work on hard problems.” Some felt that the best candidates were now coming from related fields, such as physics and chemistry, who could easily be trained to become an electrical engineer or computer scientist.

Another problem is pay. “Software engineers get paid more because at comparable skill levels the hiring market is more competitive.”

Pay issues exist within teams, as well. “In our industry, the most effective designer, the most productive designer, is 10X more productive than the average designers. The tech industry is the only industry where that is true. In other industries, it is 20% between the best and the average. So that makes this industry strange. You can’t pay them 10X as much, which is a tragedy.”

Participants also believe that verification engineers are likely to be paid more than designers these days. “You want your verification engineers, on average, to be slightly better than your designers, and you should be paying them more.”

Added one participant: “My top-paid engineer, with the exception of analog, was a verification guy. I pay them more for two reasons. First, it is a harder problem than design. And second, I am the person who will get fired if the thing does not work. If the design was brilliant or good, nobody can tell the difference when it comes to shipping. But if you get a chip back and it doesn’t work, everyone knows who to fire—and that is me.”

Participants included: Farhan Rahman, AMD; Charlie Janac, ArterisIP; Dan Lenoski, Barefoot; J Augusto de Oliveira, Cypress; Shahriar Rabii, Google; Sandeep Bharathi, Intel; Guy Hutchison, Marvell/Cavium; Anshu Nadkarni, Nvidia; Simon Duxbury, Quantenna; Seonil Choi, Samsung; Samit Chaudhuri, NVXL. From Oski: Craig Shirley, Dave Parry, Vigyan Singhal.

Related Stories
When Bugs Escape
The ability to find bugs has not kept up with the growing complexity of systems. Bugs are more likely to end up in products than ever before.
Debug Issues Grow At New Nodes
Finding the root cause of problems becoming more difficult as systemic complexity rises; methodology and different approaches play an increasingly important role.



3 comments

Linming Jin says:

Another great article, Brian. A question here: why may clock gating cause “more aggressive superbugs”? It’s technically not clear to me. We have tried very hard to introduce as many clock gating opportunities as possible.

Brian Bailey says:

Hi – clock gating is supposed to reduce power without affecting the functionality of a design. But there are two things that could go wrong. First, if the same set of vectors is run on the pre- and post-gated design, then issues may be missed. It is necessary to look at coverage on the post-gated design to see that verification remains complete. There also may be corner-case conditions that could change behaviors, and again the vector set may not be able to find those. Formal is good for those cases.

LM says:

“The kids that are coming into the industry right now, in the U.S. and Europe, are not the sharpest knives in the stack.”

It is very disheartening to see such a xenophobic remark made with no objection from the folks in the room or even the comment section.

Re: staffing issues: I’m (not) surprised no one mentioned management. It takes an exceptional manager, one with SOFT SKILLS, to identify, attract, cultivate, and retain a team with people of different cultural backgrounds. (Education path and cultural background are intertwined as kids who graduate from undergrad in the US most likely grew up in the US as well) Instead of scapegoating the kids and making up this false narrative, perhaps consider the idea that the managers do not have the skills to understand, communicate, and manage people of different backgrounds.

Also, why would any talent educated in the US/Europe want to work for a manager who thinks their education path is lowly? Obviously, comments like the one above are not said out loud, but it’s easy to sense bias and select another company where we are welcomed. It’s a two way street.

Lastly, it is kinda ironic to see a group of formal verification experts have a discussion on staffing without even questioning the assumption.

Signed,
An ASIC verification engineer with > 15 years experience, who interviews new grads, and treats them all with respect.

Leave a Reply


(Note: This name will be displayed publicly)