NetSpeed’s CEO talks about the benefits and challenges of using machine learning to optimize on-chip data flow.
Sundari Mitra, co-founder and CEO of NetSpeed Systems, sat down with Semiconductor Engineering to discuss machine learning, training algorithms, what customers are struggling with today, and how startups fare in an increasingly consolidated semiconductor industry. What follows are excerpts of that conversation.
SE: Machine learning is booming. How will this change design?
Mitra: This is a direction that computing will have to go in the future because discrete solutions won’t provide the most optimal performance. Machine learning is basically an analog way of approaching design. If you look at humans, our decisions are gray. They are not just black-and-white, and that’s how we evolve. If you look at machine learning algorithms and what they’re promoting, they are taking us there because they are self-learning. The first wave of compute that came about with artificial intelligence was programming it to come up with a solution. It wasn’t about self-learning, adapting and continual improvement. But the next generation is CNN-based, which is taking it closer to how we operate. It’s moving computing in that direction. There is a lot of excitement in the industry about this. We started using machine learning in our data structures two years ago after we had enough data, because without that data you cannot really perform the operation.
SE: What’s driving this approach?
Mitra: Generally, companies are dealing with a more compressed time to market. They have more pressure to do more in that same amount of time and not make a mistake, because if you make a mistake it’s really expensive. This is partly due to the consolidation in the semiconductor industry. There are fewer people who compete with each other, but the level at which they are competing is a lot more ferocious. They are a lot more secretive about what they are doing. There is less room for error. And there is a lot of pressure to do a lot more complicated things in the same amount or less calendar time. I’m watching every big company implement initiatives to accelerate what they’re doing.
SE: How about startups?
Mitra: It’s fantastic to see new semiconductor startups, and we work with some of them. They don’t have the same legacy of ‘not invented here.’ The entrenched companies have the attitude, ‘We know how to do it because we’ve done it this way for 15 years and you cannot teach us how to do things better, but we want efficiency improvements.’ We have a fair number of smaller companies as customers. There are some large companies that seem to run in the startup mode because the chip initiative are newer there. A lot of companies that were purely software are now getting integrated vertically, and they are starting to do hardware designs. Those are like startup companies inside because they don’t have any legacy, and they are less worried about whether their competition has 50 customers. This is the cutting edge and it’s new.
SE: For the more entrenched companies, are they more demanding of accuracy or speed or capacity in a tool evaluation?
Mitra: A startup will evaluate a solution for a week, or maybe a maximum of two weeks, and make their decision. It’s about, ‘This is my application. I want to see what you’re going to yield.’ And then they do a comparative study of whatever is open standard. An entrenched company already has five or six or seven flavors of their own interconnect solutions in-house. When you go and pitch to them they are very nebulous in what their selection criteria are, and the evaluation seems to go on and on and on. Andy Grove said, ‘I’m going to fire myself and get rehired because that’s when I can have clarity of thought in terms of how to do things differently.’ That’s what they ought to do on the design side, as well—get out, spin out, come back in, and see, ‘If I were to come in fresh and new into this company, how would I fight this battle?’ Then there would be no difference. But that’s not how it is. There is a lot of legacy. There are values to legacy. I don’t want to discount that. That’s why they are winners and became this big, because they have some strengths. But at some point it starts blurring as to what their real strength is versus legacy baggage that they carry.
SE: How does machine learning/deep learning fit into this picture?
Mitra: The interconnect is controlled by many parameters. If you look at the number of knobs that we need to turn to build an interconnect, it’s probably 2^100 different knobs that we have to manipulate, maneuver and tune to get to the optimal solution. That’s a lot of knobs. At a very high level, to define the space it’s like you have two ports that are connected to each other. You have to find out which is the driver, which is the receiver. Then you have to find out how many bits are on these ports, and you’ve got to see physically whether it is connected this way or that way. Are there congestion points anywhere? What is the optimal way of connecting them? You have to know the bandwidth between these two points. What do you need to service? What is the latency requirement between these two points, and what kind of traffic is originating from these? Is it possible that one of them is dependent on traffic coming from somewhere else? If it is dependent on something else it cannot service that request until the other request is satisfied, so there is a dependency that you need to take care of. If you don’t resolve the dependencies correctly, can this link deadlock? If you haven’t sized it correctly, can it lead to congestion? And what if you’ve worked out all these things, and then you have figured out the route table for it? You know it’s going to go this way, but then when you try to do that there is not enough physical space to fit these two. And this is only two points. Multiply this with N number of IPs with Y number of ports, and now virtualized and real channels. There are so many different knobs that you have to tune/play with to analyze and figure out what that fabric is that you have to use some way of doing it.
SE: How was this done in the past?
Mitra: Initially we used simulated annealing techniques to simulate it, synthesize it, and converge to the right answer. That is the traditional way, but when you do that you’re waiting for a starting point or a seed. The initial design point that you choose is very critical to where your final optimized solution is going to end up. When you have the ability to collect all of these data points, train it and have algorithms, as soon as that database sees a set of new input vectors, it maps it to something that it has seen before. And it will come to the right starting point based on its learning algorithms. It has learned it and says, ‘This is similar to the one I did before. Let me start from there, and start the manipulation process for each one of them.’ That is the way we use it for the fabric. We are giving it the right starting point based on all the training that we have done for the 2^100 different things for each one of them. ‘When you have this, use this. When you have that, let’s use this other approach.’ It’s guiding it so the convergence is much faster, and you get to a better minimum. You get a more optimal solution. That is what we are using it for because it’s a non-trivial problem to solve. These are NP-hard problems.
SE: Are engineers open to this approach?
Mitra: The answer to that is very geography-specific. Where there is access to a lot of cheaper manpower—let’s say the have a pool of 1,000 engineers who have been trained in a particular manner, and they are banging out chips, and every 10 months they do a tapeout. It’s like it’s a freight train that’s moving along, and you’re throwing a very big angular turn in that direction. Potentially it will derail that train. Are they ready for a revolutionary change? No. But if you can help them evolve, they probably will be more open to it. This approach is revolutionary, not evolutionary, but over time we have made it possible for people to evolve so this can be used older designs. Then you can start tweaking it. The benefits are not as great, though.
SE: Any other people-related hurdles?
Mitra: It’s very difficult to change a workforce that has been trained and practicing something for 12 or 15 years. That will change if there is a change in the industry, though. For example, if the mobile industry was the only big semiconductor market segment, nothing would change because all the methods they are used to using would continue to work. But when there is an inflection point there is an opportunity, such as in the autonomous market where real-time decisions are required. The traditional mobile fabric is actually just a tree to memory, and it doesn’t work because then you bottleneck in front of the memory. When you have different kinds of traffic trying to access the same memory, a tree will absolutely gridlock and you have performance issues. As the demands of a mobile app processor change, as mobile devices become even smarter, even that is going to have to change. But for the time being I don’t see them wanting to change. When they get pushed where the traditional methods are not working, they will be forced to change their architecture. That’s the mobile and maybe the lower end. On the higher end of the people problem it’s, ‘You know I have customized this, and I don’t care what you tell me. You know my 500 engineers who have defined this thing for the last 15 years have given us the differentiation that makes us No. 1 in the industry today, and you will never be able to beat it.’
SE: How do you measure the kind of accuracy or performance those potential customers want to see?
Mitra: For designs that are reasonably sized, about 40 or 50 blocks, it’s very easy to put together the fabric, configure it, tune it, push buttons all the way through to place and route to get timing numbers, area numbers and quantify it. ‘This is A, this is B, this is your solution, this is ours. You can see the percentage improvement in the performance and in the area. If it’s worth your time, here it is.’ But for the very big chips, they are trying to figure out how to capture this because we cannot possibly take the entire big design with thousands of blocks in it, run it through all levels of design and then go back to them with an answer. So we tell them to take a look at the toughest part. Normally, it’s the coherent subsystem part of it. The majority of time that’s a mesh network or a taurus network. We configure that and we give them our power performance numbers, draw all kinds of graphs, and let them do a lot of what-if analysis, and we tell them, ‘This is it. Go take a look at it. Yes, it’s a subset, but on all the metrics at the subset level we have shown you that we do better.’
SE: Any other challenges?
Mitra: We don’t have standard bus protocols. Every couple of years there is a new protocol spec that comes out that we need to conform to, and they are not necessarily backward-compatible. For any small company to keep pace with that becomes a challenge. If you look at the coherency protocols, it is very difficult to be backward-compatible and still get the performance advantages that you want from the next generation.
Related Stories
CCIX Enables Machine Learning
The mundane aspects of a system can make or break a solution, and interfaces often define what is possible.
Machine Learning Meets IC Design
There are multiple layers in which machine learning can help with the creation of semiconductors, but getting there is not as simple as for other application areas.
The Great Machine Learning Race
Chip industry repositions as technology begins to take shape; no clear winners yet.
Plugging Holes In Machine Learning
Part 2: Short- and long-term solutions to make sure machines behave as expected.
What’s Missing From Machine Learning
Part 1: Teaching a machine how to behave is one thing. Understanding possible flaws after that is quite another.
Building Chips That Can Learn
Machine learning, AI, require more than just power and performance.
What Does An AI Chip Look Like?
As the market for artificial intelligence heats up, so does confusion about how to build these systems.
Leave a Reply