Marvell’s CEO talks about the rising cost of design and why new packaging approaches are essential.
Sehat Sutardja, chairman and CEO of Marvell, sat down with Semiconductor Engineering to talk about new approaches for design and memory and why costs and time to market are forcing changes in Moore’s Law. What follows are excerpts of that conversation.
SE: What was behind your move into modular packaging?
Sutardja: The cost of building chips is getting out of hand. As we make things more and more complex, we acquire more expensive tools. Mask costs are exploding. But things like R&D, validation, and verification are more expensive. Time to market slows down. You have to plan way in advance.
SE: So it’s more difficult to put it on one piece of silicon?
Sutardja: You need a really, really good imagination. If you can figure out what the world will need three years from now, it works. I can’t do that. When people say the cost of building chips at 28nm is $50 million, they’re not exaggerating. It may be $40 million for some. And maybe if you’re doing a minor change, by the time you include the R&D, operational costs, validation, verification, test, burn-in, the software—if you add up all the costs, it might be $20 million. How many startup companies can build a chip for less than $50 million? And they’re supposed to be more efficient because they can do things at one-fifth the cost of a big company.
SE: We’re not seeing too many chip startups these days.
Sutardja: Yes, because it may be $100 million or more to build the first chip. Even a minor re-spin will be at least $20 million to $30 million. Let’s say you forget to add a function. It’s $20 million. Even if that function is 1 square millimeter, you have to sell 200 million chips just to break even on a re-spin.
SE: But moving into a more modular approach isn’t so simple either, right? You now have to test the chip from inside the package.
Sutardja: This is the complexity of building modern chips. Some functions can interact with other functions, either through noise coupling or because the chip is too hot or the package. Maybe the analog I/Os are too close to the DDR I/Os. Everything is so small. This is the byproduct of making things smaller.
SE: Don’t you have the capability of not doing everything in the same process?
Sutardja: Yes, but we are not taking advantage of that yet. We are using 28nm to build MoChi products. It’s not the cheapest. But we need to design for the future, and in two or three years 28nm will be a commodity.
SE: There are a lot of 28nm processes out there. Any one in particular?
Sutardja: If we don’t need performance, we use the LP process. If we do need performance, we use a high-k/metal gate process.
SE: Could you use FD-SOI, as well?
Sutardja: Yes. And over time we will revisit that.
SE: You’re using this as platforms, right?
Sutardja: Exactly. We don’t want our customers to have to worry about process nodes. What they’re most concerned about is time to market and lowest cost. We want to make it simple like LEGOs. You don’t have to know whether they’re red or yellow or green. You just have to make sure the connections are standardized. By construction it is correct and it will work.
SE: Your variables are time to market and correct by construction, but what impact does that have on power and performance?
Sutardja: If you build your SoC under ideal conditions, it might be slightly lower power. However, I’ve never seen ideal conditions in my lifetime. You have time to market pressures. You have constraints in the silicon. You have constraints in the big IPs, which determine what you can do with an amoeba-shaped area that you have to put everything else into. When you get done, it will be sub-optimal, meaning the power will be higher. Every chip we’ve done so far is lower power.
SE: Is that apples-to-apples, 28nm vs. 28nm?
Sutardja: Yes. And if you move to 10nm, you might get slightly lower power for USB IP. But it will cost a lot more money than to have it at 28nm. But when you build a chip on one die, you still have to build the functions in one process node. If you use a high-leakage process, then the rest of the functions on that chip will suffer from the decision you just made. If we use a modular format, you can select a process for lower standby power. With high-k/metal gate, the standby power is higher. It’s a tradeoff
SE: How many components have you got, and how many will come from third parties?
Sutardja: Ideally, we want to have four or five processors to start with—single-core, dual-core, quad-core. So it may be a Cortex-A53 or a Cortex-A72. And then you have micro-southbridge devices, which are PCI3e or SATA or Ethernet or USB 3.0. If you have 10 of this, 6 of this, then you can build up to 100 combinations. You can add a dual-core processor and another dual-core processor attached to the LEGO interconnect.
SE: That didn’t work so well with four big.LITTLE chips hooked together instead of an eight-core processor. What’s changed?
Sutardja: When you build an SoC, you’re building for one application. With a modular architecture, we’re allowing are customers to become the system architect. We want them to participate. And they can make the decisions late in the game. You may want an A-53 with four separate cores and ports. You can mix and match the different devices to build whatever you think you need. The software automatically works.
SE: So you’re taking a menu approach to building an SoC?
Sutardja: Yes, it’s a la carte.
SE: How long would it take you to build a chip?
Sutardja: The longest time is spent convincing marketing and sales this makes sense. We used to have 500 chips on a board. Now we have one. But one of what? In the last 10 years, this one has become very complex. It has been taking us three years to build all of the infrastructure. We are at the point where we have a handful of chips. Hopefully by the end of next year we will have 15 chips. That will be enough critical mass to allow customers to build whatever they want. Some customers are already building products. The first functions we’ve added are the ones everyone wants.
SE: How do you differentiate between chips?
Sutardja: We also will be developing functions that not everyone wants. People will pay more for those parts. And we’re not restricted to older nodes. We can move back to older nodes, and we can move forward to finFETs.
SE: So your real secret sauce is the interconnect and the software that makes it work?
Sutardja: That’s right. The interconnect has to be good for at least the next 10 years. It can’t be good enough for only one or two years. We start with 8 gigabits per second per lane. Some devices have two lanes. Some have four lanes. We also can build 8 or 16 lanes, but it’s better to upgrade that over time. But it will still be backward compatible. Twenty years from now, if you have 100 gigabits per second per lane, it will still work with the older chips.
SE: If I’m company A and it used to take me 18 months to build a chip, how long will it take me now?
Sutardja: One hour to make decisions. You have a price list. You decide what you want, how many functions you want, and then you build the evaluation board. Then, using cables, you can connect them together and test whether this is what you want. Each package is small with limited functions, so they are very simple. If you build a chip with 500 pins, you still have to put it on a board. With this approach, you put these devices next to the edge of the board. There is no wire to fan out. From there to the main chip is only two wires.
SE: Does that reduce your bill of materials?
Sutardja: Absolutely. You can build a four-layer board instead of a six- or eight-layer board to fan out 500 pins from your super SoC. We have been brainwashed to thinking more pins are better.
SE: Are the tools there to make this work?
Sutardja: The intelligence is built in. All you really need are PCB layout tools.
SE: How does this work within a system?
Sutardja: We need to provide customers guidelines for how this will work with this bandwidth. If you don’t need it, maybe you can go with half the bandwidth. The vast majority of chips have three interfaces, but you can add as many as six or seven. But all these connections can be daisy-chained. The chip closest to the processor will have the shortest latency. The one further down the chain will have to go through different chips, so it will have longer latency.
SE: What are you doing on the memory side?
Sutardja: That’s Final-Level Cache. A lot of people are confused between FLC and Last-Level Cache. With x86, Intel used last-level cache to mean the last level of their processor cache, which is L3. All the cache that’s available in the processor world is Last-Level Cache. FLC is system cache.
SE: So it can be shared among different things?
Sutardja: Yes.
SE: How do you keep that coherent?
Sutardja: Because it is shared, there is only one. There aren’t multiple things in the system where you have to check coherency. It could have multiple levels of the final-level cache, but from a system-level view it’s just one cache. The biggest benefit of an SoC is to virtualize DRAM so you don’t have to use a big DRAM. We can virtualize the main memory from using a big DRAM to using a much smaller DRAM, and replace the rest with cheaper memory like SSD or flash. This is possible because in the real world there is almost no application on this planet that requires 16 gigabytes of DRAM to run at any instance in time. In the short time frame—one second or one hour—a lot of applications only require 1 gigabit. But it may be a different gigabit, which is why you may need a gigabyte or 16 gigabytes or hundreds of gigabytes in a big system. When you’re dealing with millions of people writing code it’s impossible to get everyone to agree how to make things more efficient. FLC will learn that only the things you need will be kept in the system cache. The things you use once but don’t use again for the next hour will be pushed out to the low-cost memory. If you monitor what’s in the cache, only the most active applications will be there.
SE: Does that require you to understand the flow of data in the system?
Sutardja: No, because if you were to understand the way things work, that would be a nightmare. As soon as you make it work with one scenario, it will not work with a different scenario. The solution is not to know how things work. It has to work based on statistics. If you have to run 10,000 things at any time, it’s hopeless. But in the real world, there are not that many events you have to keep track of. You have to rely on big data assumptions. There are very few things you care about.
So why exit the smartphone SoC market just ahead of MoChi? Especially since it opens the door to some cooperation with AMD and/or Nvidia for a GPU module in the very high end. Addressing the tablet market would be easier too with MoChi. Sure they weren’t targeting a GPU module but the marketing might work, they could even have dedicated DRAM for the module as a marketing feature. There could be a market in gaming boxes with Razer, Valve or some others.- maybe TVs in China where the traditional consoles have little relevance.
In phones A72 on 28nm is a big opportunity and Snapdragon 652/650 hasn’t arrived as fast as expected. If Marvel had a solution sooner , there was potential for some nice wins at decent ASP.