Optimization Driving Changes In Microarchitectures

New approaches emerge as demand for improved power and performance overwhelm design tools.

popularity

The semiconductor ecosystem is at a turning point for how to best architect the CPU based on the explosion of data, the increased usage of AI, and the need for differentiation and customization in leading-edge applications.

In the past, much of this would have been accomplished by moving to the next process node. But with the benefits from scaling diminishing at each new node, the focus is shifting to architectural and particularly microarchitectural changes to manage tradeoffs like resource allocation, power, throughput, and area. This is putting pressure on EDA vendors to develop a new set of capabilities to optimize power, performance, and area for a complex mix of different tasks and applications. At the same time, it’s rekindling interest in design tools with the kind of “wow factor” not seen in years.

“I remember well when we went to the designers with the first designs [using synthesis],” said Aart de Geus, chairman and co-CEO at Synopsys, in a recent presentation. “We took their design, and in a matter of hours were able to make it much smaller and much faster. Initially, they didn’t believe it. Then they looked at it, they checked it out, and suddenly they found out it was correct. But then a problem happened, which was they thought it was magic and could do anything. No, no, no, we had a lot of work for the last 30 years to make it a lot better. The same is going to happen here.”

Others point to similar examples, observing that the semiconductor ecosystem is shifting gears to deal with the confluence of data, AI, and new applications.

“Looking back at CPU design, there were general-purpose CPUs that followed certain rules like Moore’s Law, Dennard scaling, and Amdahl’s Law, but these either stopped working or limits the performance,” said Zdeněk Přikryl, CTO of Codasip. “We have to find different ways to improve performance. One option could be heterogeneous computation, where certain blocks are focused on certain things they are good at. This is what the semiconductor ecosystem has started looking at. We know that we cannot go above about 5 GHz. Yes, there are other ways to achieve that like optics, or maybe carbon nanotubes, but these are 10 years away. So heterogeneous computation is the best choice, because we have to solve the performance issues now.”

This is more than just designing hardware, though. It’s about how to utilize the hardware, as well. “We used to look at the general purpose microarchitectures and do them independently without looking at the target domain or software,” Přikryl said. “Now, in heterogeneous computation, you don’t just have a general-purpose one. There’s one that’s meant for AI, and others meant for other parts or domains. AI is particularly interesting because every single one of them has some key differentiation point and the CPU has to be tuned and customized according to that.”

Fig. 1: Increasing complexity for chip designs. Source: Codasip

Many of these changes are revolutionary, but they’re constructed on top of evolutionary improvements in a number of areas. The result is much more flexibility and the ability to deal with rapidly increasing complexity and heterogeneity.

“If you go back a couple of decades and think about what processors looked like at that time, you had these pieces of silicon that were more or less designed for a single user, and they were more or less designed to do everything,” said Steven Woo, fellow and distinguished inventor at Rambus. “These CPUs were responsible for doing all the computing. They were responsible for processing I/O, for processing network traffic, and for whatever graphics were there. Fast forward a couple of decades and we have orders of magnitude more gates. And the way that people think about processors now is very multi-core and multi-functional. So in addition to having multiple CPU cores, now they also have things like graphics, and some of them have specialized accelerators for things like encryption and even high-performance computation like vector engines.”

CPUs today often have multiple cores, multiple threads, and they can even support multiple users at the same time.

“This question starts to become, ‘If I’ve got all these cores and potentially all these users or maybe multiple programs all executing on the CPU, what’s a fair partition of all these resources,’ said Woo. “By ‘fair,’ you have to make it so that everybody can have their own resources. But if not everybody’s using all those resources, you don’t want them to go to waste, so they have to be both shareable and partitionable. We didn’t have to worry about that kind of thing in the past.”

That also makes system design significantly more complex. “As with everything in this kind of work, we are getting yet again a mixture of contradicting demands,” said Aleksandar Mijatovic, digital design manager at Vtool. “We want to have the least power consumption, in the least area, for maximum flexibility and maximum performance. This frustrates every single designer in the world when management comes with, ‘We want it not to waste energy, but we want to have all the features, and it has to be configurable and optimized for maximum performance.’ One thing is true, you are not going to get all of that. But you are going to try to get as much as possible. For different use cases, for same ASIC, you’re going to do optimization of optimization to find the common characteristics of all optimization changes you need to do. And you’re going to try not to impact the entire system, or to be a chip for yourself instead of a block inside, which is likely the biggest danger in this kind of maximum configurability and flexibility approach.”

This is a fundamental contradiction of the term ASIC, which is no longer tied to a single application. “The ASIC is now required to have maximum flexibility, which was something reserved for FPGA designs for quite a long time,” Mijatovic said. “But from a certain perspective you simply cannot get the same performance (clock speeds) and price from an FPGA as you can from an ASIC. We are losing a bit of that gap between FPGA and ASIC approaches. For example, there are ASICs that incorporate small FPGAs in order to load some custom filtering and data processing optimization code based on a processor-requested application. And there is a bit of redundancy on-chip, which needs to be paid for in area, design time, verification time, and power consumption.”

And yet, some things remain the same with ASICs. “Once you built it, what you support, you support,” he said. “There is no going back. You can not just go fix and release a patch. That’s what makes it different from software. Once it is out of the factory, you have to work with what you planned for. If at some point something is required from your ASIC that you forgot to build in, then it simply becomes obsolete. So many companies are trying to add some stuff at the very beginning, which they suspect that will be used.”

That makes decisions around microarchitectures, and how to design them, increasingly difficult.

“One of the biggest questions for every market research team is what to support,” he said. “Should we go with the two chips, one, or maybe a family of them? What will be the market perspective if we go with one oversized chip and make a bit expensive? Will anybody buy some of the chips we tried to do in a family when we decided to make five instead of one? Will we end up with one or two that nobody is using? These decisions are getting tricky because it’s all about higher performance, more complexity, and the demands of moving with the market. If you’re late you or if you missed one feature everybody uses, you are a complete market failure.”

Moving data
A huge consideration with many applications today is data movement and where data needs to be processed. The architecture determines which components need to be where, and how fast they can run and how quickly data can be move. The microarchitecture, in contrast, determines how all of those resources are utilized, which can vary greatly from one application to the next.

“You’ll see people change algorithms,” said Rambus’ Woo. “You’ll see people reorganize computation. There is re-factoring, where you look at your application, and the way you might have written it 10 years ago could be completely different than the way you’d write it now. It’s the notion of, ‘If I re-factor my application, break it down differently to use the new silicon that’s available, how might I do things?’ That’s a lot of what goes on. With the hardware that’s available, and with the algorithms you want to try to implement, how do you re-factor those things? With some of these applications that have been around for many years, you know that it works, and at the same time you know there’s a lot of embedded knowledge in optimizing it for the architectures that exist today. The re-factoring task is very intensive for certain applications, and is a bit of a tradeoff, as well, in trying to figure out what the development cost is going to be. Sometimes it’s just not advantageous to undertake. It’s expensive to do software development and then to re-verify everything, and sometimes the equation just doesn’t work out.

But the essentials still matter. While exploring different area, timing and clock optimizations, all points must pass the basic timing test. “They have to,” said Synopsys’ de Geus. “Otherwise, the chip doesn’t work. The optimization really has, first and foremost, one objective. It says, ‘All of this will speed up the chip, and therefore timing is completely irrelevant.’ Secondly, power is the other dimension, and depending on what type of chip you do, power is more important or timing is, so most of the changes that you can make are then optimization changes of moving things around, of determining what margins to build into the layout, and thus to impact things like yield. — and then, with respect the structures that you can’t really modify, such as testing structures. And so, in that optimization, timing has always been the singular most interconnected part of the feature to optimize.”

That isn’t a new idea. “While Synopsys is known for synthesis, it was really synthesis with built-in timing verification,” he said. “And, of course, timing verification has become dramatically more complex with time borrowing, with moving clocks around, and looking at the statistics of timing. There’s a reason that most people, when they get chips back, ultimately do binning. It’s because there’s a statistical distribution on the timing. And we optimize not just the timing — we optimize the statistical distribution of those timings as much as possible, because that has a huge impact on yield if you have only one window of acceptability. If you can do binning, that means if some things are better than expected you sell them for more, while others go into less-performance applications. Timing is probably the single most interwoven aspect of the optimizations we do. And as you know, there are many things you can change in a design to meet timing.”

An important part of this process is design space exploration. “You have the specifications, along with the workloads that you would like to do, but you would like to find the best solution for this one so you have to be really quick, effective, productive,” said Codasip’s Přikryl. “If you look at the standard approaches, such as RTL or Verilog coding, yes you can do that and it is still well-used. But in the case of today’s challenges, you are not exploring all the options that you could. I’m talking about the SDK here as well, because the RTL is only one part. You need to have a compiler that can feed this RTL.”

Přikryl suggests moving to a higher level of abstraction, such as an architecture description language. “We write Codal in a way that we have a single description, that you can start with some sort of functional model,” he said. “We then start evaluating the system with the workloads that you have, and there you can tune things, refine your architectures, and keep adding details and re-evaluating things. You end up with device RTL that’s very tuned for your needs. But the steps through this design space exploration, from the functional model to the implementation model, have to be as smooth as possible. This cannot be done in RTL, because it’s too low-level. If you stay at the C level, it’s too high-level. The architecture description language gives the power to go through this process.”

EDA tools under stress
But rising complexity in microarchitectures also puts more demand on EDA tools.

“The challenge is that consumers need much more than they can get with today’s electronics,” said Simon Davidmann, CEO of Imperas. “Electronics are orders of magnitude more complex now. It’s pushing EDA every which way. It’s not uncommon to see 1.2 billion transistor chip for an AI inference engine with 150-plus processors. These kinds of chips change things in lots of ways.”

Giant heterogeneous chips are stretching EDA tools, as well. Davidmann pointed to one case of a customer running an AI inference engine with 150 processors, much of which are complex vectors among other things in an AI framework. “It may take two hours to run one of the test sequences, and that is on a pretty sophisticated simulator. But it’s abstractly running however many billions of instructions, so the tool is stretched every which way.”

Davidmann sees the semiconductor industry moving into a different phase where the complexity and challenges are very much at the forefront. “This is like when Inmos started doing its transputer in the 1980s. They decided they would build everything themselves. Or when Simon Knowles at Graphcore said, ‘We’re going to build this huge parallel machine,’ and they think, ‘There’s no point going to talk to big EDA, because they’ll just try to sell me Verilog. I’m not interested in Verilog. I’m looking for a machine that can simulate AI,’ so they have to build it themselves. Over time there will likely be consensus for the types of technologies that chip companies need to help them to do the different things, and maybe their tools will become available.”

This new cycle is being driven by systems companies such as Google, Facebook, Amazon, Microsoft, Apple, and Tesla, which are creating architecture and design innovations, many centered around AI.

“Twenty years ago, everyone was trying to build parallel hardware, but there was no software,” he said. “No one could program it, no one knew what to do. Very few companies succeeded building parallel. Fast forward to now, and the problem is how to run these AI frameworks on the hardware. They don’t run on an x86 very well, so let’s build one with 100 cores in it, 1,000 cores in it. That’s why the Graphcores, and all these companies building stuff around Arm or proprietary or RISC-V — everybody’s building their own microarchitecture. They are building these are architectures to solve their software problem. And from an EDA point of view, it’s very exciting because they are pushing EDA with the types of technologies and solutions and magnitude of the capacity that they need. The leading edge says, ‘I don’t want to simulate one, I want to simulate 1,000 running this stuff. And I want to run it in an hour. I don’t want it to run in a month, or a year. I need the results tomorrow morning every time.’ So they’re pushing it like that. They’re also coming up with new architectures that include vectors and other advanced operations that do things very differently. We don’t really have to understand what these software algorithms are that are running on top of this. We’ve just got to get the microarchitectures to simulate it, and get them excited about it.”

Open source also plays into the dynamics here with much surrounding machine learning in the open source domain.

Davidmann sees this feeding into the EDA tools with advanced engineering teams looking for tools they can configure and customize themselves. “Most traditional tools were proprietary and closed. They don’t want to build simulators. They want to build architectures, and that’s where we’ve hit that sweet spot with them. They don’t care about the simulation underneath. It’s the architecture they want to control and not be restricted to how it should be. Open source is shining a light in terms of flexibility that people are looking for in new tools.”

Conclusion
Overall, microarchitecture challenges are driven by the needs of the new world software that’s pushing EDA. “This means EDA has to be more efficient,” Davidmann said. “It’s got to start to use some of this new-world software because we’re not going to be able to help people design 50 billion transistor chips unless we have much better tools. It’s making us step up to the mark to build scalable, more efficient, faster, more accessible tools, cloud-based, with flexible licensing.”

None of this will be simple, though, and it will never be enough.

“The minute you have something useful, users will say, ‘But it’s dog slow, and how come it cannot do this and that?’ What is clear is that whenever you can automatically optimize the lower steps, it moves you up a step as an architect,” said de Geus. “Then you can say, ‘If it had just known that parallel is okay, and not just serial, it could do so much better.’ Why did you not tell it? Because you couldn’t do anything with it at that point in time. The optimizations are both supporting you, and also somewhat your limiter. But in that context, if we can enrich in the language, step by step, by essentially focusing more on the very optionality that you have to decide upon, this is how you will never be happy with [EDA]. And yet it will make you happy that you can move up, because every new insight that you have, and the tool cannot immediately absorb, you’re going to say, ‘This is still primitive stuff.’ And it is still primitive stuff.”

But it’s also changing rapidly.



Leave a Reply


(Note: This name will be displayed publicly)