What needs to be considered in optimizing SoCs for power and performance.
Maximizing SoC performance and minimizing power is becoming a multi-layered and multi-company challenge that depends on everything from ecosystem feedback and interactions to micro-architectural decisions about whether analog circuits whisper or shout.
What used to be a straightforward architectural tradeoff between performance and power has evolved into a much more diffuse and collaborative process. And it is only becoming more complex as chipmakers seek to leverage their designs across more products, many of which need to be customized for specific applications.
“The architecture is a world of instructions and memory system rules that sets out a contract between hardware and software to ensure working devices can be developed quickly,” said Peter Greenhalgh, director of technology and a fellow at ARM. “Within this architectural world, the micro-architectures are implementations that might be software-compatible but address anything from an automotive drivetrain to an enterprise server to automotive in-vehicle infotainment (IVI).”
In effect, this has become a customizable platform strategy. When ARM develops a processor core, it also builds micro-architectures for specific market needs based upon performance and power. From there, ARM’s customers and partners modify it even further.
“The process is going to be entirely driven by what the end application is, and in particular, what the demanding parts of the end application are—and almost always in one of the platforms you recognize there’s a handful of things that are going to set the pace for what it is,” said Chris Rowen, Fellow and CTO of the IP Group at Cadence. “The architecture decision process tends to look like this—if you’re doing high performance computing, you have a set of computational benchmarks. If you have a mobile handset, you have a long checklist of capabilities that are driven by what the competition is doing. If you’re building a very new kind of platform for some new layer in networking, you have very concrete computational demands and very concrete I/O flows that have to be supported. If the purpose of the box is to transform a stream of data coming in at so many hundreds of gigabits per second in a stream going out, that immediately tells you something about what the fundamental bandwidth requirements, and probably what the fundamental computational requirements are.”
Consider a vision-based system, for example. “I know how many cameras are coming in,” said Rowen. “I know roughly what kind of computation is going to be required because I may have decided this is a neural network-based system and my research shows I have this many trillions of operations per second, which are necessary to implement those algorithms. That’s going to be one of the central parts of the decision. How do I bring that amount of mandatory I/O and do that level of probable compute, plus some headroom because the compute requirements do always change and usually go up rather than down over the course of it? That will set some fundamental needs in terms of what kinds of processors, how many processors I’m putting in there.”
So for a smartphone there will likely be a GPU and a number of ARM cores of different sizes. “I may say my differentiating characteristic is this level of vision processing performance,” he noted. “I need this category of vision processing. That populates the mandatory I/O and the mandatory compute.”
From there, and specifically from the compute side, there is usually a complementary analysis of how much memory bandwidth needed, he explained. Then the fundamental plumbing of the chip is worked on— specifically the memory type, width, data rate, plus some of the other characteristics of the fundamental DDR interface.
“For many of these platforms, you’re really driven by performance, fundamentally because at the end of the day people go through the pain and expense of designing a new chip for a reason, and it’s usually because they want to accomplish something significantly harder than what they’ve done before,” Rowen said. “Often it is because they are pushing the envelope in terms of capability. The capability is fundamentally about what kind of data streams and how much compute I can do on it. A central part of the cost is actually associated with the memory system, so the memory system is a critical second step. The third step is wiring up all of the other incidental bits and pieces—I need this many USB ports, I need I2C, I have certain requirements for my audio interfaces. And I need to make sure I can talk to flash and other things, which may not be as centrally part of the performance makeup but which may be important to the cost and certainly important to fulfilling the overall checklist of requirements,” Rowen added.
Getting memory right
Part of the architecture and micro-architecture decision process includes a number of choices around the memory, stressed Nick Heaton, distinguished engineer at Cadence. “Very quickly the architects of these devices come to a point of decision about the fundamental DDR technology they’re going to use because that has a big cost implication on the product. Typically, they’re going to try and make a decision and it could be, in a mobile device, where you want to provide the right amount of performance at the right price point. You would want to put in the cheapest DDR you can get away with that does the job adequately. If you’re trying to build an ARM server where performance is everything, you’ll pick the highest-spec DDR that your budget will allow because that ultimately will determine the end benchmark metrics that you’re looking to hit.”
He noted another critical issue that has come to the forefront in the last few years is making decisions about coherent and non-coherent paths to memory. “What ARM has introduced in the last few years is the idea of shared L2 caches, and that has an up side and a down side. It’s got a cost in the hardware but it can get you some additional performance depending on the use case. It may be that in certain applications traffic just bypasses the coherent part of the system and goes straight to memory because it’s so big there’s no way it could be stored in cache. But in other peripherals you would definitely target them where they interact closely. Maybe if you’re in a networking switch you might be looking at processing packet headers, and a lot of that data might sit very nicely in cache in which case you get a big power and performance benefit because you reduce the number of external memory accesses.”
Initially there are different levels macro architectural decisions, but ultimately there’s progressive refinement. A lot of the early exploration and targeting of an architecture could be done in high-level modeling.
“You might start to model what your algorithms look like, and hence get a sense of the memory bandwidth and computational requirements,” Heaton said. “That can direct you to the kind of processor choices and architecture that would match, but ultimately this gets refined the whole way to tape-out. You don’t stop doing analysis because with big multicore systems predicting this performance is actually quite hard. You can make early predictions, but they could be significantly varied as you implement them in RTL and ultimately implement them in silicon.”
IP vendors are well aware these kinds of tradeoffs are being made, and they have added some flexibility into their IP to account for it.
“An engineering team working on an A-Class, high-performance processor used in IVI will chase every fraction of a percent of performance and energy efficiency without needing to hold back to ensure determinism and interrupt latency requirements are met,” said Greenhalgh. “The engineering team working on an R-Class, automotive processor used in drivetrain will be considering whether each feature provides deterministic performance and efficiency in a way that can be tested to meet stringent safety requirements with low interrupt latency. The R-Class team will also have to consider that automotive drivetrain products will want to optimize for embedded flash, which both means extra interfaces and designing for process nodes that are often two generations ahead of the leading edge consumer process nodes.”
Navraj Nandra, senior director of marketing for the DesignWare Analog and MSIP Solutions Group at Synopsys, agreed it is essential to get the memory right. He said engineering teams today are using a mix of different types of CPUs, as well as graphics accelerators. Approaches like ARM’s big.LITTLE strategy affect what is happening at a macro level for some memory processing elements, all of which have to talk to an external memory through the interconnects.
“In making their IP they make a lot of interesting tradeoffs between DRAM capacity, the DRAM bandwidth, and the types of access the system can make,” Nandra said. “These are very challenging tradeoffs because they are multi-dimensional and parallel, so these make it very difficult for an engineer to solve the problem.”
At the same time, a very large fraction of power in many of these SoCs is spent on the analog circuits to do things like communicate with the outside world, explained Steven Woo, vice president of solutions marketing at Rambus. “What you’re starting to see is more and more of an emphasis on trying to come up with better signaling technologies that will use less power on the signal yet maintain good signal integrity. Also, in general, people are trying not to move data very large distances. If you need to move data really long distances, it’s equivalent to saying those analog circuits have to shout what the data is. And if it’s a little closer you can get closer to a whisper you can drop the amount of power down quite a bit. But people are really trying not to move data over really long distances because it’s a power issue.”
Would a fresh start be easier?
Given all of these considerations, it does seem like it might be easier to start with a blank sheet of paper for an SoC.
“Given a blank slate, and an indefinite amount of time, you can come up with the best tradeoff,” said Arvind Shanmugavel, senior director of applications engineering at Ansys-Apache. “But unfortunately the industry is so competitive that the window of opportunity for these chip manufacturers is so small, they really cannot afford to do a ground up design anymore.”
Most engineering groups today start with an existing design or an existing architecture, making slight modifications just so that they can meet time to market. Shanmugavel said they recognize they’re leaving a lot on the table in terms of optimization, but this is the nature of the game due to time-to-market requirements.
Thomas Bollaert, technical director, Calypto products at Mentor Graphics, noted that high-performance design happens at every stage of the design process. “It’s not that you can resolve everything up front at the architectural level. Even when you start moving down into implementation, you’re still going to have some considerations about the performance aspect of your design. But if we start at the very beginning, there are ways to look at the performance of the architecture with high-level models. This is really what the traditional ESL has been about, and there’s been a lot of talk about this for many years, but today people are truly doing this because there is a real need to look at the performance aspect of the design.”
He said complete chips are never started from a blank page, primarily because there is a lot of reuse and IP that is sourced from third-party vendors. “There are nonetheless going to be some blocks from your design and some architectural decisions that will be yours. Today, high-level architectural tools exist, such as ESL architectural analysis tools. They have models and technologies that help people put those architectural prototypes together pretty quickly. That is definitely happening at the chip architecture level. But then there are different things you’ll want to consider when you look at each of the blocks and IPs inside that SoC architecture. What you may want to consider will actually differ based on the kinds of blocks you are looking at. We see in the industry right now a lot of activity on video codecs, so that’s a key component of the lowest chip these days because most of the chips are connected and they need to process video in one form or another. It used to be easy in the sense that there were one or two legacy standard video codecs. Now there are a lot of new codecs in the market and chips must support those. All of this to say that there’s a lot of design activity going on, and finding the right architecture for these codecs is important. At this stage, they’re not necessarily going to use architectural analysis tools, but high-level synthesis is actually the better way to go about finding the optimal architecture for these algorithmic IP in the system.”
Interestingly, Shanmugavel sees that there is a high level of over-designed chips today, ‘just to make sure.’ “They know it worked in a previous generation, so the attitude is, ‘let’s just over-design a little bit, time is more important for me so that’s fine.’ And in fact, without over-designing, you can have a better, faster product, something that’s more reliable — if you don’t over-design it. You can even build it cheaper having die size reductions or optimizing the chip and package together. There are a lot of different optimizations that you can do at the physical implementation level.”
“People over-design thinking that they’re covering all the bases, but we’ve also seen over-design being done incorrectly,” he continued. “For example, when you’re trying to design the power grid, power integrity is a very important aspect in terms of performance. You have very low operating voltages nowadays and you have a very low noise margin. When you’re trying to design the power grid, over-designing different layers could lead to different effects, such as if your bottleneck for your power noise is, let’s say, Metal 2, but you’ve over-designed Metal 4, 5 and 6, but under-designed 2 and 3, what happens is you’re exacerbating the bottleneck but you’ve tried to improve something that doesn’t really make sense. It’s a very slippery slope.”
Shanmugavel stressed that navigating these issues requires a high level of expertise along with the analysis capability and the analytics capability. “How do you analyze a design from an early stage to sign-off, and change the analysis modes from being just purely descriptive to being prescriptive? You need to have smart tools that can analyze the design, but also be able to tell you what to do next. On the analytics side, how do you convert the data that comes from the analysis into information that can help you design better? At the end of the day, there is no replacement for good engineering judgment. You can have fantastic tools, but if you have poor judgment in engineering, it’s not going to cut it.”
Leave a Reply