Keeping The Balance

Making power/performance tradeoffs in multi- or many-core systems requires careful balance; fully understanding the application is critical.


By Ann Steffora Mutschler
The brains of datacenters today are more powerful than ever due to technology advancements in chip architectures and in manufacturing processes that allow more processing power thanks to Moore’s Law. But knowing exactly how and where to configure the processors and cores for optimum throughput and performance within a certain power budget raises a number of questions.

“The first question should always be, ‘What is the nature of the application especially with respect to how much parallelization is available?’ said Chris Rowen, CTO at Tensilica. “Whenever you’re talking about multi-core or SIMD or vector processors you’re always talking about how to most effectively exploit the parallelism that’s in the application. And you have parallelism available across that spectrum from graphics or imaging on the handset to this big datacenter machine. But it comes in different flavors.”

In the datacenter, there are tens or hundreds of thousands of requests like running a database or a map server or fielding Twitter tweets. There is massive parallelism and a fair degree of independence. These requests may all be tapping into the same data in some sense, but what one person is looking for on their map and what another person is looking for on another map don’t interact very much. “So the first thing you have is lots of threads—lots of very independent tasks. The challenge for the system design team is to get maximum throughput at minimum cost and power for that largely independent set of requests that are coming through Rowen explained.

To this point, Charles Webb, IBM fellow of System z processor design, said the first thing to note with the mainframe System z is that it is optimized around system-level and even datacenter-level power/performance efficiency. “We really do design this for scale, which changes the way we do some things. At the system level there is somewhat of a higher fixed cost in terms of power, but when you fully configure it, it yields something much more efficient. If you go back in history a couple of generations in 2008 we introduced the System z 10 and that one took a very large jump in processor performance and frequency. It was a redesign of the processor. We did change some of the system level characteristics, but we also committed at that point and going forward to continuing to grow the system performance within that power envelope. What our clients typically do with the large systems is they’ve got them in place in the datacenters and periodically they refresh the technology to get more performance and more capacity.”

In terms of how power is allocated among the processors, he explained that the processors are housed in what IBM calls ‘books,’ whereby each book is a circuit board that’s actually a very large blade. It includes a multichip module with the processors on it, along with memory and I/O. “We actually drive to keep the power per book about the same going forward from one generation to the next, and translate that down into the power at the MCM level being about the same from one generation to the next. What that does from a design point of view is as we get a new processor technology, we know going in that we’ve got pretty much a fixed power budget to work with at the processor level. And then within that we look to balance what we can from per-processor growth and from total system-capacity growth, and provide balanced growth across those two dimensions for our clients because they care about both of those,” Webb said.

He noted that on System z they’ve tended to tilt that a little more towards the per processor growth because that has been a key metric that customers care about:  being able to run their transactions faster, being able to run their batch jobs faster or at least in the same batch window as those jobs get more complex. “Versus other platforms, we’ve probably tilted our scale a little bit more towards the per thread performance/per processor performance but in the end, we want to be able to grow the system capacity about 50% per generation and grow the per processor performance as much within that as the technology will allow given the power/performance tradeoffs.”

The biggest challenge to achieving this system capacity growth is keeping the balance, Webb said. “The processor performance is the one that’s becoming more and more challenging. Growing the raw capacity is driven largely by density and there the biggest challenge is the software scalability and keeping enough cache close to the processors to keep them fed. You can do a system that jams a whole bunch of processors onto a die, you can get nice, raw benchmark capacity but when you go and try to run that on a real, enterprise-scale transaction processing workload, you’re starving the processors for data.”

IBM plays a balancing game between the speed of the processors, the number of processors and the amount of cache. One of the “big levers,” Webb said, is embedded DRAM technology. “That lets us put quite large caches very close to the processors. We have a 48MB cache on the processor chip that’s shared among the cores on that processor and then separately have even larger caches that are still on the MCM that are shared by all the cores on that module. We have two chips that are basically full of eDRAM that have also got the fabric dataflow and the directory and controls. But we end up with a very nice size 384MB cache that is shared among those processors and that keeps the data close to it. That’s part of the balance that we have to play there is making sure that we can keep all those processor fed in a very data intensive environment.”

The maximum system configuration of IBM’s EC12 mainframe has 120 active cores, of which 101 are customer configurable. The processors are manufactured on IBM’s 32nm SOI process.

Power management is particularly important in large server farms because the enormous number of servers can be directly mapped to the huge cost of powering the machines and for cooling them. Those same concerns are applicable in desktop machines and in mobile devices, where battery technology has been relatively stagnant for years. Nvidia follows a strategy that includes using SMP technology for its Tegra 2 SoCs that allow cores to be power gated off when not needed.

Nvidia’s engineering team uses DVFS (dynamic voltage and frequency scaling) technology to run the cores based on workload. “We were symmetric, we weren’t asynchronous, so we could run the cores at different operating voltages and thus frequencies based on the workload,” said Nick Stam, director of technical marketing at Nvidia.

“Our GPU cores in Tegra 2 and in Tegra 3 are able to do a lot of the things you see in CPUs and GPUs on the desktops—but on the mobile side, of course—and that is to power gate many of the blocks within the core. You could independently power gate sections of those cores that aren’t being used. In 3D, for example, there are times when you might be doing one sort of operation but not another operation, like fetching pixels but not texturing or processing vertex. Generally you’re going to be doing everything at some level of operation because you’re pipelined, but there can be sometimes when you’re not fetching text or doing heavy duty processing with vertex or pixel shaders. You can scale them down. So we can turn off units as needed or power down an entire core or multiple cores and power up those cores based on workload,” Stam said.

Moving to Tegra 3, Nvidia took a more intensive approach of power gating for each of the cores both GPU and CPU and implemented a technology similar in concept to ARM’s big.LITTLE. “We did it first,” he claims. “We have a 4-PLUS-1, which we call our variable SMP architecture, where we have four Cortex A9s as the main complex. You can power gate each of those cores individually, turn them off and only have one core working on a workload if the workload doesn’t require much to be done. Just turn on one main core and then selectively turn on cores two, three and four as needed. They are running in a symmetric mode so they’ll all be running at the same GHz or MHz.”

The “PLUS-1” is the fifth core called the Power Saver core, which can be used dynamically. “We can run either the main set of cores or the Power Saver core depending on workload. Say you’re doing video streaming or audio streaming or background syncs of Facebook or email—things like that that aren’t necessarily intensive like processing a Web page or playing a game. You use this fifth core and that’s a significant savings,” he said.

Making Tradeoffs
Once IBM’s engineers have a design point established, Webb said they model that and see where they’re sitting. “Usually the answer is the power is higher than we want it to be. That becomes one of the challenges for the team to work on as we refine the design—to use lower-power circuits that aren’t on the critical paths to really tune things so we’re only spending the power where we need it to get the performance and give in to those tradeoffs.”

The tradeoff may be as fundamental as saving cycles vs. adding circuits. “We really look hard and ask, ‘How much power am I going to add for this circuit that I need there versus how much performance is it going to deliver?’ At the end of the day, when we finally get the real silicon, we’re not just modeling anymore. We’re modeling and measuring. We look at the actual silicon that’s coming out of the fab and measure the distribution that we’re getting on the power/performance characteristics and such. In the end we have to make a call at what frequency can we run this and stay within our power envelope,” he said.

Nevertheless, power and frequency are rather fungible by adjusting what voltage and frequency the chip runs at. “From a design point of view, working on driving down the cycle time and working on driving down the power go hand in hand. We constantly review all of those metrics and what we track through the design process, so that in the end we’ve got something that’s balanced and that we’ll have the flexibility to get whatever the silicon process can yield and deliver the products. Sometimes we don’t make that final call on what frequency we’re going to ship at until two or three months before we’re ready to go out the door,” Webb added.

Leave a Reply

(Note: This name will be displayed publicly)