IBM’s Power7+ Processor

It’s not all about performance. The gain in energy efficiency is the real architectural breakthrough.

popularity

By Barry Pangrle
Hot Chips 24 was held Aug. 27-29 with tutorials on the first day and 30-minute technical presentations plus keynote addresses on the second and third days. There were a lot of great presentations and Hot Chips is definitely one of my favorite conferences.

So out of all of the presentations, why did I choose IBM’s Scott Taylor’s on Power7+? Well, it’s likely that I’ll write some future articles on other presentations, too, but I wanted to focus on energy efficiency. In order to reach the goals for the Challenges in Exascale Computing initiative, it will be necessary to drastically reduce the amount of energy needed for every operation performed.

Remember when performance was measured in MIPS? Well, maybe in some cases it still is, but we now have off-the-shelf graphics cards that are capable of teraflop/s (1012) of computing power and we talk about supercomputers with petaflop/s (1015) of performance. IBM’s Sequoia supercomputer (a BlueGene/Q system) this past June shot to the top of the Supercomputing Top500 chart, clocking in at 16.32 petaflop/s on the Linpack benchmark using 1,572,864 cores (and that was using IBM’s 45nm parts, by the way). Systems built around the IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom also currently dominate the top of the Green500 list with the Sequoia placing 20th in the megaflop/s/watt category with 19 of its brethren in front of it and the difference between 1 and 20 is miniscule compared to the drop to the first non-BlueGene/Q system at 21.

To put this into better perspective, Sequoia is getting just a bit above 10 gigaflop/s per core and the whole system reportedly runs in 7.9 megawatts. That may seem like a lot of power—and it is—but it’s still only about 63% of the power of Fujitsu’s “K Computer” that it pushed off the perch while benchmarking about 1.55 times faster. Sequoia is achieving a bit better than 2 gigaflop/s per watt or 2 gigaflop per joule. Equivalently, that is just below 500 picojoules per flop. So if we want to, say, hit 1 exaflop/s and keep it within a 10 megawatt envelope, we would need to achieve 10 gigaflop/s per watt or 100 picojoules per flop. There is still about a factor of 5 in efficiency needed to hit the magical 1 exaflop/s target.

At this point, some readers might be thinking that some high-end graphics cards at say 200 watts and 2 teraflops look really close to that 100 picojoules per flop, but you need to remember that it has to benchmark that speed on a Linpack benchmark, and the power is for the whole system and not just the card. Start adding in memory, disk drives, power supplies and it starts to add up and makes the IBM system all the more impressive. [Interested readers can check out the Green500 list to see where the graphics card based systems start to come into play.]

So what’s new about the Power7+? First, it’s built on IBM’s 32nm lithography, copper, SOI process with 13 levels of metal and IBM’s eDRAM. The chip contains 80MB of shared L3 eDRAM, which undoubtedly helps from an energy-efficiency standpoint by keeping more data closer to the computational units. The chip is 567mm(2) and has 8 processor cores (shown below in Figure 1) with 4-way simultaneous multithreading (SMT) and 256KB of L2 cache per core. The chip also contains new hardware accelerators and enhanced power management while maintaining binary compatibility with the previous Power6/7 generations. It has 2.1B transistors but if eDRAM wasn’t used for all of that L3 cache, it would instead take about 5.4B transistors.

barry1
Figure 1. IBM Power7+ Processor Chip (source: IBM)

Figure 2 below shows a picture of the core “chiplet” that’s outlined in white in Figure 1. It also shows the inclusion of critical path monitors and buffers that are used to perform “real-time chip guardbanding.” As we’ve seen in past articles, processors are often more energy efficient when running at lower voltages, but as the voltage is reduced, variability becomes more pronounced. If designers have to statically guardband for worst-case possibilities then the price paid is running at higher-voltage levels and less-energy-efficient chips. Given that versions of the 8-core Power7 parts clocked at 4GHz, we might be able to infer something about the voltage level of the Sequoia BlueGene/Q parts that are only clocked at 1.6GHz. IBM claims up to a 25% frequency gain for Power7+ due to mapping into the 32nm technology and power-management improvements. Presumably, that frequency gain could be turned into an energy-efficiency gain by running at lower voltages as long as the advantage wasn’t guardbanded away. It looks like IBM has taken steps to make sure it’s getting the most out of the energy pumped into its processors.

barry2
Figure 2. IBM Power7+ Processor Core Chiplet (source: IBM)

IBM has implemented a DPLL per core and aggregates the critical path monitors across the core to determine how much margin is present. Figure 3 below shows a diagram detailing the different types of variation and how much margin is reclaimed using the CPM scheme.

barry3
Figure 3. Advantage in Real-Time Guardbanding (source: IBM)

The diagram shows the advantage of the IBM implementation. Moving to smaller nodes should help improve the energy per operation and stacking memory on top of the processor dies could also offer more energy benefits. It looks like IBM will have competitive parts to attempt to retain its current number one ranking on the supercomputing list, and there’s still plenty more to do to get to that 1 exaflop/s mark. Energy-efficiency will be a key factor in getting there.

—Barry Pangrle is an independent power architect in Silicon Valley.