Change in focus highlights fundamental problems in processor design at advanced process nodes; workarounds and innovation are key.
By Pallab Chatterjee
The game has changed for processors. The goal now is data throughput, not higher gigahertz and more watts.
That shift dominated the presentations at the Hot Chips conference this week. In previous years, the theme was higher single-core performance, more power and smaller geometries processes. This year it was all about multi-core and multi-power options as the realities of process technologies, economic viability—as well not having higher power density design melt the silicon and package—begin taking hold.
The new server platforms in the enterprise arena are targeting multicore, multithreaded processors with a high-speed multi-chip data bus. AMD’s Opteron processor family, for example, now includes the new “Magny-Cours.” This is a two-die MCM configuration of the 45nm SOI six-core Istanbul processor using the Greyhound architecture. It comes with a 64-bit core with 4x 6.4GT/s Hypertransport interface and a new low-power DDR3 memory interface.
The core architecture of the die (see figure 1) includes a distributed L2 cache for each of the six-core units, a common L3 cache and then a centralized DDR memory controller interface.
The finalized two-die configuration for the MCM (see figure 2) is a new processor optimized for the enterprise by supporting a 2P/4P configuration in a standard rack unit. It can support high speed DRAM access without needing to cache probe the status of the other cores.
Intel highlighted the architecture of the Nehalem-EX enterprise processor. This is a 45nm eight-core processor with four QPI channels and a DDR interface. Similar to the application addressed by the AMD part, the Nehalem-EX is targeted at minimizing latency and maximizing data throughput without raising the clock rate of the part. The die photo (see figure 3) shows the distributed cache by the 8 cores, and a common “uncore” area. The chip is flanked by the QPI and SMI interfaces.
To reduce memory latency there are two DDR3 channels, a ring protocol for the distributed LLC caches, and a scheduler that can support 32 simultaneous requests (see figure 4).
The use of the QPI interface allows high-speed communication between the die and can result in a very high thread and core count server module (see figure 5). This configuration contains eight sockets and four I/O hub support chips.
TI presented an OMAP processor, a multi-media core design that features distributed processing and high-speed memory interfaces. For the multicore processing, it is a large structure with both analog and digital modules. This chip architecture is optimized for standard-function, non-graphic content creation tasks. It features two Cortex A9 cores, two Cortex M3 task processors, and a general purpose 64x-Lite DSP core. To support the processor cores and the multi-media (audio and video), a power management and scheduling function is part of the chip making sure that sections not used are powered down.
Nvidia presented a keynote about the new uses of the GPU as a general-purpose data processing engine. The company presented a high-performance platform and in the later sections, executives presented a minimum configuration engine and a fill PC feature set IC called the ION. The ION is available now as an option for small format embedded PCs and as a companion chip to the Intel Atom Processor. The Ion Processor features both hardware and software power management control. This includes having the software turn on and off aspects of the core block without being tied to a fixed clock time based application. In this scenario, the power conditioning is data-based, rather than based on a fixed task.
Day one continued with additional processor applications and more details on the Intel design, including multicore processor architectures. The common trend on these designs with embedded cache is the focus on switched and gated power for unused blocks.
The highlight for day two of the conference was the introduction of the IBM Power7 processor, a monolothic 45nm SOI eDRAM process design with eight cores and 32 threads per chip. The design features distributed L2 caches with a centralized L3 cache and a common DDR3 interface. Rather than the QPI interface, it has its own instruction pre-fetch interface for high bandwidth communication. The design is targeted at standard single-die applications and quad-chip MCM modules for compute-intensive datacenters. The applications of the hardware are architected to support up to 32 of the chips (8 quad chip MCMs) in a single configuration and memory structure.
All of the processors and chips discussed were showing their low-power characteristics. These included the standard, synthesis-able low-power solutions of multiple synchronous clock paths based on a master, gated power, multimode and power-down feature under software control, and reduced operating voltage on the core vs. I/O. There were no real innovative custom power handling solutions that were put into the architecture, just the widespread use of the known solutions the last few years. The trend appears to be multicore and multi-die data passing as the first goal; after this is figured out, then extensive power handling will be addressed. The GPU as a compute engine is working with a different power/performance curve over the standard multicore engines and its power solutions will end up improving under different solution paths as products move forward.
Leave a Reply