Inside The Hybrid Memory Cube

A look at how to break through the memory bandwidth wall.


The memory bandwidth requirements for today’s high-performance computing applications and next-generation networking applications have increased beyond what conventional memory architectures can provide. For example in a typical 400G networking application, packet buffer bandwidth requirement could be as high as 2,000 Gb/s.

Achieving this level of bandwidth using the latest DDR4 memory technology would require at least 16 x 72-bit DIMM (DDR4 @2400), and more than 2,000 Pins. The high pin count and form factor makes this solution look worse and may be impractical.

Current DDRx technologies increase the number of memory modules in an effort to increase the memory capacity and the bandwidth, but using a wide parallel data bus in a DDRx system increases the problem of crosstalk and signal integrity. Overcoming this requires running the system at lower frequencies with a resulting loss of bandwidth. It is impossible to scale the DDRx technology to achieve the high bandwidth and the high capacity required by today’s high-end applications.

Breaking through this “memory wall” problem requires a new architecture that can provide increased density and bandwidth at significantly reduced power consumption. The revolutionary Hybrid Memory Cube (HMC) architecture provides solutions for this problem.

Only one HMC device, operating at 4HMC Links running at a SerDes speed of 30Gbps is needed to solve above 400G packet buffer problems. This solution provides up to 2,560 Gb/s of memory access bandwidth with only 276 pins.

An HMC is a single package containing either four or eight DRAM die and one logic die, stacked together using through-silicon via (TSV) technology. Stacking several DRAM dies on top of each increases memory density without any additional area penalty. Portions of each memory die are combined together to form a memory vault. Each memory vault is controlled by a vault controller implemented at the logic die. Revolutionary TSV technology for 3D die stacking helps increase the bandwidth, as they can run at much higher frequency and thereby provide high data transfer rates to form the DRAM dies.

Each HMC memory consists of 16 or 32 vaults, with each vault made up of 4 to 8 high DRAM stacks. Each vault can provide a maximum bandwidth of 10 GBps, thus providing 320 GBps of bandwidth for a 32-vault memory architecture. Because the vaults works independently, a high level of memory access parallelism is achieved. This is as good as multiple DDRx systems running parallel.

To exploit this architecture and achieve high bandwidth, system designers should use the HMC user-defined address-mapping scheme. For instance, if the requested address stream is sequential from the low-order address bits, system designers can choose to use one of the default address map modes offered in the HMC. It maps the vault address to the less significant address bits, followed by the bank address just above the vault address. This address-mapping algorithm is referred to as “low interleave” and forces sequential addressing to be spread across different vaults and then across different banks within a vault, thus providing memory access parallelism. Open-Silicon’s HMC Controller allows system designers to adopt a flexible addressing scheme of HMC protocol.

The HMC vault includes a memory controller for the DRAM stack associated with it. By keeping the DRAM controller functionality inside the HMC, host-side system designer’s need not worry about the DRAM memory access timings and refresh rates. This also allows memory vendors to improve the access timings between the DRAM controller and the DRAM die without hampering the host HMC Controller architecture. This compares with the previous DDRX technologies, where any change with respect to the timing parameter of the memory would require a change in memory controller design. This is not the case with HMC.

The Host HMC controller uses packetized protocol over high-speed serial links to access the HMC memory. This protocol is highly optimized to provide higher user data bandwidth. For example, there is no requirement for additional meta-data to maintain link level synchronization as compared to other serial protocols. The link level synchronization is achieved through an initial one-time link level training. The Open-Silicon HMC controller provides seamless implementation of HMC link protocol with the lowest latency solution thus extracting maximum bandwidth from the HMC.

Systems using HMC should use source synchronous clocking between the Host and HMC. Open-Silicon’s HMC controller exploits this to achieve low latency architecture.