Bigger Pipes, New Priorities

Micron architects talk about the benefits and challenges of stacked die, and how the packaging approach will change the fundamentals of design.


By Ann Steffora Mutschler
From the impact of stacking on memory subsystems to advances in computing architecture, Micron Technology is at the forefront in the memory industry. System-Level Design sat down to discuss challenges, as well as some possible solutions, that plague memory subsystem architects with Scott Graham, general manager for Micron’s Hybrid Memory Cube (HMC) and Joe Jeddeloh, whose team developed the logic portion of the HMC. What follows are excerpts from that discussion.

SLD: As the industry moves to employ stacking techniques, what are some of the overall impacts on the memory subsystem?
Jeddeloh: The TSV itself enables a very low power interconnect, and as we move up and down in the Z direction one of the objectives to take the greatest advantage of that is try not to move in the X and Y direction very much. For us, we create a DRAM architecture that is tiled. Instead of a DRAM die being one large device that has one set of I/Os on it, we break it into, say, 16 separate DRAMs, in essence much like a multicore processor. Each of those DRAMs has its own interface so when you go to access data, you go to a very local area of DRAM. Instead of lighting up a really large DRAM array or page or row, your cache line comes from a more localized area. At the start, it’s a more efficient access from the DRAM itself because we’re not moving bits very far in an X, Y direction. We’re not lighting up an extra number of transistors and capacitors. It’s a more directed access.

Then we take that access and being that it is coming from a tile or partition that is, say, 1/16th the size of a normal DRAM, we then move that down the Z direction on a TSV in a very localized access. There are two big themes. We lit up fewer transistors and we moved it a shorter distance so it becomes a very efficient transfer of that data down that Z pipe.

SLD: What does that do to cache? Can you get rid of some levels of cache?
Jeddeloh: Not necessarily. That will be a conversation on the longer-range implications. Cache hierarchies are created because of not only latencies but bandwidth deficiencies of going to a memory subsystem. When you can have thousands of these TSVs with a very low power profile you can get a tremendous amount of bandwidth now moving in a cube. The cache structures that we rely on today…you think of them differently when there is so much bandwidth that is potentially available so they can begin to be rethought.

SLD: Is Micron supporting Wide I/O?
Graham: Which version? There is a low-power Wide I/O and then there is a Wide I/O derivative that basically spawned from that group that is in the JEDEC task group right now. It is being explored and they are actually calling it high-bandwidth memory, but it is essentially a Wide I/O effort. Micron is actively participating in both of those.

SLD: What is the impact of stacking on performance of memory subsystems?
Jeddeloh: Two aspects to that. One is just raw bandwidth. You can put so many TSVs into that stack that you can generate more than a magnitude increase of bandwidth. But when we start talking about tiling or the partition into concurrent resources, where there’s a traditional memory subsystem, there’s always a bank conflict that has to be managed and there are a limited number of banks. When you think of a DIMM, maybe it has 4, 8, 16 banks in it, and that fundamental access to that bank isn’t going to come down latency-wise a great deal. That just can’t be pushed.

The physics can’t push it that far unless you go to something like an RLDRAM and pay a big die cost to get that. In a traditional DRAM, the banking concept is still going to be there, but once you go into a memory cube where you have these tiles and partitions, each of those has its own bank structure. So instead of 8 banks, you have 128 banks, 256 banks and each of these are put into parallel DRAM structures so you have a tremendous amount of concurrency available. You can think of a many-core processor coming at a many tiled memory system that marries up and can handle a lot of concurrent transactions.

SLD: How do architects make the system design trade-offs in terms of memory subsystems?
Jeddeloh: One is the processor needs to get bandwidth onto it. As we go to more and more cores, it’s becoming more and more bandwidth-hungry. In this generation, you can’t stack the DRAM on top of the processor because the processor is too hot. That means the processor has to go off chip to get that bandwidth, and that just becomes and I/O pin power silicon density. You need to connect a pipe to that processor that can bring in as much bandwidth at the lowest amount of power or for that investment, and for something like a memory cube you can put more density in a very local area and put that right next to the processor. When you think about that topology, then once again we are reducing distances such that we can create a power advantage. And power is really the No. 1 theme going forward. Once we reduce that power, we can create a smaller, more efficient I/O structure when the processor and the memory system are right next to each other.

As an architect, you start thinking about this not only from a logical perspective but the physical perspective—being able to stack these 3D structures in a very localized area and create a very dense, low-power interconnect. This is also going to mean new materials, perhaps, like MCMs, silicon interposers, but not the traditional 10 inches of FR-4 because it just consumes way too much power to ship the bits over that structure.

As a memory architect, you start thinking about this from many different facets. You have the materials, the locality, and then of course you have a thermal issue. As we move close to that processor, we start looking into the processor’s thermal solutions space.

And, then you start thinking about the concurrency of having resources like this. If you have, say, 8 cubes hooked up to a processor, there’s a tremendous amount of bandwidth and concurrency that can happen in a very small area.

SLD: How do you deal with the heat issue in the HMC?
Jeddeloh: In many of the early instances, we’re going to be part of the processor’s cooling complex. We put a top or lid on it. DRAM doesn’t like heat. It messes up the refresh. If we are not on top of the processor, the heat is manageable. Once you create that low power I/O, which is enabled by changing the locality of the overall system topology—and we’re not creating as much power within the cube itself—then we stack it up and pull the heat out the top.

Source: Micron Technology

SLD: How far out is the HMC from being able to be used in production designs.
Graham: Our plan of record is for production to begin in the second half of 2013.

SLD: What trends do you see happening in computing architecture?
Jeddeloh: It used to be megahertz. Then it was multicore. We believe that the trend is low-power memory bandwidth. That prime piece of real estate is what sits down at the bottom of those TSVs. That’s a very, very valuable piece of real estate and there’s going to be a struggle over how that value proposition is going to be established.