中文 English

What Happened To Execute-in-Place?

The concept as it was originally conceived no longer applies. Here’s why.

popularity

Executing code directly from non-volatile memory, where it is stored, greatly simplifies compute architectures — especially for simple embedded devices like microcontrollers (MCUs). However, the divergence of memory and logic processes has made that nearly impossible today.

The term “execute-in-place,” or ”XIP,” originated with the embedded NOR memory in MCUs that made XIP viable. That term is still used, but its meaning has become vague and confused.

“There’s definitely this concept of execute-in-place,” said Mark Reiten, vice president of license marketing for silicon storage technology at Microchip. “Today it’s really more like executing from a caching scheme that allows you to page stuff in rapidly from an external flash. Otherwise, you can’t really operate at a high enough clock speed.”

Memory processes have not been able to keep pace with logic processes. Processor performance scales with each new generation, but memories cannot store code that fast. Various strategies for buffering or caching instructions, which was common in high-speed processors, even is becoming prevalent for all levels of embedded processors. And despite all of the options, there are questions about whether XIP is still a meaningful notion.

NOR on-chip
Whlle servers and desktop computers historically have relied on hard drives (or solid-state drives) for storing code, embedded systems have looked to flash memory – specifically, NOR flash. In earlier times, and even today in lower-performance MCUs, embedding NOR flash meant that, at least for boot code, instructions were delivered directly to the processor, with no need for buffering or caching. “Anything greater than 45nm still uses NOR internally,” said Jim Handy, memory analyst at Objective Analysis.


Fig. 1: NOR flash structure and layout. Source: Wikipedia

That makes NOR flash much more attractive for programming than NAND flash, because branches and function calls may interrupt a stream of instructions, and NAND flash must retrieve a page at a time. When embedding flash, MCUs have therefore used NOR flash. As long as the operating frequency is slow enough that the flash access times can keep up, instructions can be fetched and executed directly out of the flash, otherwise known as executed in place. The original idea was that you didn’t need to buffer the instructions somewhere else before executing them.


Fig. 2: NAND flash structure and layout. Source: Wikipedia

Embedded NOR memory is still alive and well, although today’s higher volumes are on nodes far from what would be considered leading-edge for logic. “There’s a lot of volume on 55 and 65nm,” said Reiten. “My peak royalty node is still 180nm, but 250, 350 and 500nm are still delivering. 90nm was kind of a weird node. TSMC did it, but a lot of my other partners jumped to 55/65. But we have a nice volume at 90, and 55/65 is going to eclipse it soon. 40 is starting to ramp.”

Capacity for embedded NOR is also less than that what is available in external NOR or NAND. But there’s a range of capacities that embedded NOR can satisfy. “The bigger memory implementations we see are 64 Mb,” said Reitein. “There are 128 Mb embedded flash designs with tens of megabits of SRAM, as well.”

With access times of around 8ns or less, the more aggressive embedded NOR implementations can support clock speeds of around 120MHz, according to Reiten. That’s not bleeding-edge, but it is a viable frequency for some designs.

Reiten also pointed out that, when it comes to security and safety, more integration has value. “For safety, having everything on the die is a much better approach because you have full control,” he said. “You don’t have to worry about package-level qualifications.”

NOR flash falling behind
But embedded NOR flash must maintain compatibility with the CMOS logic process from which the MCU is built, and that becomes a limiting factor. “Embedded flash is typically limited to about 40nm,” said Gideon Intrater, CTO at Adesto. “There’s some early availability of 28nm, but it’s very expensive.”

Microchip is trying to continue scaling, however. “We have an integration scheme proposed for finFET,” said Reiten. Even so, as embedded NOR has been limited by capacity constraints and process evolution barriers, many designers have moved their code into external NOR flash units.

The cost of external NOR flash is about double the cost of NAND flash, according to Syed Hussain, director of segment marketing at Winbond. “NOR flash is less attractive to big business because it’s coming to the end of its Moore’s Law life,” Hussain said. In fact, the main benefit of NOR flash for this type of application is its familiarity and lower-granularity availability of data.

That higher cost, however, is on a per-bit basis. NAND flash comes in large sizes. If a smaller size is needed than is available in NAND flash, then an external NOR flash device may be less expensive – even if it costs more for each bit.

“External NOR is also cheaper than external NAND for super simple, cost-conscious apps,” said Handy.

Another subtle change came about through the use of serial interfaces instead of parallel interfaces between the external NOR flash and the processor. “When you think of execute-in-place, the mind goes back to the parallel NOR timeframe where typically we used parallel NOR for XIP,” observed Hussain.

That creates other issues. “By going serial, you get a smaller pinout,” said Larry Song, product marketing manager at Synopsys. “With the old parallel bus, you had one bus for the address and one for the data. While reading the data — in this case, an instruction — you could be applying the next address, which didn’t have to be the next sequential address.”

A serial bus replaces the two ports with a single serial port that’s half-duplex. So the approach there is largely to load an instruction and then turn the bus around, streaming instructions out until something outside of that stream is needed. Then you need to stop the stream, turn the bus around again to load a new address, and then resume streaming.

This is a great way to take large blocks of memory and transfer it elsewhere. It’s not so great for managing an execution stream, which may involve frequent branches and loops – and rapid changes of address. So that moves things yet one further step away from the original XIP notion.

Caching changes everything
External NOR flash is unlike embedded NOR flash in one major regard — it’s optimized for size and cost, not for performance. Whereas slower processors could operate directly out of internal NOR flash, external NOR flash latency moved out into the tens-of-microseconds realm. Once a fetched block is streaming, throughput is much faster, and some companies have a continuous streaming mode with two internal buffers. While one is streaming out, the other is fetching a new page for a more or less continuous stream.

But if you were executing directly from the memory, then any branch or loop to an address outside that continuously fed block would incur microsecond latency. That makes executing directly out of memory a complete non-starter.

As a result, most new MCUs have cache. So by definition, you no longer are executing directly out of any non-volatile memory. You’re executing out of the cache. You can keep external NOR in the picture, but the instructions go through the cache en route to the processor. This abstracts the memory speed from the processor speed. The cache is SRAM, which is built out of CMOS logic, so it remains in sync with the selected logic process node.

Meanwhile, NAND has continued to evolve down to about the 15nm node, but the 3D NAND transformation has it moving much farther down the cost and capacity path than would be possible with NOR. The 3D approach is not consistent with monolithic integration onto a high-performance CMOS chip, but even with planar single level-cell NAND cells, the cost per bit remains attractive. “Cost is always the first priority,” noted Song.

This has resulted in an approach called “shadowing.” In cell phones, for example, NOR flash has been replaced by NAND flash. Because one can’t execute code directly from NAND flash, the code is first transferred from the NAND array into DRAM, and the code is executed from DRAM (via a cache). “NAND flash can be extremely fast,” said Intrater. “We’re talking about NAND flash devices that have interfaces at the gigahertz level.”

This might sound like a simple virtual-memory exercise, but in this case it’s not. According to Handy, smartphones were originally going to be using a demand-paging OS in order to manage the storage of instructions.

“Demand-paging virtual memory is nothing else than a cache,” noted Michael Frank, fellow and chief architect at Arteris IP. But then Android came available for free, unlike the planned OSes. So the strategy changed from one of demand-paging to moving the entire code base from flash to DRAM, and then using the SRAM cache mechanism to further manage instruction access times — all in the interest of lower cost.

Shadowing adds to the time required for a system to boot, because the first step is to transfer the code. Applications that need to be running quickly will need to use an alternative strategy for the critical start-up code while the bigger main application set loads.

Following the MCU trajectory to this point raises a question. If the device has no on-chip NOR, or other code storage, and also uses external DRAM, is it really an MCU? Microchip, for example, refers to the processing chips that no longer have everything internally as microprocessors, not microcontrollers.

“Down at that more bleeding edge, the microcontroller products are now looking more like a microprocessor plus external flash,” said Reiten. “There may be some class of products now that are really ‘microcontroller-lite,’ meaning all they have externally is the NOR. Most of the products that have an external NOR memory also have NAND, and maybe they have DDR4. They run more than just a bare bones RTOS. They’re running a little bit heavier-duty Linux, bare-bones Linux, and they just need a lot more memory.”

This becomes an even bigger distinction when a processor has multiple cores and considerations like cache coherency. “Most of the stuff at 28nm has more than one core,” said Reiten. “They’re doing caching and coherency between the processors, and they’re doing look-ahead and multi-threading.” In such a situation, the whole notion of XIP breaks down, because there are multiple instruction streams for multiple cores.

NAND as NOR and other non-volatile memories
Yet another strategy being used by several manufacturers is to build NAND memory in a fashion that behaves like NOR memory, at least to an extent. NAND flash always will require data access by larger pages, but if that can be hidden with internal buffers, and if the interface and the command set look like NOR, then the processor can treat it as if it were a NOR flash chip – and the latency times might even improve a bit —although they remain in the tens-of-microseconds range for initial latency. NAND devices are coming available with dual, quad, and octal SPI interfaces.


Fig. 3: The transition from external NOR to NAND. Source: Winbond

New non-volatile memories (NVMs) also show promise for byte addressability, access times that are closer to DRAM, and a more symmetric approach to reading and writing than flash has.

“Some of the novel architectures are offering at least the promise of prices close to DRAM,” said Marc Greenberg, group director of product marketing, IP Group at Cadence. “It has the benefit of non-volatility, which DRAM doesn’t have, and access times that are one or two orders of magnitude better than flash.”

Examples are the various flavors of MRAM, ReRAM, and PCRAM. These memories, while not built out of pure CMOS logic, are at least compatible with CMOS in a way that’s less demanding than flash is. So the promise there would be the ability to go back to monolithic integration and “true” XIP.

But the code bases even for embedded systems have grown dramatically – especially if a system runs Linux. If enough memory were placed on-chip to satisfy all of those needs, eliminating both DRAM and external NVM, it would add significantly to the die area. On aggressive silicon nodes, that would be expensive silicon.

“Space on your logic die is always at a premium compared to the space on a memory die,” noted Greenberg. “On the logic die, it’s still going to have 10-plus layers of metal on top of it. In a DRAM process, they typically use a three-layer metal process. So there’s a lot less processing of the wafer that happens when you build a memory die.”

Two solutions have been proposed. One is to limit the on-chip code to boot code or anything that absolutely has to be available for immediate startup. That might include small functions that must remain available while the rest of the system is nominally off. That portion of the system could then be executed straight out of memory.

Alternatively, Arteris IP’s Frank pointed out that advances in 3D monolithic integration may get us to the point where we layer the memory atop the logic die – not through packaging, but as a single die. The extra memory may then not have such an impact on the cost of the die. But that has yet to be proven both feasible to do and economical as well.

So where does that leave XIP?
Engineers who have been around long enough to remember the original XIP notion still understand its meaning, at least as it was understood at the time. It means fetching instructions directly from the non-volatile memory and sending them directly to the processor.

“XIP meant a device that could be addressed to boot the system and then never look at it again,” said Bill Gervasi, principal systems architect at Nantero. That did not include stopping at any kind of cache.

Today, definitions are more varied. Two versions initially sounded different, but in the end amount to the same thing. The simple version of the definition today is that pretty much everything qualifies as XIP, except DRAM shadowing. “When we say execute in place, it says that you don’t need DRAM, as only non volatile memory is required,” said Hussain.

Adds Greenberg: “I’m thinking of execute-in-place more in the context of bypassing DRAM so that you might still hit into a local cache even if you’re executing in place.”

That might sound somewhat arbitrary, but Frank had a different – and yet isomorphic – way of describing that. He said it’s any execution where the instruction address used by the processor reflects the original NVM address. “My definition of execute in place is where you do not have an address change, where you execute in a cached way, and your original source of the code or the data is still at the same address that you are executing at.”

That means that any caching strategies do not affect the status of XIP. DRAM caching (i.e., demand paging) and SRAM caching still have the processor using the original memory addresses. It’s just that the intervening caching architecture does hidden address translations so that the processor may think it’s getting instructions from the address given, when instead the instructions are coming out of cache.

The one architecture that doesn’t meet this criterion is DRAM shadowing. In that case, you bulk copy flash contents into some location in DRAM. Now the processor is working with the DRAM address, not the original flash address. Because the address base has changed, this would no longer be XIP.

So is XIP an important notion anymore? Engineers are still familiar with the concept, but they’ve had to adjust the meaning to meet modern situations. “There are still systems that use it if they have the requirement of being able to do instant on,” said Greenberg. But, most importantly, it doesn’t seem to be an important consideration for system architects. When building a memory hierarchy, engineers do what they need to do. Whether or not it qualifies as XIP seems to have become somewhat academic.



Leave a Reply


(Note: This name will be displayed publicly)