SPONSOR BLOG

Symmetric Multiprocessing (SMP) RTOS On Xtensa Multicore

Automatic load balancing for multi-threaded embedded applications in a multicore environment.

February 13th, 2025 - By: Nayan Gaywala

An increasing number of multi-threaded embedded applications want to leverage multicore designs. Symmetric Multiprocessing (SMP) RTOS provides automatic load balancing of multiple threads in a multicore environment. Also, numerous legacy multi-threaded embedded applications are deployed on a single-core RTOS that customers want to move to a multicore environment. For these reasons, application developers often need an RTOS that supports SMP.

FreeRTOS has merged the SMP feature into the main kernel, FreeRTOSv11.0.0. This blog uses a FreeRTOS SMP example to discuss topics related to SMP RTOS port/run on an Xtensa multicore cluster. The blog covers two popular topics: interrupt latency and context switch. It also presents an example of an Xtensa SMP platform and memory map.

A system design example

Figure 1 depicts an Xtensa LX8-based dual-core system running FreeRTOS SMP.

Fig. 1: Dual-core example.

Core configuration

Xtensa cores are configurable. Here’s the list of core configuration options to support SMP:

AXI4 Manager exclusive access (required) – It implements a basic locking mechanism in the FreeRTOS SMP critical section implementation.
A timer with an interrupt option (required) – A RTOS Tick timer provides time base.
External interrupts (required) – N-1 inter-core interrupts in a N-core system.
Processor ID (required) – In an N-core system, assign PRIDs ranging from 0 to N-1. Each core in an SMP uses PRID to get the current task handle (pxCurrentTCBs[]) and state variables. Core#0 initializes the shared .bss section and C library global context to set up the C runtime.
Local Data Memory – Stores stack and per CPU context. Note it is possible to place each core’s stack in a separate shared memory segment for a single image multicore build.
Local Instruction Memory – Interrupt vectors and per core Interrupt handlers. Note it is possible to re-locate interrupt vectors in shared memory.
Xclib C Standard library.
No Cache – Cache-based configuration will be covered in a future blog post.

Shared memory (uncached)

Each core in SMP picks the highest priority ready-to-run task from the shared task list (pxReadyTasksLists – figure 2 pink highlight). When no affinity (i.e., the task is not dedicated to a specific CPU in the cluster) is defined, tasks can run on any core. Task control block structures (tskTCB) must reside in shared memory. Tasks might also share data. A chunk of shared memory that looks the same from each core’s view is required. Spinlocks used to implement SMP critical sections are also allocated in a shared system memory region.

Inter-core interrupts

In an SMP environment, a core performs a yield request, portYIELD_CORE(), to another core by asserting an interrupt. Memory-mapped input-output (MMIO) registers allow each core to assert an interrupt to another core.

xtsc-run is an Xtensa SDK application that allows users to run SystemC simulations using a pre-built library of SystemC components. Users can specify a system model that reflects their multicore hardware in a readable text format. For more details, please refer to Cadence’s Xtensa SystemC (XTSC) User’s Guide. Minor differences exist between the images running on the simulation and hardware platforms.

RTOS configuration

FreeRTOSConfig.h tailors the RTOS kernel to the application being built. For a detailed description of configuration options, see Customizations. SMP-specific configuration options are described here. Subsequent sections refer to a few of the RTOS configurations listed below.

configNUMBER_OF_CORES – Number of SMP cores
configRUN_MULTIPLE_PRIORITIES – Controls whether tasks with different priority levels may run simultaneously on different cores.
configAPPLICATION_ALLOCATED_HEAP – The application provides the heap space.
configKERNEL_INTERRUPT_PRIORITY – Sets the interrupt priority used by the FreeRTOS kernel itself
configMAX_SYSCALL_INTERRUPT_PRIORITY – Sets the highest interrupt priority from which interrupt-safe APIs can be called.
configUSE_CORE_AFFINITY – Controls which core a task can run on. It helps pin a task to a particular core.

Differences from a single-core operation

Multi-threading on a single core differs from multi-threading in an SMP environment. FreeRTOS SMP Change describes these complications in more detail. An application running on a single core needs to pay attention to following SMP changes.

Multiple tasks can run simultaneously

In a single-core multi-threaded environment, only one task, typically the highest-priority ready-to-run task, runs at any given time. In a single-core environment, the assumption is that lower-priority tasks will not run simultaneously with a higher-priority task. This is not true in an SMP environment, so the multi-threaded application might need to be modified for an SMP environment.

Critical section – disabling interrupts is not enough

When sharing data with another task or an ISR, a task enters a critical session. Typically, this is accomplished in a single-core environment by disabling interrupts. Disabling interrupts does not create a critical section in an SMP environment. When interrupts are disabled on one core, the task or ISR on another core can still access the shared data. It is now necessary for ISRs to also enter critical sections by acquiring the spinlock. The port adds two spinlocks for this purpose. Please refer to Cadence’s System Software Reference Manual for spinlock APIs. FreeRTOS APIs taskENTER_CRITICAL() and taskEXIT_CRITICAL() now use two spinlocks to provide the critical section in an SMP environment.

Memory map

Figure 2 shows an example memory map. In this example, each core has its own local (Tightly Coupled Memory) instruction and data memory, but both cores share system memory. Please refer to Cadence’s Linker Support Packages (LSP) Reference Manual for code and data placement.

Fig. 2: Memory map.

Memory management

FreeRTOS objects can be allocated statically or dynamically. Figure 2 shows an RTOS-managed dynamic allocation scheme, heap_4. It uses ucHeap[] space as a heap separate from the Standard C Library heap. The configTOTAL_HEAP_SIZE parameter specifies the heap size in FreeRTOSConfig.h.

Interrupts

RTOS must be aware when an interrupt routine makes a non-blocking RTOS API call. Several schemes exist to inform RTOS that a call is made in an interrupt context. FreeRTOS uses a scheme that provides a separate set of interrupt-safe APIs.

Interrupt latency

In a core configuration with a single interrupt priority level, application interrupt latency will be adversely affected as the RTOS system tick and application interrupts are at the same priority level. Figure 3 depicts a typical Xtensa core configuration with multiple interrupt priority levels.

Fig. 3: Interrupt priority levels.

The RTOS system tick is set to the lowest priority level. The critical section will mask the interrupts that make RTOS API calls. RTOS activity will affect the interrupt latency of these interrupts. Also, the interrupt latency will be higher since these interrupts will save a full task context to prepare for a possible context switch. On an Xtensa core, a configurable option (EXCMLEVEL) sets the priority level of exception processing. To avoid masking exceptions by RTOS critical sections, users should keep the priority level of interrupts that can make interrupt-safe API calls equal to or below the EXCMLEVEL level. Note that interrupt latency above the configMAX_SYSCALL_INTERRUPT_PRIORITY priority level is unaffected by RTOS activity. However, these interrupts cannot make RTOS API calls.

Xtensa cores can be configured to place interrupt vectors in local instruction RAM. Vector space for each interrupt priority level is configurable but limited. Typically, an ISR does not fit in the vector space, so a jump to ISR is needed. Interrupt service routines (ISR) that do not fit within the vector space can be allocated in local memory to minimize interrupt latency. Because of nested interrupt architecture, lower-priority interrupts will have a higher variability in interrupt latency than higher-priority interrupts. The highest priority interrupt, Level 6 in figure 3, will have deterministic latency.

Interrupt stack

A separate interrupt stack is more efficient when RTOS takes an interrupt or an exception while executing a task. Thus, each task stack must only allow space for one interrupt stack frame, and then nested interrupt stack frames go on to the separate interrupt stack. Each core allocates a separate interrupt stack in local data memory – refer to the figure 2 cyan highlight. Note that placing an interrupt stack in a shared memory is possible. The “configISR_STACK_SIZE” parameter specifies the interrupt stack size in FreeRTOSConfig.h.

Broadcast interrupts

In an SMP environment, broadcast interrupts propagate to all cores. This results in overhead as each core needs to save the current context and load the interrupt context. Typically, an application might want one core to handle the interrupt and pin a handler task to the same core. Broadcast interrupts might be handled differently on each individual core. A separate per-core interrupt vector and interrupt handlers provide this flexibility—refer to the orange highlight in figure 2.

Context

A typical controller context includes General purpose and Status registers, Program counter, and Stack pointer. However, Xtensa-based DSPs are SIMD machines that include Vector registers. The context for these machines now includes the coprocessor state, too. Saving and restoring the coprocessor state of these machines can be quite costly. Xtensa supports a lazy context-switching mechanism. The idea is for RTOS to save the coprocessor context only when the newly switched task (current task it switched to) uses the coprocessor’s resources. If the current task does not use a coprocessor, RTOS avoids saving and restoring coprocessor registers. In an SMP environment, the use of this feature imposes a restriction: application developers must pin a task that uses a coprocessor to a particular core.

Task stack

Execution uses three stacks: Interrupt, Task, and Main. The interrupt stack was described earlier. Each task has a stack associated with it and is used by the core when executing it. In an SMP, the stack must reside in shared memory (figure 2 green highlight). The user specifies the task stack size when creating a task. The main and scheduler running on each core use the Main Stack (figure 2 yellow highlight). It is initialized to the bottom of the core’s local data memory. Note that placing the main stack in a shared memory is possible.

Thread-safe C library

Xtensa FreeRTOS task creation initializes the C library context per thread in the task control block (figure 2, red highlight). The scheduler switches the C library context on a task switch.

Core#0 C runtime initializes the global C library context (figure 2 blue highlight) allocated in shared memory. Each core initializes per core C library context (figure 2 purple highlight) allocated in local data RAM. The main and scheduler running on each core use these C library contexts.

Nayan Gaywala

(all posts)
Nayan Gaywala is a senior principal application engineer at Cadence.

Symmetric Multiprocessing (SMP) RTOS On Xtensa Multicore

A system design example

Core configuration

Shared memory (uncached)

Inter-core interrupts

RTOS configuration

Differences from a single-core operation

Multiple tasks can run simultaneously

Critical section – disabling interrupts is not enough

Memory map

Memory management

Interrupts

Interrupt latency

Interrupt stack

Broadcast interrupts

Context

Task stack

Thread-safe C library

Nayan Gaywala

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Die-to-die Interconnect Standards In Flux

Sponsors

Recent Comments

About

Navigation

Connect With Us

Symmetric Multiprocessing (SMP) RTOS On Xtensa Multicore

A system design example

Core configuration

Shared memory (uncached)

Inter-core interrupts

RTOS configuration

Differences from a single-core operation

Multiple tasks can run simultaneously

Critical section – disabling interrupts is not enough

Memory map

Memory management

Interrupts

Interrupt latency

Interrupt stack

Broadcast interrupts

Context

Task stack

Thread-safe C library

Nayan Gaywala

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

3D-IC For The Masses

Chiplets Add New Power Issues

Development Flows For Chiplets

New Data Center Protocols Tackle AI

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Die-to-die Interconnect Standards In Flux

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored