Automatic load balancing for multi-threaded embedded applications in a multicore environment.
An increasing number of multi-threaded embedded applications want to leverage multicore designs. Symmetric Multiprocessing (SMP) RTOS provides automatic load balancing of multiple threads in a multicore environment. Also, numerous legacy multi-threaded embedded applications are deployed on a single-core RTOS that customers want to move to a multicore environment. For these reasons, application developers often need an RTOS that supports SMP.
FreeRTOS has merged the SMP feature into the main kernel, FreeRTOSv11.0.0. This blog uses a FreeRTOS SMP example to discuss topics related to SMP RTOS port/run on an Xtensa multicore cluster. The blog covers two popular topics: interrupt latency and context switch. It also presents an example of an Xtensa SMP platform and memory map.
Figure 1 depicts an Xtensa LX8-based dual-core system running FreeRTOS SMP.
Fig. 1: Dual-core example.
Xtensa cores are configurable. Here’s the list of core configuration options to support SMP:
Each core in SMP picks the highest priority ready-to-run task from the shared task list (pxReadyTasksLists – figure 2 pink highlight). When no affinity (i.e., the task is not dedicated to a specific CPU in the cluster) is defined, tasks can run on any core. Task control block structures (tskTCB) must reside in shared memory. Tasks might also share data. A chunk of shared memory that looks the same from each core’s view is required. Spinlocks used to implement SMP critical sections are also allocated in a shared system memory region.
In an SMP environment, a core performs a yield request, portYIELD_CORE(), to another core by asserting an interrupt. Memory-mapped input-output (MMIO) registers allow each core to assert an interrupt to another core.
xtsc-run is an Xtensa SDK application that allows users to run SystemC simulations using a pre-built library of SystemC components. Users can specify a system model that reflects their multicore hardware in a readable text format. For more details, please refer to Cadence’s Xtensa SystemC (XTSC) User’s Guide. Minor differences exist between the images running on the simulation and hardware platforms.
FreeRTOSConfig.h tailors the RTOS kernel to the application being built. For a detailed description of configuration options, see Customizations. SMP-specific configuration options are described here. Subsequent sections refer to a few of the RTOS configurations listed below.
Multi-threading on a single core differs from multi-threading in an SMP environment. FreeRTOS SMP Change describes these complications in more detail. An application running on a single core needs to pay attention to following SMP changes.
In a single-core multi-threaded environment, only one task, typically the highest-priority ready-to-run task, runs at any given time. In a single-core environment, the assumption is that lower-priority tasks will not run simultaneously with a higher-priority task. This is not true in an SMP environment, so the multi-threaded application might need to be modified for an SMP environment.
When sharing data with another task or an ISR, a task enters a critical session. Typically, this is accomplished in a single-core environment by disabling interrupts. Disabling interrupts does not create a critical section in an SMP environment. When interrupts are disabled on one core, the task or ISR on another core can still access the shared data. It is now necessary for ISRs to also enter critical sections by acquiring the spinlock. The port adds two spinlocks for this purpose. Please refer to Cadence’s System Software Reference Manual for spinlock APIs. FreeRTOS APIs taskENTER_CRITICAL() and taskEXIT_CRITICAL() now use two spinlocks to provide the critical section in an SMP environment.
Figure 2 shows an example memory map. In this example, each core has its own local (Tightly Coupled Memory) instruction and data memory, but both cores share system memory. Please refer to Cadence’s Linker Support Packages (LSP) Reference Manual for code and data placement.
Fig. 2: Memory map.
FreeRTOS objects can be allocated statically or dynamically. Figure 2 shows an RTOS-managed dynamic allocation scheme, heap_4. It uses ucHeap[] space as a heap separate from the Standard C Library heap. The configTOTAL_HEAP_SIZE parameter specifies the heap size in FreeRTOSConfig.h.
RTOS must be aware when an interrupt routine makes a non-blocking RTOS API call. Several schemes exist to inform RTOS that a call is made in an interrupt context. FreeRTOS uses a scheme that provides a separate set of interrupt-safe APIs.
In a core configuration with a single interrupt priority level, application interrupt latency will be adversely affected as the RTOS system tick and application interrupts are at the same priority level. Figure 3 depicts a typical Xtensa core configuration with multiple interrupt priority levels.
Fig. 3: Interrupt priority levels.
The RTOS system tick is set to the lowest priority level. The critical section will mask the interrupts that make RTOS API calls. RTOS activity will affect the interrupt latency of these interrupts. Also, the interrupt latency will be higher since these interrupts will save a full task context to prepare for a possible context switch. On an Xtensa core, a configurable option (EXCMLEVEL) sets the priority level of exception processing. To avoid masking exceptions by RTOS critical sections, users should keep the priority level of interrupts that can make interrupt-safe API calls equal to or below the EXCMLEVEL level. Note that interrupt latency above the configMAX_SYSCALL_INTERRUPT_PRIORITY priority level is unaffected by RTOS activity. However, these interrupts cannot make RTOS API calls.
Xtensa cores can be configured to place interrupt vectors in local instruction RAM. Vector space for each interrupt priority level is configurable but limited. Typically, an ISR does not fit in the vector space, so a jump to ISR is needed. Interrupt service routines (ISR) that do not fit within the vector space can be allocated in local memory to minimize interrupt latency. Because of nested interrupt architecture, lower-priority interrupts will have a higher variability in interrupt latency than higher-priority interrupts. The highest priority interrupt, Level 6 in figure 3, will have deterministic latency.
A separate interrupt stack is more efficient when RTOS takes an interrupt or an exception while executing a task. Thus, each task stack must only allow space for one interrupt stack frame, and then nested interrupt stack frames go on to the separate interrupt stack. Each core allocates a separate interrupt stack in local data memory – refer to the figure 2 cyan highlight. Note that placing an interrupt stack in a shared memory is possible. The “configISR_STACK_SIZE” parameter specifies the interrupt stack size in FreeRTOSConfig.h.
In an SMP environment, broadcast interrupts propagate to all cores. This results in overhead as each core needs to save the current context and load the interrupt context. Typically, an application might want one core to handle the interrupt and pin a handler task to the same core. Broadcast interrupts might be handled differently on each individual core. A separate per-core interrupt vector and interrupt handlers provide this flexibility—refer to the orange highlight in figure 2.
A typical controller context includes General purpose and Status registers, Program counter, and Stack pointer. However, Xtensa-based DSPs are SIMD machines that include Vector registers. The context for these machines now includes the coprocessor state, too. Saving and restoring the coprocessor state of these machines can be quite costly. Xtensa supports a lazy context-switching mechanism. The idea is for RTOS to save the coprocessor context only when the newly switched task (current task it switched to) uses the coprocessor’s resources. If the current task does not use a coprocessor, RTOS avoids saving and restoring coprocessor registers. In an SMP environment, the use of this feature imposes a restriction: application developers must pin a task that uses a coprocessor to a particular core.
Execution uses three stacks: Interrupt, Task, and Main. The interrupt stack was described earlier. Each task has a stack associated with it and is used by the core when executing it. In an SMP, the stack must reside in shared memory (figure 2 green highlight). The user specifies the task stack size when creating a task. The main and scheduler running on each core use the Main Stack (figure 2 yellow highlight). It is initialized to the bottom of the core’s local data memory. Note that placing the main stack in a shared memory is possible.
Xtensa FreeRTOS task creation initializes the C library context per thread in the task control block (figure 2, red highlight). The scheduler switches the C library context on a task switch.
Core#0 C runtime initializes the global C library context (figure 2 blue highlight) allocated in shared memory. Each core initializes per core C library context (figure 2 purple highlight) allocated in local data RAM. The main and scheduler running on each core use these C library contexts.
Leave a Reply