Using Multicore Processors To Accelerate Your High-Performance Embedded Linux Applications

The challenge is to improve performance in a constantly shrinking power budget. What’s the solution?


The adoption of Linux is accelerating, as it is becoming the operating system of choice for a variety of embedded applications. However, designers of these performance-intensive, embedded SoCs running Linux or other virtual-memory operating systems are challenged with increasing performance requirements within constant or shrinking power budgets. Most processors either achieve the performance goals while exceeding the power budget, or fit within the power budget but lack the necessary performance.

The traditional path of increasing performance by increasing the processor clock speed is not always a viable approach because power consumption rises linearly with clock frequency. A better alternative for achieving higher performance is to use a multicore processor solution that supports symmetric multiprocessing (SMP). Applications that require higher performance but low power consumption benefit from running on multiple CPU cores, especially when the software can efficiently distribute workloads across a multicore cluster. For some applications that run high-end operating systems, a single core is enough to achieve the required performance for most implementations. However, when more performance is needed, a dual- or quad-core processor that can be implemented in a symmetric configuration with the operating system distributing the load across the cores can make the difference in achieving the required performance while staying within the power budget.

Multicore Implementation for Higher Performance
High-performance processors offer a number of cache-related features to enable implementation of multicore systems. Level 1 (L1) cache coherence is important for multicore SMP. When two or more CPUs can work on shared information, it is necessary to keep the cached data coherent to prevent the CPUs from independently modifying the shared information with different results. Using software to maintain coherence is difficult and consumes many clock cycles; in contrast implementing this mechanism in hardware is efficient. Hardware coherency is done using ‘snooping’ to watch all the L1 caches in a cluster for read and write operations and keeps the cached data coherent with the other cores so problems don’t develop.

Likewise, an I/O coherency unit that keeps input/output traffic coherent with the L1 caches can automatically handle complex bookkeeping and eliminate the need for the application programmer to focus on these details.

User-Configurable L2 Cache
Processors with a user-configurable Level 2 (L2) cache reduce misses and the resulting main-memory accesses, improving performance and reducing power consumption. An L2 cache can include several features that ensure high performance while consuming minimal power. For example, an L2 designed to run at the same clock frequency as the processor and shared by all the cores in a multicore cluster ensures that the L2 cache can keep up with the CPUs. In addition, a cache that is tightly connected with the core through a separate low-latency bus can avoid AXI bus traffic on the data paths between the CPU cores and the latency that this can introduce to further improve performance.

Processors that offer a high degree of configurability offer excellent power and performance benefits. Besides being able to configure the L2 cache’s clock speed, memory size, and AXI interfaces, designers using configurable processors can customize the core and make use of different capabilities to save power.

ARC HS38 Processor: Designed for High Performance

Multicore Processors  Accelerate Embedded Linux Applications_Fig1
Figure 1. The ARC HS38 processor enables designers to implement dual- and quad-core clusters supporting symmetric multiprocessing and supports full L1 cache coherency and up to 8MB of L2 cache.

Synopsys designed the ARC HS family to maximize performance efficiency for embedded applications offering very high performance with size and power consumption levels that are less than half of what is required for competitive cores. Designing high-performance processors is not difficult when the power and transistor budgets are unlimited. It is much more difficult to design a small, power-efficient processor with enough performance for today’s applications and that also offers the potential to support more performance-intensive designs in the future. The ARC HS38 processor enables designers to implement dual- and quad-core clusters supporting symmetric multiprocessing and supports full L1 cache coherency and up to 8MB of L2 cache. With throughput at 2.2 GHz (typical in 28nm) reaching more than 4,200 Dhrystone MIPS and 7,700 CoreMarks per core, the HS38 processor delivers the performance needed for today’s high-end embedded applications with room for greater performance in future designs. With the ARC HS family, Synopsys is expanding its DesignWare IP portfolio to meet the growing high-performance needs of SoC developers while avoiding unnecessary features that would compromise today’s tightening cost and power budgets.