The big challenge is for next-generation embedded processors is to provide a significant performance boost for the same or lower power.
The Changing World
Technology is shaping and altering the world around us. Reality is being augmented and “virtual reality” is becoming the norm. Video is becoming more immersive, offering 3D effects and 4K resolution, with 8K on the horizon. Cars are a technology showcase that in a few years conceivably will take over the driving for us. Our ability to interact with technology through touch, speech, and gesture is increasing at exponential rates, and information exchange is accelerating from person to person and machine to machine.
Power-Performance Paradox
Advances in technology are being enabled by the next-generation of embedded microprocessors that deliver levels of performance that were inconceivable just a few years ago. Increasingly, new designs are being developed with multiple processors to achieve the performance required. But this higher performance comes at the price of increased power consumption, which proves problematic for chip designers who have to increase the performance of their products but frequently must maintain the same or lower power budgets. Mobile products that are battery powered require longer battery life, and even products that are plugged in are subject to constraints on power to minimize heat and the developer’s desire to be green.
This power-performance paradox is made even more challenging because the typical approach to delivering more performance from a processor is to increase the transistor budget for the design, which increases both its power consumption and size. Many of the current processors implement superscalar or multithreading schemes to achieve higher performance. These architectures can deliver on total performance, but they lack performance efficiency (DMIPS/mW, DMIPS/mm2), so they use a lot more power. Also, because of their size, they are limited to only modest gains in maximum clock speeds versus the previous generation of processors that they replace. What is needed is a processor that delivers on total performance, can be clocked at GHz speeds, and uses power sparingly. It is a difficult task to design a small, efficient processor that offers enough performance for today plus headroom for future design growth.
A New Generation of Embedded Processors
A new generation of high-performance processors is now available to enable embedded SoC designers to address the power-performance challenges that they face. These new processors offer multi-gigahertz performance but are also built with performance efficiency (DMIPS/mW) as a key design goal. When performance efficiency is considered, the option to increase the transistor budget, which is the typical solution for achieving higher levels of performance, is taken off the table. This is a big challenge that compels the designer of the processor to reconsider how they will achieve the high levels of performance needed. It forces them to take a different approach, which requires innovation and a new pipeline implementation that supports fine-grained optimization of the stages to optimally balance the pipeline.
Synopsys has been wrestling with this balancing act for some time. Its ARC processors run at up to 2.2 GHz on 28nm processes (typical silicon) and consume only 80 mW of power. In many cases the new cores offer more than twice the performance, at less than half the power consumption of older processors. This is welcome news for embedded system-on-a-chip (SoC) designers, who are facing the challenges of the power-performance paradox. If even higher performance is required the new processors are available in dual-core and quad-core versions.
The processors are designed for use in high-end applications such as solid-state drives, connected appliances, automotive controllers, media players, digital TV, set-top boxes and home networking. They are highly configurable, enabling users to tailor them specifically for each instance on an SoC to maximize performance while minimizing power consumption and area. They support the addition of custom instructions that let users integrate their proprietary hardware accelerators into the processor to further increase performance and add competitive differentiation to their SoC product.
Under the covers is a power-efficient 10-stage pipeline (Figure 1). This single-issue, scalar pipeline minimizes size and power consumption. To improve performance, the processor supports limited out-of-order execution for long latency instructions. The processors have sophisticated branch prediction that has high accuracy with early detection of mis-predicted branches. They also have a late stage ALU in the 9th stage, which allows the processing of some instructions to be delayed from the early ALU in the 6th stage in the case of branches or interrupts that require the pipeline to be flushed. In these cases, the pipeline continues to process instructions on the backend while the front end is being reloaded. This significantly reduces the load to use, and for many instructions can eliminate it, which is a big benefit for many embedded applications.
Figure 1: A power-efficient 10-stage pipeline minimizes area and power consumption
Moving data is very important in many of the higher end embedded applications, so the new processors have a parallel load/store pipeline starting at the 6th stage to improve the data handling performance. They support 64-bit loads and stores to and from register pairs to move data faster. They feature non-aligned load and store access that use banked close-coupled data memories and D cache, allowing them to complete data moves without extra cycle penalties. The processors also have an optional low-latency memory port for fast access to peripherals and memory. This port supports single-cycle access to all peripheral registers or memory on an SoC and reduces system latency by moving this traffic off of the multilayer AMBA bus. The processors further improve efficient data movement supporting I/O coherency with data cache snooping and a programmable address space that keeps the cache coherent with shared memory of peripherals without interfering with normal cache operations.
Embedded Multicore Support
For high-performance applications, the new processors are available in dual-core and quad-core versions (Figure 2). The multicore versions feature inter-core hardware that facilitates message passing, interrupt handling, semaphores and debug. The inter-core message passing uses a centralized SRAM that is shared by all cores with round-robin arbitration to manage simultaneous accesses. The inter-core interrupt capability allows each core to generate interrupts to the other cores, and each core can receive acknowledges from any other core. The inter-core semaphores are provided for synchronization across shared resources. The inter-core debugger can simultaneously or selectively halt, run or reset any combination of the cores. Designed to increase performance, the multi-core implementations have a 64-bit global real-time counter to synchronize multiple threads.
Figure 2: Dual-core and quad-core versions are capable of delivering more than 16,000 DMIPS of total performance on 28nm processes.
The challenges faced by embedded SoC designers are growing due to the demand for increasing performance with power budgets that are fixed or declining. This power-performance paradox requires a new generation of high-performance processors that offer high-speed and unrivaled performance efficiency. An interesting future awaits us, and much of it will be enabled by highly efficient next-generation embedded processors.
Leave a Reply