How to achieve 32-bit processor performance with the energy consumption of an 8-bit core.
For many years, the 8-bit microcontroller has been the workhorse of embedded systems. Design teams favor the size and power benefits that a tightly coupled processor, such as the 8051 microcontroller, brings to their designs. The compact and ultra-low power 8-bit architecture improves battery life and reduces bill-of-material costs.
However, embedded systems increasingly require higher performance, driven by the need to complete more complex tasks even within deeply embedded processors. Across a broad range of applications, from IoT to industrial control systems, embedded systems must support increased workloads. To enable this, many design teams are migrating to 32-bit processor architectures.
Adopting 32-bit cores satisfies higher performance needs, but designers remain reluctant to give up the power and area benefits that the previous generation of 8-bit microcontrollers have provided. They want an embedded system with the performance of a 32-bit processor and the area footprint and lower power of an 8-bit microcontroller with tightly coupled peripherals and memories. With the right implementation, embedded designers can get what they want.
Configuring the Subsystem
The ability to configure the core is a critical step in achieving the optimal power, performance and area. Synopsys, as an example, supports the ability to tightly integrate memories and peripherals in DesignWare ARC processors by using ARC Processor EXtension technology (APEX). APEX can be used to configure processor cores to incorporate closely coupled memories (CCMs) for both instructions and data, removing the need to access memory over an AHB bus. It also allows designers to tightly couple peripherals by using the processor’s auxiliary registers and external interfaces, which makes it possible for them to directly access the peripherals, eliminating the need for the slave bus (typically, APB).
Tightly coupling both memories and peripherals reduces the overall area and power of the embedded system, and significantly boosts performance compared to a bus-based implementation. The performance benefits and power savings can be substantial for workloads that require frequent memory and peripheral register accesses. In addition to supporting a tightly-coupled architecture, design teams can extend the ARC processor by adding (or removing) instructions and co-processors to match their specific needs.
Building an Optimized Sensor Subsystem
Design teams are using sensors in clever ways to expand the capabilities of applications for a whole range of sectors, including consumer, industrial and healthcare. For each of these market segments, the embedded processor will need to (1) collect sensor data, (2) process the data and (3) send results to the host processor. By building more intelligence into an integrated sensor subsystem, typically by pre-processing and filtering the data, design teams can remove the processing burden from the host processor, which can help to extend battery life.
Figure 1 depicts the bus-based implementation of such a sensor hub, implemented with an ARC EM4 core without closely coupled memories and with bus-based peripherals. The sensor hub collects sensor data from a magnetometer, accelerometer and a gyroscope, filters and processes the collected data to determine the orientation of a device, and sends the results to a host. The system contains two I2C masters and one SPI master to collect the sensor data from the different sensor transducers concurrently, a GPIO to act upon other events in the system, and a UART to communicate with the host.
Figure 1: Bus-based implementation of a sensor hub
Figure 2 shows the tightly-coupled implementation of the same sensor hub. Compared to the typical bus-based implementation provided in Figure 1, the external memories are replaced by closely coupled instruction (ICCM) and closely coupled data (DCCM) memories and the peripherals are also tightly coupled to the ARC processor using APEX technology. CCMs and tightly coupled peripherals have made the external bus infrastructure (interconnect, interfaces, and master/slave bridges) superfluous.
Figure 2: ARC+APEX tightly coupled implementation of a sensor hub
For a typical 40nm design, the tightly-coupled implementation shown in Figure 2 saves thousands of gates and provides an order of magnitude reduction in overall energy consumption processing sensor data.
The bus-based implementation needs to run at a higher clock frequency, as it suffers from the higher latency of memory fetches and peripheral accesses. The impact of tightly-coupled memories on performance and energy consumption is illustrated in Figure 3. In the second step of the sensor application, when the sensor data is processed by the ARC EM4 core, there is no interaction with the peripherals. The cycle counts spent on processing are independent of the clock frequency since both core and memories run at the same frequency. This allows a direct comparison of the cycle counts of the bus-based implementation with the cycle counts of the tightly coupled implementation.
Figure 3 shows the number of cycles the ARC EM4 core spends in the processing stage relative to the number of cycles spent by the bus-based implementation. The tightly coupled implementation requires 4.2x fewer cycles for the processing than the bus-based implementation. The tightly coupled implementation fetches from the memories with single cycle latency while the AHB read transactions of the bus-based implementation cost several additional cycles due to the AHB bus infrastructure. For the processing stage, the tightly coupled implementation results in an energy reduction (excluding memories) of 2.1x.
Figure 3: ARC+APEX tightly coupled implementation results for sensor processing
Summary
The trend to shift from tightly coupled embedded systems utilizing an 8-bit microcontroller towards 32-bit processor bus-based embedded systems provides designers performance head room at the cost of power and area. ARC APEX technology provides a means to tightly couple memories and peripherals to an ARC processor core and make the area- and latency-expensive bus infrastructure redundant. This reduces both the power consumption and area costs of the embedded system without sacrificing performance.
More details on the cycle count and energy consumption of the sensor subsystem design example above can be found in the whitepaper, “Building an Efficient, Tightly Coupled Embedded System Using an Extensible Processor”.
Leave a Reply