Achieving Flexible Processing Requirements For IoT End-Node Devices

Tight power and cost constraints make reducing the numbers of processors in an IoT design important.


The term Internet of Things (IoT), which used to be used broadly to describe almost all connected devices, can now be seen to fall into two segments: Critical IoT and Massive IoT. Critical IoT refers to mission-critical applications such as automotive communication, industrial machines, and medical procedures where low latency is crucial, while Massive IoT is related to the billions of connected devices that include end-node devices. These end-node devices usually have tight power and cost constraints.

End-node SoC device operation can be summarized in four main functional areas:

Sensing, from environmental sensing, for example temperature, humidity, chemical composition, which requires very low sample rates of one sample per minute or second(s), to motion, audio, voice, and vision, which can have up to 100B samples per second.

Computation, which includes system control, synchronization, machine learning/artificial intelligence (AI), digital signal processing, data encryption, and running of the operating system (OS).

Communication, which includes support of a range of various wireless communications standards, as shown in Figure 1.

Figure 1: Wireless communications standards supported by end-node devices

Security, which is driven by increasing concerns about data breaches and other security risks in end-node devices. Common security features in these devices include tamper resistance, prevention of side channel accesses, execution of a Trusted Execution Environment (TEE), encryption, and implementation of safety islands.

With the tight cost and power budgets of end-node IoT devices, reducing the number of processors in the design is important. A single core that can provide all the required functionality, including controller functionality for system synchronization, a real-time operating system (RTOS), a communications PHY interface, security, and encryption, would be ideal. However, the core should also perform DSP operations such as front-end signal processing, sensor data filtering, wireless communications, and PHY computation. Finding a single processor that can meet all these requirements is a challenge, and while an end-node device might require any of these functions, it is not common for a single application to require every function. Therefore, a highly configurable core that can be tailored to meet the performance and computation throughput requirements while consuming very little power and area is the best solution.

Single-core implementation
Synopsys’ DesignWare ARC EM9D core is uniquely positioned to deliver both controller and DSP functionality within a very small footprint. Based on a three stage pipeline micro-architecture, the EM9D processor can achieve up to 4.0 CoreMark/MHz performance and 1.8 DMIPs/MHz.

With the ARC EM9D processor, multiple operations are fused into one instruction and executed in a single cycle. This yields high computation throughput and very small instruction memory. Fused instructions can, for example, load multiple data vectors from memory, perform an operation on this data (e.g., multiply accumulate), auto-update memory pointers and store the data, all in one instruction. This enables the core to perform up to seven operations in a single cycle. The processor can perform two MACs per cycle, allowing high vector data computation throughput. The ARC MetaWare Compiler fully supports the fused instructions and will automatically map them from C code to execution code instructions.

The processor has an optimized instruction set architecture (ISA) for end-node IoT applications. For example, a set of instructions for data streaming in and out of core data memory allows data bits to be read and written directly from/to data memory without pre-packaging bits into words, which is ideal for connecting to low data-rate sensor interfaces.

Because of this architecture, optimized ISA, and high-performance data throughput, it is an extremely computation-powerful DSP. The EM9D core can execute a software algorithm for facial detection CNN computation in only 40 kcycles.

Getting the most with processor configurability and extensibility
When looking at optimizing a core in terms of performance, size, and power consumption, data memory interfaces are key. The data memory interface (load/store units) defines the amount of data loaded and stored and the frequency of these operations. Also, these units are quite large in terms of physical implementation. The ability to optimize this interface offers an advantage by giving designers the ability to balance power consumption and area against performance requirements.

The EM9D processor has a fully configurable data memory interface, supporting from one to three closely coupled data memories (DCCM, XCCM, and YCCM). These memory regions are fully supported by the MetaWare Compiler, which eliminates the need for manual data vector allocation. These memory accesses are supported with fused instructions and allow operation computation execution and parallel access to three memory regions all in one cycle, offering very high performance if needed. The configurability allows the SoC developer to tune the core memory interface to meet the computation throughput, area, and power requirements. For example, configuring the EM9D with three physical data memories will offer three times the computation performance, with a reduction of core/memory power consumption by up to 40%.

Along with data memory size and configuration, the instruction memory size is also another important factor affecting system area and power consumption. A highly efficient ISA, coupled with the efficient mapping of the instructions and scheduling by the compiler, provide around 15% to 20% smaller code size. On top of this, the fused instructions significantly reduce code size, and hence the required instruction memory size.

In addition to optimizing the core and memories, SoC system integration of the DSP is important for optimal performance, power, and area. End-node IoT SoCs can range from quite simple to highly complex and sometimes the traditional modular SoC interconnect system adds gate count, milliwatts, and cycle budget overhead that can also be optimized.

Peripheral hardware blocks can be connected to the processor via a dedicated peripheral interface for a ‘bus-less’ design that enables zero latency for data throughput intensive blocks. The core register bank can be extended in size and hardware blocks can directly connect to these registers, allowing core software control/status update of these hardware blocks. In addition, by using ARC Processor EXtension (APEX) technology, designers can add custom registers and interfaces in the form of an RTL description to the ISA. These connection schemes give SoC developers further flexibility to once again tune the system architecture to meet performance, power, and area goals.

To further optimize performance, an optional mDMA controller can be added to the processor. This mDMA engine is controlled directly from the processor, but operates in parallel to core execution offloading heavy data movement.

Figure 2 shows an example of how this system architecture optimization can greatly improve performance, power consumption, and area.

Figure 2: Implementing a bus-less design using ARC and an APEX interface improves PPA

With all these features and configuration options, the ARC EM9D processor is ideal for IoT end-node applications that require both control and digital signal processing capabilities.

The ARC EM9D offers industry-leading performance per area, coupled with the tight integration options for system peripherals and hardware accelerator blocks, giving end-node device SoC developers the ability to tune the ARC EM9D configuration, memory size, and system connectivity to meet their performance, area, and power requirements.


Leave a Reply

(Note: This name will be displayed publicly)