The Real Value In Customizing Instructions

How to enable an order of magnitude in power savings for IoT applications.


One element that distinguishes devices for the emerging IoT market from the mobile devices of the mature handheld market is power. Specifically, while the latter can accept a battery recharge cycle of days, the former demands years between battery recharge/replacement.

Where the two devices resemble one another is their need for high performance. While embedded CPU cores have concentrated on conventional techniques to save power—power islands and the like— another method that meets the power and performance requirement is the use of custom instructions to accelerate compute-intensive tasks. While some embedded CPU cores allow for custom instructions, the hurdle that has kept them from being developed is the engineering overhead in accommodating the custom logic with the software and hardware verification process.

This article will illustrate the effectiveness of custom instructions in saving power and explain how the barrier to incorporating a customized embedded core has been broken. The example chosen to illustrate the power of custom instructions is the CRC 32 polynomial commonly used in communications applications.

Most designs approach the polynomial using optimized C code. A second more power efficient method is to add a CRC table to the optimized C code. Figure 1 details the two methods in comparison to using a custom instruction to solve the polynomial in a custom engine. As can be seen, the second approach of combining a CRC table with optimized C code uses 19 times the energy required to perform the polynomial using the custom engine. Furthermore, using only optimized C code to perform the polynomial requires 114 times the energy needed by the custom engine. Any system performing compute-intensive tasks such as a DSP function or a system performing data encryption and decryption will not only perform the function faster, but consume orders of magnitude less power than performing the functions in optimized C code.

Screen Shot 2015-08-05 at 1.23.38 PM
Figure 1. Comparison of Techniques to Solve CRC32 Polynomial. (See notes below)

Illustrated in Figure 2 is the Andes embedded core containing a custom engine to solve the CRC polynomial. The internal pipeline architecture of the CPU is brought outside the CPU data path, where the custom logic can be added for the custom instruction. This task involves creating an execute interface to the CPU’s pipeline, thus allowing the designer to focus on the new logic to implement the custom instruction. What makes this customization easier than other CPU cores offering custom instructions is the Andes Custom Extension (ACE) framework, allowing SoC designers to create instructions specific to their applications and optimize the performance and power consumption in a much shorter timeframe.

Figure 2. Block diagram of custom engine addition.

Under the ACE framework are tools to simplify the instruction design process and provide optional performance enhancement features, such as branch prediction, return address stack, and 3Read2Write register port. Single- or multi-cycle latency can be used, with logic sharing among custom instructions to reduce cost.

For SoC developers who need programmability and efficiency, this kind of approach directly addresses their needs. Another benefit of using custom instructions is it increases the security of the chip by having proprietary hardware and software. By having proprietary hardware and software, it makes it much harder for hackers to reverse engineer or attack the hardware and software without knowing the implementation of the custom instructions.

SoC developers also can define their own instructions to simplify the design process of extending RTL and simulator, thereby facilitating the instruction creation while avoiding tedious and error-prone design work. Custom instructions allow more performance efficiency on a chip and provide protection for proprietary software IP through the use of custom instructions. This can be used in applications from DSP acceleration and high-volume data processing to emerging applications whose features and specifications are still evolving, such as IoT, wearable devices, smart sensor devices, medical devices, storage, packet processing, intelligent household appliances, touch panels, wireless charging, fingerprint identification, SSD and encryption security chips.

For a presentation on the methods Andes employs for Power savings, please view the video of a recent webinar on the topic.

a. Process 90LP is used. 1B ROM =~ 2.5 gates
b. C code is O3-optimized with no loop unrolled to keep the ROM size small
c. The CRC table has 256 entries, 4 bytes each
d. ROM/SRAM power isn’t included. So, total power for non-ACE versions will be even higher