How to get the performance of dedicated cryptographic engines with the flexibility of software implementations.
In the previous blog post, “Embedded Security Using Cryptography”, we looked at how cryptography can be used for securing assets in embedded systems and ensure confidentiality, integrity and authenticity, or in short “CIA”. In this blog, we will explore the Advanced Encryption Standard (AES) and how to implement an AES engine on Cadence Tensilica Xtensa application-specific processors.
The Data Encryption Standard (DES) cipher has been widely used since the 1970s. However, with technological advancements, the time required to recover the 56-bit DES keys has decreased significantly. One solution to counter this is to implement triple DES, which is the serial application of three DES ciphers each with its own individual keys. This provided effective security for a 112-bit key but at the expense of tripling the bandwidth.
In 1997, the NIST (U.S. Department of Commerce, National Institute of Standards and Technology) held a public contest to design the AES cipher, a successor to the aging DES cipher. This effort involved the cooperation of the U.S. government, the private sector, and academia from around the world. In October 2000, the Rijndael block cipher was chosen for the AES, based on its simplicity, security and flexibility. The Rjindael cipher is highly secure and flexible by virtue of variable block lengths and wide key lengths. However, AES limits the use of keys to a length of 128, 192, or 256 bits to encrypt a 128-bit block. The Rijndael cipher is a cipher that can be easily implemented in both hardware and software. However, implementing this using software alone consumes a lot of cycles, which may not be feasible in many performance-critical applications. Implementing the AES cipher on Xtensa processors can offer performance comparable to dedicated cryptographic engines while offering the same flexibility and ease of design found with a software implementation.
The AES cipher is done on a block-by-block basis. Each block is 128 bits (16 bytes) and is stored as a 4×4 byte array called a state array.
Note the order in which data is stored into the state array. Bytes of the plaintext are stored in the upper left square, moving downward in the left column, and then proceeding to the next columns until the lower right square is filled. Byte placement is important because AES cipher transformations operate on rows or columns of the state array.
The AES uses either a 128- or 256-bit key to encrypt the plaintext into ciphertext. The same key must be used to decrypt the ciphertext back into plaintext.
The Xtensa processor is configurable, allowing you to create a processor that has a feature set optimized for your application. Moreover, the Xtensa processor is also extensible, allowing you to extend the processor by defining application-specific instructions in the Tensilica Instruction Extension (TIE) language. TIE is compiled into hardware inside the Xtensa processor, and its software tools are automatically extended to support the new instructions. Let’s consider the design of an application-specific processor for AES by taking advantage of the configurability and extensibility of the Xtensa processor. A block diagram of the AES engine is shown in figure 2.
Fig. 2: The Xtensa processor extended with AES cipher functionality.
Figure 2 shows an Xtensa processor block diagram, configured and extended for AES functions. Most blocks represent features of the Xtensa processor that are part of every Xtensa processor – basic features like instruction fetch, the execution pipeline, and the base ISA ALU. Many features of the Xtensa processor are configurable, such as the number of tightly coupled RAM or ROM elements and the presence of I- or D-Cache and their attributes like total size, cache line size, and associativity.
Among the configuration options, the data load/store unit may be configured with data widths wider than the standard 32-bit (default) width. In the case of the custom operations needed for AES, a 128-bit load/store unit enables encryption text blocks and keys to be loaded or stored into local RAM/DCache using a single memory operation.
The base Xtensa processor has a 32-bit wide register file. Given that AES performs transformations on a 128-bit state array, it would be ideal if the processor had a 128-bit register file specifically for AES transformations. A 128-bit register file will allow instructions to operate on all 128 bits of a state array concurrently, as opposed to 32 bits at a time. Combined with the wide data bus, the wide register file will also enable the state array to be transferred to and from memory with a single memory access, rather than in four 32-bit accesses. TIE language is used to create a 128-bit AES register file ❶ to handle 128-bit quantities. AES-specific operations ❷ are designed to perform the various transformations on the 128-bit data types stored in the AES register file, including wide loads and stores.
The TIE language is used to describe processor extensions that are directly compiled by the TIE Compiler (TC) into hardware descriptions and integrated into the RTL of the Xtensa core. The instruction fetch and decode functions of the processor are extended to include the new operations defined in TIE, as well as operand data paths and register file ports. In addition, the Xtensa software tools are automatically extended to support the new operations and C-types defined in the TIE design source.
An encryption key is initially loaded to the local data RAM. Encryption begins after blocks of plaintext are loaded into the local data memory. The Xtensa processor then loads the 128-bit plaintext using a single TIE load instruction into the AES register file. Encryption rounds will operate on the AES register file until encryption is completed. Finally, the 128-bit ciphertext is written from the AES register file to the local data memory using a single TIE store instruction.
The decryption process is quite similar to the AES encryption process mentioned earlier. The same techniques can be applied to develop a new set of instructions to support decryption and achieve throughput equivalent to encryption.
The table below provides cycle count estimates for implementing AES encryption/decryption in software alone compared to implementation on Xtensa processors using custom extensions with TIE. As can be seen from the table, the encryption and decryption performances are increased tremendously. The total gate count increases by a modest 47K gates.
Table 1: Cycle count estimates for AES cipher on Tensilica processors. †Cycle counts for encryption/decryption/key expansion include all subroutine overhead including call/return
The AES cipher offers a robust way of protecting the assets that are prone to attack by encrypting the software image and data at rest. Although the AES algorithm can be implemented in software alone, the number of cycles required makes it less attractive. Running the AES algorithm on Xtensa processors offers significant improvement in performance while offering the same flexibility and ease of design found with a software implementation.
Please refer to the application note about implementing AES encryption on Xtensa processors for more details.
Leave a Reply