# Achieving Numerical Precision And Design Customization With Flexible Floating-Point IP

A better approach for meeting power, performance and area objectives.

Floating-point operations in application-specific hardware have gained in popularity mostly because they are easier to use than fixed-point operations and they are a better match to numerical behavior in software algorithms. Fixed-point operations present design challenges in the definition of input/output ranges and internal precision for each operation. On the other hand, floating-point operations use precision and fixed formats. This article describes the floating-point operation and how designers can achieve a desired numerical precision and build custom floating-point operations with flexible floating-point IP to meet power, performance and area objectives.

Anatomy of a Floating-Point Operation
As defined by the IEEE 754 standard, floating-point values are represented in three fields: a significand or mantissa, a sign bit for the significand and an exponent field. The exponent is a biased signed integer. The bias is an unsigned integer with a median value that can be represented in the exponent field. The significand carries fractional bits, and depending on the exponent, a unit bit or hidden bit is added to the significand to get the actual floating-point value. The value represented in floating-point format can be calculated by a simple expression. Special representations are defined for values such as infinity, not-a-number, and denormals or subnormals.

Floating-point operators can be divided into four categories:

1. Basic or atomic operators are commonly used and usually include addition, subtraction, multiplication and elementary functions.
2. Compound operators, which is the focus of this article, are formed by dependent basic operations where data flows from one operation to the next to get the floating-point result.
3. Shared operators are formed by independent functions that share common hardware. The functions may be concurrent or mutually exclusive. An example is a component that performs addition and subtraction on two input values at the same time.
4. A combination of basic, compound and shared operators is possible.

Designers are usually interested in compound operators because they are fully accurate or fused and trade accuracy for performance, power and area (PPA). Compound operators give margin to customization allowing designers to tap into intermediate results or control accuracy.

Implementing a Compound Floating-Point Operation
The implementation of compound floating-point operations may go in different directions. A network of basic or atomic operators is a straightforward and simple way to implement a compound operation. The network of atomic operators is usually smaller and easier to put together than other alternatives. However, the designer can only change the network topology and there is limited control to experiment with design variations. There are three other paths for different design objectives. (Figure 1).

Figure 1: Design goals for compound floating point operators.

In path A, a compound operator is used for the best accuracy and can be found as components in floating-point libraries. These components use unique architectures and algorithms to deliver the best possible accuracy. As a consequence, the PPA is usually worse than the network of atomic operators.

Path B occurs when compound operations are used in parallel and can potentially share hardware. By using this technique, the designer gets a smaller area and lower power, with a small impact on performance, while the accuracy is maintained. For example, consider a component that performs addition and subtraction of two operands at the same time.

Path C is an alternative for design solutions that do not require the highest accuracy. In this instance the designer must have ways to eliminate intermediate operations or control the internal precision of intermediate operations. It’s used when there is a long series of operations to reach a result. This alternative is not possible with conventional floating-point IP.

IP-Based Floating Point Design
Designers want solutions that deliver the best possible Quality-of-Results (QoR), ideally, IP libraries that cover many solutions for common basic, compound and shared floating-point operations. However, such solutions comply with the floating-point standard and cannot be modified, limiting the ability to control precision. Designers also look for ways to create their custom designs that may require some adjustment of accuracy in the calculation, slightly different interfaces or functionalities that were not supported in the available library components.

The IP that supports floating-point designs must be based on sub-functions as part of the floating-point algorithms. To identify sub-functions, let’s look at a simple block diagram of a floating-point adder. (Figure 2). There are two main paths in the floating-point adder: exponent calculation and significand calculation. Those paths exchange information to generate the final result.

Figure 2: IP-based floating point design based on sub-functions.

If a significand is scaled, the exponent is adjusted accordingly. The main sub-functions are alignment, fixed-point addition, normalization and rounding. Alignment is a function that is specialized for floating-point addition, which is not required in a floating-point multiplier. Normalization is a big portion of a floating-point adder, and if eliminated would give margin to hardware optimization. The lack of normalization, however, may cause some loss of significand bits during operations, which is something that must be considered.

Internal values of the significand are represented in fixed-point, and they are tightly connected with the exponent value in the exponent calculation channel. To implement sub-functions, we need to have a way to carry exponent and fixed-point significand together, as it’s done in the standard floating-point system, but with fewer constraints and special cases.

In a sequence of floating-point additions, the normalization and rounding steps account for a significant portion of the critical path. When creating compound operations, it is usually recommended to leave normalization and rounding to the final stage of a series of operations, because when we directly access the internal fixed-point values of a floating-point operation without normalization or rounding, we reduce complexity, area and the critical path as well as gain control over internal precision.

By eliminating normalization and rounding, the numerical behavior changes. First, the exponent calculation is somewhat simplified in the first floating-point adder. Second, the representation of our floating-point output does not comply with the IEEE standard anymore. Third, the second floating-point adder in the chain needs to understand this new representation and it needs to have some information about what happened in the previous calculation to make decisions. (Figure 3)

Figure 3: Compound Functions from Simpler Sub-Blocks.

The idea of using flexible sub-functions for the design of compound or custom floating-point operations forces the definition of a new number format and the identification of core functions that are essential to implement floating-point operators. The core sub-functions are specified to reduce the overhead caused by normalization and rounding on floating-point values and to give the designer more flexibility to vary exponent and significand field sizes.

The most important core functions to consider are addition without normalization and rounding, multiplication of floating-point operands without normalization and rounding, conversion from one number format to another, separate normalization and separate rounding and conversion.

Controlling Numerical Precision and Meeting Design Requirements
To achieve design goals, designers require an IP library of floating-point components that deliver the best possible power, performance and area. The library must include C models for all the floating-point components, provide building blocks for the customization of floating-point designs and provide source code for the synthesis models of all the components in the library. Libraries that meet such requirements increase portability of synthesis models, improve the verification of floating-point designs and deliver the best features to floating-point designers.

The optimal IP library includes components with high-performance architectures, fused operators with maximum accuracy and compound operators with relaxed accuracy for better QoR. When pre-designed compound operators are not enough, the flexible floating-point IP gives designers a lot of freedom to adjust precision or customize functionality while still enabling QoR gain on the implementation of specialized designs. Using flexible floating-point, more functionality can be introduced at a lower logic level, registers can be incorporated to form a pipeline solution, or logic can be added to save dynamic power. Once a design is described using flexible floating-point, the designer has much more control over what can be done with it. Designers can meet design requirements by making tradeoffs in power, performance and area with a comprehensive library of floating-point components provided in Synopsys’ DesignWare Foundation Cores.