A Floating-Point Fused Dot-Product Unit

Abstract

A floating-point fused dot-product unit is presented that performs single-precision floating-point multiplication and addition operations on two pairs of data in a time that is only 150% the time required for a conventional floating-point multiplication. When placed and routed in a 45nm process, the fused dot-product unit occupied about 70% of the area needed to implement a parallel dot-product unit using conventional floating-point adders and multipliers. The speed of the fused dot-product is 27% faster than the speed of the conventional parallel approach. The numerical result of the fused unit is more accurate because one rounding operation is needed versus at least three for other approaches.

Existing System

? An 8-bit multiplication computed on a 32-bit Booth multiplier would result in unnecessary switching activity and power loss. Several works investigated this word length optimization. ? Each pair of incoming operands is routed to the smallest multiplier that can compute the result to take advantage of the lower energy consumption of the smaller circuit. ? This ensemble of point systems is reported to consume the least power but this came at the cost of increased chip area given the used ensemble structure. ? When adjusting the voltage, the actual performance of the multiplier running under scaled voltage has to be characterized to guarantee a fail-safe operation.

Disadvantages

? More power consumption ? Requirement of area ? Complex structure ? Less flexibility

Proposed System

• The proposed designs are implemented for single precision and synthesized with a 45nm standard cell library. • The proposed dual-path design reduces the latency by 35% compared to the traditional floating-point fused dot product unit. • The proposed dual-path floating-point fused dot product unit is split into three stages. • The proposed system reduces the shift amount and normalization is applied to reduce the size of significant and addition and LZA reduces the reduction tree. • The proposed MBE multiplier combines the advantages of both of these two approaches to produce a very regular partial product array.

Advantages

? A high performance 45 nm process was used for the implementation with a standard cell library designed for high speed applications. ? FMA units are utilized in embedded signal processing and graphics applications, used to perform division, argument reduction, and this is why the FMA has become an integral unit of many commercial processors such as those of IBM, HP and Intel. ? To operations performed by a FMA, in many DSP algorithms and in other fields calculating the sum of the products of two sets of operands (dot-product) is a frequently used operation. ? Both Fused FDP and FAS unit is more efficient than the older designs. FAS are more sense only rounding is performed over 3 rounding in parallel approaches.

Download DOC Download PPT