World’s Fastest FFT Architectures: Breaking the Barrier of 100 GS/s

Abstract

This paper presents the fastest fast Fourier transform (FFT) hardware architectures so far. The architectures are based on a fully parallel implementation of the FFT algorithm. In order to obtain the highest throughput while keeping the resource utilization low, we base our design on making use of advanced shift-and-add techniques to implement the rotators and on selecting the most suitable FFT algorithms for these architectures. Apart from high throughput and resource efficiency, we also guarantee high accuracy in the proposed architectures. For the implementation, we have developed an automatic tool that generates the architectures as a function of the FFT size, input word length and accuracy of the rotations. We provide experimental results covering various FFT sizes, FFT algorithms, and field-programmable gate array boards. These results show that it is possible to break the barrier of 100 GS/s for FFT calculation.

Existing System

? There exist a number of FFT hardware architecture with nonpower-of-two sizes, non-power-of-two FFTs are significantly more developed and optimized. ? The zero-padded FFT offers increased frequency resolution by extending the length of the input data sequence in the time domain by padding with zeros at the tail of the discrete-time signal. ? The radix-2 and radix-4 algorithms are the most widely used for implementing FFT processors because of their simple architectures. ? Single-path delay feedback (SDF) pipeline FFT architectures are commonly used because they have the smallest number of non-trivial multiplications compared with other pipeline architectures, such as single-path delay commutator (SDC) and multi-path delay commutator (MDC).

Disadvantages

? To avoid this issue, it is a good time to apply all the knowledge on power-of-two FFTs to non-powerof-two ones. ? In the proposed hardware architecture, an input data sequence of length N is delayed using N delay elements and the delayed data sequence is fed back to the delay elements and simultaneously transferred to a complex multiplier for multiplying by the TF. ? Most notably, compared with the conventional hardware architecture (in which the number of delay elements seriously increases with FFT length and the number of data paths), the proposed hardware architecture reduces the number of the delay elements significantly.

Proposed System

• In the proposed structure, in an N-point RFFT, exactly N signal values are computed at the output of each FFT stage and at the output. • Although this property is satisfied by only one prior architecture proposed in, general approaches for designing canonic RFFT computations have been not presented. • The proposed canonic DIF RFFT computation has less twiddle factor operations than the computation in, while it has the same performance as the work. • The proposed canonic structures are not necessarily canonic with respect to the twiddle factor operations and are nonunique.

Advantages

? An efficient way to implement these filters is to make use of fully parallel FFTs. ? A hardware-efficient fully parallel FFT can reduce significantly the hardware cost of the PE in the iterative FFT. ? With the aim of designing the most efficient rotators, we also exploit different approaches to implement rotators in hardware. ? In order to achieve accurate FFTs it is necessary to select coefficients with small rotator error. ? To achieve these goals, we take into account the coefficient selection, we explore different architectures for the rotators and we make use of advanced shift-and-add algorithms.

Download DOC Download PPT