<< Chapter < Page | Chapter >> Page > |
And yet there is a vast gap between this basic mathematical theory and the actual practice—highly optimized FFT packages are often anorder of magnitude faster than the textbook subroutines, and the internal structure to achieve this performance is radically differentfrom the typical textbook presentation of the “same” Cooley-Tukey algorithm. For example, [link] plots the ratio of benchmark speeds between a highly optimized FFT [link] , [link] and a typical textbook radix-2 implementation [link] , and the former is faster by a factor of 5–40 (with a larger ratio as $n$ grows). Here, we will consider some of the reasons for this discrepancy, and some techniques that can be used to address thedifficulties faced by a practical high-performance FFT implementation. We won't address the question of parallelization on multi-processor machines, which adds even greaterdifficulty to FFT implementation—although multi-processors are increasingly important, achieving good serial performance is a basicprerequisite for optimized parallel code, and is already hard enough!
In particular, in this chapter we will discuss some of the lessons learned and the strategies adopted in the FFTW library. FFTW [link] , [link] is a widely used free-software library that computes the discrete Fourier transform (DFT) and its various special cases.Its performance is competitive even with manufacturer-optimized programs [link] , and this performance is portable thanks the structure of the algorithms employed, self-optimization techniques, and highlyoptimized kernels (FFTW's codelets ) generated by a special-purpose compiler.
This chapter is structured as follows. First "Review of the Cooley-Tukey FFT" , we briefly review the basic ideas behind the Cooley-Tukey algorithm anddefine some common terminology, especially focusing on the many degrees of freedom that the abstract algorithm allows toimplementations. Next, in "Goals and Background of the FFTW Project" , we provide some context for FFTW's development and stress that performance, while it receives the most publicity, is not necessarily the most important consideration in the implementation of a library of this sort. Third, in "FFTs and the Memory Hierarchy" , we consider a basic theoretical model of the computer memory hierarchy and its impact onFFT algorithm choices: quite general considerations push implementations towards large radices and explicitly recursivestructure. Unfortunately, general considerations are not sufficient in themselves, so we will explain in "Adaptive Composition of FFT Algorithms" how FFTW self-optimizes for particular machines by selecting its algorithm atruntime from a composition of simple algorithmic steps. Furthermore, "Generating Small FFT Kernels" describes the utility and the principles of automatic code generation used to produce the highly optimized building blocksof this composition, FFTW's codelets. Finally, we will briefly consider an important non-performance issue, in "Numerical Accuracy in FFTs" .
The (forward, one-dimensional) discrete Fourier transform (DFT) of an array $\mathbf{X}$ of $n$ complex numbers is the array $\mathbf{Y}$ given by
Notification Switch
Would you like to follow the 'Fast fourier transforms' conversation and receive update notifications?