<< Chapter < Page Chapter >> Page >

Franchetti and Puschel argue that vectorization is best performed at the algorithm level of abstraction by manipulating Kronecker product expressions through mathematical identities  [link] , and this is the basis for a rewriting system  [link] that vectorizes for short vector machines  [link] , [link] , [link] .

In  [link] , SPIRAL is slower than FFTW 3.1 for 2-way double-precision power of two transforms, but SPIRAL is fastest for 4-way single-precision power of two transforms where 16 n 128 . SPIRAL generates code that is characterized by large basic blocks and single-threaded performance does not scale beyond sizes of about 4096 points. Indeed, source code is only publicly available for sizes 2 through to 8192 points  [link] .

Uhfft

UHFFT  [link] , [link] , [link] , [link] , [link] , like FFTW, generates a library of codelets which are assembled into transforms by a planner. The planner uses dynamic programming to search an exponential space of possible algorithms, factors and schedules, relying on codelet timings to predict transform execution times  [link] .

UHFFT uses the mixed-radix and split-radix  [link] , [link] algorithms for power of two sizes, the prime-factor algorithm  [link] , [link] for sizes that are factored by co-primes, and the Radar  [link] algorithm for sizes that are prime.

UHFFT generates a schedule from a DAG which has been topologically sorted, mainly to optimize memory reuse distance  [link] . The schedule is then unparsed to C code.

Scalar results on Itanium2 and Opteron show that UHFFT's dynamic programming approach can choose a plan having performance within 10% of the actual optimal plan. For power of two sizes, UHFFT's performance was typically worse than FFTW or Intel MKL, while UHFFT was faster than FFTW for prime-factor and prime sizes (ibid.).

Intel integrated performance primitives (ipp)

Of the closed source FFT implementations, the IPP library  [link] provides the best results for most sizes of DFT on machines with Intel processors. IPP includes a number of different FFT implementations thatappear to be hand optimized for different machine configurations, and in contrast to FFTW, IPP deterministically chooses the best code to run based on thecapabilities of the machine and the OS – achieving results that are typically superior to FFTW.

Because IPP is closed source, there is no publicly available information regarding the algorithms and techniques used.

Apple vdsp

The Apple Accelerate libraries contain a wide range of computationally intensive functions that have been optimized for vector computation on PowerPC, x86 and ARM architectures.Within the Accelerate library, vDSP is a collection of DSP functions that includes the FFT.

The vDSP implementation of the FFT is distinctive among the other libraries reviewed in this chapter in that it only operates on data that is stored in split format (where the real and imaginary parts of complex numbers are stored in separate arrays). However, many applications have data that is already ininterleaved format (where the real and imaginary part of each complex number are stored adjacent in memory), or require data in interleaved format, and so vDSP provides un-zip/zip functions for converting data to/from split format.

The Apple vDSP library is notable for having very good FFT performance on ARM NEON devices, while its x86 performance is average (comparable with FFTW “estimate” mode performance).

As with IPP, vDSP is only distributed in binary form and thus little can be said about the algorithms and techniques employed.

Matrixfft

MatrixFFT is a library for efficiently computing large transforms of more than 2 18 points on Apple hardware, with sustained processing rates reportedly being as high as 40 CTGs for very large single precision transforms. Large scale FFTs have been used in areas such as image processing (with images of over 10 9 pixels) and experimental mathematics (for extreme-precision computation of π ).

MatrixFFT uses the four-step algorithm to decompose a transform into smaller sub-transforms that fit in the cache  [link] , and computes the smaller sub-transforms with Apple vDSP. Interestingly, MatrixFFT has better performance – in many cases – while using interleaved format to store the data, even though the interleaved format must be converted to split format before using vDSP  [link] .

MatrixFFT includes a calibration utility that evaluates the various implementation parameters for each size of transform on a given machine, which can then be used to re-compile the library so that it achieves best performance on that particular machine.

MatrixFFT is freely available and distributed in source code form by Apple  [link] .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Computing the fast fourier transform on simd microprocessors. OpenStax CNX. Jul 15, 2012 Download for free at http://cnx.org/content/col11438/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?

Ask