<< Chapter < Page Chapter >> Page >

The pattern exhibited by [link] can be exploited to access the data stored in the input array with better locality, as Figures [link] and [link] show. [link] depicts the memory access pattern of an FFT with size-16 hard-coded leaves, while [link] depicts the same FFT with sorted hard-coded leaves.

To compute the FFT with sorted leaves, the leaf sub-transforms and the body sub-transforms are split into two separate lists, and the entire list of leaf sub-transforms is computed before any of the body sub-transforms. There is, however, a cost associated with this re-arrangement: each leaf sub-transform's offset into the output array is not easy to compute because the offsets are now essentially decimated-in-frequency, and thus they are now pre-computed. Overall, the trade-off is justified because the output memory accesses within each leaf sub-transform are still linear.

Memory access pattern of the straight line blocks of code in a VL-1 size-128 hard-coded leaf FFT
Memory access pattern of the straight line blocks of code in a VL-1 size-128 hard-coded leaf FFT after leaf node sorting

The leaf transforms can be computed in three loops. The first and third loops compute size- N l e a f sub-transforms, while the second loop computes size- N l e a f / 2 sub-transforms. The size of the three loops i 0 , i 1 and i 2 are:

i 0 = N 3 × N l e a f + 1
i 1 = N 3 × N l e a f + N N l e a f mod 3 × 1 2

and

i 2 = N 3 × N l e a f

The transform can now be elaborated without leaf nodes, and the code for the three loops emitted in the place of calls to individual leaf sub-transforms.

Example

[link] is the main function for the FFT that corresponds to the leaf node list in [link] . The first and third loops invoke size-16 sub-transforms at lines 8 and 16, and the second loop invokes 2x size-8 sub-transforms at line 12. Following the leaf sub-transforms, the body sub-transforms are called at lines 19-23.

void sfft_dcf128_shl16_4(sfft_plan_t *p,const void *vin,void *vout){   const SFFT_D *in = vin;  SFFT_D *out = vout;   offset_t *is = p->is_base;   offset_t *offsets = p->offsets_base;   int i;  for(i=3;i>0;--i) {     sfft_dcf128_shl16_4_e(is, in, out+offsets[0]);     is += 16; offsets += 1;  }   for(i=3;i>0;--i) {     sfft_dcf128_shl16_4_o(is, in, out+offsets[0]);     is += 16; offsets += 1;  }   for(i=2;i>0;--i) {     sfft_dcf128_shl16_4_e(is, in, out+offsets[0]);     is += 16; offsets += 1;  }   sfft_dcf128_shl16_4_X_4(out+0, 32, p->ws[0]);  sfft_dcf128_shl16_4_X_4(out+0, 64, p->ws[1]);  sfft_dcf128_shl16_4_X_4(out+128, 32, p->ws[0]);  sfft_dcf128_shl16_4_X_4(out+192, 32, p->ws[0]);  sfft_dcf128_shl16_4_X_4(out+0, 128, p->ws[2]);}
Hard-coded VL-1 size-128 FFT with size-16 leaves (sub-transforms omitted)

Scalability

In terms of code size, computing the leaf sub-transforms with three loops is economical. As the size of the transform grows, the code size attributed to the leaf sub-transforms remains constant. However, as the size of the transform begins to grow large (e.g., 65 , 536 ), the instructions required for the body sub-transform calls (lines 19-23 in [link] ) begins to dominate the overall program size. "Optimizing the hierarchical structure" describes a method for compressing the code size of the body sub-transform calls while maintaining performance.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Computing the fast fourier transform on simd microprocessors. OpenStax CNX. Jul 15, 2012 Download for free at http://cnx.org/content/col11438/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Computing the fast fourier transform on simd microprocessors' conversation and receive update notifications?

Ask