<< Chapter < Page Chapter >> Page >

The most basic form of loop optimization is loop unrolling. It is so basic that most of today’s compilers do it automatically if it looks like there’s a benefit. There has been a great deal of clutter introduced into old dusty-deck FORTRAN programs in the name of loop unrolling that now serves only to confuse and mislead today’s compilers.

We’re not suggesting that you unroll any loops by hand. The purpose of this section is twofold. First, once you are familiar with loop unrolling, you might recognize code that was unrolled by a programmer (not you) some time ago and simplify the code. Second, you need to understand the concepts of loop unrolling so that when you look at generated machine code, you recognize unrolled loops.

The primary benefit in loop unrolling is to perform more computations per iteration. At the end of each iteration, the index value must be incremented, tested, and the control is branched back to the top of the loop if the loop has more iterations to process. By unrolling the loop, there are less “loop-ends” per loop execution. Unrolling also reduces the overall number of branches significantly and gives the processor more instructions between branches (i.e., it increases the size of the basic blocks).

For illustration, consider the following loop. It has a single statement wrapped in a do-loop:


DO I=1,N A(I) = A(I) + B(I) * CENDDO

You can unroll the loop, as we have below, giving you the same operations in fewer iterations with less loop overhead. You can imagine how this would help on any computer. Because the computations in one iteration do not depend on the computations in other iterations, calculations from different iterations can be executed together. On a superscalar processor, portions of these four statements may actually execute in parallel:


DO I=1,N,4 A(I) = A(I) + B(I) * CA(I+1) = A(I+1) + B(I+1) * C A(I+2) = A(I+2) + B(I+2) * CA(I+3) = A(I+3) + B(I+3) * C ENDDO

However, this loop is not exactly the same as the previous loop. The loop is unrolled four times, but what if N is not divisible by 4? If not, there will be one, two, or three spare iterations that don’t get executed. To handle these extra iterations, we add another little loop to soak them up. The extra loop is called a preconditioning loop :


II = IMOD (N,4) DO I=1,IIA(I) = A(I) + B(I) * C ENDDODO I=1+II,N,4A(I) = A(I) + B(I) * C A(I+1) = A(I+1) + B(I+1) * CA(I+2) = A(I+2) + B(I+2) * C A(I+3) = A(I+3) + B(I+3) * CENDDO

The number of iterations needed in the preconditioning loop is the total iteration count modulo for this unrolling amount. If, at runtime, N turns out to be divisible by 4, there are no spare iterations, and the preconditioning loop isn’t executed.

Speculative execution in the post-RISC architecture can reduce or eliminate the need for unrolling a loop that will operate on values that must be retrieved from main memory. Because the load operations take such a long time relative to the computations, the loop is naturally unrolled. While the processor is waiting for the first load to finish, it may speculatively execute three to four iterations of the loop ahead of the first load, effectively unrolling the loop in the Instruction Reorder Buffer.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, High performance computing. OpenStax CNX. Aug 25, 2010 Download for free at http://cnx.org/content/col11136/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'High performance computing' conversation and receive update notifications?

Ask