<< Chapter < Page Chapter >> Page >

One reason to generate the code using this simplistic approach is to guarantee that the program will produce the correct results. Looking at the above code, it’s pretty easy to argue that it indeed does exactly what the FORTRAN code does. You can track every single assembly statement directly back to part of a FORTRAN statement.

It’s pretty clear that you don’t want to execute this code in a high performance production environment without some more optimization.

Moderate optimization

In this example, we enable some optimization ( -O1 ):


save %sp,-120,%sp ! Rotate the register window add %i0,-4,%o0 ! Address of A(0)st %o0,[%fp-12] ! Store on the stackadd %i1,-4,%o0 ! Address of B(0) st %o0,[%fp-4]! Store on the stack add %i2,-4,%o0 ! Address of C(0)st %o0,[%fp-8] ! Store on the stacksethi %hi(GPB.addem.i),%o0 ! Address of I (top portion) add %o0,%lo(GPB.addem.i),%o2 ! Address of I (lower portion)ld [%i3],%o0 ! %o0 = N (fourth parameter)or %g0,1,%o1 ! %o1 = 1 (for addition) st %o0,[%fp-20]! store N on the stack st %o1,[%o2]! Set memory copy of I to 1 ld [%o2],%o1 ! o1 = I (kind of redundant) cmp %o1,%o0 ! Check I>N (zero-trip?) bg .L12 ! Don’t do loop at allnop ! Delay Slot ld [%o2],%o0 ! Pre-load for Branch Delay Slot .L900000110: ! Top of the loopld [%fp-4],%o1 ! o1 = Address of B(0)sll %o0,2,%o0 ! Multiply I by 4 ld [%o1+%o0],%f2 ! f2 = B(I) ld [%o2],%o0 ! Load I from memory ld [%fp-8],%o1 ! o1 = Address of C(0) sll %o0,2,%o0 ! Multiply I by 4ld [%o1+%o0],%f3 ! f3 = C(I)fadds %f2,%f3,%f2 ! Register-to-register add ld [%o2],%o0 ! Load I from memory (not again!) ld [%fp-12],%o1 ! o1 = Address of A(0) sll %o0,2,%o0 ! Multiply I by 4 (yes, again)st %f2,[%o1+%o0] ! A(I) = f2ld [%o2],%o0 ! Load I from memoryadd %o0,1,%o0 ! Increment I in register st %o0,[%o2]! Store I back into memory ld [%o2],%o0 ! Load I back into a register ld [%fp-20],%o1 ! Load N into a register cmp %o0,%o1 ! I>N ?? ble,a .L900000110ld [%o2],%o0 ! Branch Delay Slot

This is a significant improvement from the previous example. Some loop constant computations (subtracting 4) were hoisted out of the loop. We only loaded I 4 times during a loop iteration. Strangely, the compiler didn’t choose to store the addresses of A(0) , B(0) , and C(0) in registers at all even though there were plenty of registers. Even more perplexing is the fact that it loaded a value from memory immediately after it had stored it from the exact same register!

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, High performance computing. OpenStax CNX. Aug 25, 2010 Download for free at http://cnx.org/content/col11136/1.5
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'High performance computing' conversation and receive update notifications?

Ask