int add (int n, int a []) { int r = 0; while (n-- > 0) r += a [n]; return r; }Compiled separately, this procedure could then be called as follows:
# include <stdio.h> extern int add (int n, int a []); # define DIM(a) (sizeof a / sizeof a[0]) static int a [] = { 1, 2, 3, 4, 5 }; static int b [] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; int main (void) { printf ("%d %d\n", add (DIM (a), a), add (DIM (b), b)); return 0; }When executed, the program will print print:
15 55In order to write the code for the add function in Sparc assembly language I recommend starting with extensive comments. (Comments, for example, could include a conceptually equivalent fragment of C code.)
%l0 will be our variable r, and %l1 serves as a temporary register for loading array element from memory into a register, so they can be added to %l0.
This leaves us with the following setup:
i0 -> n i1 -> pointer to first element of a l0 -> r l1 -> temporary (place for loading values from memory)
In our case it is possible to remedy this situation by multiplying %i0 by four once and for all times before we enter the loop. Of course, decrementing n will then correspond to subtracting 4, not 1, from %i0!
.seg "text" .global _add .align 4 .proc 4 _add: save %sp, -96, %spNow we can establish our loop invariants (i.e., make sure the registers contain the correct initial values):
!! i0 -> n !! i1 -> pointer to first element of a !! l0 -> r !! l1 -> temporary (place for loading values from memory) set 0, %l0 ! initialize r sll %i0, 2, %i0 ! n *= 4
The add instruction was moved into the delay-slot of the ba branch, so no cyles are wasted for no-ops at this point.
loop: cmp %i0, %g0 ! has n reached 0? be done nop sub %i0, 4, %i0 ! n-- ld [%i1+%i0], %l1 ! fetch a [n] from memory ba loop add %l0, %l1, %l0 ! r += a [n] in delay slot
done: mov %l0, %i0 ret restore
!! int add (int n, int a []) !! { !! int r = 0; !! while (n-- > 0) !! r += a [n]; !! return r; !! } !! .seg "text" .global _add .align 4 .proc 4 _add: save %sp, -96, %sp !! i0 -> n !! i1 -> pointer to first element of a !! l0 -> r !! l1 -> temporary (place for loading values from memory) set 0, %l0 sll %i0, 2, %i0 ! n *= 4 loop: cmp %i0, %g0 ! has n reached 0? be done nop sub %i0, 4, %i0 ! n-- ld [%i1+%i0], %l1 ! fetch a [n] from memory ba loop add %l0, %l1, %l0 ! r += a [n] in delay slot done: mov %l0, %i0 ret restore
Here is C code conceptually equivalent to what we have in mind:
# include <stdarg.h> int add (int n, ...) { va_list v; int r; va_start (v, n); while (n-- > 0) r += va_arg (v, int); va_end (v); return r; }The function could be called like this:
# include <stdio.h> extern int add (int n, ...); int main (void) { printf ("%d %d\n", add (5, 1, 2, 3, 4, 5), add (10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)); return 0; }That program, when run, also prints
15 55
%fp -> 64 byes to save register window %fp + 64 -> 4 bytes for structure return pointer %fp + 68 -> space reserved for saving %i0 (arg 1) %fp + 72 -> space reserved for saving %i1 (arg 2) %fp + 76 -> space reserved for saving %i2 (arg 3) %fp + 80 -> space reserved for saving %i3 (arg 4) %fp + 84 -> space reserved for saving %i4 (arg 5) %fp + 88 -> space reserved for saving %i5 (arg 6)The space reserved for arguments 1 through 6 is always there, even if the function has less than 6 arguments.
If there are more than 6 arguments, then the caller will have allocated extra stack space for them. Argument 7 appears on the stack at address %fp+92, argument 8 at %fp+96, and so on. In general, there is a stack location reserved for every argument, and those locations are contiguous! This is very convenient.
In order to establish a contiguous array of all arguments all we need to do is store the contents of %i1 through %i5 into their respective reserved locations.
.seg "text" .global _add .align 4 .proc 4 _add: save %sp, -96, %sp !! i0 -> n !! i1, ..., i5 -> first 5 arguments !! l0 -> r !! l1 -> temporary set 0, %l0 sll %i0, 2, %i0 ! n *= 4 !! store first five arguments in memory (so we can treat all arguments !! uniformely) st %i5, [%fp+88] st %i4, [%fp+84] st %i3, [%fp+80] st %i2, [%fp+76] st %i1, [%fp+72]The array of numbers to add up now starts at address %fp+72. If we store that value in %i1, then we have established exactly the same loop invariants that we also used in the previous example.
!! argument vector starts at %fp+72 !! save this value in %i1 add %fp, 72, %i1Therefore, the rest of the code is an exact copy of what we have seen before.
!! # include <stdarg.h> !! int add (int n, ...) !! { !! va_list v; !! int r; !! va_start (v, n); !! while (n-- > 0) !! r += va_arg (v, int); !! va_end (v); !! return r; !! } !! .seg "text" .global _add .align 4 .proc 4 _add: save %sp, -96, %sp !! i0 -> n !! i1, ..., i5 -> first 5 arguments !! l1 -> temporary !! l0 -> r set 0, %l0 sll %i0, 2, %i0 ! n *= 4 !! store first five arguments in memory (so we can treat all arguments !! uniformely) st %i5, [%fp+88] st %i4, [%fp+84] st %i3, [%fp+80] st %i2, [%fp+76] st %i1, [%fp+72] !! argument vector starts at %fp+72 !! save this value in %i1 add %fp, 72, %i1 loop: cmp %i0, %g0 ! has n reached 0? be done nop sub %i0, 4, %i0 ! n-- ld [%i1+%i0], %l1 ! fetch arg n from memory ba loop add %l0, %l1, %l0 ! r += a [n] in delay slot done: mov %l0, %i0 ret restore
The code for the loop and the return sequence still contains a no-op after the branch instruction for the loop-termination test. We will now try to eliminate the nop by filling the delay-slot with a useful instruction.
loop: cmp %i0, %g0 ! has n reached 0? be,a done ! annulled brach to exit loop mov %l0, %i0 ! first instruction after exit from loop sub %i0, 4, %i0 ! n-- ld [%i1+%i0], %l1 ! fetch arg n from memory ba loop add %l0, %l1, %l0 ! r += a [n] in delay slot done: ret restoreThis code is not only harder to read, it also doesn't really buy us much. The instruction will be annulled (and in effect be treated like a nop) every time we go around the loop. Only the last time, upon exit from the loop, it will perform useful work. In effect, we will save one machine cycle in total (and we use one less word of memory for the instruction sequence). This isn't much.
loop: cmp %i0, %g0 ! has n reached 0? bne,a more ! annulled branch to continue loop sub %i0, 4, %i0 ! n-- in branch delay slot (annulled) mov %l0, %i0 ! prepare return value ret restore more: ld [%i1+%i0], %l1 ! fetch a [n] from memory ba loop add %l0, %l1, %l0 ! r += a [n] in delay slotIn fact, since %i0 gets assigned a new value as part of the return sequence anyway, we don't really have to use an annulled branch, because the so-annulled instruction doesn't do any harm even if executed.
Remember: Be careful with ``micro-optimizations'', they are often not worth the trouble and sometimes even harmful.