We will go through two examples of assembly code functions. Both functions add up some integers, the number of which will only be known at runtime. The first example uses an array for storing and passing those integers. It demonstrates the basics of argument passing in registers. The more challenging second function accepts a variable number of arguments and requires deeper understanding of stack frames.

Passing an array of dynamic size

Let us start with a function that takes an array of integers together with its length and calculates the sum of all the elements. In C this would look like this:

int add (int n, int a [])
{
  int r = 0;
  while (n-- > 0)
    r += a [n];
  return r;
}

Compiled separately, this procedure could then be called as follows:

# include <stdio.h>

extern int add (int n, int a []);

# define DIM(a) (sizeof a / sizeof a[0])

static int a [] = { 1, 2, 3, 4, 5 };
static int b [] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };

int main (void)
{
  printf ("%d %d\n", add (DIM (a), a), add (DIM (b), b));
  return 0;
}

When executed, the program will print print:

15 55

In order to write the code for the add function in Sparc assembly language I recommend starting with extensive comments. (Comments, for example, could include a conceptually equivalent fragment of C code.)

Register assignments

Let's decide how to use registers. We know that n will be available as the first argument -- in register %i0. Furthermore, since arrays are passed by passing a pointer to their first element, %i1 will be the base address for the array argument a. We will need this base address for fetching array elements from memory.

%l0 will be our variable r, and %l1 serves as a temporary register for loading array element from memory into a register, so they can be added to %l0.

This leaves us with the following setup:

i0 -> n
i1 -> pointer to first element of a
l0 -> r
l1 -> temporary (place for loading values from memory)

Addressing the array

Array elements are addressed using the array's base address %i1 and the value of n in %i0. However, since we are adding 4-byte quantities, the correct address of a[n] is [%i1+4*%i0]. Unfortunately, this expression cannot be used directly in a load instruction (ld), so we would have to multiply the contents of %i0 by four every time around. This would also require using another temporary register.

In our case it is possible to remedy this situation by multiplying %i0 by four once and for all times before we enter the loop. Of course, decrementing n will then correspond to subtracting 4, not 1, from %i0!

Function entry sequence

First we need to execute a save instruction, which remaps registers in a way such that o-registers become i-registers, and a fresh set of l- and o-registers becomes available.

	.seg "text"
	.global _add
	.align 4
	.proc 4

_add:	save %sp, -96, %sp

Now we can establish our loop invariants (i.e., make sure the registers contain the correct initial values):

	!! i0 -> n
	!! i1 -> pointer to first element of a
	!! l0 -> r
	!! l1 -> temporary (place for loading values from memory)
	set 0, %l0              ! initialize r
	sll %i0, 2, %i0		! n *= 4

Adding it all up

After all the preliminaries are taken care of we can now write the code for the actual loop. Notice, that we test the value of %i0 before we decrement it! Also, we were lucky, because our "strength reduction" optimization that was applied to n didn't affect the loop-termination test. (This works, because 4*0=0.)

The add instruction was moved into the delay-slot of the ba branch, so no cyles are wasted for no-ops at this point.

loop:	cmp %i0, %g0		! has n reached 0?
	be done
	nop
	sub %i0, 4, %i0		! n--
	ld [%i1+%i0], %l1	! fetch a [n] from memory
	ba loop
	add %l0, %l1, %l0	! r += a [n] in delay slot

Returning the result

We are besically done, %l0 has our result. All that's left to do is to put it into the right register and return. Return values show up in %o0, but since the restore instruction will rename all i-registers back into o-registers (while recovering the previous sets of i- and l-registers), we must move the result to %i0.

done:	mov %l0, %i0
	ret
	restore

The complete code

	!! int add (int n, int a [])
	!! {
	!!   int r = 0;
	!!   while (n-- > 0)
	!!     r += a [n];
	!!   return r;
	!! }
	!!

	.seg "text"
	.global _add
	.align 4
	.proc 4

_add:	save %sp, -96, %sp
	!! i0 -> n
	!! i1 -> pointer to first element of a
	!! l0 -> r
	!! l1 -> temporary (place for loading values from memory)
	set 0, %l0
	sll %i0, 2, %i0		! n *= 4
	
loop:	cmp %i0, %g0		! has n reached 0?
	be done
	nop
	sub %i0, 4, %i0		! n--
	ld [%i1+%i0], %l1	! fetch a [n] from memory
	ba loop
	add %l0, %l1, %l0	! r += a [n] in delay slot

done:	mov %l0, %i0
	ret
	restore

Variable argument lists

Let us now write a very similar function, which instead of using an array for passing multiple values accepts an arbitrary number of arguments! There is no good way of finding out at runtime how many arguments have actually been supplied. Therefore, we will require the caller of our function to pass an additional first argument n, which specifies how many other integer values follow.

Here is C code conceptually equivalent to what we have in mind:

# include <stdarg.h>
int add (int n, ...)
{
  va_list v;
  int r;
  va_start (v, n);
  while (n-- > 0)
    r += va_arg (v, int);
  va_end (v);
  return r;
}

The function could be called like this:

# include <stdio.h>

extern int add (int n, ...);

int main (void)
{
  printf ("%d %d\n",
	  add (5, 1, 2, 3, 4, 5),
	  add (10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10));
  return 0;
}

That program, when run, also prints

15 55

Reducing the problem to a known one

We will play around with the arguments and registers in such a way, that we can use the loop and the return sequence from the previous example unchanged. For this it is necessary to have all arguments (other than the very first one) be stored in an array-like contiguous piece of memory. For arguments beyond the sixth this is already the case, but arguments 1 though 6 are passed in registers.

The stack frame

After the save %sp, -96, %sp instruction has been executed for a function, the stack frame has the following layout:

%fp        -> 64 byes to save register window
%fp + 64   -> 4 bytes for structure return pointer
%fp + 68   -> space reserved for saving %i0 (arg 1)
%fp + 72   -> space reserved for saving %i1 (arg 2)
%fp + 76   -> space reserved for saving %i2 (arg 3)
%fp + 80   -> space reserved for saving %i3 (arg 4)
%fp + 84   -> space reserved for saving %i4 (arg 5)
%fp + 88   -> space reserved for saving %i5 (arg 6)

The space reserved for arguments 1 through 6 is always there, even if the function has less than 6 arguments.

If there are more than 6 arguments, then the caller will have allocated extra stack space for them. Argument 7 appears on the stack at address %fp+92, argument 8 at %fp+96, and so on. In general, there is a stack location reserved for every argument, and those locations are contiguous! This is very convenient.

In order to establish a contiguous array of all arguments all we need to do is store the contents of %i1 through %i5 into their respective reserved locations.

	.seg "text"
	.global _add
	.align 4
	.proc 4

_add:	save %sp, -96, %sp
	!! i0 -> n
	!! i1, ..., i5 -> first 5 arguments
	!! l0 -> r
	!! l1 -> temporary
	set 0, %l0
	sll %i0, 2, %i0		! n *= 4

	!! store first five arguments in memory (so we can treat all arguments
	!! uniformely)
	st %i5, [%fp+88]
	st %i4, [%fp+84]
	st %i3, [%fp+80]
	st %i2, [%fp+76]
	st %i1, [%fp+72]

The array of numbers to add up now starts at address %fp+72. If we store that value in %i1, then we have established exactly the same loop invariants that we also used in the previous example.

	!! argument vector starts at %fp+72
	!! save this value in %i1
	add %fp, 72, %i1

Therefore, the rest of the code is an exact copy of what we have seen before.

The complete code

	!! # include <stdarg.h>
	!! int add (int n, ...)
	!! {
	!!   va_list v;
	!!   int r;
	!!   va_start (v, n);
	!!   while (n-- > 0)
	!!     r += va_arg (v, int);
	!!   va_end (v);
	!!   return r;
	!! }
	!!

	.seg "text"
	.global _add
	.align 4
	.proc 4

_add:	save %sp, -96, %sp
	!! i0 -> n
	!! i1, ..., i5 -> first 5 arguments
	!! l1 -> temporary
	!! l0 -> r
	set 0, %l0
	sll %i0, 2, %i0		! n *= 4

	!! store first five arguments in memory (so we can treat all arguments
	!! uniformely)
	st %i5, [%fp+88]
	st %i4, [%fp+84]
	st %i3, [%fp+80]
	st %i2, [%fp+76]
	st %i1, [%fp+72]

	!! argument vector starts at %fp+72
	!! save this value in %i1
	add %fp, 72, %i1

loop:	cmp %i0, %g0		! has n reached 0?
	be done
	nop
	sub %i0, 4, %i0		! n--
	ld [%i1+%i0], %l1	! fetch arg n from memory
	ba loop
	add %l0, %l1, %l0	! r += a [n] in delay slot

done:	mov %l0, %i0
	ret
	restore

Making the code tighter

(Optional reading -- but if you read it, then read it to the end!)

The code for the loop and the return sequence still contains a no-op after the branch instruction for the loop-termination test. We will now try to eliminate the nop by filling the delay-slot with a useful instruction.

Approach 1

We can turn the be into an annulled branch. This means, that the instruction in the delay slot is only executed if the branch is actually taken. Therefore, we could move the first instruction that follows the loop into that slot.

loop:	cmp %i0, %g0		! has n reached 0?
	be,a done		! annulled brach to exit loop
	mov %l0, %i0		! first instruction after exit from loop
	sub %i0, 4, %i0		! n--
	ld [%i1+%i0], %l1	! fetch arg n from memory
	ba loop
	add %l0, %l1, %l0	! r += a [n] in delay slot

done:	ret
	restore

This code is not only harder to read, it also doesn't really buy us much. The instruction will be annulled (and in effect be treated like a nop) every time we go around the loop. Only the last time, upon exit from the loop, it will perform useful work. In effect, we will save one machine cycle in total (and we use one less word of memory for the instruction sequence). This isn't much.

Approach 2

It seems like it would be better, if in the course of n times around the loop we could make the delay slot perform useful work n-1 times and have it annulled only once upon exit. Annulled branches execute their delay slot when they are taken. This means we need to reverse the logic of our loop-termination test and move the return sequence into the middle of the loop code.

loop:	cmp %i0, %g0		! has n reached 0?
	bne,a more		! annulled branch to continue loop
	sub %i0, 4, %i0		! n-- in branch delay slot (annulled)

	mov %l0, %i0		! prepare return value
	ret
	restore

more:	ld [%i1+%i0], %l1	! fetch a [n] from memory
	ba loop
	add %l0, %l1, %l0	! r += a [n] in delay slot

In fact, since %i0 gets assigned a new value as part of the return sequence anyway, we don't really have to use an annulled branch, because the so-annulled instruction doesn't do any harm even if executed.

Approach 3 = Approach 0

If we go and write a little test program to assess the improvement in timing behavior, then we might be unpleasantly surprised: The ``improved'' version is actually slower than the original one that had an explicit nop! The reason for this is that branches that are taken are actually quite a nuisance for some RISC machines, because the instruction pipeline will have to be flushed. Our ``optimization'' involved turning a branch that was taken ``almost never'' into one that is taken ``almost always''. It turns out that the negative impact eats up all improvement we get from eliminating the nop and then some.

Remember: Be careful with ``micro-optimizations'', they are often not worth the trouble and sometimes even harmful.

Matthias Blume, CS Department, Princeton University