A guide to VSCM internals

By Matthias Blume

Preface

You are asking how to add extensions to VSCM? Unfortunately, I haven't written a memo on this subject so far - maybe I should - because this seems to be a recurring question. Some people went and figured it out by themselves, but this shouldn't be the norm.

The trouble is that I am forced to run the VSCM project with pretty low priority - work on SML/NJ being my main project now. This leaves little time to fix bugs and to further the development of the code itself, much less does it allows me to write documentation.

VSCM is an experiment. I do not want my hands to be tied by past decisions of mine - decisions other people's code has begun to rely on. I do not want to discourage anybody from making extensions to VSCM, but be warned that your code might not be compatible with future releases of the system. The internal mechanisms have changed radically in the past (and not only once). Right now (when the workload permits) I'm working on a brand new compiler (even though this particular change doesn't affect the internals of types and the calling conventions for built-in procedures).


Types

You should look at the code which implements types in VSCM. Types are implemented by source files starting with a capital letter. The .c file contains the actual code, while the .h file specifies the interface for this type. There are a few very complicated types like Code and Cont, which heavily interact with the internal workings of the virtual machine. Other types like Port, Boolean, Cons, and so forth are very straightforward.

When you look at those files you will discover that everything is extremely stylized. Not only a few C preprocessor tricks (shudder!) have been played to make maintenance of this stuff as easy as possible. But if you are not used to the idioms then it might be hard to catch on.

Let me explain the Vector type, because it is one of the more interesting ones:


Vector.h:

There is a bit of #ifndef mumbo-jumbo at the beginning of the file. This prevents multiple inclusions of the same file.

A Scheme type named Xy usually resides in files Xy.c and Xy.h. It is implemented as a C struct type. The first element of the structure must be of type MEM_descriptor (imported from "storage.h"). The name of this member usually is _.

The structure uses a tag named ScmXy and is also typedef'd to ScmXy. This convention is important, because there are many macros which rely on it. If elements of the type are heap-allocated (vectors are!) then they are permitted to contain pointers to other heap elements.

The definition of ScmVector uses the infamous struct hack of C, where you declare a one-element array as the last member of the structure, but at runtime you allocate more space - thereby making the array as long as you need. Vector.c contains the code which tells the memory manager about the size of each individual vector and heap-pointers within the vectors.

Note that it is not necessary to expose the actual layout of the structure in the .h file - unless you want to access the members of this structure directly. Numeric.h declares the numeric types without exposing their actual definition.

Once you are done declaring the type ScmXy you add the line

DCL_MEM_TYPE (Xy);
This enables you to use the ScmType macro from "type.h" as in:
void *x;
...
if (ScmTypeOf (x) == ScmType (Xy))
  ...
else
  ...
The SCM_NEW_VECTOR declaration is a convenient short-hand for allocating storage for a new vector.

Vector.c:

Somewhere near the end of the file you will find a call to the MEM_VECTOR macro, the first argument of which is the name of the type (i.e. Xy).

For types of fixed size (like Cons cells) you specify

MEM_UNITS (sizeof (ScmCons))
as the second argument and
MEM_NULL_measure
as the third argument to the MEM_VECTOR macro (see Cons.c). For variable-sized types like Vector you must write a measure function to find out what the size of a runtime-object is. Look at the definition of measure in Vector.c to see an example.

Having this procedure you specify

0
as the second and
measure
as the third argument to MEM_VECTOR.

In the unlikely case that your type is not heap-allocated (which implies that it cannot contain heap-pointers!) then you write 0 and MEM_NULL_measure as the second and third arguments to MEM_VECTOR, respectively. See Boolean.[ch] and Character.[ch] for examples of this (#f, #t, (), the end-of-file object, and all characters) are non-heap objects with fixed addresses, which among other things makes comparisons with those objects more efficient).

The next task is to write a procedure, which applies a given function to the addresses of heap-pointers within a given object. The generic name for this function is iterator. Again, Vector.c and Cons.c contain good examples for this. The name iterator ends up as the fourth argument to MEM_VECTOR. If you don't have heap-pointers in your type then use MEM_NULL_iterator in place of the fourth MEM_VECTOR argument!

VSCM has a mechanism to dump the contents of the entire heap in a system-independent format to a file and to later resume execution from the state represented by such a file. For a new type you will have to write a few helper routines to make this work.

For every types there is a unique small integer tag representing objects of this type within the dump file. You need to edit "ident.tab" where you add the line:

IDENTIFIER (Xy)
and "identifier.c" where you supply the line
# include "Xy.h"
at the obvious places. (You might want to consult "identifier.h" to understand how this works.)

If your new type does not contain anything besides heap-pointers and is of fized size, then you are almost done. Write MEM_NULL_dumper, MEM_NULL_excavator, and MEM_NULL_revisor as the fifth, sixth, and seventh arguments to MEM_VECTOR, respectively.

When you need a dumper then you will also need an excavator and vice versa. The dumper procedure is responsible for writing all non-heap-pointer members of the type to a file (in some fixed portable format!). In particular, if you deal with variable-sized types then you must write size information to the file in one or the other form.

excavator is the opposite of dumper. When called it must read what was written by the corresponding dumper, allocate an object of the correct size, fill in the non-heap-pointer members, and return the newly constructed object.

At this point there must be enough information in the object to allow for the iterator procedure to work properly. The memory manager now goes and supplies the missing heap-pointers. Afterwards it calls the revisor routine (if provided). I haven't used the revisor mechanism myself anywhere in the current system, but it might give you an opportunity to do some fixup work after the heap-pointers are in place.

Arguments 8 through 10 are procedures (or MEM_NULL_task), which are called at system startup (8), at the beginning of a GC run (just before the heap gets traversed) (9), and at the end of a GC run (after the heap has been copied and all pointers are updated) (10), respectively. You can use these procedures to your benefit (see Port.c for an example of how you achieve so-called finalization using this mechanism).

So far we have been dealing with a generic memory management system, which is not necessarily geared towards an application in a Scheme system. The 11th and last argument to MEM_VECTOR, however, contains Scheme-specific information. It always has the form:

EXT (num-cat, cvt-real, display, write, equal, eqv)
Unless you are implementing a new numerical type (which is extremely tricky) num-cat will always be SCM_NO_NUMBER. Likewise, cvt-real is just cannot_cvt_real in this case.

display must be provided - this is called when the Scheme procedure DISPLAY tries to write an object of your type. write is the same for the case of a Scheme WRITE. See the original source code for examples on how to write these procedures.

equal is the procedure to implement Scheme EQUAL? and - you've guessed it - eqv is the same for EQV?. You can use NULL_eq in either place if the predicate in question coincides with pointer equality.


Allocating space

To allocate space for your objects you should use the macros provided by "type.h". They all correctly initialize the _ field of your objects.

For fized-size types use

ScmXy *xy;

MEM_NEW (xy, Xy);
After completion xy points to a fresh object of type ScmXy.

Variable-sized objects of the struct hack kind are easily allocated with:

MEM_VNEW (xy, Xy, l, t);
where l is the number of elements in the final array member, which was declared of size 1 but which really embodies l elements of type t. See the SCM_NEW_VECTOR macro in Vector.h for an example of how to use MEM_VNEW.

WARNING! WARNING! WARNING! WARNING! WARNING! WARNING!

Any memory allocation can trigger the garbage collector to run. Since it is a copying collector it will change pointers in the process. All pointers which the collector doesn't know about will point to garbage afterwards! Be careful not to get the rug pulled out under your feet while allocating memory. This can lead to very subtle bugs which are hard to find.

If you need an extra root pointer (which must be a global or static variable), then it must be registered with the collector at system startup time. There is an abundance of examples of this throughout the code. Use the MEM_root_var macro at system startup (exactly once!) on each variable you want to be a root pointer.

The other danger with GC is that all pointers the GC might encounter during its heap traversal (including all root pointers!) must contain a valid pointer or NULL. Any uninitialized or otherwise invalid heap pointer will cause the collector to break badly.

These problems will be the major obstacle when writing code which interfaces VSCM with other C libraries and such.


Writing new built-in procedures

Procedure naming and calling conventions

The Scheme arguments sit on a stack which can be manipulated using PEEK, POP, SET_TOP, Push, and PUSH. The first (i.e. leftmost) argument will be the topmost stack element.

It is the procedure's responsibility to POP all arguments from the stack. Push is a safe version of PUSH, which might allocate a bigger stack frame if necessary. PUSH may only be used if you are sure that the stack is allocated big enough already (perhaps because you POPed 10 elements from the stack earlier, so you know there is space for at least 10 PUSHes). PUSH can never trigger a GC, but Push can (remember the warning in the previous section!).

Calling conventions

There are three cases, which are listed in order of increasing complexity below:
  1. Scheme calls a C function. The C function does not call any other Scheme function.
    • Leave the result on the stack and return 0.
  2. Scheme calls a C function. The C function wants to make a tail-call to a Scheme function.
    • Push the arguments to the tail-call (right-to-left!) onto the stack.
    • Push the function to be called (i.e. a pointer to a Procedure, a Primitive, or a Cont objects) onto the stack.
    • Return the number of elements you have pushed (this is at least 1!).
  3. Scheme calls a C funtion. The C funtion wants to call a Scheme function and get control back.

Installation

To make a new built-in procedure available from within Scheme it must be registered with builtins.tab. For every new procedure add a call to either the BUILTIN or the BUILTIN_CONT macro (the latter is necessary in the case of a procedure of the complicated last kind).
BUILTIN (ScmPrimitiveXyz, "xyz", argcnt)
ScmPrimitiveXyz is the name of the C function, xyz is the name under which it will be known from within Scheme (must be all-lowercase!). argcnt is the (fixed) number of arguments the procedure expects or -1 for variable-size argument lists. In the former case the argument count is checked automatically while in the latter case you must do it yourself.
BUILTIN_CONT (ScmPrimitiveXyz, ScmPrimitiveXyxC, "xyz", argcnt)
is the same for procedures split into initial and continuation parts. ScmPrimitiveXyzC is the name of the C function for the continuatiuon part.
This interface will definitely change in the next release - but only in appearance, not in spirit. The new format of builtins.tab replaces entries of the form
BUILTIN (ScmPrimitiveXyz, "xyz", argcnt)
or
BUILTIN_CONT (ScmPrimitiveXyz, ScmPrimitiveXyxC, "xyz", argcnt)
by
BI_ (Xyz, "xyz", argcnt, user)
or
BIC (Abc, AbcC, "abc", argcnt, user)
respectively.

The ScmPrimitive prefix is implied (which means that you *must* name your functions this way!) and there is another extra argument, which essentially describes which namespace the procedure name belongs to. (The next version of VSCM will have a module system [.dvi], and extensions beyond R4RS will be encapsulated within special built-in modules.)


Re-building the system

Any addition to the type system, any added or removed procedure, as well as any other change which affects the format of dump files must be carried out with extreme care. You must have a working version of the system, and there must be a current version of compiler/vscmc.asm.

The reason for the last requirement is that it will not be possible (in general) to use old boot images of the compiler or the run-time system. They need to be re-built from the assembly source.

Before you start make simply say

touch format.time
which will cause the boot-images to be recomputed.
Matthias Blume, CS Department, Princeton University