Programming Languages: Garbage Collection

COS 441 - Garbage Collection - Mar 28, 1996

Garbage Collection

The heap is a collection of records that represent environments, continuations, and data. Data includes closures, pairs, boxes, etc. It can be represented as a directed graph with nodes being records and edges being pointers. Many heap records are not needed after program execution passes a certain point. Consider: (map add1 (map sub1 '(1 2 3))). The intermediate list created by sub1 is not need after add1 has executed. This example leads us to the question: when do we want to reclaim heap records that will never be used again? Put another way: when is data garbage?

At one end of the spectrum, we never collect garbage. This is a simple answer, but un-practical because it requires an arbitrarily large memory. On the other end of the spectrum, we reclaim data immediately after its last use. Unfortunately, this is not possible to implement because it is undecidable. Consider

A:  (use x)
...
B:  (if P x Q)

Point A could be the last use of x if P evaluates false. But predicting P is undecidable, hence whether A is the last use is undecidable as well. For this reason, we must pick a decidable point on the spectrum.

Consider the state of the CPS, first-order, registerized interpreter when stopped at some point during execution. To resume execution, we need to know the contents of the registers, namely e, env, k and everything to which they refer. Nothing else.

e refers to parts of the abstract syntax tree (ie program), which refer only to other parts of the program. These are static - allocated once before program execution begins, and never again. Let's forget about them.

env, k hold records that refer to other environment records, continuations, and values (ie. heap records). We only need those heap records that are reachable from env, k. Heap records that are not reachable will never again be used.

We now need some definitions.

roots - registers such as env and k
live data - data that is reachable from roots
garbage - data that is not reachable from roots
collector - an algorithm that makes garbage available for reuse
mutator - the program itself that allocates data

The abstract algorithm for garbage collection follows.

(1) Stop the machine.
(2) Partition the heap into live data and garbage.
(3) Mark or rearrange heap so that garbage can be reused.
(4) Restart the machine.

When to stop the machine?

when unable to allocate.
when remaining free space is low.
periodically.
when user program pauses for terminal or disk I/O.

The policy we choose depends on the collector algorithm, whether the application program is interactive, how much memory is available on the machine, the allocation behavior of the program, etc.

Mark and Sweep

The idea behind Mark/Sweep collection is to mark each record as live or dead and place the dead records on a "free list" for reuse. We allocate a mark bit in each record.

(0) Set all mark bits dead.
(1) Starting from roots, mark those records live. Follow pointers in those records to children. For each child c:
- if c is marked live, do nothing
- if c is marked dead, mark c live and follow pointers in c
(2) For each record in heap, if marked dead, place in the free list.

Step 1 is called the mark phase, and step 2 is called the sweep phase. If we merge step 0 with step 2 this is then a two pass algorithm. The runtime of the mark phase is O(live-data), but the runtime of the sweep phase is O(heap-size). Thus runtime for the entire algorithm is O(heap-size).