Scheme from Scratch - Bootstrap v0.22 - Garbage Collection

Garbage collection is an essential part of a production Scheme system. Bootstrap Scheme is not a production Scheme system. It is intended to be just enough of a Scheme interpreter to execute a Scheme compiler once. If your computer has enough memory that it won’t run out during the execution of a compiler than you don’t need to implement a garbage collector in Bootstrap Scheme.

I think that the simplest garbage collection system you could add to the interpreter is a mark-and-sweep system. Mark-and-sweep is easier to implement than many collectors, isn’t foiled by circular references, and doesn’t require hardware-specific knowledge. The garbage collection pauses in a mark-and-sweep system can be noticeably long but that doesn’t matter for a bootstrap interpreter.

The C function alloc_object in the interpreter currently calls malloc for every object allocated and doesn’t record anything about what memory has been allocated. The fundamental goal is to replace alloc_object so a new version of it manipulates a heap of preallocated objects that can be recycled.

This page describes one way to implement a mark-and-sweep algorithm. This is basically the mark-and-sweep algorithm I’ve seen in use in some interpreters.

When the interpreter begins, the C init function in the model layer is executed. This is the right place to set up a heap of Scheme objects. You can statically allocate or dynamically malloc enough space for thousands of Scheme objects (i.e. struct object). Each object needs a new field, char mark, that must be set to 0 during initialization. Each object also needs a struct object *next field to be used for tracking which memory is free or active. At the beginning, all memory is free. Link all the heap objects together into a linked list with the next field. The last object in the list can point to C’s NULL. Assign the head of this linked list to a C global variable free_list. There is also a global active_list that is set to NULL now and that will hold all objects in use by the program.

When alloc_object is called, it looks to see if the free_list has an available object. If the list does have an available object then alloc_object moves that object from the free_list to the active_list and returns that object. If the free_list is empty then alloc_object calls a gc function to run the mark-and-sweep algorithm. When gc returns, alloc_object can recheck for available objects on the free_list and return one if there is one or print an error message and exit the program.

The gc function is reasonably simple with its two phases: mark and sweep.

In the mark phase, the garbage collector starts at all the “root” objects (described below) and walks all reachable memory (e.g. pairs can point to other objects) and marks each object. Each reachable object will have its mark field set to 1. When all reachable objects are marked, then move on to the sweep phase.

In the sweep phase, the garbage collector starts at the top of the active_list and examines each object in that list. Objects that have mark set to 0 are moved to the free_list. (Symbols and strings also need to have their data.symbol.value field, for example, freed.) Objects in the active_list that have mark set to 1 have their mark set back to 0 in preparation for the next mark phase.

All of that is pretty easy and nicely contained to a very small area of the C source code. Where everything becomes messy is keeping track of the roots for the mark-and-sweep algorithm. The root objects are the ones where the mark phase starts. These are objects that may not have any other objects pointing to them but should still not be garbage collected. the_global_environment is an example of a known root object. It is easy to have the sweep phase start at all the known roots.

What isn’t easy is managing the stack roots. Have a look at the following C function, for example,

object *make_if(object *predicate, object *consequent,
                object *alternative) {
    return cons(if_symbol,
                cons(predicate,
                     cons(consequent,
                          cons(alternative,
                               the_empty_list))));
}

There are a total of four calls to cons. Each call to cons will call alloc_object. Each call to alloc_object can potentially trigger a round of garbage collection. The objects returned by the three inner cons calls above are not rooted. That is, there is no known garbage collection root, that when followed, will lead to those returned objects. We need a way to protect the returned objects so that an outer call to cons doesn’t move the previous returned value back to the free_list.

We need to maintain a global stack_root list of these otherwise unprotected objects. The mark phase can then use these stack_root objects as additional roots. In order to do this, the make_if function needs to be changed to push and pop from the stack_root list.

object *make_if(object *predicate, object *consequent,
                object *alternative) {
    object *result;
    
    push_stack_root(&result);
    result = cons(alternative, the_empty_list);
    result = cons(consequent, result);
    result = cons(predicate, result);
    result = cons(if_symbol, result);
    pop_stack_root();
    return result;
}

Ugg. Not pretty, is it? What is neat, and keeps the implementation from being even uglier, is that it is a pointer to the result pointer that is pushed onto the stack_root list. That way, anything pointed to by result is protected. You’ll notice that the result isn’t protected when it is returned to make_if’s caller. It is up to that caller to protect the returned object.

If you forget to protect a stack variable you can have dangling pointer. If you forget to pop a stack root you will have a memory leak. The garbage collection code is complex and invades almost all functions. Code with care!

I think you can see why I didn’t cloud the implementation up to now with garbage collection worries. Focusing on the model, reader and evaluation layers was plenty as they were being introduced. I think worrying about garbage collection along the way would have made things intimidating and less fun.

Other types of garbage collectors, with hardware-specific knowledge, allow you to avoid making this mess. The Bohem conservative garbage collector is one such library you could link to your interpreter. If you have a look in the source code of that library you will probably be more frightened by it than you are by the approach I’ve presented.

One more article tomorrow to wrap up things.

Previous article: Standard Library
Next article: Conclusion

Comments

Have something to write? Comment on this article.

Nick Fitzgerald January 27, 2010

Ok, I have been really strapped for time these last few days, but after an evening at the keyboard, I am up to v0.10 and ifs!

I expect things to start really picking up now that the parser is done!

Peter Michaux January 27, 2010

Nick,

Yes I think you’ll be able to do quite a few of the later versions like let, cond, and, etc in a single sitting.

Sam January 30, 2010

Objects in the active_list that have mark set to 1 have their mark set back to 0 in preparation for the next mark phase.

One technique that I have seen in some mark-sweep implementations is instead of visiting all of the active objects again to reset their marks, they just invert the sense of what is marked and what is unmarked.

some C pseudocode:

// in heap initialization:
int current_mark = 1;

// checking marks and setting marks
#define marked_p(o) (o->mark == current_mark)
#define mark_obj(o) (o->mark = current_mark)

// after GC is complete swap marks
current_mark = (current_mark == 1) ? 0 : 1;
Steven Obua March 11, 2010

I very much like this series of articles about Bootstrap Scheme, as I am faced with similar decisions for my new language Babel-17 v0.2, which is also a functional, dynamically typed language.

I also did not want to worry too much about garbage collection when coding an interpreter for it; and I wanted tail-recursion without growing control context, closures, continuations.

Java seems to be a better implementation choice then than C. It is also portable, and has already important features like garbage collection. What were your reasons to choose C over Java as an implementation language?

Peter Michaux March 11, 2010

Steven,

For this series, I choose C because that is close enough to “from scratch”. I discussed the decision a bit in the introductory article. The idea of implementing from scratch is to avoid missing out on implementing features (or at least thinking about them) that a higher-level implementation language would provide for me.

Have something to write? Comment on this article.