Bootstrapping Caml with Format

 
Camel bike at Burning Man Camel bikecycle at Burning Man. Unrelated to the contents of this post... sort of...

The OCaml programming language embodies a variety of languages: a core functional language, an object-oriented language, a module language, and a formatting language. As with any self-respecting compiler, the OCaml compiler is capable of bootstrapping itself, i.e. the compiler is capable of compiling its own source. A typical user of OCaml need not worry about bootstrapping, unless, you happen to be developing the OCaml compiler. Recently, the formatting language and the bootstrap process made me scratch my head — multiple times.

Though, first a bit of background: I am currently at OCaml Labs in Cambridge working with Leo White, KC Sivaramakrishnan, Stephen Dolan, and Anil Madhavapeddy on effect typing for Multicore OCaml. This work is based on previous work by White et al. [1]. The main goal of this work is to switch from nominal effects to structural effects. The motivation for this is worth its own post (in other words I will write about this in a future post). However, to kick things off I began by rebasing Leo's previous work on top of the Multicore OCaml master branch.

A little technical background

The following two subsections provide very brief primers to the bootstrapping process and the format language.

Bootstrapping OCaml

A hard bootstrapping pass is required whenever you add or remove a primitive in the OCaml compiler source. Usually, the bootstrapping process is rather straightforward, the OCaml Makefile even provides a simple recipe for bootstrapping the compiler

# make coreboot     [old system -- you were in a stable state]
# <change the source>
# make clean runtime coreall
# <debug your changes>
# make clean runtime coreall
# make coreboot [new system -- now in a stable state]

The command make clean runtime coreall builds the base bootstrapping compiler including the standard library. The final command make coreboot builds the actual bootstrapping compiler and thereby lifts any changes to the source into the compiler.

Typed format expressions

OCaml has built-in support for formatting which enables OCaml to type check format expressions which in turn means that OCaml can provide type-safe format expressions, e.g.

# let say = Printf.printf "%s %s%c";;
val say : string -> string -> char -> unit = <fun>

For this code OCaml has inferred, based on the string "%s %s%c", that say is function that takes three arguments of types string, string, and char respectively, and produces a unit value, meaning that we can say

# say "Hello" "World" '!';;
Hello World!- : unit = ()

While OCaml will not let us get away with providing an integer instead of a string, i.e.

# say "Hello" 0 '!';;
Error: This expression has type int but an expression was expected of type
         string

Internally format expressions get compiled into the datatype format6 whose definition is

and ('a, 'b, 'c, 'd, 'e, 'f) format6 =
  Format of ('a, 'b, 'c, 'd, 'e, 'f) fmt * string

The type fmt is an algebraic datatype which has a data constructor for every possible format modifier (such as %s and %S for strings). The code below is part of its definition

and ('a, 'b, 'c, 'd, 'e, 'f) fmt =
| String :                                                 (* %s *)
    ('x, string -[!p]-> 'a) padding
    * ('a, 'b, 'c, 'd, 'e, 'f) fmt ->
      ('x, 'b, 'c, 'd, 'e, 'f) fmt
| Caml_string :                                            (* %S *)
    ('x, string -[!p]-> 'a) padding
    * ('a, 'b, 'c, 'd, 'e, 'f) fmt ->
      ('x, 'b, 'c, 'd, 'e, 'f) fmt
...

As you can imagine the whole handling of format expressions involve a fair bit of effectful code.

An effective format

The aforementioned rebasing went rather smoothly until I got to the commit which ascribes effect types to the whole standard library. The standard library comprises the Pervasives module, List module, Array module, and so forth, including the built-in formatting library.

Rather crucically this commit changes the definition of format6 such that it takes an additional parameter !p which is an effect variable

and ('a, 'b, 'c, 'd, 'e, 'f, !p) format6e =
  Format of ('a, 'b, 'c, 'd, 'e, 'f, !p) fmt * string

Note the commits also renames the datatype to format6e (I am guessing e for effective).

Actually, just rebasing that particular commit was rather painless as there were only a few merge conflicts which were easy to fix, but doing it right turned out to be somewhat difficult. After the initial rebase the compiler could no longer bootstrap! Yikes! Immediate loss of all self-respect! The bootstrapping process failed at the second step with the following error

File "camlinternalFormat.ml", line 1854, characters 42-70:
Error: This expression has type string but an expression was expected of type
         ('a -[!p]-> 'b, unit, string, 'c, 'd, 'e, !q wio)
         CamlinternalFormatBasics.format6e

The error arose while type checking the following innocuous looking code

let invalid_box () = failwith_message "invalid box description %S" str in

The function failwith_message expects a format string as its argument, and based on the shape of the string it generates the corresponding format function which in this particular case a function that takes a string as its only argument.

However, the error is a red herring as the cause of the problem has little to do that function. The problem is that there is a cyclic dependency between the type checker and the format definitions in the standard library. The type checker depends on format definitions for compilation of format expressions and therefore expects those definitions to have a particular shape. In turn the format definitions depend on the type checker for type checking. So the question is which should we build first?

The problematic code is in the type checker for the core language

| Pexp_constant(Const_string (str, _) as cst) -> (
    (* Terrible hack for format strings *)
    let ty_exp = expand_head env ty_expected in
    let fmt6_path =
      Path.(Pdot (Pident (Ident.create_persistent "CamlinternalFormatBasics"),
                  "format6", 0)) in
    let is_format = match ty_exp.desc with
      | Tconstr(path, _, _, _) when Path.same path fmt6_path ->
        if !Clflags.principal && ty_exp.level <> generic_level then
          Location.prerr_warning loc
            (Warnings.Not_principal "this coercion to format6");
        true
      | _ -> false
    in

The name of the format6 (or format6e in the commit) datatype is hardwired into the compiler which is used to determine whether a string literal is format expression. The rather telling comment should have alerted us of a code hazard coming up ahead.

An effective hack

I tried a variety of build strategies to get the compiler to build. My common approach was to separate out the problematic parts: the type checker and the format library. I tried carefully staging the changes to each part by building the base compiler one component at a time without the new standard library. However, this approach fails when you get to stage where you have to lift changes into the bootstrapping compiler as its construction depends on the standard library which now does not type check. I tried various variations of this strategy, but with no luck.

Meanwhile Leo had an eureka. How about we counter one hack with another hack? Leo suggested that we allow both datatypes format6 and format6e simultaneously. Thus, I made fmt6e_path, an alternative version fmt6_path, which would point to the new format6e datatype and extended the case guard accordingly. The code fragment below contains both changes (annotated with (* change *)).

| Pexp_constant(Const_string (str, _) as cst) -> (
    (* Terrible hack for format strings *)
    let ty_exp = expand_head env ty_expected in
    let fmt6_path =
      Path.(Pdot (Pident (Ident.create_persistent "CamlinternalFormatBasics"),
                  "format6", 0)) in
    let fmt6e_path = (* change *)
      Path.(Pdot (Pident (Ident.create_persistent "CamlinternalFormatBasics"),
                  "format6e", 0)) in
    let is_format = match ty_exp.desc with
      | Tconstr(path, _, _, _) when Path.same path fmt6_path
                                 || Path.same path fmt6e_path -> (* change *)
        if !Clflags.principal && ty_exp.level <> generic_level then
          Location.prerr_warning loc
            (Warnings.Not_principal "this coercion to format6");
        true
      | _ -> false
    in

This hack allows format6 and format6e names to coexist — this is only intended as a temporary hack while bootstrapping the compiler. Still some care is required to build the compiler. The bootstrapping process still fails if we try to apply the standard bootstrapping recipe at this stage. Rather I proceeded as follows

  1. Patch the type checker to include fmt6e_path.
  2. Build base system without effect types in the standard library via make clean runtime coreall.
  3. Lift the patched typechecker into the compiler via make coreboot.
  4. Cherry pick Leo's commit which ascribes effect types to the standard library.
  5. Revert the type checker to my patched version.
  6. Build the base system with effect types in the standard library via make clean runtime coreall.
  7. Patch the type checker to use the new format6 datatype.
  8. Lift the changes into the compiler via make coreboot.

After step 8 I had a bootstrapping compiler with an effect-typed standard library which I could use to build a native version with effect typing enabled. Success!

Conclusion

On one hand compiler bootstrapping can be somewhat mind bending. In particular when your dependency graph has cycles. On the other hand, going through this process gave me a lot of insight into the more delicate and intricate details of compiler bootstrapping.

I am hoping to have a working prototype with structural effect typing by the end of next week.

References

[1] L. White, S. Dolan, M. Pretnar, and K. Sivaramakrishnan, “Effective programming: Bringing algebraic effects and handlers to OCaml.” Keynote at HOPE, Nara, Japan, 2016.