is complete enough (allows the compiled form of all well-type source programs), and
is decidable.
Why Verify Haskell?
Compiler sanity check
Checking compiler output is easier than proving the compiler is correct
Allows us to safely run untrusted code
Web applets
Distributed grid computing
Sand-boxed plug-in support (desktop applications, OS kernels, smart phones, etc.)
Untrusted code? How does that work?
Haskell is purely functional. Side effects only happen in the IO monad.
Any code which type-checks and is not in the IO monad is always safe to run.*
Hey, but wait! I need side effects to do anything useful!
Define an IO-like monad (e.g., ProxyIO) which only exposes the
functionality needed by client code. Interpret
ProxyIO in a trusted security kernel that performs I/O actions on behalf of the
client only if the requested actions are allowed by the security policy.
Because verification guarantees that untrusted client code cannot perform IO calls directly,
our security policies have a "default deny" flavor. Compare with the "default allow"
behavior of Java, where the standard libraries have to be carefully coded to check security
rights at every potentially sensitive method call.
* Provided you eliminate bad actors like unsafePerformIO.
Yhc (York Haskell Compiler)
Project kicked off in late 2005 by Tom Shackell and Neil Mitchell
Implements the vast majority of Haskell98, as well as the FFI
Main runtime implementation is written in portable C
However, current Yhc lacks a bytecode verifier and is vulnerable to
malicious bytecode programs which can crash and perhaps compromise the VM.
Example malicious program
So what's wrong with this program?
The bogus POP 255 instruction breaks
assumptions that the VM makes that programs will only access in-bounds stack locations.
This program pops off a bunch of the stack (which it doesn't own), writes a constant into
whatever location that is, and then calls itself building a new stack frame lower down
in the stack. The stack pointer will eventually be driven off the bottom end of the stack and
into an unallocated memory region. When the VM attempts to write to this location, it
will segfault.
Type inference for Yhc bytecode is probably undecidable, even
just for programs corresponding to Haskell98. If we wish to
support some of the popular extensions
(e.g., polymorphic recursion, higher-rank polymorphism), it becomes
certainly undecidable.
Therefore, we need additional type information to do verification.
We call this extra information the type certificate.
The compiler has the information we need after it typechecks the source
program...
...but this information is discarded during type erasure!
We need a compiler that retains type information all the way through
bytecode generation.
A Type-certifying Haskell compilation system
Haskell is Hard to Parse and Typecheck!
Solution: define a more tractable intermediate language and use that.
Haskell
IL
Layout
No layout
LR grammar
LL(1) grammar
Fixity resolution
No infix operators
Defaulting
No defaulting
Compound pattern matching
Primitive case analysis
Implicit type abstraction/application
Explicit type abstraction/application
Curry-style "external" types
Church-style "internal" types
Complicated type-inference
Top-down typechecking
Typeclasses
Dictionary-passing
Design of the (simplified) IL
Start with the solid base of System
, and add:
n-ary (unlifted) products and projections,
n-ary sum types with data constructors and case analysis,
iso-recursive types, and
recursive let.
You end up with an intermediate language capable of encoding
most of Haskell which has a strong metatheory.
We conjecture that this language is confluent and type-safe and
that type-checking is decidable.
Compiler Block Diagram
Stage 1 Bytecode
G-Machine execution model
Program is a collection of combinator definitions
Bytecodes are based on a stack machine
Types are manipulatable data
Each stack slot contains:
A pointer to heap data
The type currently assigned to the pointer
Types in the Stage 1 machine have a Curry flavor, in that
stack slots can be assigned several types; in a Church system,
types are unique
A special REWRITE_TYPE instruction
changes the type of a stack slot by applying a type rewrite rule
Type rewrite rules
Type rewrite rules take the place of "non-computational" IL
syntactic constructs
Type rewrite rules are chosen so that they are always safe to apply
Examples:
Replace a polymorphic type with a specialization of that type
Hoist a universal binder through the right side of arrow types
Replace a recursive type with its one-step unrolling
Add variants to a sum type
Stage 2 bytecode
Also, G-Machine, stack-based, etc.
We drop the idea that the machine explicitly manipulates types
All REWRITE_TYPE instructions are removed
They get turned into a (possibly empty) list of type rewrite rules
associated with each remaining bytecode instruction
We call this process type segregation
Final steps
Housekeeping
Zapping analysis
Tail-call elimination
Stack / heap usage analysis
Calculate constant and string tables
Verify! (Turned up a major off-by-one error in the code generator)
Serialize to disk
Verification algorithm
foreach f in combinator-definitions
let code be the bytecode sequence for f
let args be the types of arguments to f
let result be the the result type of f
let s be an array of length (len(code) + 1)
s[0] := empty_stack; s[1..len(code)+1] := void
for i in 0 .. len(code)
if (s[i] = void) fail
if (! preconditions(s[i],code[i],args,result)) fail
foreach j in next-locations(code[i])
s[j] := calculate-effect(s[i],code[i],j) or fail
foreach r in rewrite-rules[i]
s[j] := rewrite-type(r,s[j]) or fail
done
done
done
done
Properties of the verification algorithm
Decidable (requires no unification)
Linear time (n = program size + certificate size)
Fairly small (hopefully amenable to formal verification)
Correct and complete enough (we hope!)
Future Directions
Develop a file format for type certificates:
Investigate integrating with the Yhc bytecode file format
Bonus points for finding a representation that makes verification
fast and easy to code in C
Prove the important theorems about the IL and verification:
IL: confluence, type-safety, soundness of type-rewrite rules
Verification: correctness, complete-enough-ness
Extend the IL and verification to handle currently missing features
and some popular extensions:
This presentation is coded entirely in XHTML+SVG+MathML+Javascript!
You too can code your presentations in a web-accessible,
standards-compliant, cross-platform way!
Recent versions
of Mozilla Suite and Firefox are able to directly render SVG and MathML embedded
content. Hopefully other browser vendors will follow Mozilla's lead in
the future.