Programming Languages: Parsing

COS 441 - Parsing - Feb 8, 1996

Abstract Syntax

Abstract syntax is a representation of a program that:

abstracts away unnecessary details of the concrete syntax;
retains only enough information to let us assign meaning (semantics) to terms; and
parallels the structure of the language's BNF.

Consequently, two expressions (of the same programming language) that have different concrete syntaxes but the same abstract syntaxes must have the same semantics.

Parsing means interpreting the input stream as terms in the language at hand. Recall that we view a language's syntax as consisting of three layers: lexical elements, context-free syntax, and context-sensitive syntax. Consequently we'll parse a language by considering these three layers separately.

A lexical analyzer or tokenizer takes the input stream of characters and breaks it into tokens. For this course, we'll use Scheme's tokenizer to do this for us.

A parser takes the token stream produced by the lexical analyzer and constructs a representation of the program's abstract syntax called an abstract syntax tree or parse tree. As you can see, the term parsing is often (ab)used to mean simply interpreting the token stream into context-free syntax.

Let us return to the query example. A query is:

query ::= Word
        | NOT query
        | ( query AND query )

To parse querys, we have to fix a representation for tokens and a representation for querys, ie. for the abstract syntax of querys. For tokens, we will use the following representations:

Word - symbol
NOT  - 'NOT
AND  - 'AND
(    - "("
)    - ")"

We assume that we have a function tokenize : input -> list of tokens that converts the input stream into a list of such tokens. We will assume the functions make-Word, make-Not, and make-And build appropriate representations of querys.

We can now write a function parse to parse querys. This function will take a list of tokens as input, and return a pair of an abstract query and the remainder of the input.

(define parse
  (lambda (input)
    (cond ((equal? 'NOT (car input))
           (let* ((r (parse (cdr input)))
                  (q (car r))
                  (rest (cdr r)))
             (cons (make-Not q) rest)))
          ((symbol? (car input))
           (cons (make-Word (car input)) (cdr input)))
          ((equal? "(" (car input))
           (let* ((r1 (parse (cdr input)))
                  (q1 (car r1))
                  (rest1 (cdr r1))
                  (rest2 (cdr rest1))   ; skip "AND"
                  (r2 (parse rest2))
                  (q2 (car r2))
                  (rest3 (cdr r2))
                  (rest4 (cdr rest3)))  ; skip ")"
             (cons (make-And q1 q2) rest4)))
          (else (error "Bad input")))))

This is pretty simple because the grammar for querys is LL0. But we can make it even easier by taking advantage of a builtin parser Scheme has for a language called s-expressions. S-expressions are defined as follows:

sexp ::= #t | #f | number | char | symbol | () | string 
            | (sexp . sexp) | #(sexp*) | (sexp*)

An s-expression of the form (sexp . sexp) is a pair; an s-expression of the form #(sexp*) is a vector; and (sexp*) is a list. Lists are represented using pairs and nil. S-expressions are built by read and (quote sexp), which can be abbreviated 'sexp.

If we now change the syntax of querys slightly so that querys are a subset of s-expressions, we can use the s-expression parser to do some of the parsing for us. Let's redefine querys as follows:

q ::= word | (NOT q) | (AND q q)

Note the parentheses that are now required around a NOT query. Our new parse function takes a list of tokens and returns simply a parsed query:

(define parse
  (lambda (sexp)
    (cond ((symbol? sexp) (make-Word sexp))
          ((pair? sexp)
           (cond ((equal? 'NOT (car sexp))
                  (make-Not (parse (cadr sexp))))
                 ((equal? 'AND (car sexp))
                  (make-And (parse (cadr sexp)) (parse (caddr sexp))))
                 (else (error "Bad input"))))
          (else (error "Bad input")))))

Let us now build a parser for a subset of Scheme. We'll consider the following subset:

e ::= #t | #f | () | number | ...
         | x
         | (lambda (x*) e)
         | (if e e e)
         | (cond (e e)* [(else e)])
         | (e e*)

We'll represent tokens exactly as Scheme represents them in s-expressions. We use the define-record facility to build representations of the abstract syntax:

(define-record Const (value))
(define-record Var   (name))
(define-record Lam   (formals body))
(define-record If    (test then else))
(define-record Cond  (clauses else))
(define-record Ap    (fun args))

Each (define-record Foo (field1 ... fieldN)) expression builds the following procedures: make-Foo, Foo?, and Foo->field1 through Foo->fieldN. These are called the constructor, the predicate, and the selectors (or accessors) for data of type Foo. The following identities will hold:

(Foo? (make-Foo v1 ... vN)) = #t
(Foo->fieldM (make-Foo v1 ... vN)) = vM

for v1 ... vN values. Now let's parse Scheme.

(define parse
  (lambda (sexp)
    (cond ((member sexp '(#t #f ()))
           (make-Const sexp))
          ((or (number? sexp) (string? sexp) (char? sexp))
           (make-Const sexp))
          ((symbol? sexp)
           (make-Var sexp))
          ((pair? sexp) 
           (cond ((equal? 'lambda (car sexp))
                  (make-Lam (cadr sexp) (parse (caddr sexp))))
                 ((equal? 'if (car sexp))
                  (make-If (cadr sexp) (caddr sexp) (cadddr sexp)))
                 ((equal? 'cond (car sexp))
                  ...  ...)
                 (else
                   (make-Ap (parse (car sexp)) (map parse (cdr sexp)))))))))

Reading

The Seasoned Schemer, Chapters 11, 12, 13
EOPL Chapter 2