## COS 441- Syntax - Feb 6, 1996

A programming language is a formal language used to communicate algorithms both from programmer to programmer and from programmer to machine. A formal language consists of:

• a set of symbols;
• rules for forming term;
• rules for transforming terms to terms.

Some general purpose programming languages include C, C++, PASCAL, and Ada. Some special languages are TEX, Post-Script, JAVA Byte-Code, TCP/IP, and perhaps the WIN32S API.

To use a programming language effectively we must study and understand it from three perspectives:

• Syntax - the set of symbols and rules for forming terms.
• Semantics - the rules for transforming terms to terms.
• Pragmatics - using the particular constructs of the language.

Here are three ways of expressing "increment the i-th element of array x" in different programming languages.

```(a) x[i] = x[i] + 1; [C]
(b) (vector-set! x i (+ (vector-ref x i) 1)) [Scheme]
(c) x[i] = x[i] + 1; [Java]
```

These expressions have approximately the following semantics.

```(a) if i in bounds of x then x[i] <- (x[i] + 1) mod 2^32
else who knows?
(b) if x is not a vector then ERROR
else if i is not an integer then ERROR
else if i is not in bounds of x then ERROR
else if x[i] is not an integer then ERROR
else x[i] <- x[i] + 1
(c) if i is not in bounds of x then ERROR
else x[i] <- (x[i] + 1) mod 2^32
```
Despite the apparent similarity of the C and Java expressions, the Java expression semantics is closer to that of Scheme than C.

Now consider expressing "increment x[0] thru x[N]". In C we write:

```for( i = N; i >= 0; i-- )
x[i] = x[i] + 1;
```
In Scheme we write something rather different:
```(define natural-foreach
(lambda (f n)
(cond ((>= n 0) (begin
(f n)
(natural-foreach f (- n 1)))))))
(define inc-x (lambda (i)
(vector-set! x i (+ (vector-ref x i) 1))))
(natural-fold inc-x N)
```
Finally in Java, we probably write something that looks the same (has the same syntax) as in C. C/Java pragmatics suggest the use of iteration, while in Scheme we use of recursion.

Most programming language courses survey a variety of programming languages, covering syntax mostly, with only a short time left for semantics. Instead, we will only use Scheme, which will allow us to quickly move onto semantic issues. We will use definitial interpreters and spend a little time looking at pragmatic issues.

This course will NOT teach you:

• any practical programming languages; nor
• how to implement high performance programming languages.

But it will teach you:

• how to learn a new programming language quickly;
• how to choose a programming language for a particular task;
• how to design and build interpreters;
• more about programming languages than the designers of most popular languages will ever know.

### Syntax

To simplify understanding and analyzing a language's syntax, we separate syntax into three levels: lexical elements, context free syntax, and context sensitive syntax. In English, letters form words which form sentences. In programming languages, characters form tokens, which form terms. Tokens are lexical elements.

#### Lexical Analysis

Following are some of Scheme's tokens:
• string of digits
• "characters ..."
• ' ` #f #t '() #\a
• strings of letters, digits, and characters such as - + * - @ \$, etc.
The last element of this list is called "identifiers". Scheme's syntax for identifiers is more liberal than that of many other languages; for example, `+`, `a+b`, and `-a*2 ` are all identifiers.

Aside: Comments are usually discarded by a language processor during lexical analysis; that is, while the language processor is converting the stream of input characters into a stream of tokens. Scheme's comments begin with a semicolon and extend to the end of the line.

#### Context Free Syntax

Consider a simple query language. In English, we define a query to be a list of words, NOT query, or (query AND query). To be more precise, let's define querys using mathematics (specifically set theory):
```Query = { w1 ... wN | w1,...,wN in Word }
U { NOT q | q in Query }
U { (q1 AND q2) | q1, q2 in Query }
```
For defining the context free syntax of programming languages, we often use a special language that is more concise. It is called BNF (Backus Naur Form):
```Query ::= Word * | NOT Query | (Query AND Query)
```
In BNF terminals, or tokens, are symbols that do not appear on the left of the ::= operator. In the example above, AND, (, ), and NOT are the terminals. Query is the only non-terminal. Well, almost. Word is really a non-terminal that we haven't bothered to define.

BNF can only describe context free languages. The following set of terms is impossible to describe in BNF.

```Kwery = { w1 ... wN | w1,...,wN in Word }
U { NOT q | q in Kwery }
U { (q1 AND q1) | q1 in Kwery }
```
An AND-Kwery requires both its arms be the same. This set of terms is context-sensitive: selecting a Kwery to place in the hole of the term `(q1 AND [])` (where [] denotes a hole) cannot be done without knowing the context surrounding the hole. Specifically, we must know what q1 is, because to get a valid Kwery we can only place q1 in the hole.

• The Little Schemer (whole book)
• EOPL (Essentials of Programming Languages) Chapter 1

### Exercise

• Write a C program that declares an array x, initializes it, and increments each element.
• Write a Scheme program that does the same.
• Write a C program that does it using recursion.