7.1   Formal Languages


This section under major construction.

The theory that we consider in this chapter is based on the following simple abstract notions. An alphabet is a set of symbols. A string over an alphabet is a sequence of symbols taken from that alphabet. A formal language over an alphabet is a set of strings over that alphabet. When addressing theoretical questions, we often use the binary alphabet that consists of the two symbols 0 and 1. In practice, we use whatever alphabet is suitable to the task at hand: the Roman alphabet for processing text; decimal digits for processing numbers, and the characters acgt for processing genetic data.


Alphabet Symbols Symbol
Name
String
Name
binary 01 bit bitstring
Roman abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
letter word
decimal 0123456789 digit integer
special ~!@#$%^&*()_-+={[}]|\:;"'<,>.?/
keyboard Roman + decimal + special keystroke typescript
DNA actg gene genome
monomer ACDEFGHIKLMNPQRSTVWY amino acid protein
ASCII see Appendix byte String
UNICODE see Appendix char String


The simplest way to specify a formal language is to enumerate its strings. For example, we can enumerate all bitstrings of length 3:

000 001 010 011 100 101 110 111

One complication is that languages can be large or infinite. For the moment, we will use informal English descriptions to describe (possibly) infinite sets of strings.


Language true false
Tail(2)

next-to-last bit is 0
00
11101
111111010101
0
000010
11111111111
Equal

equal number of 0s and 1s
10
110010
000011111100
0
11100
01010101010101010


Why not just work with precise, complete definitions? This is the crux of the matter! Our goal is to work with precise, complete definitions. This is known as the specification problem for formal languages. How is formal language related to computation? Each formal language leads to the precise statement of several specific computational problems. Perhaps the most basic is the following: given a language L and a bitstring x, determine whether x is in L or not. This task is called the recognition problem for formal languages. With suitable interpretations of the meaning of the bitstrings, we can define formal languages that relate to nontrivial computational problems. For example, we might define Prime to be the set of all bitstrings that are the binary representation of a prime number. Solving the recognition problem for this language involves understanding some mathematics, and probably using a computer.

Formal languages are very important tools that are widely used in practical applications. Natural languages like English, programming languages like Java, and genetic codes are all examples of formal languages. Specification, recognition, and related problems for these sorts of languages are of critical importance in modern computing.