COS 126 Lecture 16: Formal Languages

NOTE: See also Notes on Formal Language

An alphabet is a finite, non-empty set of symbols. We've been looking at the alphabet consisting of the two symbols, 0 and 1.

Strings are an ordered, finite set of elements from the alphabet (including the empty set.)

A language is a subset of all the possible strings over an alphabet. For the binary alphabet, every binary number is a string over the alphabet. (We say 'over' the alphabet to mean that the elements of the string consist only of alphabet symbols.) A language is a particular subset of these binary strings, such as all 4-bit strings, or all strings with an even number of 0's, or all strings with more 1's than 0's.

This slide summarizes the concepts introduced in lecture 15.


slide 2 - Grammars

For each of the examples given, how would you write the corresponding regular expression?

  1. all bit strings that begin with 0 and end with 1
  2. all bit strings whose number of 0's is a multiple of 5
  3. all bit strings with more 1's than 0's
  4. all bit strings with no consecutive 1's

ANSWERS


slide 3 - Finite State Automata

In precept, I referred to a finite state machine. This is the same thing: an FSA. Each FSA has a fixed number of states (hence the 'finite' part of the name.) There is always a start state. One or more states may be designated as 'accept' states. (The start state may be an accept state.) An FSA accepts a given input string if and only if you are 'in' an accept state when you finish reading the string. How do you use an FSA to read a string?

You read the input string one character at a time (left to right) and start in the 'start' state of the FSA. That start state will have arrows labeled with different characters in the alphabet. (Let's assume the FSA is "deterministic" - I'll explain what this means, shortly.) In a deterministic FSA, there will be exactly one arrow for each possible input character. So, for the binary alphabet we've been using, there will be two arrows coming out of each state - one labeled '0,' the other labeled '1.' For concreteness, let's assume that the first digit of the string is a 0. We'll read this 0 and move to the state the '0' arrow points to. Now, read the next character in the input string and follow the corresponding arrow to the next state. Continue like this until you have read all the characters in the input string. If you 'end up' in an accept state, we say that the FSA recognizes the input string, or, equivalently, that the input string is part of th elanguage described by the FSA. (This is just like saying that a string is in a language defined by a regular expression.

If you look at the examples in this lecture slide, you can see the order of the states encountered (or 'traveled through') as we read the input string. The states in this example are numbered. By default, state 0 is the start state. Accept states are indicated by double circles.

You should learn to figure out what language a particular FSA represents. The first example recognizes the language of all strings with the pattern 10 (10, 1010, 101010,...) The second example recognizes the language with an odd number of 0's. Trying out a few example input strings and looking for a pattern is a good way to start if you're stuck. Also, always be sure to check whether the empty set is in the language.


slide 4 - An application

This slide demonstrates the computational power of an FSA. On each arrow, the first number represents a character read in the input string. The second number (after the /) is the output generated. The chart on the right shows the output generated when the FSA is applied to the input string: 01010101 - the string 00011101 is generated. Isolated 0's and 1's within the string are eliminated. Notice that this FSA doesn't really have an accept state. It's not recognizing a language; rather, it's performing a computation. This example demonstrates how powerful even a simple FSA can be.

Pay particular attention to the state interpretations chart. If you're asked (on an exam, for example) to draw an FSA representing a given language, it's good to try to formuate the meaning of each state.

In the first example on slide 3, here's the interpretation of each state:

0: nothing read so far
1: a 1 was just read
2: two consective 1's read, OR two consecutive 0's read OR a 0 was read initially
3: a 0 was read after state 1
   i.e., a 0 was read after a 1 was read

How do the state interpretations help you understand the FSA as a whole? State 3 is the accept state, because only strings with the pattern 10(10)* will end up in this state. State 2 has arrows ("transitions" is the correct terminology) for both 0 and 1 returning to it. This corresponds to the fact that once you've reached state 2, you've read input characters which are not in the language represented by the regular expression 10(10)*. So, you can never get to the accept state. In state 1, you've read strings of the form 10(10)*1. For an input string to be in the language, you must read a 0 next, and go to state 3.

Question: what are the interpretations of the states in the second example on slide 3?


slide 5 - C program to Simulate FSAs

This C program simulates FSAs that operate on the binary alphabet: 0 and 1. The C program opens a file representing the FSA. (The file name is passed as a command line argument. See K&R section 5.10) The format of the file is a list of states, one line per state. For each state (line of the file), the destination state is given for each of the two possible inputs, 0 and 1. So, each line in the file will have two numbers. The first number corresponds to the '0 arrow.' The second number corresponds to the '1 arrow.' The value of the numbers is the number of the state the arrow points to. For the example given, the file will look like this:

0 1
2 0
1 2

Then, an input string is read from standard input. To run this program, you would type something like:

a.out FSA_filename 11101010

In each iteration of the while loop, the value of state is updated based on the current state and whether a 0 or a 1 is read. At the end, if the state is 0 (the accept state), Accepted is printed. Otherwise, Rejected is printed. You should try tracing through the code while tracing through the FSA to make sure you understand it.


slide 6 - A language that is not regular

FSA's have fundamental limitations. It is not possible to construct a FSA that accepts exactly those bit strings with an equal number of 0's and 1's.

Basically the reason is that FSA's have no ability to remember, i.e., they can't remember how many 0's and 1's they've seen so far. This makes it impossible to count. (A natural approach would be to include a state i to represent the fact that you have seen i more 1's than 0's. The accept state would be 0. Unfortunately, the FSA has to have only finitely many states - this makes it impossible.)

In contrast, FSA's can determine whether a bit string has an even number of 0's. Here we only need a state to represent the parity, i.e., whether we've seen an even or odd number of 0's already. Only two states are needed for this - a finite number.

A proof by contradiction is given to show that it is not possible to design a FSA to recognize whether or not a bit string has an equal number of 0's and 1's.


slide 7 - Pushdown Automata

Think of a Pushdown Automata (PDA) as an FSA + a stack. The stack stores extra information - it is the "memory" added to the FSA. In an FSA, the next state depended only on the current state and whether the input bit was a zero or a one. In a PDA, the next state will depend on the current state and the input bit AND the value of the bit on the top of the stack. Also, in addition to 'moving' to the next state, we can also update the stack by either POPPing the top item off the stack or PUSHing the current input bit onto the stack.

Each arrow on the PDA has a 3 part value on it instead of just the input bit (like in an FSA.) The first part is the input bit (exactly the same as the FSA.) The second part is the value of the bit on the top of the stack. Instead of an ACCEPT state, we accept if the stack is empty, when we finish reading the string. If we ever try to POP an empty stack, we crash, meaning that the string is not in the language (rejected.)

In the example given, we use the stack to hold the extra 0's or 1's. If, for example, we read 10111..., we'll use the stack to hold the extra 1's until enough 0's come along to POP them off the stack.


slide 8 - Nondeterministic Machines

A defining feature of the FSAs we've looked at so far is that they've all been deterministic. The American Heritage Dictionary defines determinism as "The philosophical doctrine that every event, act, and decision is the inevitable consequence of antecedents that are independent of the human will." In computer science, we use the adjective "deterministic" to describe machines, or automata in which decisions are forced. There is no real choice. In the FSA we've looked at, we never had to choose which state to go to next: we just followed the only arrow (transition) which was designated for the given input bit. In a nondeterministic FSA, instead of one arrow for each input bit, there may be 0 or more arrows. If there are more than one arrow, we need to 'choose' which arrow to follow. How can we interpret nondeterministic FSA?

The basic concept is the same. We read the input string one bit at a time and move to the next state accordingly. If there is no arrow for the input bit, we have 'crashed.' The input string is rejected. If there is only one arrow for the input bit, follow it to the next state (just like in a deterministic FSA.) If there is more than one, just choose one, follow it and continue with the next input bit. Here's the tricky part: we say that the string is recognized by the FSA (accepted) if there is any way to get to an accept state. Let's look at the example given on the lecture slide:

01 -we start in state 0. then, we have the choice of going to state 1 or state 2. If we choose state2, we then stay in state 2 when we read the '1.' (result=rejected.) However, if we choose state 1, we move to state 3 when we read the 1, so the result is ACCEPT. Since there is a way to end up in the accept state, 01 is recognized by this FSA

0111110101 - the sequence of states that leads to the accept state is: 0,1,3,2,2,2,2,0,0,1,3

01000010110 - the sequence of states that leads to the accept state is: 0,1,3,1,1,1,1,3,1,3,2,3

01000 - there is no sequence of states that leads to the accept state. We start in state 0, stay in state 0, and then if we move to state 1, we'll stay there and never leave. If we instead choose to move to state 2, we still won't end up in state 3...


slide 9 - Nondeterminism doesn't help in FSAs

This slide demonstrates a method for converting a nondeterministic FSA to a deterministic one. This is the procedure:

If there are N states, list all the N digit binary numbers. Each binary number represents some combination of the states. 0000 represents none of the states. 0110 represents states 1 and 2 (not 0 and 3.) 1110 represents states 0 and 1 and 2. etc. Each of these binary permutations will represent a state in our deterministic FSA (but we might not use all of them.) For each new state, you need to determine its 'arrows.' That is, you need to determine what the next state will be for a '0' input and a '1' input. To make it easier, we'll use the decimal representation of each of our binary numbers to indicate the number of the corresponding state.

It's not as complicated as it sounds, but it does take a while. Let's start constructing the table in the example:

0001 - in the n-FSA, state 3 had a 0-arrow going to state 1, and a 1-arrow going to state 2. The binary representation of state 1 is 0100. The binary representation of state 2 is 0010. So, the 0-arrow for our state 1 (0001) will go to 4 (0100) and the 1-arrow will go to 2 (0010).

0010 - in the n-FSA, state 2 has a 0-arrow going to states 0 (1000) and 3 (0001), and a 1-arrow going to state 2 (0010). So, in the deterministic FSA, our state 2 (0010) will have a 0-arrow going to state 9 (1001), and a 1-arrow going to state 2. Notice that state 9 represents the combination of states 0 and 3 - (1001).

0011 - this state (3) represents the combination of states 2 and 3. So, we need to look at the transitions (arrows) of both states in the n-FSA. For the 0-arrow, the destinations are states 1 and 0 and 3. For the 1-arrow, the destinations are 2 for both states. In the deterministic FSA, then, for state 3, we'll have a 0-arrow going to state 13 (1101) (wrong on the slide) and the 1-arrow going to state 2 (0010).

If any state doesn't have any arrows for a particular input bit in the n-FSA, we'll make the corresponding arrow go to state 0 in the deterministic FSA.

Notice that each state in the deterministic FSA will have exactly 2 arrows going from it - one for 0 and one for 1. If any states in the deterministic FSA we create don't have any arrows going to it (besides the start state), we can eliminate those states.

One more thing: we haven't specified which states are the start and accept states. The start state will remain the same. If state 0 was the start state, then state 8 will be the new start state (1000). Any states in the new FSA which include an accept state as part of their combinations of states will be an accept state. In our example, if state 3 (in the n-FSA) is the accept state, then the accept states will be 1 (0001), 3 (0011), 5 (0101), 7 (0111), 9 (1001), 11 (1011), 13 (1101), and 15 (1111).

You can probably see that this would take a long time to do, especially if there are many states. The main point is that, in principle, it can always be done.


slide 10 - FSAs are equivalent to REs

A FSA is a type of machine that accepts certain strings and rejects others. Those that it accepts are in the language. Regular expressions play a dual role. Given a regular expression, we can generate different strings that satisfy the pattern. The set of all strings that matches the RE forms a language. It turns out there is a remarkable connection between the two. For any FSA, it is always possible to create a RE that generates precisely the same strings that the FSA accepts. Conversely, for any RE it is always possbile to create a FSA that accepts precisely the same strings that the RE generates.

This slide indicates how we can claim such a remarkable fact. The proof that NFSAs and FSAs are equivalent is by construction: given any NFSA, we can construct an equivalent FSA. This slide shows that for any RE, we can construct an NFSA. [Note that for a N-FSA, transitions between states which correspond to the null character (neither 0 or 1) are allowed. You have the choice of following these arrows before reading the next character in the input string. The connecting lines in the rules listed are these null-arrows.]

For example, for the simple regular expression: 10 + 1

(sorry, again, no graphic yet) the NFSA will be:

             state 1 ---> state 2
           /          1          \0
start state                       accept state 
           \            1        /
            state 3 -------------

Let's explain the picture. Coming out of the start state are two arrows (pretend), neither of which as any input bit associated with it. Initially, you can just choose which one to follow. Going the top route, you need to read a 1 followed by a 0 to get to the accept state. Going the bottom direction, you need to read a 1 to get to the accept state.


slide 11 - Nondeterminism does help in PDAs

The main thing to remember is that non-deterministic FSAs and deterministic FSAs are equivallent. Non-deterministic PDAs are more powerful than deterministic ones; that is, you can recognize a broader class of languages with a nondeterministic PDA. Turing machines can recognize an even more broad class of languages.


slide 12 Turing Machines

Turing machines are the same type of thing that FSAs and PDAs are. You can still think of reading an input string and 'moving' through the states accordingly. Like the PDAs, you have additional information to consider. Instead of a stack, you have a 'tape.' Think of the tape as a long strip of paper onto which you can write. The initial input is written on the tape so that you can read it. Depending on your current state and what you read, you can write a bit, move left or right, and move to a new state.

A Turing Machine has no restrictions on the tape. You can even think of having multiple tapes, if you need them (but you don't.) This type of automata is so powerful that non-determinism doesn't make it any more powerful. (Well, that's not the whole story. While nondeterminism does not enable a Turing machine to recognize new languages, it appears that it makes the Turing machine capable of recognizing languages much faster. Stay tuned for Lecture 24 - NP Completeness.)