In this module we are going to introduce the formal definition of a regular expression and discuss what it means. The idea behind regular expressions is to have a way of describing a language. If you’ve done any programming I’m sure you’ve come across regular expressions before.
Lets recall the regular operations we discussed previously; union, concatenation, and star. If we are given a symbol a from the alphabet Σ then we can define a string a*, which is the string containing only the symbol a. This defines a language because there is certainly a finite automaton that accepts the string consisting only of a.
That’s a fairly trivial example though. If we have two symbols a and b we can crate another language. Consider the language a*b which is the set of all strings that end with the symbol b. This is again a regular language because we can construct a finite automaton that accepts this language.
Exercise: Construct the finite automaton that accepts the regular language a*b.
Now lets consider the union operation. The union operation is an or operation. So one condition or the other condition must be true. Therefore if we have (a ∪ b)*a we have the language that consists of all sequences containing a‘s and b‘s that end in a. The order of operations places the parenthesis before the * operation.
If we replace union with concatenation in the last construction (ab)* we have the set of all sequences of alternating a‘s and b‘s. We may then combine all three operations using parenthesis to set the order of operations if we want to change it from star, concatenation, union.
R is a regular expression if R is:
- a for a symbol a in the alphabet Σ.
- ε, which is the alphabet consisting of only the empty string.
- ∅, the alphabet consisting of no strings at all.
- The union of regular expressions.
- The concatenation of regular expressions.
- The star of a regular expression.
So we can build new regular expressions using the regular operations we have learned about already. There is also a language of a regular expression R, denoted by L(R).
Lets look at some properties of regular languages.
- If R is a regular language then R ∪ ∅ = R. Think about what the union operator means. We have that a symbol is either in R or the symbol comes from the empty set. But there are no symbols in the empty set. Therefore we are only left with R.
- If R is a regular language then R • ε = R. We can think about what the empty string does, which is being empty. It’s not the same as the empty set but it has the same effect in this context under composition. Think of it like this abε = ab. As you can see the result is as though it was never included.
Exercise: See if you can find out what goes wrong if we exchange the operators. That is R ∪ ε and R • ∅. Equality does not always hold in these cases. To help with the exercise pick a simple regular expression and add the symbols via the operators.
Equivalence to Finite Automata
The proof that regular expressions and finite automata have the same descriptive power is done in stages. I suggest reading through the section in the book to get the formal pieces of the proof. We are going to discuss what it means for regular expressions and finite automata to be equivalent.
The representations presented by finite automata and regular expressions are very different. So what does it mean for them to have the same descriptive power? In our case we have that the language a regular expression describes can be represented and accepted by a finite automaton. Because the language can be accepted by a finite automaton means that it is a regular language. Even stronger, we have that if a language is regular it can be represented by a regular expression. This is given by the following theorem: A language is regular if and only if some regular expression describes it.
Exercise: Define three regular expressions of increasing complexity. Then determine the language that they represent. Once you have done that construct a finite automaton that accepts each language. For example if the first regular expression is R = a*. Then the language is all finite sequences that contain only the label a. (Make sure that you understand why that is.) The finite automaton has a single entry state with a self loop with label a that is also the accept state.
We now have a new way of representing regular languages. We can represent a regular language with a regular expression that is a more compact representation than a finite automaton. We can represent the operations on regular languages with an easier syntax than we could with finite automata.
Next we will take a look at nonregular languages. This class of languages will show us limitations of finite automata. Once we have done that we can move on to context-free languages.