1. Introduction

In this tutorial, we’ll talk about context-free languages. In particular, we’ll cover some techniques for proving that a language is (not) context-free.

2. What’s a Context-Free Language?

Before answering that question, let’s first remember what formal languages and grammars are.

2.1. Languages and Grammars

A formal language is a set of words we form using a non-empty alphabet of symbols. For example, L_1 = \{a^k b^k \mid k = 0,1,2,\ldots\} is a language over the alphabet \{a, b\}.

Every language has its grammar. Informally, a language’s grammar describes how to produce the words of that language. Formally, it’s a tuple (V, \Sigma, S, P) where:

  • V is the set of all symbols the grammar uses
  • \Sigma is the alphabet; we call its symbols terminals, while the symbols in V \setminus \Sigma are known as non-terminals or variables
  • S is a unique variable that starts the derivation of all the words
  • P is the set of production rules

Production rules derive the words of the language at hand. Therefore, we can think of them as the rules guiding the transformation of the word from S over its transitional forms to its final shape.

For example, these rules generate \{ a^k b^k \mid k\geq 0 \}:

(1)  \begin{equation*}  \begin{aligned} S &\rightarrow aSb \\ S &\rightarrow \varepsilon \end{aligned} \end{equation*}

where \varepsilon is the empty word. Consequently, the derivation of aaaabbbb is:

    [S \rightarrow aSb \rightarrow aaSbb    \rightarrow aaaSbbb \rightarrow aaaaSbbbb \rightarrow aaaabbbb]

Depending on the form of the production rules, we differentiate between several grammar and language types.

2.2. Context-Free Languages

If the production rules P are of the form:

(2)  \begin{equation*}  X \rightarrow w \qquad (X \in V\setminus \Sigma,\quad w \in V^*) \end{equation*}

we say that the grammar is context-free (CFG). So, a language is context-free if there’s a CFG generating it.

The language L_1 from the previous section is context-free. Its grammar is:

    [G_1 = \left( \{S,a,b\}, \{a,b\}, S, P \right)]

where P is the set of rules (1). Both rules are of the CFG form (2), so G_1 is context-free, and so is the language it generates.

3. How to Prove That a Language Is Context-Free

A language can be stated implicitly as a general pattern of its words (such as a^kb^k for k \geq 1). Alternatively, we can get an explicit corpus containing the whole language.

3.1. Constructing a Grammar

No matter how our language is stated, we can prove it’s context-free by constructing a CFG grammar that generates its words.

That’s sometimes more easily said than done. If the language isn’t complex, we’ll be able to derive the rules by hand (provided it’s context-free). That was the case in the above example.

But, if the language has a complicated structure, we may struggle with designing the corresponding CFG. And it’s not just the rules that will be hard to come by. We’ll also have to decide which non-terminal symbols to use. In such cases, we can apply the techniques for automated grammar learning. They can handle even the partially stated languages, where only a sample of the words is available.

3.2. Proving That the Grammar Is Correct

Devising a CFG isn’t sufficient. We also have to prove two statements: that the grammar generates the language and that all the words of the language can be generated by the grammar. Proving only the latter shows that the grammar’s language contains the target language, but not that they’re equal.

Let’s first prove that by applying the rules of G_1 from Section 2.2. k times, we either get a word a^{k-1}b^{k-1} from L_1 or a combination a^k S b^K. The latter isn’t a word because it contains a non-terminal symbol. We’ll prove this hypothesis by induction by the number of applied rules.

In the base case (k=1), we can apply S \rightarrow \varepsilon = a^0 b^0 or S \rightarrow a S b = a^1 S b^1. Since \varepsilon \in L_1 and a^1 b^1 is of the form a^k b^k, we’ve covered the induction base.

In the step case, we assume that the induction hypothesis holds for some k\geq 1, so we have a^{k-1} b^{k-1} \in L_1 and a^{k} S b^{k}. Does the hypothesis hold for k+1?

Applying S \rightarrow \varepsilon to a^{k} S b^{k}, we get a^k b^k = a^{(k+1)-1} b^{(k+1)-1} \in L_1. Similarly, applying the other rule, we get a^{k+1} S b^{k+1}.

Thus we showed that G_1 generated only the words in L_1.

Proving the converse is easier. A word of L_1 is of the form a^k b^k. If we apply S \rightarrow a S b k times and S \rightarrow \varepsilon afterwards, we’ll get a^k b^k for any k.

3.3. Pushdown Automata

A pushdown automaton (PDA) is a finite state machine equivalent to a context-free grammar. For any language generated by a CFG, there’s a PDA that recognizes it and vice versa.

Sometimes, it’s easier to construct an automaton than grammar. Informally, a PDA is a finite state machine that transitions between states while reading a word symbol by symbol from its input and writing symbols onto its stack. If it reads the whole word and ends in an accepting state with an empty stack, we say that it’s accepted the word it read. The language a PDA recognizes consists of the words it accepts.

We won’t cover PDAs in this article, but they’re certainly an alternative to consider.

3.4. The Question of Proof

Failing to construct a CFG or a PDA by hand doesn’t mean that the language isn’t context-free. The absence of proof isn’t the same as the proof of absence. A CFG generating our language may exist, but we couldn’t find it.

If we suspect that we failed at constructing an appropriate grammar because the language isn’t context-free, we can try to prove precisely that: that the language isn’t context-free.

4. How to Prove That a Language Isn’t Context-Free

Ogden’s lemma describes a property of all context-free languages. So, if we show that the language at hand doesn’t have that property, we’ll prove that it isn’t context-free.

4.1. Ogden’s Lemma

Let’s state the lemma. Let L be a context-free language. Then, there exists a natural number n_L such that every word w \in L (|w| \geq n_L) in which at least n_L positions (indexed symbols) are marked, can be split into five parts:

(3)  \begin{equation*}  w = xuyvz \end{equation*}

so that the following holds:

  • uv contains at least one marked position.
  • uyv contains at most n_L marked positions.
  • xu^kyv^kz \in L for all k \geq 0.

The lemma shows a regularity in generating the words of a context-free language. Pumping the selected sub-words (\boldsymbol{u^k} and \boldsymbol{v^k}) produces the new words of the language. If any word with at least n_L symbols breaks this regularity, the language isn’t context-free.

4.2. The Pumping Lemma for Context-Free Languages

Usually, we apply a special case of Ogden’s lemma, where all the positions are marked. This case is known as the pumping lemma for context-free languages:

For every context-free language L, there exists a natural number n_L such that every word w \in L (|w| \geq n_L) can be written as

    [w = xuyvz]

so that the following holds:

  • |uv| \geq 1
  • |uyv| \leq n_L
  • xu^kyv^kz \in L for all k \geq 0

4.3. Example

Let’s prove that the language L_2 = \{a^n b^n c^n \mid n \geq 0\} isn’t context-free.

So, we first suppose the opposite, that \boldsymbol{L_2} is context-free. Then, by the pumping lemma, there must exist n_L that fulfills the lemma’s conditions about splitting all the words not shorter than n_L in L_2.

Let’s focus on the word a^{n_L} b^{n_L} c^{n_L}. Applying the lemma, we can write it as w = xuyvz so that we have:

  • |uv| \geq 1
  • |uyv| \leq n_L
  • xu^kyv^kz \in L for all k \geq 0

Let’s analyze the last condition. There are two options.

  • The sub-word uyv contains only as. So, pumping u and v increases the number of as in the word, while the numbers of bs and cs stay the same. Therefore, the new pumped words won’t be in L_2.
  • Alternatively, uyv can contain both as and bs. By pumping the sub-words u and v, we increase the number of as, the number of bs, or both. At the same time, the number of cs stays the same. So, the pumped words don’t belong to L_2 in this case too.

That’s a contradiction because we assumed that \boldsymbol{L_2} was context-free. Hence, the assumption was wrong, and \boldsymbol{L_2} cannot be a context-free language.

4.4. The Question of Proof, Revisited

Just as was the case with the proofs by constructing the corresponding CFGs and PDAs, failing to prove that a language isn’t context-free doesn’t mean that it is. If we couldn’t find a word that breaks the pumping property, that doesn’t imply that no such word exists.

That opens the question of what to do if both strategies fail. The best course of action would be to use software tools such as automated learners and theorem provers.

5. Conclusion

In this article, we talked about context-free languages. We can prove that a language is context-free if we construct a context-free grammar that generates it. Alternatively, we can create a pushdown automaton that recognizes the language. On the other hand, we use Ogden’s lemma and the pumping lemma for context-free languages to prove that a language isn’t context-free.


« 上一篇: B树数据结构
» 下一篇: 红黑树的应用