1.1. Derivations and Parse Trees¶

1.1.1. Parse Trees¶

Topics in this section include:

Grammar terminology

Derivations and parse trees

Grammars provide a formalism for expressing the syntax of programming languages. That syntax is consequently used to parse, that is, determine the syntactical correctness of, a “program” in the language. A grammar is composed of the following three elements.

A set of terminals. These terminals represent the tokens or groups of characters that logically belong together, such as operator symbols, delimiters, keywords, and variable names that ultimately comprise the program or expression being parsed. In the case of algebraic expressions, the terminals would be variables, numeric constants, parentheses, and the various operators that are allowed.

A set of non-terminals. These non-terminals represent the various grammatical constructs within the language we are parsing. In particular, one non-terminal is designated as the start symbol for the grammar.

A set of productions. The productions are formal rules defining the syntactical composition of the non-terminals from the previous point. The productions take the form:

\[\begin{split}\begin{eqnarray*} <non\mathrm{-}terminal> &::=& String \; of \; terminals \; and/or \; non\mathrm{-}terminals\\ \end{eqnarray*}\end{split}\]

We say that the non-terminal on the left-hand side (LHS) of such a production derives the string on the right-hand side (RHS).

The type of grammars we are using in this course (i.e., those with only one non-terminal on the LHS of each production) is more precisely called a context-free grammar or BNF grammar (short for “Backus-Naur Form). Since we are not going to discuss other types of formal grammars in this course, we can safely refer to them simply as grammars.

An example of a grammar should help to clarify the three-part definition above. By convention, the non-terminal on the LHS of the first production is the start symbol, and that is what ultimately must be parsed to have a complete expression in the language. Hence in the example below, \(<exp>\) is the start symbol.

1.1.1.1. First Sample Grammar¶

\[\begin{split}\begin{eqnarray*} <exp> &::=& <term>\\ &|& <exp> + <term> \\ &|& <exp> - <term> \\ &|& <exp> * <term> \\ &|& <exp> / <term> \\ <term> &::=& <var> \\ &|& ( <exp> ) \\ <var> &::=& A\ |\ B\ |\ C\ |\ \ldots\ |\ X\ |\ Y\ |\ Z \end{eqnarray*}\end{split}\]

This is essentially a grammar for algebraic expressions with variables (that is, the \(<var>\) non-terminal) allowed to be a single uppercase letter. When reading a grammar, the vertical bar \(|\) means “or”. Hence \(<var>\) can be A or B or C … The \(<term>\) non-terminal must either be a \(<var>\) or a parenthesized \(<exp>\). A derivation of the expression \(A + B * C\) according to this grammar proceeds as illustrated in the following slide show, with the final result being a parse tree. You should step through all of the slides, making sure that at each step you understand the production that is being applied to “grow” the parse tree.

Settings

Saving...

Server Error
Resubmit

Note that, in a complete parse tree, leaf nodes are always terminals, and a traversal of the tree that would output these leaf nodes would reproduce the expression being parsed. This is indicated by the red highlighting in the above slide show.

The following set of four review problems for this section should be completed before you go on. In these review problems, the symbol \(\epsilon\) is used to represent the empty string. When \(\epsilon\) appears on the RHS of a production, it means that one of the possibilities for the non-terminal on the LHS the production is for it to derive the empty string, that is, the string with no characters. This is typically used when the syntax for the language being parsed allows the option of the non-terminal not appearing at all. Often with productions that are recursive, it provides a way for the recursion to bottom out, similarly to the way a recursive termination condition would work in a recursive algorithm.

The first problem below is about building a parse tree given a grammar and a string.

1.1.2. Strings Generated by a Grammar¶

The following problem is about determining whether a given string can be generated by a given grammar.

1.1.3. Characterizing a Language: Example 1¶

The following problem is about identifying properties of all of the strings in a language defined by a given grammar.

1.1.4. Characterizing a Language: Example 2¶

The following problem is about precisely characterizing the whole language generated by a given grammar.

Programming Languages

Chapter 1 Grammars