1.2. Overview¶
1.2.1. Languages, Grammars, and Machines¶
A language is simply a collection of strings (specifically, a subset of the strings that can be derived from some alphabet). A fundamental thing to be able to do with a string is to determine whether or not it is part of a given language. This semester, we are studying various types of machines to see what languages they can reliably recognize (that is, reliably determine whether some string is in that language or not). We will also study other ways of expressing or defining languages. These include grammars, and other “language defining languages” like regular expressions.
1.2.2. Language Hierarchy¶
This picture shows what we will do most of the semester. By the end, you will know how things in this picture relate to each other, how they apply to compilers, and how hard some related problems are to solve.
You are familiar with programming languages. Now we’ll look in finer detail at how programming languages are defined, using grammars, and automata (which are simpler versions of computers then your laptop). “Automata” is just another word for “machine”.
We start in the very center of the diagram, which represents the smallest set of languages. We’ll be examining simpler languages and grammars and then building up to how programming languages are formed. We start with finite automata (also called finite state machines). We will see that any finite automaton represents a simple language, and that a type of grammar (regular grammar) can represent the same language. Also we will look at regular expressions. These all turn out to be languages that can be represented by a program with no extra memory.
Then we will add memory in a very simple way—a stack—using a machine (Pushdown Automata or PDA) that can represent a larger set of languages, along with their corresponding grammars.
Then we will add more memory and capabilities, bringing us to another machine (the Turing machine), the type of languages that it can represent, and its corresponding type of grammar.
At the end, we will briefly discuss languages that you cannot write a program to reliably recognize (a topic referred to as computability theory), and also talk a bit about the difference between problems that are (relatively) cheap to solve versus ones that are (relatively) expensive to solve (a topic referred to as complexity theory).
1.2.3. The Power of Machines¶
We will be going over all of this information in detail during the semester. But this table gives you a quick overview.
FA - can recognize integers, real numbers, but can’t recognize a valid arithmetic expression.
PDA - can recognize a valid arithmetic expression, but can’t compute it and give you the answer.
TM - can compute expressions, but you can’t write a program for a Turing machine that takes an arbitrary Java program and some input and tells you whether that program will halt on that input or not.
1.2.4. Application: Compilers¶
Question: Given a program in some language (say Java or C++), is it valid? That is, is it a syntactically correct program? This is something that certain automata can do, if the grammar for the language is defined in the right way.
If the program is syntactically correct, then the compiler will go ahead and generate code to execute the program efficiently. We won’t talk about that part of a compiler—to learn about doing that, you would need to take a compiler course.
You might think that understanding how to write a grammar to recognize a language (or design the language so that it is indeed possible to write a grammar) is an unnecessary skill. But in reality, a lot of programmers write “little languages” as part of their job. For example, you might work for a company that makes robots, and you could need a little language to control the robot. Or you might write a web page that has an input box that has to be restricted to a certain structure for the input.
1.2.4.1. Stages of a Compiler¶
The following figure gives a rough overview of how a compiler works, by performing three basic tasks. In this class we will be learning about the first two of the three major tasks: recognizing tokens, and determining if the tokens fit together in an acceptable way.
PART 1: Identifying the tokens in a program. Regular languages are the foundation for this. Lexical analysis identifies the pieces (tokens) of the program. Tokens are things like integers, keywords, variable names, special symbols such as \(+\).
PART 2: Identify whether the tokens fit together in the correct way, so that the program is syntactically valid. This is called Syntax Analysis. We will be learning the theory for this in our unit on context free languages. This will involve studying several parsing algorithms.
PART 3. Creating the parse tree. An interpretor walks through the parse tree and immediately executes the program (it does not generate code to execute the program). A compiler will take the parse tree and create a version of the program (that is not so nice for a human to read) that can quickly execute the program.
1.2.5. Some Mindbending Ideas¶
There are a lot of “meta” concepts related to Formal Languages. Here are a few things to think about.
The descriptions of languages are just strings. Which means that, for example, the set of (strings that are) regular expressions is, itself, a language. Which leads to some questions like:
What type of language (from our hierarchy) is the set of regular expressions?
What type of language (from our hierarchy) is Java?
What type of language is a Context Free Grammar?
Here is some other interesting “meta” questions and claims.
For any given language \(L\), define the language co-\(L\) to be all strings not in \(L\). Is co-\(L\) always the same type of language (in our hierarchy) as \(L\)?
Turing machines can do anything (at least, anything that any computer can do).
And, how can it be that you can’t always tell if a loop will halt?!