In this interactive website, we're going to provide an introduction to the theory behind programming languages. To make a programming language, we're going to need to make a compiler or an interpreter. If you have some experience with programming, you might have heard these words thrown around and developed an understanding of their purpose, but you might be wondering what's going on under the hood? If you haven't, don't worry, keep reading! You should have a better understanding by the end of this, even if some of the more technical aspects get lost on you. Plus, maybe you'll come back to this after a while and understand more.
To understand the purpose of compilers, we should take a very brief look at the history of computing.
The first programs were directly written in machine code, a machine-specific set of binary instructions that correspond to actions on the CPU. A little later, programs were written in assembly, and then converted to machine code using an assembler. Assembly is more human-readable, but not by much, and still specific to computer architecture. Programmers needed a programming language that was easier to write and understand and could be run on any machine.
In order to use these more high-level, abstract programming languages, you need some way to get from those languages to the low-level instructions that actually run on your computer. This is where compilers come into play. A compiler will translate a universal high-level program to machine code specific to the machine.
We can break this compilation process into five steps: Design and Grammars, Scanning, Parsing, Semantic Analysis, and Code Generation.
If you're going to make a programming language, you first need to figure out what that language is going to look like. You might have an idea of what the syntax will look like, but we need that translate that vision into cold, hard, formal mathematics. For that, we use grammars, specifically context-free grammars. Grammars can pop up a lot in more theoretical computer science, and for good reason, they are really helpful tools. As the name implies, there are parallels to natural language. You can think of a grammar as a formal representation of the rules of some language.
Note: Some of the more astute and contrarian among you might be thinking, does that mean there's some sort of context-sensitive or context-full grammar. Don't worry, there are various classes of grammars in The Chomsky Hierarchy, but the restrictions imposed on context-free grammars are useful for our purposes. Interestingly, the theory of computation and the theory of programming languages especially have a lot of interesting overlap at this point. If you're interested, definitely research the topic and take the relevant coursework.
All that talk is well and good, but what does a grammar actually look like?
Now, if this is your first time with grammars, it can be pretty hard to wrap your head around, but here's an explanation. A grammar consists of production rules like this:
S | ⟶ | a |
On the left-hand side of the production arrow, we have a nonterminal. It is what it sounds like, you have to replace it with some production rules, you can't terminate with a non-terminal there.
On the right-hand side of the production, we have a string of terminals and nonterminals. Terminals are what the grammar is eventually generating. You can replace a non-terminal on the left with whatever is on the right.
Sometimes you want to have one nonterminal able to evaluate to multiple things:
S | ⟶ | a |
S | ⟶ | b |
Using the vertical bar to represent 'or,' we can simplify the above two productions into a single line:
S | ⟶ | a | b |
Nonterminals can also evaluate to nothing. This is expressed using an epsilon (ϵ) to represent the empty string:
S | ⟶ | a | ϵ |
Putting this all together, we can construct a toy grammar to play around with:
S | ⟶ | a S a | b S b | a | b | ϵ |
Now, let's play around with the palindrome grammar using the interactive demo. Click the S tag in the box below to reduce it to one of the options described by the grammar. (Remember: ϵ is the empty string, so it evaluates to nothing) Try to make the following:
Before we get to the grammar for the language, let's create a grammar to generate arithmetic expressions.
Let's start with the simple, an expression can be an id (variable) or a number literal.
Note: You can construct grammars to recognize ids and numbers, but we omit them here for simplicity's sake.expr | ⟶ | id | number |
An expression can also be two expressions as operands for some operator (addition, subtraction, multiplication, and division). We can also use parenthesis to nest expressions.
expr | ⟶ | expr op expr | ( expr ) |
op | ⟶ | + | - | * | / |
Putting that all together, we get a naive grammar for arithmetic expressions.
expr | ⟶ | id | number | ( expr ) | expr op expr |
op | ⟶ | + | - | * | / |
However, this grammar is ambiguous, we could evaluate it in different ways to get the same expression. We want to make the evaluation (parse tree) unambiguous (think order of operations multiplication and division should be in a different "tier" than addition and subtraction). To achieve this, we can reorganize the grammar using term and factor nonterminals.
expr | ⟶ | term | expr add_op term |
term | ⟶ | factor | term mult_op factor |
factor | ⟶ | ( expr ) | id | number |
add_op | ⟶ | + | - |
mult_op | ⟶ | * | / |
To get a hang of this grammar, use the interactive demo below to try to make the following:
Hint: This grammar evaluates right to left, not left to right, so try to match the rightmost first and move inwards.
Now that we have some background on context free grammars, let's design a simple programming language.
Our language will be a simple calculator language that supports simple computations, variable assignment, reading input, and writing output.
Now, we can put that into a grammar. First, a program should be a list of staments marked with an end of program marker (we'll use $$).
program | ⟶ | stmt_list $$ |
A statement list can be how ever long the user wants, and a statement can either be a variable assignment, a read, or a write.
stmt_list | ⟶ | stmt stmt_list | stmt |
stmt | ⟶ | id := expr | read id | write expr |
Putting that all together with the arithmetic expression grammar from before we get:
program | ⟶ | stmt_list $$ |
stmt_list | ⟶ | stmt stmt_list | stmt |
stmt | ⟶ | id := expr | read id | write expr |
expr | ⟶ | term | expr add_op term |
term | ⟶ | factor | term mult_op factor |
factor | ⟶ | ( expr ) | id | number |
add_op | ⟶ | + | - |
mult_op | ⟶ | * | / |
This is the grammar that will define the rules for our calculator programming language. To really get an intuition for this grammar, try to generate the following. Don't worry if gets a little messy, it'll have to.
We did it! However, figuring out whether a program can be generated by our grammar by hand is a pain. Now, like a good computer scientist, let's make a machine do it for us.
Now that we've defined the calculator language, we want to be able to turn some calculator language program into machine code instructions that can run on our computer. The first step is lexical analysis, recognizing each of the parts of our program.
Let's say we have some text that should be a valid calculator language program. The computer needs to be able to determine the meaning of each part of the program. We need to split up the program into meaningful tokens. To do this, we create a scanner (also known as a tokenizer).
There is a sample program in the input below.
You can also write your own code and the site will use it going forward (you can come back and change it at any time).
Note: We can't have nice syntax highlighting yet, because we don't know what to highlight yet. We need tokens.
ERROR
What we want to do now is recognize each unit of the program and break them into tokens. For example, we would take the sample program from earlier and break it into tokens. Notice how they align with the syntax highlighting.
We could formalize token detection using regular expressions and finite automata. In practice language designers do, and use tools to generate a scanner (see LEX). However, the calculator language is simple enough to write an ad hoc scanner. An ad hoc scanner simply iterates over the input program and peeks ahead when it needs to.
Let's build up an ad hoc scanner. The simplest thing to recognize are single-character tokens: parentheses and most operators '()+-*'. We cannot recognize division here because with only a single character, we cannot distinguish between the division operator and a comment ('/' vs '//' or '/*').
Now let's recognize division and comments. It gets a little trickier here, we need to peek ahead to see if the next character makes the token a comment. If so, ignore it, otherwise it is division.
Assignment is similar.
We need to recognize ids (variables), but because they are a string of characters, we need to check if an id is a reserved keyword (read, write)
Now all we need to do is recognize numbers. There are two cases: it starts with a digit or it is a floating point number starting with a decimal point.
Putting that all together, we can now tokenize a program. Plus, we can get some nice syntax highlighting. Play around with this scanning demo to try to understand the process better:
Great! We have a tokenized program. Next, we need the computer to understand the structure of those tokens within the program.
We've converted our program into tokens, but these tokens don't have any meaningful structure. We need to organize them in some way. Conveniently, we have a context-free grammar that describes the structure of calculator language program. To structure our tokens, we create a parser. Together, the scanner and the parser perform syntactic analysis, recognizing the structure of a program.
This form is inherently nested, so we use the ever-popular structure, a tree. More specifically, the parser generates a parse tree of tokens (similar to what you might create when evaluating a grammar, but in reverse). This parse tree is integral to the rest of compilation. Parsers can be generated from grammars, and in practice, they usually are (see YACC). Although we can generate a parser from any grammar, we need a compiler to run quickly. Therefore, language designers use specific classes of grammars. Two of these kinds of grammars are LL grammars and LR grammars.
These classes are important in practice, but we'll just cover them briefly here. LL stands for Left-to-right, Left-most derivation. LL parsers can be written by hand or generated. LR stands for Left-to-right, Right-most derivation. LR parsers are almost always generated. For this site, we use a variant of LR called an SLR(1) parser (hand written driven by a generated table).
The bottom line is: a parser takes in a list of tokens and spits out a tree that shows the relation of tokens according to a grammar. For example, the code that we tokenized becomes:
Alright! We have a parse tree. Now that the compiler understands the structure of the program, it needs to make sure that the meaning of the program is valid.
Semantic Analysis is the next step in the compiler, which takes in the original parse tree and returns an annotated parse tree with other information stored. The goal of semantic analysis is to do as much as possible before runtime. Effectively, a semantic analyzer wants to determine all the information that will aid the execution of the program that can be figured out without running it.
So, what can a semantic analyzer do during the compilation process?
It can do type checking
- In any language where types are specified, it can make sure all functions and operations are called on things of the correct type.
- It can also store the type of variables so the compiler knows how to treat them beforehand.
It can annotate the parse tree
- Figure out any other information necessary to a compiler, whether that involves splitting the tree or storing separate values.
- In the semantic analysis of this calculator language, it adds a unique id to each value in the parse tree for better memory handling.
What can't it do?
It cannot figure out if the program will loop infinitely
- This is due to an interesting result in the theory of computation called the Halting Problem, no program can determine if a program will run forever.
It cannot check for all errors
- In the vein as the previous note, several types of errors can not be determined without knowing inputs or other things that will take place during the execution of the program, and so there will always be potential runtime errors.
Here's a simple annotated version of the previous parse tree:
Lexical, syntactic, and semantic analysis make up the front-end of the compiler, taking source code and creating an annotated parse tree and an intermediate code representation. The back-end of the compiler takes these and generates the final executable code.
The first step is code optimization. The compiler uses various tricks to make the intermediate code more efficient. If you've heard people talking about compiler optimizations to code (e.g., statements being converted to switch statements), this step is likely what they are referring to.
The final step in the compilation process is code generation. Here's where the compiler coverts the code to actual machine code. Much of the preceding steps were relatively machine independent, but code generation is very architecture specific. The compiler has to reference actual memory locations and registers, translating the intermediate representation to instructions that will be run on the CPU. Then we have machine code to execute on our computer.
Note: Interestingly, contrary to what their names might lead you to believe, compilers and interpreters have a lot in common. They both have a sort of front-end that generates an intermediate representation of the code. A compiler then generates machine code while an interpreter acts on that code.
Woohoo! That's a compiler! Now you should have a good sense of the steps of compilation. Now you can play around with this dummy console that can run your calculator language code (from the input at the beginning of scanning). Try the 'help' command to get started.
Hint: Try the 'run' command on the 'program.calc' file.Much of what we learned to create this explanation, including the original calculator language and grammar, came from Programming Language Pragmatics by Michael L. Scott. It's a really great textbook, definitely check it out if you want a deeper dive into PL theory.