so you want to make a

programming language

In this interactive website, we're going to provide an introduction to the theory behind programming languages. To make a programming language, we're going to need to make a compiler or an interpreter. If you have some experience with programming, you might have heard these words thrown around and developed an understanding of their purpose, but you might be wondering what's going on under the hood? If you haven't, don't worry, keep reading! You should have a better understanding by the end of this, even if some of the more technical aspects get lost on you. Plus, maybe you'll come back to this after a while and understand more.


AN INTRODUCTION

To understand the purpose of compilers, we should take a very brief look at the history of computing.

The first programs were directly written in machine code, a machine-specific set of binary instructions that correspond to actions on the CPU. A little later, programs were written in assembly, and then converted to machine code using an assembler. Assembly is more human-readable, but not by much, and still specific to computer architecture. Programmers needed a programming language that was easier to write and understand and could be run on any machine.

In order to use these more high-level, abstract programming languages, you need some way to get from those languages to the low-level instructions that actually run on your computer. This is where compilers come into play. A compiler will translate a universal high-level program to machine code specific to the machine.


#include <stdio.h> int main() { printf("Hello World!"); return 0; }
helloworld.c
a simple C hello world program
COMPILATION
00000000 cf fa ed fe 07 00 00 01 03 00 00 00 02 00 00 00 00000010 10 00 00 00 58 05 00 00 85 00 20 00 00 00 00 00 00000020 19 00 00 00 48 00 00 00 5f 5f 50 41 47 45 5a 45 00000030 52 4f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000040 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000060 00 00 00 00 00 00 00 00 19 00 00 00 d8 01 00 00 00000070 5f 5f 54 45 58 54 00 00 00 00 00 00 00 00 00 00 00000080 00 00 00 00 01 00 00 00 00 40 00 00 00 00 00 00 00000090 00 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 000000a0 05 00 00 00 05 00 00 00 05 00 00 00 00 00 00 00 000000b0 5f 5f 74 65 78 74 00 00 00 00 00 00 00 00 00 00 000000c0 5f 5f 54 45 58 54 00 00 00 00 00 00 00 00 00 00 000000d0 65 3f 00 00 01 00 00 00 1d 00 00 00 00 00 00 00 000000e0 65 3f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000f0 00 04 00 80 00 00 00 00 00 00 00 00 00 00 00 00 ...
a.out
part of the GCC compiled machine code


We can break this compilation process into five steps: Design and Grammars, Scanning, Parsing, Semantic Analysis, and Code Generation.


DESIGN and GRAMMARS

If you're going to make a programming language, you first need to figure out what that language is going to look like. You might have an idea of what the syntax will look like, but we need that translate that vision into cold, hard, formal mathematics. For that, we use grammars, specifically context-free grammars. Grammars can pop up a lot in more theoretical computer science, and for good reason, they are really helpful tools. As the name implies, there are parallels to natural language. You can think of a grammar as a formal representation of the rules of some language.

Note: Some of the more astute and contrarian among you might be thinking, does that mean there's some sort of context-sensitive or context-full grammar. Don't worry, there are various classes of grammars in The Chomsky Hierarchy, but the restrictions imposed on context-free grammars are useful for our purposes. Interestingly, the theory of computation and the theory of programming languages especially have a lot of interesting overlap at this point. If you're interested, definitely research the topic and take the relevant coursework.

All that talk is well and good, but what does a grammar actually look like?
Now, if this is your first time with grammars, it can be pretty hard to wrap your head around, but here's an explanation. A grammar consists of production rules like this:

Sa
a simple grammar production that only generates the character 'a'

On the left-hand side of the production arrow, we have a nonterminal. It is what it sounds like, you have to replace it with some production rules, you can't terminate with a non-terminal there.
On the right-hand side of the production, we have a string of terminals and nonterminals. Terminals are what the grammar is eventually generating. You can replace a non-terminal on the left with whatever is on the right.
Sometimes you want to have one nonterminal able to evaluate to multiple things:

Sa
Sb
a simple grammar that only generates the character 'a' or the character 'b'

Using the vertical bar to represent 'or,' we can simplify the above two productions into a single line:

Sa | b
a simple grammar that only generates the character 'a' or the character 'b'

Nonterminals can also evaluate to nothing. This is expressed using an epsilon (ϵ) to represent the empty string:

Sa | ϵ
a simple grammar that only generates the character 'a' or an empty string

Putting this all together, we can construct a toy grammar to play around with:

Sa S a | b S b | a | b | ϵ
a simple grammar that generates palindromes that consist only of the letters 'a' and 'b'

Now, let's play around with the palindrome grammar using the interactive demo. Click the S tag in the box below to reduce it to one of the options described by the grammar. (Remember: ϵ is the empty string, so it evaluates to nothing) Try to make the following:

a b b a
a b a a b a
a a b a b a a
S

Before we get to the grammar for the language, let's create a grammar to generate arithmetic expressions.

Let's start with the simple, an expression can be an id (variable) or a number literal.

Note: You can construct grammars to recognize ids and numbers, but we omit them here for simplicity's sake.
exprid | number
an expression can be an id or number

An expression can also be two expressions as operands for some operator (addition, subtraction, multiplication, and division). We can also use parenthesis to nest expressions.

exprexpr op expr | ( expr )
op+ | - | * | /

Putting that all together, we get a naive grammar for arithmetic expressions.

exprid | number | ( expr ) | expr op expr
op+ | - | * | /
a simple arithmetic expression grammar

However, this grammar is ambiguous, we could evaluate it in different ways to get the same expression. We want to make the evaluation (parse tree) unambiguous (think order of operations multiplication and division should be in a different "tier" than addition and subtraction). To achieve this, we can reorganize the grammar using term and factor nonterminals.

exprterm | expr add_op term
termfactor | term mult_op factor
factor( expr ) | id | number
add_op+ | -
mult_op* | /
a grammar to generate arithmetic expressions without ambiguity

To get a hang of this grammar, use the interactive demo below to try to make the following:
Hint: This grammar evaluates right to left, not left to right, so try to match the rightmost first and move inwards.

number + number
number * number + number
( number - number ) / number + number
expr

Now that we have some background on context free grammars, let's design a simple programming language.

Our language will be a simple calculator language that supports simple computations, variable assignment, reading input, and writing output.

/* this is some calculator language code */ read a b := a + 10.1 + (10-1)/3 * .7 // wow it works! write b
what we want a calculator language program to look like

Now, we can put that into a grammar. First, a program should be a list of staments marked with an end of program marker (we'll use $$).

programstmt_list $$

A statement list can be how ever long the user wants, and a statement can either be a variable assignment, a read, or a write.

stmt_liststmt stmt_list | stmt
stmtid := expr | read id | write expr

Putting that all together with the arithmetic expression grammar from before we get:

programstmt_list $$
stmt_liststmt stmt_list | stmt
stmtid := expr | read id | write expr
exprterm | expr add_op term
termfactor | term mult_op factor
factor( expr ) | id | number
add_op+ | -
mult_op* | /

This is the grammar that will define the rules for our calculator programming language. To really get an intuition for this grammar, try to generate the following. Don't worry if gets a little messy, it'll have to.

read id write expr $$
id := id + number / number $$
read id id := id * number write id $$
program


We did it! However, figuring out whether a program can be generated by our grammar by hand is a pain. Now, like a good computer scientist, let's make a machine do it for us.




SCANNING

Now that we've defined the calculator language, we want to be able to turn some calculator language program into machine code instructions that can run on our computer. The first step is lexical analysis, recognizing each of the parts of our program.

Let's say we have some text that should be a valid calculator language program. The computer needs to be able to determine the meaning of each part of the program. We need to split up the program into meaningful tokens. To do this, we create a scanner (also known as a tokenizer).

There is a sample program in the input below.
You can also write your own code and the site will use it going forward (you can come back and change it at any time).
Note: We can't have nice syntax highlighting yet, because we don't know what to highlight yet. We need tokens.

ERROR

What we want to do now is recognize each unit of the program and break them into tokens. For example, we would take the sample program from earlier and break it into tokens. Notice how they align with the syntax highlighting.

/* this is some calculator language code */ read a b := a + 10.1 + (10-1)/3 * .7 // wow it works! write b
a sample calculator language program

read a b := a + 10.1 + ( 10 - 1 ) / 3 * .7 write b
tokenized sample program

We could formalize token detection using regular expressions and finite automata. In practice language designers do, and use tools to generate a scanner (see LEX). However, the calculator language is simple enough to write an ad hoc scanner. An ad hoc scanner simply iterates over the input program and peeks ahead when it needs to.

Let's build up an ad hoc scanner. The simplest thing to recognize are single-character tokens: parentheses and most operators '()+-*'. We cannot recognize division here because with only a single character, we cannot distinguish between the division operator and a comment ('/' vs '//' or '/*').

iterating over program: if curr_char in '()+-*': add token

Now let's recognize division and comments. It gets a little trickier here, we need to peek ahead to see if the next character makes the token a comment. If so, ignore it, otherwise it is division.

iterating over program: ... if curr_char == '/': look ahead at the next character if next_char in '*/': read characters until matching */ or newline, ignoring them else: add division token

Assignment is similar.

iterating over program: ... if curr_char == ':': read the next character if next_char == '=': add assignment token else: raise error

We need to recognize ids (variables), but because they are a string of characters, we need to check if an id is a reserved keyword (read, write)

iterating over program: ... if curr_char is letter: read following numbers or letters if it is read or write: add respective read or write token else: add id token

Now all we need to do is recognize numbers. There are two cases: it starts with a digit or it is a floating point number starting with a decimal point.

iterating over program: ... if curr_char is number: read following numbers add number token if curr_char == '.': read following numbers ensure there is at most one decimal point add number token

Putting that all together, we can now tokenize a program. Plus, we can get some nice syntax highlighting. Play around with this scanning demo to try to understand the process better:

/* this is some calculator language code */ read a variable := a + 10.1+(10-1)/3*.7 write variable // wow it works! read a



Great! We have a tokenized program. Next, we need the computer to understand the structure of those tokens within the program.

PARSING

We've converted our program into tokens, but these tokens don't have any meaningful structure. We need to organize them in some way. Conveniently, we have a context-free grammar that describes the structure of calculator language program. To structure our tokens, we create a parser. Together, the scanner and the parser perform syntactic analysis, recognizing the structure of a program.

This form is inherently nested, so we use the ever-popular structure, a tree. More specifically, the parser generates a parse tree of tokens (similar to what you might create when evaluating a grammar, but in reverse). This parse tree is integral to the rest of compilation. Parsers can be generated from grammars, and in practice, they usually are (see YACC). Although we can generate a parser from any grammar, we need a compiler to run quickly. Therefore, language designers use specific classes of grammars. Two of these kinds of grammars are LL grammars and LR grammars.

These classes are important in practice, but we'll just cover them briefly here. LL stands for Left-to-right, Left-most derivation. LL parsers can be written by hand or generated. LR stands for Left-to-right, Right-most derivation. LR parsers are almost always generated. For this site, we use a variant of LR called an SLR(1) parser (hand written driven by a generated table).

The bottom line is: a parser takes in a list of tokens and spits out a tree that shows the relation of tokens according to a grammar. For example, the code that we tokenized becomes:


Alright! We have a parse tree. Now that the compiler understands the structure of the program, it needs to make sure that the meaning of the program is valid.



SEMANTIC ANALYSIS

Semantic Analysis is the next step in the compiler, which takes in the original parse tree and returns an annotated parse tree with other information stored. The goal of semantic analysis is to do as much as possible before runtime. Effectively, a semantic analyzer wants to determine all the information that will aid the execution of the program that can be figured out without running it.

So, what can a semantic analyzer do during the compilation process?

It can do type checking - In any language where types are specified, it can make sure all functions and operations are called on things of the correct type. - It can also store the type of variables so the compiler knows how to treat them beforehand. It can annotate the parse tree - Figure out any other information necessary to a compiler, whether that involves splitting the tree or storing separate values. - In the semantic analysis of this calculator language, it adds a unique id to each value in the parse tree for better memory handling.

What can't it do?

It cannot figure out if the program will loop infinitely - This is due to an interesting result in the theory of computation called the Halting Problem, no program can determine if a program will run forever. It cannot check for all errors - In the vein as the previous note, several types of errors can not be determined without knowing inputs or other things that will take place during the execution of the program, and so there will always be potential runtime errors.

Here's a simple annotated version of the previous parse tree:




CODE GENERATION

Lexical, syntactic, and semantic analysis make up the front-end of the compiler, taking source code and creating an annotated parse tree and an intermediate code representation. The back-end of the compiler takes these and generates the final executable code.
The first step is code optimization. The compiler uses various tricks to make the intermediate code more efficient. If you've heard people talking about compiler optimizations to code (e.g., statements being converted to switch statements), this step is likely what they are referring to.

The final step in the compilation process is code generation. Here's where the compiler coverts the code to actual machine code. Much of the preceding steps were relatively machine independent, but code generation is very architecture specific. The compiler has to reference actual memory locations and registers, translating the intermediate representation to instructions that will be run on the CPU. Then we have machine code to execute on our computer.

Note: Interestingly, contrary to what their names might lead you to believe, compilers and interpreters have a lot in common. They both have a sort of front-end that generates an intermediate representation of the code. A compiler then generates machine code while an interpreter acts on that code.



Woohoo! That's a compiler! Now you should have a good sense of the steps of compilation. Now you can play around with this dummy console that can run your calculator language code (from the input at the beginning of scanning). Try the 'help' command to get started.

Hint: Try the 'run' command on the 'program.calc' file.
pl@dummy-console:~$

Much of what we learned to create this explanation, including the original calculator language and grammar, came from Programming Language Pragmatics by Michael L. Scott. It's a really great textbook, definitely check it out if you want a deeper dive into PL theory.