Unwrapping intermediate representations

Intermediate representations (IRs) are used as in-between language steps, from source code to execution on a target machine. My article on Why we use intermediate representation explains what IRs are and how to use them. Here I want to show what IR actually looks like, with some examples. Understanding this level of IR is helpful when talking about implementing compilers and virtual machines.

I’ll mention “machines” a lot in this article. In brief, an “abstract machine” is a theoretical construct that defines a set of instructions and their behaviour. A “virtual machine” is software that can execute the code of a particular “abstract machine”. And hardware is a physical machine that can actually run the code for a virtual machine, or a low-level abstract machine.

Simple Code converted to an intermediate representation

To understand IR, it’s best to take simple high-level examples and convert them to IR. I’m going to use a simple language and convert to a simple IR to keep the examples intelligible. Real languages and IRs get complex quickly.

Consider a simple functional language that calculates and outputs a result. We’ll start with the below example:

5 + 7

This code is will output the value 12, the sum of 5 and 7.

Intermediate representations can take a variety of forms depending on the abstract machine they are targeting. A key distinction lies in how the system handles expression evaluation. Let’s examine how these approaches, stack-based and then registry-based, handle this code.

Converted to a Stack-Based IR

A stack is an area of memory to which values can be pushed and popped. An IR that targets this abstract machine will use the stack to store the results of expressions.

This IR can represent our example:

push 5
push 7
add

The push command puts a value on the stack. The add command takes two values off the stack, computes their sum, and pushes the result back onto the stack. After the program finishes, it prints the top stack value to the console.

I’ll augment the above code by showing the state of the stack at each step.

// [] <-- The stack, it starts empty
push 5  // [5]
push 7 // [5, 7] 
add // [12]
// machine emits the top of the stack, "12"

It’s easy to implement stack based virtual machines: a simple loop and switch statement can take the above style of IR and execute it. We define our IR by its operation on an abstract machine—a theoretical construct defining the code’s rules and operations. A virtual machine is something “real” that can execute this code. With this simple IR, we can implement the rules of the abstract machine directly. This makes this IR usable as-is, without further transformations.

Converted to a Register-Based IR

Another option, instead of a stack, is to track all expressions explicitly. This is like pretendind our machine has an infinite number of named registers. No machine has infinite registers, but it makes for a good abstraction.

The same code in a register-based IR:

%a = 5
%b = 7
%c = add %a %b
@result %c

In our register based machine, all values have to be assigned a “register”. This is simply a name that refers to a constant value, or simple expression.

I’ll use a different symbol for @result as it denotes something special. One property of many register based machines is that registers are one-time use only. Since they’re infinite, we have no need to ever reuse any of them. Thus @result is not a register, but a way to tell our machine which value to emit.

Obviously, no hardware supports this model of computation. It’d also be tricky, and inefficient, to write a virtual machine that uses such IR directly. This IR requires an additional transformation before execution.

An Example with a Heap and a Call

Consider another example, with its corresponding IR in the stack and register machines. In this example, we introduce the concept of a heap, where we’ll store a value permanently. We removed the default output, so we must call an explicit print function. We also introduce basic typing to ease into that concept.

a: Integer = 5
a += 7
print( a )

Converted to a Stack-Based IR

Here is the stack-based IR, along with the contents of the stack at each step.

// a: Integer = 5
allocate_integer @a
push_integer 5  // [5]
pop_integer_to @a  // []

// a += 7
push_integer_from @a  // [5]
push_integer 7  // [5, 7]
add_integer  // [12]
pop_integer_to @a  // []

// print( a )
push_integer_from @a // [12]
call print // eventually -> []

What we see in this code is that our high-level expressions have been reduced to step-wise operations involving the stack. Even the call to the print function has lost its arguments, instead it’s been pushed to the stack and an argumentless call is made.

Converted to a Register-Based IR

Here the same code in our register based IR.

// a : Integer = 5
allocate_integer @a
%ra = 5
store_integer_to @a %ra

// a += 7
%rb = load_integer_from @a
%rc = 7
%rd = add_integer %rb %rd
store_integer_to @a %rd

// print( a )
%re = load_integer_from @a
call (Integer) print %re

This is like what LLVM uses for its IR; LLVM is a compiler toolchain with a well defined IR. The % symbols are known as registers, and in this machine, we’re assumed to have an infinite number of registers. Optimizers often prefer this form as input; the graph structure of relationships between values and expressions allows for easier reordering than a stack.

In the final line, call (Integer) print %re, the (Integer) piece is the argument signature of the function we’re calling. We’re relying on the abstract machine to create a suitable calling convention. Some machines use registers, some use the stack, some use a combination. This part differs between operating systems, even those running on the same hardware. Thus, it’s a detail best left to the final transformation, and not our IR.

In practice, both real machines and virtual machines have an assortment of registers, a stack, and a heap area. Which approach the intermediate representation takes depends on the goal of the IR, and about where you want your trouble points. Though stack based, the Java’s JIT-compiler in the VM can optimize for register based machines, and though register based, LLVM has a JIT that quickly converts to mixed register and stack machines.

I’ll use a register-based IR from here. Though it seems complex at first, it’s easier to see the flow of execution and the expression values.

Many toolchains will let you emit this format from source code. For examples, search for “emitting LLVM IR”. Try playing around with very, very simple files, and emitting their IR. Make sure to turn off optimizations, otherwise it’ll be hard to recognize how the source was converted.

Representing Branch Blocks in IR

Branching is a common feature of languages. A common feature of IR is “flattening”. The source language expresses blocks with nested sections attached to conditionals, or loops. At some point these structures have to disappear. You’ll see in the IR that those nested sections have been flattened to a labelled series of sequential instructions.

In this example I’ll skip some setup to focus on the conditional part.

if a > 5:
	b = 1

Converted to IR:

	%r1 = cmp_integer_gt @a 5
	branch %r1 label_true label_false

label_true:
	assign_integer @b 1
	jump label_end
	
label_false:
	jump label_end

label_end:
	...

This is a form similar to LLVM IR. Those labels introduce blocks of code. Every block of code is complete, either issuing a return statement or a branch statement as the final instruction. There can be only one branching statement in a block. This means the order of these blocks isn’t important, like the registers, they form a graph. This is a good form for applying optimizations.

Also note that I have label_false that does nothing but jump to label_end. I could have specified label_end directly in the branch statement instead. You’ll see soon why this is helpful.

If you’ve seen old assembler code, or are familiar with Java byte-code though, you might be expecting to have simpler branching instructions, where you jump to an offset in the code. I can rewrite the above to look like that.

	%r1 = cmp_integer_gt a 5
	branch_not %r1 label_false
	assign_integer b 1

label_false:
	...

Here we rely on fallthrough behaviour and the negative condition. If the expression is false, then we skip the following code. In bytecode, the labels will also be gone, likely replaced with a byte offset instead. For example, branch_not %r1 +16 assuming label_false is 16 bytes away.

Compared to the always branching form, this fallthrough form is closer to how the machine code will look. However, if you have a function with hundreds of these blocks (potentially thousands), you don’t want to rely on order, especially if an optimizer wants to move the blocks around.

Branching with a `finally` Block

Understanding the relationship of the generated IR to the source code quickly becomes challenging. We naturally nest conditions, loops, and switches in our code — all of which need to be flattened in the IR. We have implied branches, for example C++ object destructors are called when they go out of scope, and Java has a finally keyword. I’ll use the finally keyword for an example.

function proc:
	try:
		a = get_value()
		if a > 10:
			return 1
		
	finally:
		print_notice()
		
	return 0

I need to keep these examples trivial, as typical IR grows rapidly in size.

entry:
	alloc_integer @branch_code 0
	
	%r1 = call get_value
	%r2 = cmp_integer_gt %r1 10
	branch %r2 label_true label_false
	
label_true:
	store_integer @return_code 1
	jump label_finally
	
label_false:
	jump label_finally
	
label_finally:
	call print_notice
	branch return_code %post_cond %end_block
	
post_cond:
	return 1
	
end_block
	return 0

Here I want to show how statements that interrupt the normal flow quickly create overwhelming IR. Consider that a return could exist inside any nesting of blocks, with many finally blocks, or many object destructors. On top of this, we add exceptional flow, which is like wrapping every block in another block of conditional flow. Yikes!

When I wrote the Leaf compiler, I found going from the abstract syntax tree directly to LLVM’s IR to be too difficult. I found it too complicated to translate instructions and flatten them into blocks simultaneously. So I did the most logical thing of introducing another IR, the Leaf IR. The most significant difference between the Leaf IR, and LLVM IR, was that my Leaf IR retained the tree-like structure of the initial code. In the initial IR generation, I could focus on non-flow construction, then in the second phase focus on flattening flow.

IR looks like Assembler

If you have experience with assembly programming, you will recognize the similarities. IR is effectively assembly code for an abstract machine. It uses a reduced set of instructions and maps well to the way we think a computer behaves.

We think of assembly code as something that maps precisely to a specific machine. There is a near 1:1 relationship between the statements of the assembly code and the machine code. Yet most IRs, in compiler toolchains, and virtual machines, have instructions that are higher level than these machine specific assembly languages.

Except for a few special cases, we rarely write in machine assembly languages, but can use them to debug the output of a compiler. That’s true of IRs as well, we rarely write anything in the IR. The only reason we want a visual form of the language is if we’re debugging something. Otherwise, we store the IR as a compact byte-encoded form.

In the strictest sense, an intermediate representation, or intermediate language, is any representation between source code and what executes on real hardware. We, however, tend not to refer to parse trees, or abstract syntax trees as IRs. We keep IR to mean a reduced form, where much of the high level constructs have converted to a simpler series of instructions. The IR form is fit for quick transformation by a final compilation layer and then executed.

It’s this reduced form of IR I showed in this article. Virtual machines and compilers use this type of IR as a base. This will help you understand some of my other articles that talk about IR. Ambiguity surrounds what truly constitutes an IR. You can read more about that in my higher-level article “Why do we use intermediate representations / languages?”.

The Life of a Programmer

Unwrapping intermediate representations

Simple Code converted to an intermediate representation

Converted to a Stack-Based IR

Converted to a Register-Based IR

An Example with a Heap and a Call

Converted to a Stack-Based IR

Converted to a Register-Based IR

Representing Branch Blocks in IR

Branching with a `finally` Block

IR looks like Assembler