The Life of a Programmer

Search

Why language interpreters and virtual machines are slow

Interpreters and virtual machines are programs that execute code without having it compiled to native machine code. Like a CPU, they step through instructions and execute them, but in software, rather than hardware. This gives them flexibility in how they implement the language’s abstract machine, but results in them being slower than native code. In this article, I want to look at the reasons they are slow, and some of the common approaches taken to speed them up.

Do not consider this a criticism of interpreters and virtual machines. They're an undeniable part of our technical ecosytem. This article only intends to inform about their key performance challenges.

Intermediate Representation (IR)

Regardless of the source format, some compilation is necessary before the system can execute the code. There are several steps involved in parsing, and several of them are quite slow. Though it contributes to some startup speed issues with interpreted languages, it’s not what I want to focus on here.

We’re going to assume the code has already been compiled to an intermediate representation (IR), in particular, one suitable for running on a virtual machine. This is a reduced form that can look something like assembly code. A virtual machine reads this version of the code and executes it. If you’re unfamiliar with IRs, refer to my article about “unwrapped intermediate representations”. Processing the IR is the major aspect to virtual machine performance.

There is one other form of an interpreter that doesn’t use an IR, instead walking the abstract syntax tree directly. These tend to be reserved for domain specific languages and operate with different constraints. For general purpose code, they perform worse than IR based interpreters. They are an interesting case, and quite valuable, but outside the scope of this article.

The Virtual Machine

An interpreter implements a virtual machine that runs the IR. It reads the instructions and executes them, stepping through in order and branching as required. The interpreter also needs to provide memory facilities like the heap, stack, and registers. It has to provide everything the IR expects in the target platform — it’s an implementation of the abstract machine.

At the heart of this virtual machine are a loop and a switch statement. The pair operate similarly to how a CPU works, with a pointer to the next instruction, and the ability to dispatch the instruction.

Let’s look at how this works through a simple bit of IR targeting a stack based virtual machine. We’ll assume we already have an a defined in memory, without worrying about where.

push_integer_from a
push_integer 5
add_integer
pop_integer_to a

Without worrying about the byte format, assume these commands are packed into an array, where the first item is the command, followed by the arguments. For example, the first instruction is [ push_integer_from, a] and the second one [push_integer, 5]. The commands are specific to the exact type of their arguments and know how to interpret them.

Let’s assume a is a string here. Something might have already converted it into an offset in memory, or a numbered virtual register. As a string though, we’ll need to look up the memory by named index. We’ll come back to this later.

Given that, let’s look at a snippet of a core virtual machine loop.

while( instruction_index < instruction_list.length ):
	instruction = instruction_list[instruction_index]
	switch( instruction[0] ):
		...
		case push_integer_from:
			stack.push_integer( memory.get_integer(instruction[1]) )
			instruction_index += 2
			
		case push_integer:
			stack.push_integer( instruction[1] )
			instruction_index += 2
			
		case add_integer:
			tmp_a = stack.pop_integer()
			tmp_b = stack.pop_integer()
			stack.push_integer( tmp_a + tmp_b )
			instruction_index += 1
			
		case pop_integer_to:
			tmp_a = stack.pop_integer()
			memory.set_integer(instruction[0], tmp_a)
			instruction_index += 2
		...

If you step through the example IR, you can see how this ends up adding the value of a and 5 and storing the result back in a.

Instruction stepping

One piece of overhead of a virtual machine is the need to track an instruction pointer, step through the instructions and dispatch them. These are all things that the CPU does natively. While it may not seem like much, these are adding instructions the CPU needs to execute between every statement of our IR. The magnification of instructions is significant on its own.

The command dispatch can also be a problem. While a CPU has hardwired pathways, we need to write a bit of code for each of our instructions. This code has to be in memory, and will end up in the lowest level caches of the CPU. A virtual machine must keep this code size tiny to ensure it never swaps out of those caches. Regular cache misses stall the execution of our code.

The memory our core loop occupies is the most scarce memory in the system. Having the virtual machine code reside there reduces the amount available. The pointer to our instructions, as well as the pointer to our stack memory, are likely also in registers, which are also in short supply.

Don’t worry about the functions

One thing not to worry about is the overhead of those functions I’ve used. The virtual machine itself will be compiled to machine code, and any reasonably optimizing compiler will inline all of those functions. For example, the add_integer function would end up as likely only 2-4 machine code instructions that directly access memory via pointers in registers.

CPUs are also quite good at optimizing small pieces of code, which they’ll do on our core loop. This sounds good, but it’s actually a performance problem. Because we have our own instruction loop, the true code we’re executing, our IR, is somewhat invisible to the CPU. The optimizations about branching, instruction reordering, and memory lookup that apply to compiled programs will not apply to the executing IR — or at least not to the same degree.

The core virtual machine loop will run fast. It’ll take advantage of all the compiler optimizations used to compile it, as well as the CPU optimizations. But we’re one step removed from the code we are actually running, that which is encoded in the IR. And this one bit of abstraction can have a significant impact on overall speed.

Just-in-time Compilation

The clearest answer to this overhead is to compile the IR into machine code. This is what just-in-time compilation does. It takes IR code and compiles it to the target machine. Allowing code to run directly on the CPU, without the VM loop, is a huge performance improvement. It eliminates the most expensive part of VM execution.

Granted, you can’t likely compile the entire program, or you don’t want to make the user wait while this happens. The VM will then compile parts of the program as they’re needed. It will then deal with going in and out of its core loop, which introduces new issues of synchronization. Unless the code is trivially small though, it’s hard to imagine how this new overhead outweighs running fragments natively on the CPU.

It’s unlikely the JIT will spend as much time optimizing as a native compiler would. It may also miss out on some global optimizations if it’s working in chunks. This can cause JIT code being significantly slower than natively compiled components. It depends a lot on what that code is doing. As a rough way to think about this, consider a program compiled to native code but with no optimizations, versus the same program compiled with the highest optimization level. The difference can be staggering. Nonetheless, even unoptimized native code will significantly outperform the VM loop.

Variable Access

In our example, we referenced the variable a by name. While a virtual machine can certainly do this, it poses a significant speed impact. Every time we wish to access the variable a, we’d have to do a lookup in a dictionary, or hash-table, to get the address of a. This would be a major speed issue.

Just like natively compiled code though, an IR can encode its variables as offsets, either from the local function stack, or a base global pointer. When the IR runs, it must handle address rewriting or placement, similar to what native executables do. So the instruction that writes to a, push_integer_from a could end up as something like push_integer_from 0x04502 where 0x04502 is the address of a in the heap.

What the virtual machine can miss out on though are addressing modes. Native instructions often have ways to indirectly access memory via a register, address and increment, or a variety of other optimized memory access forms. An IR could offer a series of optimized forms, but the more you have, the larger the virtual machine gets, and the more critical cache memory it consumes.

A virtual machine also can not allocate registers efficiently. When compiling code to native code, the optimizers know a lot about the target and can layout the local memory to well use the registers. However, the VM doesn’t yet know which code it will execute, so the best it can do is allocate its registers in the best way to run its own code. Perhaps the IR could have some register logic encoded in it, but if it intends to be cross-platform, it won’t know all the registers it has available. Thus, a virtual machine will end up spending more time reading and writing values from the stack or heap while executing code.

The Stack Memory

If we create a naive implementation of the stack in the above code, we’ll have another source of problems. If we wish to offer memory protection, which we should, then any time we push to, or pop from, the stack, we need to check if the address is in bounds. Adding a simple comparison to every memory operation can take a significant toll on performance — though how much depends somewhat on the CPUs branch prediction ability.

A native stack doesn’t incur this cost. An initial amount is allocated when the program starts. Anytime there is an overflow, an additional amount will be automatically allocated. If there is an underflow, we get a memory access fault.

The operating system should provide this native functionality to a virtual machine as well. For example, in Linux, the mmap call can use the MAP_GROWSDOWN flag to produce an area of memory that functions like a stack. I’d hope that all modern OS’s expose all the memory features we’d want, including stacks as no-execute flags and write protection for the IR instruction memory.

So while the stack could be a problem, it shouldn’t be with the right virtual machine implementation. It will be a problem that it has to use the stack more often than compiled code. This goes back to not knowing how to allocate the registers best, meaning it will push and pop from the stack more often. Even if the stack use is as fast as native, it can’t be as fast as just not using it.

Other memory, like the heap, or global constants, shouldn’t be an issue. The heap is allocated dynamically by any program, so doing it via the VM core loop won’t make much of a difference. Static data is also the same: a block of data given an address at startup. Though again, the VM may not take advantage of all addressing modes, so while memory management has no overhead, the iteration of an array might.

Almost as fast, but not quite

It’s not unreasonable to think, then, that a VM could just take the time to precalculate all memory addresses, use a native stack implementation, and then JIT all the code into native code. That’d effectively just be delayed compilation, not really a virtual machine. It would be fast, but it still wouldn’t be as fast as precompiled. It’d be highly useful, since you have a nice cross-platform IR that works well enough for many domains.

That cross-platform IR is where the remaining difficulty comes from. An IR built for native compilation will attempt to maintain as much information as possible from the source language. An IR built for a VM will have a form suited for fast VM execution or JIT compilation. That difference in an IR will make a difference in how well it can be compiled to native code.

For added fun and confusion, it’s possible for an optimizer to recreate some of that missing information, or the VM-loop and JIT system to dynamically detect optimization paths. If optimizing at runtime, you can optimize for what is actually executing, rather than what might be executing. For example, CPUs themselves do branch prediction to improve execution speed. It doesn’t always work, but it often works well enough to make a difference.

And the not-so-fine details

All the previous is shying away from what language is being executed. This overhead applies regardless of the source language. The instruction loop, variable and stack management have a core set of responsibilities that don’t change between languages. A virtual machine has a significant cost regardless of the source language.

But we know some languages perform worse than others. Languages themselves offer widely different constructs which can affect the execution time, whether they’re run in a virtual machine, or compiled natively. The biggest one is dynamic typing; a huge cost. But I’m going to spin that off to a separate article, as it’s an orthogonal problem. It just happens that many of the common VM-based languages also offer costly constructs.

Please join me on Discord to discuss, or ping me on Mastadon.

Why language interpreters and virtual machines are slow

An exploration of how interpreters processing code introduces bottlenecks that compiled code does not, and some techniques to alleviate those issues.

A Harmony of People. Code That Runs the World. And the Individual Behind the Keyboard.

Mailing List

Signup to my mailing list to get notified of each article I publish.

Recent Posts

Search