Skip to content

Custom arena allocator #197

@overlookmotel

Description

@overlookmotel

I feel we should investigate replacing Bumpalo with a custom allocator. Notes on why, and some ideas:

Make it faster - remove alignment calculations

Oxc's AST has a property which is unusual. Every single AST type is aligned on 8 (because they all contain a Vec, Box, Atom or &str).

We can exploit this property to remove alignment calculations when allocating into the arena. The bump pointer is always aligned on 8 after the last allocation, so no need to re-align each time you allocate.

This will only remove a couple of instructions (and possibly one branch), but given that allocating is a very hot path, it might well make a measurable difference.

Bumpalo is designed for a general use case, and so cannot perform this optimization.

NB: There are 2 exceptions to the "everything aligned on 8" rule:

  1. String data is aligned on 1. Either store strings in a separate memory space, or at start of arena chunks (see below), or align them on 8 manually.
  2. BooleanLiteral is aligned on 4. Easy fix by making it #[repr(align(8)].

Reference: ser_raw performs this optimization.

Make it faster - alternative design

Blink allocator claims to be much faster than Bumpalo. There was an attempt to replace Bumpalo with Blink in Oxc last year oxc-project/oxc#106, but it didn't show performance improvements. Nonetheless, I think it's worthwhile investigating why and whether anything we can learn from Blink.

I have some questions myself about Bumpalo's design. I don't understand why it uses double-indirection (pointer to a pointer) to get to the current chunk cursor. I wonder if reducing indirection could yield a speed boost.

Some other possible optimizations here: fitzgen/bumpalo#234

Fill chunks from both ends

  • Fill from end for AST nodes (bumping downwards, same as Bumpalo does).
  • Fill from start for strings (bumping upwards).

#37

This would also have the nice property of grouping all string data together in one place in memory, which is probably advantageous for AST transfer.

Align chunks on 4 GiB

AST transfer can work with the existing allocator, but it can be much faster if chunks are always allocated on 4 GiB boundaries. JS's Number cannot contain a u64, but if chunks are aligned on 4 GiB and there is only 1 chunk, only the bottom 32 bits of pointers are relevant, so can do pointer offset maths with 32 bit ints, which is faster.

fitzgen declined to support this in Bumpalo. fitzgen/bumpalo#237

Cap max size at 4 GiB

In environments with virtual memory (all the targets Oxc supports except WASM), allocating a large chunk of memory only consumes virtual memory space. Actual memory pages are only consumed when they are written to.

So always allocate chunks as 4 GiB size (except in WASM).

If it's possible to impose a size limit of 4 GiB on ASTs (#27), then we only ever need 1 chunk, which would:

  • Massively simplify and streamline the allocator.
  • That simplicity might help the compiler see that AST nodes can be constructed directly in arena, rather than constructing on stack and then memcpy into arena.
  • Allow completely branchless allocation (fitzgen's suggestion).
  • Open the door to storing pointers as u32s in Box, Vec and Atom (like V8 does - pointer compression).

If 4 GiB size limit for ASTs is not viable, we could at least limit each individual chunk to 4 GiB. This seems like a viable limit - would cap Vec<Statement> to max 268 million entries ((1 << 32) / 16). I think we can assume even the largest application, bundled into a single file, will not contain more than 268 million top level statements?!

Either way, this would allow reducing Vec to 16 bytes (#18).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions