Custom arena allocator

I feel we should investigate replacing Bumpalo with a custom allocator. Notes on why, and some ideas:

### Make it faster - remove alignment calculations

Oxc's AST has a property which is unusual. Every single AST type is aligned on 8 (because they all contain a `Vec`, `Box`, `Atom` or `&str`).

We can exploit this property to remove alignment calculations when allocating into the arena. The bump pointer is always aligned on 8 after the last allocation, so no need to re-align each time you allocate.

This will only remove a couple of instructions (and possibly one branch), but given that allocating is a very hot path, it might well make a measurable difference.

Bumpalo is designed for a general use case, and so cannot perform this optimization.

NB: There are 2 exceptions to the "everything aligned on 8" rule:

1. String data is aligned on 1. Either store strings in a separate memory space, or at start of arena chunks (see below), or align them on 8 manually.
2. `BooleanLiteral` is aligned on 4. Easy fix by making it `#[repr(align(8)]`.

Reference: [ser_raw](https://docs.rs/ser_raw/latest/ser_raw/storage/struct.AlignedVec.html) performs this optimization.

### Make it faster - alternative design

[Blink allocator](https://github.com/zakarumych/blink-alloc) [claims to be much faster](https://github.com/zakarumych/blink-alloc/blob/main/BENCHMARKS.md) than Bumpalo. There was an attempt to replace Bumpalo with Blink in Oxc last year https://github.com/oxc-project/oxc/pull/106, but it didn't show performance improvements. Nonetheless, I think it's worthwhile investigating why and whether anything we can learn from Blink.

I have some questions myself about Bumpalo's design. I don't understand why it uses double-indirection (pointer to a pointer) to get to the current chunk cursor. I wonder if reducing indirection could yield a speed boost.

Some other possible optimizations here: https://github.com/fitzgen/bumpalo/issues/234

### Fill chunks from both ends

* Fill from end for AST nodes (bumping downwards, same as Bumpalo does).
* Fill from start for strings (bumping upwards).

oxc-project/backlog#37

This would also have the nice property of grouping all string data together in one place in memory, which is probably advantageous for AST transfer.

### Align chunks on 4 GiB

AST transfer can work with the existing allocator, but it can be much faster if chunks are always allocated on 4 GiB boundaries. JS's `Number` cannot contain a `u64`, but if chunks are aligned on 4 GiB and there is only 1 chunk, only the bottom 32 bits of pointers are relevant, so can do pointer offset maths with 32 bit ints, which is faster.

fitzgen declined to support this in Bumpalo. https://github.com/fitzgen/bumpalo/issues/237

### Cap max size at 4 GiB

In environments with virtual memory (all the targets Oxc supports except WASM), allocating a large chunk of memory only consumes *virtual* memory space. Actual memory pages are only consumed when they are written to.

So always allocate chunks as 4 GiB size (except in WASM).

If it's possible to impose a size limit of 4 GiB on ASTs (#27), then we only ever need 1 chunk, which would:

* Massively simplify and streamline the allocator.
* That simplicity *might* help the compiler see that AST nodes can be constructed directly in arena, rather than constructing on stack and then memcpy into arena.
* Allow completely branchless allocation ([fitzgen's suggestion](https://github.com/fitzgen/bumpalo/issues/234#issuecomment-1954983926)).
* Open the door to storing pointers as `u32`s in `Box`, `Vec` and `Atom` ([like V8 does](https://v8.dev/blog/pointer-compression) - pointer compression).

If 4 GiB size limit for ASTs is not viable, we could at least limit each individual *chunk* to 4 GiB. This seems like a viable limit - would cap `Vec<Statement>` to max 268 million entries (`(1 << 32) / 16`). I think we can assume even the largest application, bundled into a single file, will not contain more than 268 million top level statements?!

Either way, this would allow reducing `Vec` to 16 bytes (#18).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom arena allocator #197

Make it faster - remove alignment calculations

Make it faster - alternative design

Fill chunks from both ends

Align chunks on 4 GiB

Cap max size at 4 GiB

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Custom arena allocator #197

Description

Make it faster - remove alignment calculations

Make it faster - alternative design

Fill chunks from both ends

Align chunks on 4 GiB

Cap max size at 4 GiB

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions