The fastest way to match characters on ARM processors?

Consider the following problem. Given a string, you must match all of the ASCII white-space characters (\t, \n, \r, and the space) and some characters important in JSON (:, ,, [, ], {, }). JSON is a text-based data format used for web services. A toy JSON document looks as follows.

{
  "name": "Alice",
  "age": 30,
  "email": "[email protected]",
  "tags": ["developer", "python", "open-source"],
  "active": true
}

We want to solve this problem using SIMD (single-instruction-multiple-data) instructions. With these instructions, you can compare a block of 16 bytes with another block of 16 bytes in one instruction.

It is a subproblem in the fast simdjson JSON library when we index a JSON document. We call this task vectorized classification. We also use the same technique when parsing DNS records, and so forth. In the actual simdjson library, we must also handle strings and quotes, and it gets more complicated.

I need to define what I mean by ‘matching’ the characters. In my case, it is enough to get, for each block of 64 bytes, two 64-bit masks: one for spaces and one for important characters. To illustrate, let me consider a 16-byte variant:

{"name": "Ali" }
1000000100000001 // important characters
0000000010000010 // spaces

Thus, I want to get back the numbers 0b1000000100000001 and 0b0000000010000010 in binary format (they are 33025 and 130 in decimal).

I refer you to Langdale and Lemire (2019) for how to do it using the conventional SIMD instructions available on ARM processors (NEON). Their key idea is a table-driven, branch-free classifier: for each byte, use SIMD table lookups to map each nibble to a bitmask, and compare to decide whether the byte belongs to a target set (whitespace or structural JSON characters). This avoids doing many separate equality comparisons per character.

There is now a better way on recent ARM processors.

The 128-bit version of NEON was introduced in 2011 with the ARMv8-A architecture (AArch64). Apple played an important role and it was first used by the Apple A7 chip in the iPhone 5S. You can count on all 64-bit ARM processors to support NEON, which is convenient. (There are 32-bit ARM processors but they are mostly used for embedded systems, not mainstream computing.)

ARM NEON is good but getting old. It is no match for the AVX-512 instruction set available on x64 (AMD and Intel) processors. Not only do the AVX-512 instructions support wider registers (64 bytes as opposed to ARM NEON’s 16 bytes), but they also have more powerful instructions.

But ARM has something else to offer: Scalable Vector Extension (SVE) and its successor, SVE2. Though SVE was first introduced in 2016, it took until 2022 before we had actual access. The Neoverse V1 architecture used by the Amazon Graviton 3 is the first one I had access to. Soon after, we got SVE2 with the Neoverse V2 and N2 architectures. Today it is readily available: the Graviton4 on AWS, the Microsoft Cobalt 100 on Azure, the Google Axion on Google Cloud (and newer Google Cloud ARM CPUs), the NVIDIA Grace CPU, as well as several chips from Qualcomm, MediaTek, and Samsung. Notice who I am not including? Apple. For unclear reasons, Apple has not yet adopted SVE2.

I have mixed feelings about SVE/SVE2. Like RISC-V, it breaks with the approach from ARM NEON and x64 SIMD that uses fixed-length register sizes (16 bytes, 32 bytes, 64 bytes). This means that you are expected to code without knowing how wide the registers are.

This is convenient for chip makers because it gives them the option of adjusting the register size to better suit their market. Yet it seems to have failed. While the Graviton 3 processor from Amazon had 256-bit registers… all commodity chips have had 128-bit registers after that.

On the plus side, SVE/SVE2 has masks a bit like AVX-512, so you can load and process data only in a subset of the registers. It solves a long-standing problem with earlier SIMD instruction sets where the input is not a multiple of the register size. Both SVE/SVE2 and AVX-512 might make tail handling nicer. Being able to operate on only part of the register allows clever optimizations. Sadly, SVE/SVE2 does not allow you to move masks to and from a general-purpose register efficiently, unlike AVX-512. And that’s a direct consequence of their design with variable-length registers. Thus, even though your registers might always be 128-bit and contain 16 bytes, the instruction set is not allowed to assume that a mask fits in a 16-bit word.

I was pessimistic regarding SVE/SVE2 until I learned that it is designed to be interoperable with ARM NEON. Thus you can use the SVE/SVE2 instructions with your ARM NEON code. This works especially well if you know that the SVE/SVE2 registers match the ARM NEON registers (16 bytes).

For the work I do, there are two SVE2 instructions that are important: match and nmatch. In their 8-bit versions, what they do is the following: given two vectors a and b, each containing up to 16 bytes, match sets a predicate bit to true for each position i where a[i] equals any of the bytes in b. In other words, b acts as a small lookup set, and match tests set membership for every byte of a simultaneously. The nmatch instruction is the logical complement: it sets a predicate bit to true wherever a[i] does not match any byte in b. A single instruction thus replaces a series of equality comparisons and OR-reductions that would otherwise be needed. In the code below, op_chars holds the 8 structural JSON characters and ws_chars holds the 4 whitespace characters; calling svmatch_u8 once on a 16-byte chunk d0 produces a predicate that has a true bit exactly where that input byte is a structural character. The code uses SVE2 intrinsics: compiler-provided C/C++ functions that map almost one-to-one to CPU SIMD instructions, so you get near-assembly control without writing assembly.

// : , [ ] { }
uint8_t op_chars_data[16] = {
    0x3a, 0x2c, 0x5b, 0x5d, 0x7b, 0x7d, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0
};
// \t \n \r ' '
uint8_t ws_chars_data[16] = {
    0x09, 0x0a, 0x0d, 0x20, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0
};

// load the characters in SIMD registers
svuint8_t op_chars = svld1_u8(svptrue_b8(), op_chars_data);
svuint8_t ws_chars = svld1_u8(svptrue_b8(), ws_chars_data);

// load data
// const char * input = ...
svbool_t pg = svptrue_pat_b8(SV_VL16);
svuint8_t d = svld1_u8(pg, input);

// matching
svbool_t op = svmatch_u8(pg, d, op_chars);
svbool_t ws = svmatch_u8(pg, d, ws_chars);

In this code snippet, svuint8_t is an SVE vector type containing unsigned 8-bit lanes (bytes). svbool_t is an SVE predicate (mask) type. svptrue_b8() builds a predicate where all 8-bit lanes are active, and svld1_u8(pg, ptr) loads bytes from memory into an SVE vector, using predicate pg to decide which lanes are actually read.

If you paid attention thus far, you might have noticed that my code is slightly wrong since I am including 0 in the character sets. But it is fine as long as I assume that the zero byte is not present in the input. In practice, I could just repeat one of the characters, or use a bogus character that I do not expect to see in my inputs (such as the byte value 0xFF, which cannot appear in a valid UTF-8 string).

In standard SVE/SVE2, op and ws are predicates, not integer masks. A practical trick is to materialize each predicate as bytes (0xFF for true, 0x00 for false), for example with svdup_n_u8_z.

svuint8_t opm = svdup_n_u8_z(op, 0xFF);
svuint8_t wsm = svdup_n_u8_z(ws, 0xFF);

When SVE vectors are 128 bits, this byte vector maps naturally to a NEON uint8x16_t via svget_neonq_u8, and from there we can build scalar bitmasks efficiently with NEON operations (masking plus pairwise additions). Repeating this over four 16-byte chunks gives the two 64-bit masks needed for a 64-byte block.

How does it compare with pure NEON code? I compiled the routine processing blocks of 64 bytes using different compilers.

method GCC 16 LLVM clang 20
simdjson (NEON) 69 66
SVE/SVE2 (new!) 42  52

Interestingly, GCC 16 even adopts SVE instructions in the pure NEON code, which suggests that recompiling old NEON code while targeting SVE/SVE2 could be beneficial.

I was hoping to test both compilers in a benchmark, but I wanted to quickly run my benchmarks on an AWS Graviton 4. I also did not want to compile GCC 16 from source. So I just kept LLVM clang 20 which was readily available in the images that AWS makes available (I picked RedHat 10).

The AWS Graviton 4 processor is a Neoverse V2 processor. Google has its own Neoverse V2 processors in its cloud. In my tests, it ran at 2.8 GHz.

My benchmark generates a random string of 1 MiB and computes the bitmaps indicating the positions of the characters. It is available on GitHub. My results are as follow.

method GB/s instructions/byte instructions/cycle
simdjson (NEON) 11.4 0.94 3.5
SVE/SVE2 (new!) 14.4  0.67 3.8

So the SVE/SVE2 approach is about 25% faster than the NEON equivalent and uses 30% fewer instructions, and that’s without any kind of fancy optimization. Importantly, the code is relatively simple thanks the match instruction.

It might be that the SVE2 function match is the fastest way to match characters on ARM processors.

Credit: This post was motivated by a sketch by user liuyang-664 on GitHub.

References

Langdale, G., & Lemire, D. (2019). Parsing gigabytes of JSON per second. The VLDB Journal, 28(6), 941-960.

Koekkoek, J., & Lemire, D. (2025). Parsing millions of DNS records per second. Software: Practice and Experience, 55(4), 778-788.

Lemire, D. (2025). Scanning HTML at Tens of Gigabytes Per Second on ARM Processors. Software: Practice and Experience, 55(7), 1256-1265.

A brief history of C/C++ programming languages

Initially, we had languages like Fortran (1957), Pascal (1970), and C (1972). Fortran was designed for number crunching and scientific computing. Pascal was restrictive with respect to low-level access (it was deliberately “safe”, as meant for teaching structured programming). So C won out as a language that allowed low-level/unsafe programming (pointer arithmetic, direct memory access) while remaining general-purpose enough for systems work like Unix. To be fair, Pascal had descendants that are still around, but C clearly dominated.
Object-oriented programming became viewed as the future in the 1980s and 1990s. It turned into some kind of sect.
But C was not object-oriented.
So we got C++, which began as “C with Classes”. C++ had templates, enabling generic programming and compile-time metaprogramming. This part of the language makes C++ quite powerful, but somewhat difficult to master (with crazy error messages).
Both C and C++ became wildly successful, but writing portable applications remained difficult — you often had to target Windows or a specific Unix variant. This was a problem for a company like Sun Microsystems that sold Unix boxes and wanted to compete against the juggernaut that Microsoft was becoming.
So Java came along in 1995. It was positioned as a safe, portable alternative to C++: it eliminated raw pointer arithmetic, added mandatory garbage collection, array bounds checking everywhere, and ran on a virtual machine (JVM) with just-in-time compilation for performance.
The “write once, run anywhere” promise addressed C/C++ portability pain points directly. To this day, Java remains a strong solution for writing portable enterprise and server-side code.
We also got JavaScript in 1995. Despite the name, it has almost nothing in common with Java semantically. It is best viewed as separate from the C/C++ branch. Python is similarly quite different.
Microsoft would eventually come up with C# in 2000. It belongs to the same C-family syntax tradition as C++ and Java, but with support for ahead-of-time compilation in modern .NET. It also allows guarded pointer access within explicitly marked unsafe scopes. At this point, C# can be seen as “C++ with garbage collection” in spirit. It even competes against C++ in the game industry thanks to Unity.
Google came up with Go. It is much like a simpler, modern C: garbage-collected, with built-in bounds checking on slices/arrays, and pointers allowed but without arbitrary arithmetic in safe code (the unsafe package exists for low-level needs).
Later, Apple came up with Swift. It has C++-like performance and syntax goals but adds modern safety features (bounds checking by default, integer overflow panics in debug mode) and uses Automatic Reference Counting (ARC) for memory management. Swift replaced Objective-C but I still view it as a C++ successor.
At about the same time, we got Rust. Like Swift, it drops the generational garbage collection from Java, C# and Go. It relies instead on compile-time ownership and borrowing rules, with the tradeoff that you can leak memory with reference cycles. We also got Zig  which makes memory usage fully explicit.
I think that it is fairer to describe Rust and Zig as descendants of C rather than C++. Both are much more powerful than C, of course… and the evolution of programming languages is complex. Still. They are C-like programming languages.
To this day, in much of the industry, the dominant programming languages for performance-critical, systems, enterprise, and infrastructure work remain C, C++, Java, and C#. By the Lindy effect (the longer something has survived, the longer it is likely to continue surviving), these languages, especially C, now over 50 years old, are still going to be around for a long time.

Can your AI rewrite your code in assembly?

Suppose you have several strings and you want to count the number of instances of the character ! in your strings. In C++, you might solve the problem as follows if you are an old-school programmer.

size_t c = 0;
for (const auto &str : strings) {
    c += std::count(str.begin(), str.end(), '!');
}

You can also get fancier with ranges.

for (const auto &str : strings) {
    c += std::ranges::count(str, '!');
}

And so forth.

But what if you want to go faster? Maybe you’d want to rewrite this function in assembly. I decided to do so, and to have fun using both Grok and Claude as my AIs, setting up a friendly competition.

I started with my function and then I asked AIs to optimize it in assembly. Importantly, they knew which machine I was on, so they started to write ARM assembly.

By repeated prompting, I got the following functions.

  • count_classic: Uses C++ standard library std::count for reference.
  • count_assembly: A basic ARM64 assembly loop (byte-by-byte comparison). Written by Grok.
  • count_assembly_claude: Claude’s SIMD-optimized version using NEON instructions (16-byte chunks).
  • count_assembly_grok: Grok’s optimized version (32-byte chunks).
  • count_assembly_claude_2: Claude’s further optimized version (64-byte chunks with multiple accumulators).
  • count_assembly_grok_2: Grok’s latest version (64-byte chunks with improved accumulator handling).
  • count_assembly_claude_3: Claude’s most advanced version with additional optimizations.

You get the idea.

So, how is the performance? I use random strings of up to 1 kilobyte. In all cases, I test that the functions provide the correct count. I did not closely examine the code, so it is possible that mistakes could be hiding in the code.

I record the average number of instructions per string.

name instructions/string
classic C++ 1200
claude assembly 250
grok assembly 204
claude assembly 2 183
grok assembly 2 176
claude assembly 3 154

By repeated optimization, I reduced the number of instructions by a factor of eight. The running time decreases similarly.

Can we get the AIs to rewrite the best option in C? Yes, although you need SIMD intrinsics. So there is no benefit to leaving the code in assembly in this instance.

An open question is whether the AIs could find optimizations that are not possible if we use a higher-level language like C or C++. It is an intriguing question that I will seek to answer later. For the time being, the AIs can beat my C++ compiler!

My source code is available.

A Fast Immutable Map in Go

Consider the following problem. You have a large set of strings, maybe millions. You need to map these strings to 8-byte integers (uint64). These integers are given to you.

If you are working in Go, the standard solution is to create a map. The construction is trivial, something like the following loop.

m := make(map[string]uint64, N)
for i, k := range keys {
    m[k] = values[i]
}

One downside is that the map may use over 50 bytes per entry.

In important scenarios, we might have the following conditions. The map is large (a million of entries or more), you do not need to modify it dynamically (it is immutable), and all queried keys are in the set. In such conditions, you can reduce the memory usage down to almost the size of the keys, so about 8 bytes per entry. One fast technique is the binary fuse filters.

I implemented it as a Go library called constmap that provides an immutable map from strings to uint64 values using binary fuse filters. This data structure is ideal when you have a fixed set of keys at construction time and need fast, memory-efficient lookups afterward. You can even construct the map once, save it to disk so you do not pay the cost of constructing the map each time you need it.

The usage is just as simple.

package main

import (
    "fmt"
    "log"

    "github.com/lemire/constmap"
)

func main() {
    keys := []string{"apple", "banana", "cherry"}
    values := []uint64{100, 200, 300}

    cm, err := constmap.New(keys, values)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(cm.Map("banana")) // 200
}

The construction time is higher (as expected for any compact data structure), but lookups are optimized for speed. I ran benchmarks on my Apple M4 Max processor to compare constmap lookups against Go’s built-in map[string]uint64. The test uses 1 million keys.

Data Structure Lookup Time Memory Usage
ConstMap 7.4 ns/op 9 bytes/key
Go Map 20 ns/op 56 bytes/key

ConstMap is nearly 3 times faster than Go’s standard map for lookups! And we reduced the memory usage by a factor of 6.

The ConstMap may not always be faster, but it should always use significantly less memory. If it can reside in CPU cache while the map cannot, then it will be significantly faster.

Source Code The implementation is available on GitHub: github.com/lemire/constmap.

JSON and C++26 compile-time reflection: a talk

The next C++ standard (C++26) is getting exciting new features. One of these features is compile-time reflection. It is ideally suited to serialize and deserialize data at high speed. To test it out, we extended our fast JSON library (simdjson) and we gave a talk at CppCon 2025. The video is out on YouTube.

Our slides are also available.

How many branches can your CPU predict?

Modern processors have the ability to execute many instructions per cycle, on a single core. To be able to execute many instructions per cycle in practice, processors predict branches. I have made the point over the years that modern CPUs have an incredible ability to predict branches.

It makes benchmarking difficult because if you test on small datasets, you can get surprising results that might not work on real data.

My go-to benchmark is a function like so:

while (howmany != 0) {
    val = generate_random_value()
    if(val is odd) write to buffer
    decrement howmany
}

The processor tries to predict the branch (if clause). Because we use random values, the processor should mispredict one time out of two.

However, if we repeat multiple times the benchmark, always using the same random values, the processor learns the branches. How many can processors learn? I test using three recent processors.

  • The AMD Zen 5 processor can predict perfectly 30,000 branches.
  • The Apple M4 processor can predict perfectly 10,000 branches.
  • Intel Emerald Rapids can predict perfectly 5,000 branches.

Once more I am disappointed by Intel. AMD is doing wonderfully well on this benchmark.

My source code is available.

Prefix sums at tens of gigabytes per second with ARM NEON

Suppose that you have a record of your sales per day. You might want to get a running record where, for each day, you are told how many sales you have made since the start of the year.

day sales per day running sales
1 10$ 10 $
2 15$ 25 $
3 5$ 30 $

Such an operation is called a prefix sum or a scan.

Implementing it in C is not difficult. It is a simple loop.

  for (size_t i = 1; i < length; i++) {
    data[i] += data[i - 1];
  }

How fast can this function be? We can derive a speed limit rather simply: to compute the current value, you must have computed the previous one, and so forth.

data[0] -> data[1] -> data[2] -> ...

At best, you require one CPU cycle per entry in your table. Thus, on a 4 GHz processor, you might process 4 billion integer values per second. It is an upper bound but you might be able to reach close to it in practice on many modern systems. Of course, there are other instructions involved such as loads, stores and branching, but our processors can execute many instructions per cycle and they can predict branches effectively. So you should be able to process billions of integers per second on most processors today.

Not bad! But can we do better?

We can use SIMD instructions. SIMD instructions are special instructions that process several values at once. All 64-bit ARM processors support NEON instructions. NEON instructions can process four integers at once, if they are packed in one SIMD register.

But how do you do the prefix sum on a 4-value register? You can do it with two shifts and two additions. In theory, it scales as log(N) where N is the number elements in a vector register.

input   = [A B   C     D]
shift1  = [0 A   B     C]
sum1    = [A A+B B+C   C+D]
shift2  = [0 0   A     B+A]
result  = [A A+B A+B+C A+B+C+D]

You can then extract the last value (A+B+C+D) and broadcast it to all positions so that you can add it to the next value.

Is this faster than the scalar approach? We have 4 instructions in sequence, plus at least one instruction if you want to use the total sum in the next block of four values.

Thus the SIMD approach might be worse. It is disappointing.

A solution might be the scale up over many more integer values.

Consider ARM NEON which has interleaved load and store instructions. If you can load 16 values at once, and get all of the first values together, all of the second values together, and so forth.

original data : ABCD EFGH IJKL MNOP
loaded data   : AEIM BFJN CGKO DHLP

Then I can do a prefix sum over the four blocks in parallel. It takes three instructions. At the end of the three instructions, we have one register which contains the local sums:

A+B+C+D E+F+G+H I+J+K+L M+N+O+P

And then we can apply our prefix sum recipe on this register (4 instructions). You might end up with something like 8 sequential instructions per block of 16 values.

It is theoretically twice as fast as the scalar approach.

In C with instrinsics, you might code it as follows.

void neon_prefixsum_fast(uint32_t *data, size_t length) {
  uint32x4_t zero = {0, 0, 0, 0};
  uint32x4_t prev = {0, 0, 0, 0};

  for (size_t i = 0; i < length / 16; i++) {
    uint32x4x4_t vals = vld4q_u32(data + 16 * i);

    // Prefix sum inside each transposed ("vertical") lane
    vals.val[1] = vaddq_u32(vals.val[1], vals.val[0]);
    vals.val[2] = vaddq_u32(vals.val[2], vals.val[1]);
    vals.val[3] = vaddq_u32(vals.val[3], vals.val[2]);

    // Now vals.val[3] contains the four local prefix sums:
    //   vals.val[3] = [s0=A+B+C+D, s1=E+F+G+H, 
    //                  s2=I+J+K+L, s3=M+N+O+P]

    // Compute prefix sum across the four local sums 
    uint32x4_t off = vextq_u32(zero, vals.val[3], 3);
    uint32x4_t ps = vaddq_u32(vals.val[3], off);       
    off = vextq_u32(zero, ps, 2);                      
    ps = vaddq_u32(ps, off);

    // Now ps contains cumulative sums across the four groups
    // Add the incoming carry from the previous 16-element block
    ps = vaddq_u32(ps, prev);

    // Prepare carry for next block: broadcast the last lane of ps
    prev = vdupq_laneq_u32(ps, 3);

    // The add vector to apply to the original lanes is the 
    // prefix up to previous group
    uint32x4_t add = vextq_u32(prev, ps, 3);  

    // Apply carry/offset to each of the four transposed lanes
    vals.val[0] = vaddq_u32(vals.val[0], add);
    vals.val[1] = vaddq_u32(vals.val[1], add);
    vals.val[2] = vaddq_u32(vals.val[2], add);
    vals.val[3] = vaddq_u32(vals.val[3], add);

    // Store back the four lanes (interleaved)
    vst4q_u32(data + 16 * i, vals);
  }

  scalar_prefixsum_leftover(data, length, 16);
}

Let us try it out on an Apple M4 processor (4.5 GHz).

method billions of values/s
scalar 3.9
naive SIMD 3.6
fast SIMD 8.9

So the SIMD approach is about 2.3 times faster than the scalar approach. Not bad.

My source code is available on GitHub.

Appendix. Instrinsics

Intrinsic What it does
vld4q_u32 Loads 16 consecutive 32-bit unsigned integers from memory and deinterleaves them into 4 separate uint32x4_t vectors (lane 0 = elements 0,4,8,12,…; lane 1 = 1,5,9,13,… etc.).
vaddq_u32 Adds corresponding 32-bit unsigned integer lanes from two vectors (a[i] + b[i] for each of 4 lanes).
vextq_u32 Extracts (concatenates a and b, then takes 4 lanes starting from lane n of the 8-lane concatenation). Used to implement shifts/rotates by inserting zeros (when a is zero vector).
vdupq_laneq_u32 Broadcasts (duplicates) the value from the specified lane (0–3) of the input vector to all 4 lanes of the result.
vdupq_n_u32 (implied usage) Sets all 4 lanes of the result to the same scalar value (commonly used for zero or broadcast).

Text formats are everywhere. Why?

The Internet relies on text formats. Thus, we spend a lot of time producing and consuming data encoded in text.

Your web pages are HTML. The code running in them is JavaScript, sent as text (JavaScript source), not as already-parsed code. Your emails, including their attachments, are sent as text (your binary files are sent as text).

It does not stop there. The Python code that runs your server is stored as text. It queries data by sending text queries. It often gets back the answer as text that must then be decoded.

JSON is the universal data interchange format online today. We share maps as JSON (GeoJSON).

Not everything is text, of course. There is no common video or image format that is shared as text. Transmissions over the Internet are routinely compressed to binary formats. There are popular binary formats that compete with JSON.
But why is text dominant?

It is not because, back in the 1970s, programmers did not know about binary formats.

In fact, we did not start with text formats. Initially, we worked with raw binary data. Those of us old enough will remember programming in assembly using raw byte values.

Why text won?

1.Text is efficient.

In the XML era, when everything had to be XML, there were countless proposals for binary formats. People were sometimes surprised to find that the binary approach was not much faster in practice. Remember that many text formats date back to an era when computers were much slower. Had text been a performance bottleneck, it would not have spread. Of course, there are cases where text makes things slower. You then have a choice: optimize your code further or transition to another format. Often, both are viable.

It is easy to make wrong assumptions about binary formats, such as that you can consume them without any parsing or validation. If you pick up data from the Internet, you must assume that it could have been sent by an adversary or someone who does not follow your conventions.

2.Text is easy to work with.

If you receive text from a remote source, you can often transform it, index it, search it, quote it, version it… with little effort and without in-depth knowledge of the format. Text is often self-documenting.

In an open world, when you will never speak with the person producing the data, text often makes everything easier and smoother.

If there is an issue to report and the data is in text, you can usually copy-paste the relevant section into a message. Things are much harder with a binary format.

You can use newline characters in URLs

We locate web content using special addresses called URLs. We are all familiar with addresses like https://google.com. Sometimes, URLs can get long and they can become difficult to read. Thus, we might be tempted to format them
like so in HTML using newline and tab characters, like so:

<a href="https://lemire.me/blog/2026/02/21/
        how-fast-do-browsers-correct-utf-16-strings/">my blog post</a>

It will work.

Let us refer to the WHATWG URL specification that browsers follow. It makes two statements in sequence.

  1. If input contains any ASCII tab or newline, invalid-URL-unit validation error.
  2. Remove all ASCII tab or newline from input.

Notice how it reports an error if there is a tab or newline character, but continues anyway? The specification says that A validation error does not mean that the parser terminates and it encourages systems to report errors somewhere. Effectively, the error is ignored although it might be logged. Thus our HTML is fine in practice.

The following is also fine:

<a href="https://go
ogle.c
om" class="button">Visit Google</a>

You can also use tabs. But you cannot arbitrarily insert any other whitespace.

Yet there are cases when you can use any ASCII whitespace character: data URLs. Data URLs (also called data URIs) embed small files—like images, text, or other content—directly inside a URL string, instead of linking to an external resource. Data URLs are a special kind of URL and they follow different rules.

A typical data URL might look like data:image/png;base64,iVBORw0KGgoAAAANSUhEUg... where the string iVBORw0KGgoAAAANSUhEUg... is the binary data of the image that has been encoded with base64. Base64 is a text format that can represent any binary content: we use 64 ASCII characters so that each character encodes 6 bits. Your binary email attachments are base64 encoded.

On the web, when decoding a base64 string, you ignore all ASCII whitespaces (including the space character itself). Thus you can embed a PNG image in HTML as follows.

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA
                                 QAAAAECAIAAAAmkwkpAAAAEUl
                                 EQVR4nGP8z4AATEhsPBwAM9EB
                                 BzDn4UwAAAAASUVORK5CYII=" />

This HTML code is valid and will insert a tiny image in your page.

But there is more. A data URL can also be used to insert an SVG image. SVG (Scalable Vector Graphics) is an XML-based vector image format that describes 2D graphics using mathematical paths, shapes, and text instead of pixels.
The following should draw a very simple sunset:

<img src='data:image/svg+xml,
<svg width="200" height="200" 
     xmlns="http://www.w3.org/2000/svg">
  <rect width="100%" height="100%" fill="blue" /> 
  <!-- the sky -->
  <circle cx="100" cy="110" r="50" fill="yellow" />  
  <!-- the sun -->
  <rect x="0" y="120" width="200" height="80" fill="brown" />  
  <!-- the ground -->
</svg>' />

Observe how I was able to format the SVG code so that it is readable.

Further reading: Nizipli, Y., & Lemire, D. (2024). Parsing millions of URLs per second. Software: Practice and Experience, 54(5), 744-758.

How fast do browsers correct UTF-16 strings?

JavaScript represents strings using Unicode, like most programming languages today. Each character in a JavaScript string is stored using one or two 16-bit words. The following JavaScript code might surprise some programmers because a single character becomes two 16-bit words.

> t="🧰"
'🧰'
> t.length
2
> t[0]
'\ud83e'
> t[1]
'\uddf0'

The convention is that \uddf0 is the 16-bit value 0xDDF0 also written U+DDF0.

The UTF-16 standard is relatively simple. There are three types of values. high surrogates (the range U+D800 to U+DBFF), low surrogates (U+DC00 to U+DFFF), and all other code units (U+0000–U+D7FF together with U+E000–U+FFFF). A high surrogate must always be followed by a low surrogate, and a low surrogate must always be preceded by a high surrogate.

What happens if you break the rules and have a high surrogate followed by a high surrogate? Then you have an invalid string. We can correct the strings by patching them: we replace the bad values by the replacement character (\ufffd). The replacement character sometimes appears as a question mark.

To correct a broken string in JavaScript, you can call the toWellFormed method.

> t = '\uddf0\uddf0'
'\uddf0\uddf0'
> t.toWellFormed()
'��'

How fast is it?

I wrote a small benchmark that you can test online to measure its speed. I use broken strings of various sizes up to a few kilobytes. I run the benchmarks on my Apple M4 processor using different browsers.

Browser Speed
Safari 18.6 1 GiB/s
Firefox 147 3 GiB/s
Chrome 145 15 GiB/s

Quite a range of performance! The speed of other chromium-based browsers (Brave and Edge) is much the same as Chrome.

I also tested with JavaScript runtimes.

Engine Speed
Node.js v25.5.0 16 GiB/s
Bun 1.3.9 8.4 GiB/s

Usually Bun is faster than Node, but in this instance, Node is twice as far as Bun.

Thus, we can correct strings in JavaScript at over ten gigabytes per second if you use Chromium-based browsers.

How bad can Python stop-the-world pauses get?

When programming, we need to allocate memory, and then deallocate it. If you program in C, you get used to malloc/free functions. Sadly, this leaves you vulnerable to memory leaks: unrecovered memory. Most popular programming languages today use automated memory management: Java, JavaScript, Python, C#, Go, Swift and so forth.

There are essentially two types of automated memory managements. The simplest method is reference counting. You track how many references there are to each object. When an object has no more references, then we can free the memory associated with it. Swift and Python use reference counting. The downside of reference counting are circular references. You may have your main program reference object A, then you add object B which references object A, and you make it so that object A also reference object B. Thus object B has one reference while object A has two references. If your main program drops its reference to object A, the both objects A and B still have a reference count of one. Yet they should be freed. To solve this problem, you could just visit all of your objects to detect which are unreachable, including A and B. However, it takes time to do so. Thus, the other popular approach of automated memory management: generational garbage collection. You use the fact that most memory gets released soon after allocation. Thus you track young objects and visit them from time to time. Then, more rarely, you do a full scan. The downside of generational garbage collection is that typical implementations stop the world to scan the memory. In many instances, your entire program is stopped. There are many variations on the implementation, with decades of research.

The common Python implementation has both types: reference counting and generational garbage collection. The generational garbage collection component can trigger pauses. A lot of servers are written in Python. It means that your service might just become unavailable for a time. We often call them ‘stop the world’ pauses. How long can this pause get?

To test this out, I wrote a Python function to create a classical linked list:

class Node:
    def __init__(self, value):
        self.value = value
        self.next = None
    def add_next(self, node):
        self.next = node

def create_linked_list(limit):
    """ create a linked list of length 'limit' """
    head = Node(0)
    current = head
    for i in range(1, limit):
        new_node = Node(i)
        current.add_next(new_node)
        current = new_node
    return head

And then I create one large linked list and then, in a tight loop, we create small linked lists that are immediately discarded.

x = create_linked_list(50_000_000)
for i in range(1000000):
    create_linked_list(1000)

A key characteristic of my code is the 50 million linked list. It does not get released until the end of the program, but the garbage collector may still examine it.

And I record the maximum delay between two iterations in the loop (using time.time()).

How bad can it get? The answer depends on the Python version. And it is not consistent from run-to-run. So I ran it once and picked whatever result I got. I express the delay in milliseconds.

python version system max delay
3.14 macOS (Apple M4) 320 ms
3.12 Linux (Intel Ice Lake)  2,200 ms

Almost all of this delay (say 320 ms) is due to the garbage collection. Creating a linked list with 1000 elements takes less than a millisecond.

How long is 320 ms? It is a third of a second, so it is long enough for human beings to notice it. For reference, a video game drawing the screen 60 times per second has less than 17 ms to draw the screen. The 2,200 ms delay could look like a server crash from the point of view of a user, and might definitely trigger a time-out (failed request).

I ported the Python program to Go. It is the same algorithm, but a direct comparison is likely unfair. Still, it gives us a reference.

go version system max delay
1.25 macOS (Apple M4) 50 ms
1.25 Linux (Intel Ice Lake)  33 ms

Thus Go has pauses that are several times shorter than Python, and there is no catastrophic 2-second pause.

Should these pauses be a concern? Most Python programs do not create so many objects in memory at the same time. Thus you are not likely to see these long pauses if you have a simple web app or a script. Python gives you a few options, such as gc.set_threshold and gc.freeze which could help you tune the behaviour.

My source code is available.

 

Video

AI: Igniting the Spark to End Stagnation

Much of the West has been economically stagnant. Countries like Canada have failed to improve their productivity and standard of living as of late. In Canada, there has been no progress in Canadian living standards as measured by per-person GDP over the past five years. It is hard to overstate how anomalous this is: the USSR collapsed in part because it could only sustain a growth rate of about 1%, far below what the West was capable of. Canada is more stagnant than the USSR.

Late in 2022, some of us got access to a technical breakthrough: AI. In three years, it has become part of our lives. Nearly all students use AI to do research or write essays.

Dallas Fed economists projected the most credible effect that AI might have on our economies: AI should help reverse the post-2008 slowdown and deliver higher living standards in line with historical technological progress.

It will imply a profound, rapid but gradual transformation of our economy. There will still be teachers, accountants, and even translators in the future… but their work will change as it has changed in the past. Accountants do far less arithmetic today; that part of their work has been replaced by software. Even more of their work is about to be replaced by software, thus improving their productivity further. We will still have teachers, but all our kids, including the poorest ones, will have dedicated always-on tutors: this will not be just available in Canada or the USA, but everywhere. It is up to us to decide who is allowed to build this technology.

AI empowers the individual. An entrepreneur with a small team can get faster access to quality advice, copywriting, and so forth. Artists with an imagination can create more with fewer constraints.

I don’t have to prove these facts: they are fast becoming obvious to the whole world.

New jobs are created. Students of mine work as AI specialists. One of them helps build software providing AI assistance to pharmacists. One of my sons is an AI engineer. These are great jobs.

The conventional explanation for Canada’s stagnation is essentially that we have already harvested all the innovation we are ever going to get. The low-hanging fruit has been picked. Further progress has become inherently difficult because we are already so advanced; there is simply not much room left to improve. In this view, there is no need to rethink our institutions. Yet a sufficiently large breakthrough compels us to reconsider where we stand and what is still possible. It forces us to use our imagination again. It helps renew the culture.

We often hear claims that artificial intelligence will consume vast amounts of energy and water in the coming years. It is true that data centers, which host AI workloads along with many other computing tasks, rely on water for cooling.

But let’s look at the actual water numbers. In 2023, U.S. data centers directly consumed roughly 17.4 billion gallons of water—a figure that could potentially double or quadruple by 2028 as demand grows. By comparison, American golf courses use more than 500 billion gallons every year for irrigation, often in arid regions where this usage is widely criticized as wasteful. Even if data-center water demand were to grow exponentially, it would take decades to reach the scale of golf-course irrigation.

On the energy side, data centers are indeed taking a larger share of electricity demand. According to the International Energy Agency’s latest analysis, they consumed approximately 415 TWh in 2024—about 1.5% of global electricity consumption. This is projected to more than double to around 945 TWh by 2030 (just under 3% of global electricity). However, even this rapid growth accounts for less than 10% (roughly 8%) of the total expected increase in worldwide electricity demand through 2030. Data centers are therefore not the main driver of the much larger rise in overall energy use.

If we let engineers in Australia, Canada, or Argentina free to innovate, we will surely see fantastic developments.

You might also have heard about the possibility that ChatGPT might decide to kill us all. Nobody can predict the future, but you are surely more likely to be killed by cancer than by a rogue AI. And AI might help you with your cancer.

We always have a choice. Nations can try to regulate AI out of existence. We can set up new government bodies to prevent the application of AI. This will surely dampen the productivity gains and marginalize some nations economically.

The European Union showed it could be done. By some reports, Europeans make more money by fining American software companies than by building their own innovation enterprises. Countries like Canada have economies dominated by finance, mining and oil (with a side of Shopify).

If you are already well off, stopping innovation sounds good. It’s not if you are trying to get a start.

AI is likely to help young people who need it so much. They, more than any other group, will find it easier to occupy the new jobs, start the new businesses.

If you are a politician and you want to lose the vote of young people: make it difficult to use AI. It will crater your credibility.

It is time to renew our prosperity. It is time to create new exciting jobs.

 

References:

Wynne, M. A., & Derr, L. (2025, June 24). Advances in AI will boost productivity, living standards over time. Federal Reserve Bank of Dallas.

Fraser Institute. (2025, December 16). Canada’s recent economic growth performance has been awful.

DemandSage. (2026, January 9). 75 AI in education statistics 2026 (Global trends & facts).

MIT Technology Review. (2026, January 21). Rethinking AI’s future in an augmented workplace.

Davis, J. H. (2025). Coming into view: How AI and other megatrends will shape your investments. Wiley.

Choi, J. H., & Xie, C. (2025, June 26). AI is reshaping accounting jobs by doing the boring stuff. Stanford Graduate School of Business.

International Energy Agency. (n.d.). Energy demand from AI.

University of Colorado Anschutz Medical Campus. (2025, May 19). Real talk about AI and advancing cancer treatments.

International Energy Agency. (2025). Global energy review 2025.

The cost of a function call

When programming, we chain functions together. Function A calls function B. And so forth.

You do not have to program this way, you could write an entire program using a single function. It would be a fun exercise to write a non-trivial program using a single function… as long as you delegate the code writing to AI because human beings quickly struggle with long functions.

A key compiler optimization is ‘inlining’: the compiler takes your function definition and it tries to substitute it at the call location. It is conceptually quite simple. Consider the following example where the function add3 calls the function add.

int add(int x, int y) {
    return x + y;
}

int add3(int x, int y, int z) {
    return add(add(x, y), z);
}

You can manually inline the call as follows.

int add3(int x, int y, int z) {
    return x + y + z;
}

A function call is reasonably cheap performance-wise, but not free. If the function takes non-trivial parameters, you might need to save and restore them on the stack, so you get extra loads and stores. You need to jump into the function, and then jump out at the end. And depending on the function call convention on your system, and the type of instructions you are using, there are extra instructions at the beginning and at the end.

If a function is sufficiently simple, such as my add function, it should always be inlined when performance is critical. Let us examine a concrete example. Let me sum the integers in an array.

for (int x : numbers) {
  sum = add(sum, x);
}

I am using my MacBook (M4 processor with LLVM).

function ns/int
regular 0.7
inline 0.03

Wow. The inline version is over 20 times faster.

Let us try to see what is happening. The call site of the ‘add’ function is just a straight loop with a call to the function.

ldr    w1, [x19], #0x4
bl     0x100021740    ; add(int, int)
cmp    x19, x20
b.ne   0x100001368    ; <+28>

The function itself is as cheap as it can be: just two instructions.

add    w0, w1, w0
ret

So, we spend 6 instructions for each addition. It takes about 3 cycles per addition.

What about the inline function?

ldp    q4, q5, [x12, #-0x20]
ldp    q6, q7, [x12], #0x40
add.4s v0, v4, v0
add.4s v1, v5, v1
add.4s v2, v6, v2
add.4s v3, v7, v3
subs   x13, x13, #0x10
b.ne   0x1000013fc    ; <+104>

It is entirely different. The compiler has converted the addition to advanced (SIMD) instructions processing blocks of 16 integers using 8 instructions. So we are down to half an instruction per integer (from 6 instructions). So we use 12 times fewer instructions. On top of having fewer instructions, the processor is able to retire more instructions per cycle, for a massive performance boost.

What if we prevented the compiler from using these fancy instructions while still inlining? We still get a significant performance boost (about 10x faster).

function ns/int
regular 0.7
inline 0.03
inline (no SIMD) 0.07

Ok. But the add function is a bit extreme. We know it should always be inlined. What about something less trivial like a function that counts the number of spaces in a string.

size_t count_spaces(std::string_view sv) {
    size_t count = 0;
    for (char c : sv) {
        if (c == ' ') ++count;
    }
    return count;
}

If the string is reasonably long, then the overhead of the function call should be negligible.
Let us pass a string of 1000 characters.

function ns/string
regular 111
inline 115

The inline version is not only not faster, but it is even slightly slower. I am not sure why.

What if I use short strings (say between 0 and 6 characters)? Then the inline function is measurably faster.

function ns/string
regular 1.6
inline 1.0

Takeaways:

  1. Short and simple functions should be inlined when possible if performance is a concern. The benefits can be impressive.
  2. For functions that can be fast or slow, the decision as to whether to inline or not depends on the input. For string processing functions, the size of the string may determine whether inlining is necessary for best performance.

Note: My source code is available.

Converting data to hexadecimal outputs quickly

Given any string of bytes, you can convert it to an hexadecimal string by mapping the least significant and the most significant 4 bits of byte to characters in 01...9A...F. There are more efficient techniques like base64, that map 3 bytes to 4 characters. However, hexadecimal outputs are easier to understand and often sufficiently concise.

A simple function to do the conversion using a short lookup table is as follows:

static const char hex[] = "0123456789abcdef";
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
    uint8_t val = src[i];
    dst[k + 0] = hex[val >> 4];
    dst[k + 1] = hex[val & 15];
}

This code snippet implements a straightforward byte-to-hexadecimal string conversion loop in C++. It iterates over an input byte array (src), processing one byte at a time using index i, while simultaneously building the output string in dst with index k that advances twice as fast (by 2) since each input byte produces two hexadecimal characters. For each byte, it extracts the value as an unsigned 8-bit integer (val), then isolates the high 4 bits (via right shift by 4) and low 4 bits (via bitwise AND with 15) to index into a static lookup table (hex) containing the characters ‘0’ through ‘9’ and ‘a’ through ‘f’. The loop continues until k reaches the expected output length (dlen), which should be twice the input length, ensuring all bytes are converted without bounds errors.

This lookup table approach is used in the popular Node.js JavaScript runtime. Skovoroda recently proposed to replace this lookup table approach with an arithmetic version.

char nibble(uint8_t x) { return x + '0' + ((x > 9) * 39); }
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
    uint8_t val = src[i];
    dst[k + 0] = nibble(val >> 4);
    dst[k + 1] = nibble(val & 15);
}

Surprisingly maybe, this approach is much faster and uses far fewer instructions. At first glance, this result might be puzzling. A table lookup is cheap, the new nibble function seemingly does more work.

The trick that Skovoroda relies upon is that compilers are smart: they will ‘autovectorize’ such number crunching functions (if you are lucky). That is, instead of using regular instructions that process byte values, the will SIMD instructions that process 16 bytes at once or more.

Of course, instead of relying on the compiler, you can manually invoke SIMD instructions through SIMD instrinsic functions. Let us assume that you have an ARM processors (e.g., on Apple Silicon). Then you can process blocks of 32 bytes as follows.

size_t maxv = (slen - (slen%32));
for (; i < maxv; i += 32) {
    uint8x16_t val1 = vld1q_u8((uint8_t*)src + i);
    uint8x16_t val2 = vld1q_u8((uint8_t*)src + i + 16);
    uint8x16_t high1 = vshrq_n_u8(val1, 4);
    uint8x16_t low1 = vandq_u8(val1, vdupq_n_u8(15));
    uint8x16_t high2 = vshrq_n_u8(val2, 4);
    uint8x16_t low2 = vandq_u8(val2, vdupq_n_u8(15));
    uint8x16_t high_chars1 = vqtbl1q_u8(table, high1);
    uint8x16_t low_chars1 = vqtbl1q_u8(table, low1);
    uint8x16_t high_chars2 = vqtbl1q_u8(table, high2);
    uint8x16_t low_chars2 = vqtbl1q_u8(table, low2);
    uint8x16x2_t zipped1 = {high_chars1, low_chars1};
    uint8x16x2_t zipped2 = {high_chars2, low_chars2};
    vst2q_u8((uint8_t*)dst + i*2, zipped1);
    vst2q_u8((uint8_t*)dst + i*2 + 32, zipped2);
}

This SIMD code leverages ARM NEON intrinsics to accelerate hexadecimal encoding by processing 32 input bytes simultaneously. It begins by loading two 16-byte vectors (val1 and val2) from the source array using vld1q_u8. For each vector, it extracts the high nibbles (via right shift by 4 with vshrq_n_u8) and low nibbles (via bitwise AND with 15 using vandq_u8 and vdupq_n_u8). The nibbles are then used as indices into a pre-loaded hex table via vqtbl1q_u8 to fetch the corresponding ASCII characters. The high and low character vectors are interleaved using vzipq_u8, producing two output vectors per input pair. Finally, the results are stored back to the destination array with vst1q_u8, ensuring efficient memory operations.

You could do similar work on other systems like x64. The same code with AVX-512 for recent Intel and AMD processors would probably be insanely efficient.

Benchmarking these implementations on a dataset of 10,000 random bytes reveals significant performance differences. The basic lookup table version achieves around 3 GB/s, while the arithmetic version, benefiting from compiler autovectorization, reaches 23 GB/s. The manual SIMD NEON versions push performance further: I reach 42 GB/s in my tests.

method speed instructions per byte
table 3.1 GB/s 9
Skovoroda 23 GB/s 0.75
intrinsics 42 GB/s 0.69

One lesson is that intuition can be a poor guide when trying to assess performance.

My source code is available.

Converting floats to strings quickly

When serializing data to JSON, CSV or when logging, we convert numbers to strings. Floating-point numbers are stored in binary, but we need them as decimal strings. The first formally published algorithm is Steele and White’s Dragon schemes (specifically Dragin2) in 1990. Since then, faster methods have emerged: Grisu3, Ryū, Schubfach, Grisu-Exact, and Dragonbox. In C++17, we have a standard function called std::to_chars for this purpose. A common objective is to generate the shortest strings while still being able to uniquely identify the original number.

We recently published Converting Binary Floating-Point Numbers to Shortest Decimal Strings. We examine the full conversion, from the floating-point number to the string. In practice, the conversion implies two steps: we take the number and compute the significant and the power of 10 (step 1) and then we generate the string (step 2). E.g., for the number pi, you might need to compute 31415927 and -7 (step 1) before generating the string 3.1415927. The string generation requires placing the dot at the right location and switching to the exponential notation when needed. The generation of the string is relatively cheap and was probably a negligible cost for older schemes, but as the software got faster, it is now a more important component (using 20% to 35% of the time).

The results vary quite a bit depending on the numbers being converted. But we find that the two implementations tend to do best: Dragonbox by Jeon and Schubfach by Giulietti. The Ryū implementation by Adams is close behind or just as fast. All of these techniques are about 10 times faster than the original Dragon 4 from 1990. A tenfold performance gain in performance over three decades is equivalent to a gain of about 8% per year, entirely due to better implementations and algorithms.

Efficient algorithms use between 200 and 350 instructions for each string generated. We find that the standard function std::to_chars under Linux uses slightly more instructions than needed (up to nearly 2 times too many). So there is room to improve common implementations. Using the popular C++ library fmt is slightly less efficient.

A fun fact is that we found that that none of the available functions generate the shortest possible string. The std::to_chars C++ function renders the number 0.00011 as 0.00011 (7 characters), while the shorter scientific form 1.1e-4 would do. But, by convention, when switching to the scientific notation, it is required to pad the exponent to two digits (so 1.1e-04). Beyond this technicality, we found that no implementation always generate the shortest string.

All our code, datasets, and raw results are open-source. The benchmarking suite is at https://github.com/fastfloat/float_serialization_benchmark, test data at https://github.com/fastfloat/float-data.

Reference: Converting Binary Floating-Point Numbers to Shortest
Decimal Strings: An Experimental Review
, Software: Practice and Experience (to appear)

Optimizing Python scripts with AI

One of the first steps we take when we want to optimize software is to look
at profiling data. Software profilers are tools that try to identify where
your software spends its time. Though the exact approach can vary, a typical profiler samples your software (steps it at regular intervals) and collects statistics. If your software is routinely stopped in a given function, this function is likely using a lot of time. In turn, it might be where you should put your optimization efforts.

Matteo Collina recently shared with me his work on feeding profiler data for software optimization purposes in JavaScript. Essentially, Matteo takes the profiling data, and prepares it in a way that an AI can comprehend. The insight is simple but intriguing: tell an AI how it can capture profiling data and then let it optimize your code, possibly by repeatedly profiling the code. The idea is not original since AI tools will, on their own, figure out that they can get profiling data.

How well does it work? I had to try it.

Case 1. Code amalgamation script

For the simdutf software library, we use an amalgamation script: it collects all of the C++ files on disk, does some shallow parsing and glues them together according to some rules.

I first ask the AI to optimize the script without access to profiling data. What it did immediately was to add a file cache. The script repeatedly loads the same files from disk (the script is a bit complex). This saved about 20% of the running time.

Specifically, the AI replaced this naive code…

def read_file(file):
    with open(file, 'r') as f:
        for line in f:
            yield line.rstrip()

by this version with caching…

def read_file(file):
    if file in file_cache:
        for line in file_cache[file]:
            yield line
    else:
        lines = []
        with open(file, 'r') as f:
            for line in f:
                line = line.rstrip()
                lines.append(line)
                yield line
        file_cache[file] = lines

Could the AI do better with profiling data? I instructed it to run the Python profiler: python -m cProfile -s cumtime myprogram.py. It found two additional optimizations:

1. It precompiled the regular expressions (re.compile). It replaced

  if re.match('.*generic/.*.h', file):
    # ...

by

if generic_pattern.match(file):
    # ...

where elsewhere in the code, we have…

generic_pattern = re.compile(r'.*generic/.*\.h')

2. Instead of repeatedly calling re.sub to do a regular expression substitution, it filtered the strings by checking for the presence of a keyword in the string first.

if 'SIMDUTF_IMPLEMENTATION' in line: # This IF is the optimization
  print(uses_simdutf_implementation.sub(context.current_implementation+"\\1", line), file=fid)
else:
  print(line, file=fid) # Fast path

These two optimizations could probably have been arrived at by looking at the code directly, and I cannot be certain that they were driven by the profiling data. But I can tell that they do appear in the profile data.

Unfortunately, the low-hanging fruit, caching the file access, represented the bulk of the gain. The AI was not able to further optimize the code. So the profiling data did not help much.

Case 2: Check Link Script

When I design online courses, I often use a lot of links. These links break over time. So I have a simple Python script that goes through all the links, and verifies them.

I first ask my AI to optimize the code. It did the same regex trick, compiling the regular expression. It created a thread pool and made the script asynchronous.

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    url_results = {url: executor.submit(check_url, url) for url in urls_to_check}
    for url, future in url_results.items():
        url_cache[url] = future.result()

This parallelization more than doubled the speed of the script.

It cached the URL checks in an interesting way, using functools:

from functools import lru_cache

@lru_cache(maxsize=None)
def check(link):
    # ...

I did not know about this nice trick. This proved useless in my context because I rarely have several times the same link.

I then started again, and told it to use the profiler. It did much the same thing, except for the optimization of the regular expression.

As far as I can tell all optimizations were in vain, except for the multithreading. And it could do this part without the profiling data.

Conclusion so far

The Python scripts I tried were not heavily optimized, as their performance was not critical. They are relatively simple.

For the amalgamation, I got a 20% performance gain for ‘free’ thanks to the file caching. The link checker is going to be faster now that it is multithreaded. Both optimizations are valid and useful, and will make my life marginally better.

In neither case I was able to discern benefits due to the profiler data. I was initially hoping to get the AI busy optimizing the code in a loop, continuously running the profiler, but it did not happen in these simple cases. The AI optimized code segments that contributed little to the running time as per the profiler data.

To be fair, profiling data is often of limited use. The real problems are often architectural and not related to narrow bottlenecks. Even when there are identifiable bottlenecks, a simple profiling run can fail to make them clearly identifiable. Further, profilers become more useful as the code base grows, while my test cases are tiny.

Overall, I expect that the main reason for my relative failure is that I did not have the right use cases. I think that collecting profiling data and asking an AI to have a look might be a reasonable first step at this point.

A new way to call C from Java: how fast is it?

Irrespective of your programming language of choice, calling C functions is often a necessity. For the longest time, the only standard way to call C was the Java Native Interface (JNI). But it was so painful that few dared to do it. I have heard it said that it was deliberately painful so that people would be enticed to use pure Java as much as possible.

Since Java 22, there is a new approach called the Foreign Function & Memory API in java.lang.foreign. Let me go through step by step.

You need a Linker and a SymbolLookup instance from which you will build a MethodHandle that will capture the native function you want to call.

The linker is easy:

Linker linker = Linker.nativeLinker();

To load the SymbolLookup instance for your library (called mylibrary), you may do so as follows:

System.loadLibrary("mylibrary");
SymbolLookup lookup = SymbolLookup.loaderLookup();

The native library file should be on your java.library.path path, or somewhere on the default library paths. (You can pass it to your java executable as -Djava.library.path=something).

Alternatively, you can use SymbolLookup.libraryLookup or other means of loading
the library, but System.loadLibrary should work well enough.

You have the lookup, you can grab the address of a function like so:

lookup.find("myfunction")

This returns an Optional<MemorySegment>. You can grab the MemorySegment like so:

MemorySegment mem = lookup.find("myfunction").orElseThrow()

Once you have your MemorySegment, you can pass it to your linker to get a MethodHandle which is close to a callable function:

 MethodHandle myfunc = linker.downcallHandle(
     mem,
     functiondescr
 );

The functiondescr must describe the returned value and the function parameters that your function takes. If you pass a pointer and get back a long value, you might proceed as follows:

 MethodHandle myfunc = linker.downcallHandle(
     mem,
     FunctionDescriptor.of(
        ValueLayout.JAVA_LONG,
        ValueLayout.ADDRESS
    )
 );

That is, the first parameter is the returned value.

For function returning nothing, you use FunctionDescriptor.ofVoid.

The MethodHandle can be called almost like a normal Java function:
myfunc.invokeExact(parameters). It always returns an Object which means that if it should return a long, it will return a Long. So a cast might be necessary.

It is a bit painful, but thankfully, there is a tool called jextract that can automate this task. It generates Java bindings from native library headers.

You can allocate C data structures from Java that you can pass to your native code by using an Arena. Let us say that you want to create an instance like

MemoryLayout mystruct = MemoryLayout.structLayout(
        ValueLayout.JAVA_LONG.withName("age"),
        ValueLayout.JAVA_INT.withName("friends"));

You could do it in this manner:

MemorySegment myseg = arena.allocate(mystruct);

You can then pass myseg as a pointer to a data structure in C.

You often get an array with a try clause like so:

try (Arena arena = Arena.ofConfined()) {
       //
}

There are many types of arenas: confined, global, automatic, shared. The confined arenas are accessible from a single thread. A shared or global arena is accessible from several threads. The global and automatic arenas are managed by the Java garbage collector whereas the confined and shared arenas are managed explicitly, with a specific lifetime.

So, it is fairly complicated but manageable. Is it fast? To find out, I call from Java a C library I wrote with support for binary fuse filters. They are a fast alternative to Bloom filters.

You don’t need to know what any of this means, however. Keep in mind that I wrote a Java library called jfusebin which calls a C library. Then I also have a pure Java implementation and I can compare the speed.

I should first point out that even if calling the C function did not include any overhead, it might still be slower because the Java compiler is unlikely to inline a native function. However, if you have a pure Java function, and it is relatively small, it can get inlined and you get all sorts of nice optimizations like constant folding and so forth.

Thus I can overestimate the cost of the overhead. But that’s ok. I just want a ballpark measure.

In my benchmark, I check for the presence of a key in a set. I have one million keys in the filter. I can ask whether a key is not present in the filter.

I find that the library calling C can issue 44 million calls per second using the 8-bit binary fuse filter. I reach about 400 million calls per second using the pure Java implementation.

method time per query in nanoseconds
Java-to-C 22.7 ns
Pure Java 2.5 ns

Thus I measure an overhead of about 20 ns per C function calls from Java using a macBook (M4 processor).

We can do slightly better by marking the functions that are expected to be short running as critical. You achieve this result by passing an option to the linker.downcallHandle call.

binary_fuse8_contain = linker.downcallHandle(
    lookup.find("xfuse_binary_fuse8_contain").orElseThrow(),
    binary_fuse8_contain_desc,
    Linker.Option.critical(false)
);

You save about 15% of the running time in my case.

method time per query in nanoseconds
Java-to-C 22.7 ns
Java-to-C (critical) 19.5 ns
Pure Java 2.5 ns

Obviously, in my case, because the Java library is so fast, the 20 ns becomes too much. But it is otherwise a reasonable overhead.

I did not compare with the old approach (JNI), but other folks did and they find that the new foreign function approach can be measurably faster (e.g., 50% faster). In particular, it has been reported that calling a Java function from C is now relatively fast: I have not tested this functionality myself.

One of the cool feature of the new interface is that you can pass directly data from the Java heap to your C function with relative ease.

Suppose you have the following C function:

int sum_array(int* data, int count) {
    int sum = 0;
    for(int i = 0; i < count; i++) {
        sum += data[i];
    }
    return sum;
}

And you want the following Java array to be passed to C without a copy:

int[] javaArray = {10, 20, 30, 40, 50};

It is as simple as the following code.

System.loadLibrary("sum");
Linker linker = Linker.nativeLinker();
SymbolLookup lookup = SymbolLookup.loaderLookup();
MemorySegment sumAddress = lookup.find("sum_array").orElseThrow();

// C Signature: int sum_array(int* data, int count)
MethodHandle sumArray = linker.downcallHandle(
    sumAddress,
    FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS, ValueLayout.JAVA_INT),
    Linker.Option.critical(true)
);

int[] javaArray = {10, 20, 30, 40, 50};

try (Arena arena = Arena.ofConfined()) {
    MemorySegment heapSegment = MemorySegment.ofArray(javaArray);
    int result = (int) sumArray.invoke(heapSegment, javaArray.length);
    System.out.println("The sum from C is: " + result);
}

I created a complete example in a few minutes. One trick is to make sure that java finds the native library. If it is not at a standard library path, you can specify the location with -Djava.library.path like so:

java -Djava.library.path=target -cp target/classes IntArrayExample

Further reading.When Does Java’s Foreign Function & Memory API Actually Make Sense? by A N M Bazlur Rahman.

How stagnant is CPU technology?

Sometimes, people tell me that there is no more progress in CPU performance.

Consider these three processors which had comparable prices at release time.

  1. The AMD Ryzen 7 9800X3D (Zen 5, with up to 5.3 GHz boost) was released in 2024.
  2. The AMD Ryzen 7 7800X3D (Zen 4, with up to 5.1 GHz boost) was released in 2023.
  3. The AMD Ryzen 7 5800X3D (Zen 3, with 3.4 GHz base) was released in 2022.

Let us consider their results on on the PartialTweets open benchmark (JSON parsing). It is a single core benchmark.

2024 processor 12.7 GB/s
2023 processor 9 GB/s
2022 processor 5.2 GB/s

In two years, on this benchmark, AMD more than doubled the performance for the same cost.

So what is happening is that processor performance is indeed going up, sometimes dramatically so, but not all of our software can benefit from the improvements. Software developers must track the trends and adapt our software accordingly. Unfortunately, it is hard work and it requires expertise. In the case of this benchmark, the simdjson library is designed to benefit from better processor features.

Not all software can easily run much faster on new processors, and genuine progress is difficult.

Let us be ambitious. Let us move forward!

What I Got Wrong About “Hard Work” in My 20s

When I was younger, in my 20s, I assumed that everyone was working “hard,” meaning a solid 35 hours of work a week. Especially, say, university professors and professional engineers. I’d feel terribly guilty when I would be messing around, playing video games on a workday.

Today I realize that most people become very adept at avoiding actual work. And the people you think are working really hard are often just very good at focusing on what is externally visible. They show up to the right meetings but unashamedly avoid the hard work.

It ends up being visible to the people “who know.” Why? Because working hard is how you acquire actual expertise. And lack of actual expertise ends up being visible… but only to those who have the relevant expertise.

And the effect compounds. The difference between someone who has honed their skills for 20 years and someone who has merely showed up to the right meetings becomes enormous. And so, we end up with huge competency gaps between people who are in their 30s, 40s, 50s. It becomes night and day.

A bit of glass and freedom is all you need

Galileo Galilei was the OpenAI of his time. He helped establish modern science by emphasizing experimentation as the primary means to uncover natural truths. To this end, he built his own telescopes. He revealed to the world the moons of Jupiter, thereby changing forever how we viewed the cosmos.
 
How was Galileo able to design better telescopes than others ? If you have ever been to Venice, you may know that it was famous for its glassmakers. There is a small island nearby (Murano) where there are still glassmakers. Further, Venice had some of the best merchants of Europe, so they could export their glass worldwide.
 
This is an important lesson as to what drives innovation. It is not a linear process. Living near people making fancy glasses could be the key you need.
 
A common misconception portrays Galileo as persecuted solely for advocating heliocentrism, the idea that Earth orbits the Sun. In reality, he spent most of his career challenging established doctrines and thrived under Church patronage. Galileo overturned the widespread belief that heavier objects fall faster, and this achievement, if nothing else, brought him greater fame. When he gathered strong evidence for heliocentrism, he initially faced only cautions rather than outright condemnation.
 
Pope Urban VIII had personally permitted Galileo to discuss heliocentrism as a hypothesis and even requested that his own arguments on the matter be included. However, Galileo placed these papal views in the mouth of Simplicio, a character portrayed as intellectually inadequate in defending the traditional geocentric position. This was widely interpreted as a mockery.
 
Galileo was sentenced to house arrest, during which he continued productive work. The ban on his Copernican writings applied mainly within Catholic territories, allowing their dissemination elsewhere in Europe.
 
Thus another important element that made Galileo possible was the relative freedom he enjoyed.
 
You want to innovate ? Don’t live in the world of ideas solely. Don’t be shy about mixing with commercial interest. And make sure to have a bit of freedom.