~sebsite/printfasm

an assembler that compiles to printf lol
reword README
README: fix typo
Update thing about musl in README

You can also use your local clone with git send-email.

#an assembler that compiles to printf lol

this is an "assembler" that compiles to a printf loop. i'm not sure that "assembler" is even the right word for it. it feels right, but like, idk. call it whatever you want, point is, it generates C code which pretty much looks like this:

/* a bunch of macros and stuff... */

static unsigned char mem[65536];

int main(void)
{
	while (!*EXIT) {
		printf(/*...*/);
	}
}

for those unaware, printf is turing complete (1, 2). the reason for this is the %n specifier, which stores(!) the number of bytes printed thus far into a pointer.

this assembler makes it "easy" to write code with printf. the result is slow af and very cpu intensive, but it works! note that the generated code uses POSIX positional specifiers (e.g. %2$d), so you'll need a POSIX-compatible libc to run the resulting code. POSIX only mandates support for 9 positional parameters, but this assembler generates code which usually uses way more, so you'll need a libc that supports a large number of positional parameters (i.e. a large value for NL_ARGMAX). glibc works great; musl and freebsd libc unfortunately only support 9, so they don't work. i haven't tested any other libcs; i have a feeling most should probably work though? in the future i might add a mode which doesn't depend on POSIX, we'll see

the resulting code extensively uses the terminal's alternate screen buffer, so it's not really possible to meaningfully pipe the output. your terminal emulator must support the alternate screen buffer, and ideally also synchronized output.

as a challenge for myself, i imposed the limitation that the entire source code for the assembler must fit in under 1000 lines, while still being readable (so someone who wasn't familiar with the line limit wouldn't suspect anything).

#how to use it

make
./asm < foo.pfs > foo.c
gcc foo.c # or clang or whatever
./a.out

C23 is required to compile the assembler.

.pfs stands for "printf .s", since it's like, an assembly file but for printf. idk

if you wanna try it for yourself, try with mandelbrot.pfs! it computes and prints the mandelbrot set entirely in printf

#an introductory tutorial using fizzbuzz

to get an idea of how the assembler works, let's walk through an example fizzbuzz program:

alias i, fizz, buzz

->fizz ([i] + 1) % 3 == 0
->buzz ([i] + 1) % 5 == 0
->i [i] + 1
->exit [i] == 100

[i] if !![i] & ![fizz] & ![buzz]
"Fizz" if [fizz]
"Buzz" if [buzz]
"\n" if [i]

it's a "declarative" language of sorts; it essentially describes how the state should be mutated on each iteration, and what should be printed. memory loads use the syntax [foo], where foo resolves to an integer for the slot of memory to load from (in C, this is basically mem[foo]). note that memory loads are pre-computed before any stores; memory stores use a different syntax to emphasize this (i'll get to that later). remember that memory loads are just array subscripting expressions passed as arguments to printf, so they're evaluated before printf is actually called.

the alias keyword declares "constants", kinda. the most intuitive form of its syntax is this:

alias foo = 1
alias bar = foo + 1

when the "initializer" is omitted, it defaults to the previous alias's value plus one, or to 0 for the first alias. so alias i, fizz, buzz defines i as 0, fizz as 1, and bar as 2. these will be used as memory indices; they're just given names to make the code more readable. they have no affect on the semantics of the resulting code; you can think of them sorta like macros which are substituted with their value (note that, unlike C macros, it's not lexical substitution, so you don't need to worry about parentheses or whatever).

moving on:

->fizz ([i] + 1) % 3 == 0
->buzz ([i] + 1) % 5 == 0
->i [i] + 1

the syntax for memory stores is ->foo bar, where bar is an expression whose resulting value is stored to slot foo. so for example, ([i] + 1) % 3 == 0 is evaluated and its result is stored to slot fizz (which resolves to 1). like i mentioned above, this uses a different syntax from loads to emphasize that any later use of [fizz] will still evaluate to the previous value; the stores only "take affect" after the iteration is complete (this is technically not entirely true, but it's good enough for now. there's one kinda-advanced feature which breaks this abstraction, but don't worry about it yet).

->exit [i] == 100

->exit is special: it stores to a special reserved "exit" slot, which is checked at the beginning of each iteration. if its value is non-zero, the program exits. so it's basically an exit condition.

here's the complete list of allowed forms for memory stores (outrefs):

->0
->foo
->exit
->(foo + 1)
->[foo]

the parenthesized form allows any arbitrary expression. the bracketed form is identical to ->([foo]), i.e. it stores to the slot whose value is stored in slot foo.

a couple final notes here:

  • indices after 65530 are reserved and shouldn't be used. for instance, the exit slot is 65534, and there's an internal "accumulator" slot at 65533.
  • there's no protection against out of bounds memory accesses/stores; those will invoke undefined behavior.

moving right along:

[i] if !![i] & ![fizz] & ![buzz]
"Fizz" if [fizz]
"Buzz" if [buzz]
"\n" if [i]

the syntax foo if bar prints the value foo if bar is non-zero. the if clause can be omitted, if you just want to print a value unconditionally.

it's worth noting that && and || are deliberately not supported. the same goes with ternary conditional expressions ?:. these aren't included because they affect control flow, and the entire gimmick is that the resulting code has no explicit control flow besides the loop. because of this, printf assembly code usually uses the bitwise operators & and |. because the assembler copies C's operator precedence rules, these have the same precedence as && and || (most people agree that these precedence rules are a mistake which only exist for historical reasons; this is like the one time where they're actually useful lol). to do boolean arithmetic, stuff like !! is used (converts any non-zero value to 1), as well as unary - on boolean operands of & (to convert 1 to -1, which has every bit set). so like, foo & -!!bar evaluates to foo if bar is non-zero, otherwise it evaluates to 0.

so let's look at the entire fizzbuzz source code again:

alias i, fizz, buzz

->fizz ([i] + 1) % 3 == 0
->buzz ([i] + 1) % 5 == 0
->i [i] + 1
->exit [i] == 100

[i] if !![i] & ![fizz] & ![buzz]
"Fizz" if [fizz]
"Buzz" if [buzz]
"\n" if [i]

memory is initalized to zero, so we start counting i from 0. we store whether the next number is divisible by 3 in fizz, and whether it's divisible by 5 in buzz. we then increment [i] for the next iteration, except the program will terminate if [i] is 100. we then print [i] if it's neither divisible by 3 nor 5 (and if it's non-zero), otherwise "Fizz" or "Buzz" is printed. finally, a newline is printed (again checking that [i] is non-zero).

this actually isn't the simplest fizzbuzz program! this is based off the first one i wrote as i was first writing the assembler, but aliases were only added later, so the program can be simplified to only load/store a single slot of memory:

alias i
->i [i] + 1
->exit [i] == 99

alias Fizz = ([i] + 1) % 3 == 0
alias Buzz = ([i] + 1) % 5 == 0

[i] + 1 if !Fizz & !Buzz
"Fizz" if Fizz
"Buzz" if Buzz
"\n"

i showed the other fizzbuzz first since i felt like it did a better job of introducing the concepts of the language, but this one shows aliases being used to define "temporary" variables, rather than just describing memory indices. the convention (i decided) is to use lower_case for memory indices, CamelCase for temporaries, and SCREAMING_CASE for constants.

but wait, there's more!

#reading input

if the input keyword is used, then the form of the generated code changes:

/* a bunch of macros and includes... */

static unsigned char mem[65536];

static struct termios termios;

static void cleanup(void)
{
	/* termios restore code... */
}

int main(void)
{
	/* termios initialization code and stuff... */

	for (int input = '\0'; !*EXIT; input = getchar()) {
		printf(/*...*/);
	}
}

basically this enables raw mode on the terminal, and nonblocking mode on stdin, and attempts to read user input every iteration. input is a builtin "constant" which is initially set to 0, and on each following iteration is set to the character which was typed, or to EOF if no input was available (technically the value of EOF is non-portable, but in practice i'm pretty sure it's always -1).

as of right now, there's no way to print input as a character without storing it to memory and using an inref (described below). i'd like to add a way to do this in the future; i just don't know what the syntax should be lol

the intent here is that i'm eventually gonna write tetris in printf. it's a work in progress and may not ever be completed. but the groundwork is all layed out at least :3

#inrefs

this is a sorta advanced feature which breaks the abstraction that memory stores only occur at the end of the iteration.

<-foo
<-foo if bar

this prints the string starting at index foo. unlike memory loads, these do use previously stored memory values.

this makes sense when you think about the resulting code: memory loads are just array subscripts, evaluated outside of the printf call. inrefs are generated as %s specifiers with a pointer argument, so they'll read stuff previously stored with %n.

this is also relevant for aliases. if you wondered why they're called "aliases" and not, like, "constants" or whatever, this is why: aliases aren't pre-computed: they're substituted with their initializer, but that initializer isn't evaluated ahead of time. this only matters for inrefs; otherwise the semantics are the same as though they were only evaluated once.

inrefs have limited use, especially since there's no easy way to store a string without storing each byte individually. i considered removing them, but they're so simple to implement that ultimately i decided to keep them, so you can take advantage of more of printf's functionality.

inrefs have identical syntax to outrefs, except the arrow points in the opposite direction. this means that, technically, <-exit is allowed. the byte after the exit slot is always zero, so like, you could do this i guess; there's just no reason you'd ever actually want to lol

#misc nitty-gritty details

  • comments are prefixed with # and go til the end of the line, like in other scripting languages like sh and python.
  • the assembler assumes that, on both the host and target machines, CHAR_BIT == 8 && sizeof(int) == 4.
  • integer literals can't exceed INT_MAX (2^31 - 1).
  • calculations always use (signed) int precision. note however that stores truncate to unsigned char.
    • this means that, unfortunately, we inherit C's undefined behavior on signed integer overflow. so make sure your calculations don't overflow int. this also means 1 << 31 is UB :/
    • this also means that >> does an arithmetic right shift (technically that's actually implementation-defined, but like, let's be real, you're not gonna ever use this on a system which does logical right shift with signed operands).
  • the assembler currently doesn't check for I/O errors. it probably should tho.
  • the grammar for integer literals is very strict: essentially, it's [1-9][0-9]* for decimal, 0(o[0-7])?[0-7]* for octal, 0x[0-9A-Fa-f]+ for hex, and 0b[01]+ for binary. digit separators aren't supported. this means that e.g. 0foo lexes as two tokens: 0 foo. this detail will literally never matter to anyone but i figured i'd document it anyway.
  • both character and string literals are supported (unprefixed), with the same syntax as C. the assembler doesn't check that they're well-formed though, so like, if you use an invalid escape, that'll only error when trying to compile the C code.
  • floating point isn't supported. i may or may not add support for it in the future, but there's a lot of special considerations to get it to work, and i'm not sure i'd be able to fit it in the 1000 line limit.

by the way, the entire grammar is more thoroughly documented in grammar.txt :)

#vim plugin

a vim plugin for .pfs is included in this repo! it has syntax highlighting and stuff like that.

#that's all i think

bye :3