Unsollicited advice about stackframes

Posted by pulkomandy on Sun Dec 7 13:46:10 2025  •  Comments (0)  • 

Stacks are everywhere

In my daily life I do a lot of debugging, reverse engineering, and software development. For some of these tasks, understanding how computer works in unnecessary detail is, well, sometimes useful. In particular, stacks and stackframes are very central to how most modern computing works. And, annoyingly, the implementation is slightly different depending on CPU architecture, compiler, operating system, but also CPU generation and compiler options.

Every time I need to dig into it, I have to refresh my memory or re-understand how it works on a specific setup. I may save a lot of time, for myself and hopefully for others, if I start writing it down. So, here you go, my currently very incomplete guide about how stacks and stackframes work!

This article does not attempt to retrace the history of how all features of a computer call stack work. For that I can recommend Sten Henriksson's A brief's History of the Stack, which will take you all the way to Babylonian folklore, with a brief detour through tax avoidance schemes. Here I will introduce concepts in a way that seem convnient for explaining them.

Note: I start this article from the very beginning. Maybe the relevant bits for you will be near the end. But this allows to introduce the notations and concepts progressively and make the article somewhat standalone, and answering more questions than it opens. Skip some sections if you feel bored, come back to them later if you are lost, let me know if more details about some parts are needed!

What's a stack?

A stack is a data structure where you can only access the top element. More precisely, you can:

push
a new element to the top of the stack. It is stored "above" the previous one.
pop
the top element from the stack. The previous "top" becomes visible again.
peek
at the top element to get its value. This does not change the stack.

Of course, popping and peeking is not allowed if the stack is empty.

In its simplest form, a stack is just a pointer to the top element. From now on we will not consider linked stacks, as these are rarely used in modern computers. Instead we will consider that the elements are stored next to each other, in consecutive memory addresses. This results in the following definition for the 3 operations:

peek
is just dereferencing the stack pointer,
pop
is decrementing the stack pointer,
push
is incrementing the stack pointer and writing the new value there.

From now on, the stack pointer will be notated as SP.

Basic stack operations: pushing adds an item on top of the stack and makes SP point to it. Popping removes an item from the stack and restores SP to the previous top item.

Call stacks

The concept of a call stack may seem obvious, as it is deeply engraved into how modern programming languages and CPUs work. It was not always this way. Going without a stack, or at least allowing yourself other ways to manage control flow in your code, would allow for other ways to think about it, the most popular of these probably being coroutines.

Anyway, let's look at how the typical functional/imperative programming paradigm nowadays typically uses a stack for control flow.

Usually, a CPU executes instructions one after another, in the order they are stored in memory. But if allowed only that, computers would be a lot less smart. Some instructions (usually called "control flow instructions" allow to "jump" around in the code. In particular let's have a look at the CALL and RETURN instructions (on z80 and x86 these are called CALL and RET, other CPU architectures may use different names such as BSR (branch to subroutine) or implement things slightly differently. More details about specific architectures are in later sections of the article).

These instructions allow to implement the notion of a procedure or subroutine call. The idea is simple: there are some tasks in a program that need to be performed several times in different contexts. Repeating the same code over and over in many places would make the software difficult to maintain, as all these copies of the code would have to be kept in sync, and, moreover, it woudl be waste of memory space. So, let's have only one copy of this code. Whenever we need to run that code, we can call into it by using the CALL instruction. When the code is finished, it uses the RETURN instruction to tell the CPU, and execution resumes in the initial code flow, just after the CALL.

Control flow diagram of code calling a subroutine. The execution flows linearly into the code until reaching the CALL instruction. It then goes to the subroutine, executes all instructions there until the RETURN instruction. Then it comes back to the caller code, resuming just after the CALL.

One important observation here is that the code in the subroutine does not make any assumptions about where it was called from. The RETURN instruction is just that, RETURN. It does not have any parameters. So how does this work? Well, the CALL instructions stored the return address, and the RETURN instructions retrieved it. Of course, we are not limited to only one level of CALL and RETURN. The subroutine may itself need to call more subroutines, and so on. So, these addresses are stored on a stack, with the CALL instruction doing a PUSH, and RETURN doing a POP.

Parameters and recursion

This is all nice and fine, but it only works if the subroutine needs to perform exactly the same work each time it is called. However, this is unlikely to be the case. Typically it would need to perform the same operations, but work on different data. For this, we need to be able to pass some information to the subroutine so it knows on which data to operate. This is called parameters.

There are many ways to achieve this. For example, we could put the parameters in CPU registers or fixed memory locations (agreed in advance between the subroutine and the caller). CPU registers have an obvious limitation: there are only so many of them, what if a function needs more parameters, or parameters too large to fit in registers? So let's look closely at the other option and see how it would look like.

Control flow diagram of code calling a subroutine with arguments. It is similar to the previous diagram, but before the CALL instruction is a

For each subroutine, we will dedicate a small amount of memory for its arguments. Before calling, the caller code will set these to the parameter values, and the subroutine can access them as needed. Problem solved, right?

Unfortunately, there are two limitations with this approach. One is just an inefficiency problem, but the other is more structural and makes some things impossible. Let's start with the first one.

The inefficiency here is that there are usually not that many subroutines running at the same time. However, the memory allocated to each subroutine parameter is static: it is reserved for the use of that subroutine's parameters, and can't be used for anything else, even when the subroutine isn't running. Usually, subroutines don't take a lot of parameters, so each of them only uses up a small amount of memory, but there will be a lot of subroutines, and this memory usage adds up.

But let's not forget the other, more structural problem: what if we need to run the same subroutine multiple times at once? At first this may seem impossible, since the code execution is generally linear, but there are two cases where this can happen. The first one is recursion: a subroutine may call itself to perform some work. The second one is multitasking. This may happen because you have a system with multiple execution threads, or more simply while handling hardware interrupts, which stop the currently running code to do something else.

Both of these problems can be pretty annoying: one to the hackers who hate wasting even a single byte of memory, and the other to computer theorists who like to express their algorithms in recursive form for elegance, or really want to play with concurrent threads of execution on the same machine. As a result, everyone was annoyed and a solution was found.

Static, dynamic, and automatic variables

The solution is automatic variables, but for completeness I will also introduce the other two types of storage, if only to explain the naming used here.

We have already seen in the previous paragraph what a static variable is, but let's define it clearly. A static variable is one whose address in memory is fixed (hence the name static: it doesn't ever move). The address for these variables is determined at compile time (I am using a very liberal definition of "compile time" here, let's say, "before the program starts executing"). If the programmer wishes to reuse the same address for two different variables, it is possible, but it is up to the programmer to make sure the two variables will never be in use at the same time. This is way too much information to track for an human brain, and so this approach was quickly abandoned.

The second option is dynamic allocation. In this case, the memory address for a variable is determined while the program is executing. This is typically done with the malloc() function (at least in systems using the C programming language). Here, the programmer asks for a piece of memory large enough to hold their variable. When the variable is not in use anymore, they use another function (free()) to give that chunk of memory back to the system. The memory allocator keeps track of which chunks of memory are available, and, as long as the programmer uses the malloc and free functions correctly, the memory is managed cleanly, and there is no risk to reuse the same chunk of memory for two things.

There are, once again, two problems with this approach. One of performance, and one of stability and predictability. Let's again start with the first one. Over the years, many approaches were suggested for dynamic memory management. Besides the simple malloc and free, let's mention reference counting and garbage collection. They all suffer from the same problem: a lot of work has to be done "at runtime", while the program is executing, just to keep track of which parts of the memory are or are not in use. The more generic and flexible the memory allocation scheme is, the higher the overhead gets. The algorithms in play are also quite complex and do not lend themselves well to being efficently optimized by hardware.

The other problem is, of course, more annoying. With the simple malloc/free scheme, there is still a lot of reliance on the programmer to correctly declare what memory they need, and to release it when done. Decades of experience show that programmers will usually not get this right. And since everything is dynamic, this results in complex to diagnose bugs, that are not always predictably reproductible. Dynamic memory is still in use today, but it is too slow and dangerous to use as a basic building block for something as simple as calling a function with parameters.

This is where automatic variables come in. The name "automatic" derives from the fact that there is no need to explicitly allocate and free these. Their lifetime is automatically managed by the control flow of the program: they are allocated by calling a function, and freed when returning. This is so convenient that, in C, it became the default way to allocate memory for local variables. You can still explicitly use the "auto" keyword when declaring a local variable, but no one bothers to do it, since it is the default in pretty much all C compilers.

For the implementation, the idea is simply to put these variables on the stack. This is done typically using instructions like PUSH and POP (again, exact instruction naming will vary, and we will talk about specific CPUs later, I promise!). The compiler now takes care of pushing the parameters to a function before calling it, and cleaning up after the function returns.

Let's have a look at how things are organized in memory now (here for a traditional UNIX type memory layout, there are of course other ways to set things up).

Typical memory organization in UNIX programs: executable code and constants at the bottom, then static variables, then dynamic variables, and finally the stack at the very top of memory

I will not get into all the details of why things are in this specific order today. The interesting point here is that both the heap (used for malloc/free) and the stack share the same RAM area (whatever is free after the static code, constants and variables are loaded). The heap gets to use this area from its start, and the stack uses it from the end. The point where they meet is called the "break" or "brk". This organization allows to efficiently share that block of memory between the two, without having a fixed size for either, and things will sort of auto-adjust depending on how much stack and heap each execution of the program needs. Anyway, the main point here is, to achieve this, we had to turn the stack upside-down. Now, pushing things to the stack will decrement SP, and poping will increment it. This is a bit disturbing at first, and sometimes in the litterature you see the stack still represented with the topmost item on the top. But we will see in the next chapters that having the stack upside down allows some neat tricks, so I think it is important to show it the "right" way, however confusing it may be at first. Maybe picture a stack of helium baloons, or some antigravity powered device?

To avoid further confusion, I will avoid using the notion of a stack "top" from now on. I will instead say "head".

Local variables

Whew, I think that's where we end the basics introduction and finally enter the part I wanted to write about in the first place?

I realize that I have not shown a stack diagram in a while. Let's see how it looks now with our complete setup, that is:

  • A main function calling a subroutine,
  • The subroutine has three parameters,
  • The stack grows downwards (towards low memory addresses).

Let's look at it just after the CALL instruction has put the return address on the stack.

Typical stack frame organization: the arguments are at higher addresses, and then the return address at the stack head

We immediately notice a problem: until now, we were only able to access the head of the stack. In this case, that would be the return address, put there by the CALL instruction. But our subroutine will not need the return address until the very end, and surely will use the parameters a lot before reaching that point. Without any other primitives, we would need to store the return address somewhere else in order to access the other parameters, this seems pretty annoying.

Fortunately, the stack is stored in random access memory (at least in current computer architectures that we care about here). So, we can also do random access to it. Let's introduce a new notation for that: SP-relative addressing. The idea is very simple, let's use our stack pointer SP as a base pointer and process the stack as if it was an array. This leads to: SP[0] being the return address, SP[1] being the 3rd argument, SP[2] being the second and... wait, this argument order seems not very helpful?

Indeed, if the caller function pushes the arguments in that order (first to last), on a stack growing downwards, and we then access the stack as an array, we get the arguments in the reverse order of the array indices. The C language solves this simply by having the caller push the arguments last-to-first instead.

So let's have another look at a bit of pseudocode and the resulting stackframe with fixed argument order:

PUSH arg3
PUSH arg2
PUSH arg1
CALL subroutine
resulting stackframe from the above pseudocode: SP[0]is the return address, SP[1] is argument 1, SP[2] is argument 2, and so on

Now let's talk about local variables. Unlike arguments, these are not set by the caller function, but are handled by the subroutine for its own use. Looking at our stack, we see that there is a lot of space just below SP. So let's just put our local variables there. We could access them as SP[-1], SP[-2], etc. But we have to keep in mind that the stack is still... well a stack. Any function call will use that same space we just set aside for local variables to store the parameters and return address for the called function. And, even more annoyingly, interrupts may happen at any time and also make use of the stack to store their own temporary data. Of course, we don't want either of these to alter our local variables.

The solution is to move the stack pointer downwards and "reserve" some space for the local variables. Typically, upon entering a function, you will see an instruction like SUB SP,nn to do so (if your processor has such an instruction, not all of them do, and achieving the same thing without a dedicated instruction can require some tricky code. We are getting into the details of specific implementations soon, I promise!).

SP moved down to point to the first local variable. Now the return address is at SP[3] and the first argument at SP[4]

Ok, this looks good. We lost the direct mapping between SP[n] and argument n, but the compiler can compensate that for us, as the offset is fixed and depends just on the number of local variables. Or is it?

More fun with local variables

Unfortunately, it is not quite so. There are several cases where the stack pointer has to be moved to different addresses.

Note: it's possible that some of these cases were introduced after the solution exposed later in the article was set in place, as they then became trivial to implement. Again, this is not an history of how things developed, just a convenient way to build things up progressively in this article.

Scoped allocations

In the C language it is possible to do something like this:
void subroutine(void)
{
	int aVariable;

	// Some code here...

	{
		int bVariable;

		// Some more code here...
	}

	// Even more code here...
}

Basically, a variable can have a scope (and lifetime) shorter than an entire function. In this situation, it would make sense that the compiler temporarily allocates spaces for the new variable at the opening of the corresponding block, and releases it at the closing block. If implemented strictly in this way, it means the offset to access aVariable would be SP[0] while outside the block, but SP[1] in the block where bVariable exists.

This problem can be avoided, of course, by keeping bVariable allocated from the start to the end of the function, which is what most compilers will do. So let's look at the real problems below instead.

Call frame setup

The call frame setup is what happens when a function needs to call another one. Because of the function arguments, this is a multi-step process: first push the arguments one by one, and then call the function. Each of these PUSH and CALL instructions move the stack pointer. What if we want to pass the value of a local variable or of one of our own arguments to the called function? The compiler then has to keep track of how many things it already pushed, and make sure to adjust all accesses to local variables and arguments accordingly. Relatively easy job for the compiler, but not much fun for the developper, as they also have to track this when reading the generated code (for example in a debugger). What is SP[1] at one place in the code may well become SP[6] a bit further down, and then change again and again. Wouldn't it be much easier if things stayed where they are?

Variable length arrays

This one is possibly the most annoying and difficult to solve the for the compiler. Sometimes, the size of a local variable is not known at compile time. Let's say your function needs a temporary array to perform some work. The size of the array depends on a parameter. Something like this:

void function(int count)
{
	int array[count];

	// Perform some work on the array...
}

Or, if you use an older version of C where variable length arrays are not a thing, a similar approach with the alloca() function, which just allocates memory on the stack:

void function(int count)
{
	int* array = alloca(count);

	// Perform some work on the array...
}

Here, the compiler does not know at compile time how to adjust the references to local variables, since it depends on a runtime parameter. So, we'll need some additional data at least in this case.

Other uses of the stack: register spills

Finally, the compiler may generate PUSH, POP and other stack access instructions for many other reasons. One of them is "register spills": temporary storing registers in the middle of a calculation to the stack, and getting them back later. This is how the stack is typically used on, for example, the z80 and 8080, where it was not designed for SP-relative addressing.

Introducing frame pointers (finally!)

Looking carefully at usages for the stack pointer, we see that there are, in fact, two different usages for it.

  • As a stack pointer: with the typical PUSH and POP operations, as well as CALL and RETURN.
  • As an array base pointer: for access to function arguments and local variables.

These two tend to work together (for example: the caller uses PUSH to send arguments, but the callee rather accesses them as an array). But they also work against each other when SP needs to be moved and it offsets all access that use it as an array.

If you look at it this way, the solution becomes obvious: let's split these two roles into two separate registers! We will keep SP as the stack pointer, and introduce BP to use as the array base pointer (using the x86 name for the register this time, it is sometimes also called FP for frame pointer).

This requires a bit of extra work at the start and end of each function, to point BP at the right place, and restore it to its previous value when returning to the caller. In general, this will involve saving the previous value of BP to the stack itself. This opens an interesting new way to look at the stack memory: in addition to being a stack and an array, the successive values of BP saved to the stack can be used to traverse it as a linked list! This is used in debuggers in order to print a call stack, but it can also find other uses such as unwinding the call stack in languages that support exceptions.

This is where the exact details depend on the exact CPU used, and it will be time to look at it in more details in the next sections of the article. I will complete it progressively as I come accross various schemes, and try to build similar quickreference cards for each of them. So, please have a look at the next sections for the details.

8086 - Unknown compiler used for Philips PM3580 custom disassembler tool

Function prologue
PUSH BP ; Save the caller BP
MOV BP, SP ; Initialize BP to the current frame
MOV AX,frameSize ; Allocate space for local variables
CALL stackAdjust
PUSH r ; As needed if some registers need to be saved
PUSH s ; As needed if some registers need to be saved
Function epilogue
POP s ; Matching what was saved in the prologue
POP r ; Matching what was saved in the prologue
MOV SP, BP
POP BP
RETN
Call template
MOV AX, argN
PUSH AX
MOV AX, argM
PUSH AX
...
MOV AX, arg0
PUSH AX
CALL subroutine
ADD SP, nArg ; Caller is responsible for removing arguments from the stack
Stack adjust helper
stackAdjust:
	POP CX ; Remove address for stackAdjust from stack
	MOV BX, SP
	SUB BX, AX ; Reserve stack space
	JB .stackOverflow
	CMP BX, stackLimit
	JB .stackOverflow
	MOV SP, BX ; Set SP to the adjusted address
	JMP CX ; Return to the caller function

.stackOverflow
	XOR AX, AX
	JMP runtimeError ; Trigger a runtime error 0 and abort the program
Stack frame on 8086. BP points to saved BP from previous function, BP[1] is the return address, BP[2] to BP[N+1] are arguments, BP[-1] to BP[-n] are local variables

The stack adjust routine is a separate function in the prologue to add bounds checks for the stack. This is optional. In fact, functions from the standard library in this disassembly projet use a simpler prologue:

PUSH BP
MOV BP, SP
SUB SP, frameSize ; Direct change of SP without bounds checks
PUSH r ; As needed ...
PUSH s ; As needed ...

Analysis tips

When reading the code it can be difficult to tell at first glance if a particular PUSH instruction is setting up a function argument for a call, or just saving a register for restoring it later. Sometimes, you can look at the instruction after the CALL to get a hint: it will be an ADD SP to remove the arguments from the stack, and its parameter gives you an hint about the arguments total size. However, the compiler may also optimize this and make it harder to follow. For example, if the code calls several subroutines in succession, the arguments for several of them may be kept on the stack, until a final ADD SP,xx removes them all at once.

68000 - Unknown compiler used for Philips PM3580 custom disassemblers

Function prologue
LINK.W A6, #frameSize
MOVE.L r, -(A7) ; As needed if some registers need to be saved
Function epilogue
MOVE.L -$4(A6), r
UNLK A6
RTS
Call template
	MOVE.L A0,-(A7)
	MOVE.L A1,-(A7)
	JSR subroutine

	ADDQ.W #$8, A7
Stack frame on 68000. It is exactly the same as on 8086, escept for the register names.

The 68000 uses A7 as a stack pointer. It does not enforce a specific register to use as a frame pointer, but A6 is a common choice. The instructions LINK and UNLK allow to do in a single instruction what the 8086 is doing in three: save the old frame pointer, set the new one, and subtract a constant value from the stack pointer to make space for the stack frame.

The UNLK instruction does the opposite in the function epilogue. Note that, since it resets the stack pointer, restoring the registers is done using A6-relative addresses and doesn't bother to increment the stack pointer when getting the values.

When multiple registers need to be saved, MOVEM instructions can be used. For a big function something like MOVEM.L D2-D7/A2-A5, -(a7) will be typical, with the matching movem.l -$128(A6), D2-D7/A2-A5 in the prologue. Note that in this case, D0, D1, A0 and A1 are considered "scratch" registers, the called function is free to modify them and the caller does not assume their value will be preserved. So simple functions that only need these four registers may end up with a 1-instruction prologue and 2-instruction epilogue!

The way to call functions is very similar to the 8086, almost the same instructions, just with different names.i The 68000 wins because it has an instruction to push a constant directly onto the stack: MOVE.W #$6, -(A7) for example. On the 8086, you have to first load the constant into a register, and then push that register.

Analysis tips

The epilogue does not have any parameters, it is always the same two instructions. This make the resulting 4 byte sequence (4E 5E 4E 75) easy to identifiy when looking for function boundaries.

Unlocking a Chromebook

Posted by pulkomandy on Wed Oct 29 17:17:51 2025  •  Comments (0)  • 

Recently a Chromebook landed on my bench. It came from the training center of a supermarket chain, which gives such machines to trainees as a support for coursework and communication, apparently (with everything stored on Google services).

The device is locked by the training center and if you attempt to do a factory reset or enable developer mode, it won't work. It just shows a message asking to send it back to them.

The device is an HP Chromebook 11 G8 EE. If you own an unlocked one, it is apparently easy to reflash the BIOS from ChromeOS directly (you still have to disassemble the machine and unplug the battery, if I understand correctly). But the one I have is locked, and so, I cannot access ChromeOS.

So, the onl way is to disassemble the machine, unsolder the flash chip containing the BIOS, and rewrite it with a new firmware. The disassembly is fairly easy, with a few screws to remove and a few clips to unclip. The keyboard has a flat cable long enough that you can easily lift it away and access the connectors to remove it. Then, you have to extract the motherboard from the machine (with a few more things to unplug), because the flash chip we're looking for is on the underside of the motherboard.

You can find Youtube videos showing the disassembly process. I confirmed the flash chip was the one I was looking for, using an USB microscope to check its part number (that should start with 25) and also that it has the expected pin count (8).

I used a CH374a based programmer. It comes with a voltage adapter (as the chip is 1.8V and the programmer runs on 2.5V). It also comes with a "chip clip" that you can attach to the chip. This seems to work for reading but not for writing, for two reasons: the chip is write-locked by one of its pins (you would have to unlock that using ChromeOS software), and the chip clip lead is too long, making the communication too unreliable (I checked this by putting the chip in the clip after desoldering it). Instead I had to desolder the chip using a hot air gun, and then solder it onto the adapter board provided with the ROM programmer (mine didn't come with a suitable ZIF adapter, the included one was too narrow).

Once that's done, the process to read and reflash the ROM is quite easy. You have to take care to extract the old content of the chip because it contains some hardware serial numbers such as the MAC address. Then there is a tool to inject that into the new ROM file. I flashed a coreboot/UEFI firmware provided by Mr. Chromebox. This turns the machine into a regular PC instead of a chromebook.

This process is slightly annoying and at first I didn't understand why it was made at the same time so complicated (requiring a soldering iron and a few other special tools) but at the same time not so involved (the chip is relatively easy to desolder, there is no extra signature validation on it or anything like that). But after discussing it with other people, the reason is clear. The goal is not to protect access to the machine itself. The goal is to protect the user's data. Locking the chip from being written by unprivileged software makes it quite well protected from "rootkits" that could store malicious data and software there. Now, such attacks requires physical access to the inside of the machine, that is less easy to achieve. And apparently, the firmware checksum is used as part of computing a decryption key for the eMMC flash storage where all the user files are saved. So, despite hacing full control of the machine, I have no exploitable access to any data left there by the previous owner. This makes sense with the Chromebook system: making the hardware itself a relatively cheap commodity, and instead focusing on the data which is the most valuable thing here (both for the user, and for Google, who loves all data they can get their hands on).

After flashing and rebooting, the machine started Haiku just fine from an USB drive. I now have a few drivers to write and debug: for the I2C bus handling the touchpads (interrupts aren't coming in as expected), for the eMMC storage (I made some fixes to the SDHCI host controller but for now the storage driver in Haiku only handles SD and SDHC cards, not MMC and eMMC). The machine runs a Celeron N4120 and 8GB of RAM, which is quite reasonable. Not a super high end machine for compiling WebKit and other big projects, sure, but it will be more than enough for a bit of coding, word processing, and web browsing. And it's entirely passive cooled which is nice.

Some website updates

Posted by pulkomandy on Thu Sep 25 18:56:15 2025  •  Comments (0)  • 

Hello There!

You may have noticed some changes on this website recently. First of all, I joined a webring! With the web becoming full of LLM generated nonsense, I figured out it would be a good idea to start having links to other real websites. Fediring is a ring for people who post on the fediverse, so it covers a wide range of topics. Anyway, it means this website now has neighbors! So, go visit pfriedma's and lily's websites.

I also made some changes to the CSS to make it more friendly to mobile devices and narrow/low resolution displays. The sidebar will move to the bottom of the page in that case. This wasn't really possible when I first set up this CSS theme a few years ago, but now with media queries, it is. I also removed the footer link to Free CSS Template (which was required by the original CSS File I based things on), as the website is now gone and replaced with something farming such links.

Finally, I have recently released a FastCGI app to allow logging in to my Trac bugtracker instance using XMPP accounts. I will be deploying it to all the Trac projects I run over the next few years, and hopefully it will make it easier for people to use these.

I have also fixed some XHTML markup issues in some articles. So if you read this website using ATOM/RSS, you may have seen old articles get re-posted because of these changes. Sorry about that. I hope you enjoyed re-reading them.

That's all for today! The next few weeks will be a bit busy (in a good way) so there may not be more posts and projects here for a while. But I'll surely be back as things get quieter otherwise and I get to spend time in front of a computer.

News about VTech stuff

Posted by pulkomandy on Tue Sep 9 18:45:34 2025  •  Comments (0)  • 

Hello, happy year 2025!

Four years ago, I wrote an article starting with a note on how I had not made any progress on VTech hacking projects since 2017. So it seems it's about time for a new update.

That old article has a few wrong things. First of all, the CQFD Scientus is not a z80 computer as I expected, instead it is a 68HC05. Which is a bit ridiculous given it has quite a lot of ROM and RAM, and so, that poor CPU with a tiny address space is constantly bankswitching, making the ROM not very fun to disassemble. I understood the banking scheme and a few other things, but didn't go much further, because that doesn't seem like a very fun platform to write code for.

The other wrong thing is about the mysterious audio chip. I wrote that it could be just a volume controller, but I think that's not right and I have since identified a possible TI microcontroller with matching pinout. But I forgot what it was.

Anyway, a lot has happened as I got into the V.Smile console. This also has a weird CPU but at least no bank switching. I ended up porting an entire compiler to it that is now working acceptably (the code is not optimized, but it works). The next step is figuring out some issues with controller input and writing drivers for the sound system, and then probably trying to make some games for it.

But that's not what I'm writing about today. I also have some news about the IQ Laptop. I had bought a german version of it, but since then I also found a French one, called VTech Le Manager. The packaging and user manual seems a bit confused, not sure if it is just a kid's toy or a serious machine. I'd say it fails to deliver on both fronts: not really usable as a serious machine, but also not fun enough to be a good toy. The hardware seems pretty good, anyway, so, as long as the software can be replaced, that should be fine.

I mentionned a Linux port in the previous post, well, that is now working. It runs from an external cartridge for now. I would like to replace the internal ROM instead, and I have learnt that the French version of the laptop may have shipped with flash chips internally instead of ROM. Either they needed to update it until the last minute, or the production volume was so low that they didn't bother making proper ROMs for it. I will try to dump that memory for future emulation.

Here are some random notes about the machine:

  • 2MB of RAM and 2MB of ROM (or Flash internally)
  • Grounding one of the CPU pins can switch it into an internal debugger, allowing to load and execute code from the serial port. So it's easy to prototype things on.
  • The cartridge port uses a 45 pin, single row, 1mm spacing connector. Connector references are either JST ICM-A45H-SD14 or DDK MCB-545A, but both of these are out of stock and not manufactured. Sullins SFM210 could be somewhat compatible (right number of pins and spacing at least). To be confirmed when I get to make expansion RAM cartridges.
  • 0000 0000 - 001F FFFF: 2MB internal RAM
  • 0020 0000 - 002F FFFF: Internal ROM or Flash
  • 0300 0000 - 0307 FFFF: Internal data flash
  • 0300 0000 - 03FF FFFF: Internal IO space
  • 0400 0000 - 04FF FFFF: Cartridge port 1
  • 0500 0000 - 05FF FFFF: Cartridge port 2

Each cartridge port gets 22 address bytes for up to 16MB of address space. Extra RAM cartridges could push this thing all the way to 34MB of RAM! I received GERBER files from Alexandre for his cartridge and adapter, I can use that to document the cartridge port pinout if I need to make my own cartridges. And I also wonder how much the DragonBall EZ CPU could be overclocked safely?

David Given has ported EmuTOS to the AlphaSmart Dana, which uses the same CPU. With that and Alexandre's Linux work as sample (for the keyboard driver mainly), an EmuTOS port to this machine should be within reach?

Jean Gastinel's Thesis - Design of an Alphanumeric Terminal

Posted by pulkomandy on Tue Sep 9 18:45:10 2025  •  Comments (0)  • 

Most of this happened in march 2025, but I forgot to write an english blog post about it.

I have previously written about Thomson EFCIS lineup of display controllers.

During that research I had found about the EF9364 (aka SF.F 96364), designed by Jean Gastinel. I had found a patent, a datasheet, and mentions of it in Philippe Matherat's thesis as he based parts of his design for the EF9365 on it.

What was missing was Jean Gastinel's thesis, which had not made it on the internet. So, I subscribed to the nearest university library and asked for an inter-library loan to get it out of the archives, and I digitized it.

That thesis did not disappoint. It is made of 3 parts. The first one is an overview of integrated circuit manufacturing process at the time, explaining how to make a transistor and the process used to deposit various layers and apply masks on the silicon substrate, and then how to build simple logic gates from that.

At the time, most ICs were entirely manually routed, but not this one. The circit was quite complex, and so the idea was to instead design a few basic "logic blocks" that could be easily copypasted and chained together. There is also a performance evaluation to determine what can and cannot be done inside the chip (for example, the pixel shifter is too fast and had to be moved out to a separate component).

The second part of the thesis documents the design and operation of the alphanumeric terminal circuit and the design choices that led to it begin this way. The goal was to build a terminal to access remote machines. Other parts of such a device (such as the keyboard scanner, UART and modem) were already available in integrated ciruits at the time, and so the only missing part was the CRT controller.

Moreover, the availability of relatively affordable DRAM opened the way for a discrete design with good performance. Previous terminals used a delay line as a memory, as that was much cheaper, but it made the logic quite complicated and slow.

Finally, the third part of the thesis puts it all together, showing how the desired circuit can be implemented using the basic logic blocks, how it was prototyed on wire wrapping boards, and then put into production.

You can read the entire thesis (in French) in my digitized HTML version. The original was typed on a typewriter, with pictures pasted at the right place, and equations manually touched up. I mention this because it helps understanding how futuristic the idea of a cheap diplay terminal for a computer would have sounded at the time.

So that's it for the thesis. It is now available for all to enjoy. However, it also raised some further questions. In the conclusion, Jean Gastinel suggests that the process of building a complex circuit from basic elements could be automated by some kind of compiler, but that people were pushing against it because this way to design ICs with pre-made logic blocks was not in use in the USA at the time, instead there they drew transistors one by one on a huge drawing area.

This could be just a thesis student being convinced that his way to do things must be better against all evidence. However, he was right: what he describes is exactly how you would design an IC today, and it indeed allows to make much more complicated circuits.

Overall, this gives a feeling that these people were, in a way, living in the future (even more so in the conclusion of Philippe Matherat's thesis, where his graphical terminal circuit is used to connect to some remote machines and display 3D wireframe graphics and other cool things, all the way back in the late 1970s). These were things that were not even in movies and popular culture at the time.

So, how did they end up there? It doesn't look quite like your typical research thesis, especially when you consider that this was supposed to be a mathematics thesis (computer science wasn't yet a thing in French universities). Besides the mentions of the future possibility to build a compiler for assembling the basic logic blocks into a manufacturable chip, there are not a lot of mathematics in there.

Fortunately, the story doesn't stop here. After I shared the thesis online, a few people read it, and one of them decided to contact Philippe Matherat, to ask if he could share a bit more backstory about this. You can find the unedited replies posted publicly as comments on the linuxfr forum.

If you don't want to read it in the original French, here is a summary.

People working on these things are from the 1973 promotion of the Ecole Normale Supérieure. That is one of France very prestigious schools. Jean Gastinel, in particular, is the son of Noël Gastinel, who was one of the early adopters of computers in France. Noël worked at Grenoble university and got the university to set up an IBM 370 after visiting IBM and US universities.

However, that was still a large, room-sized computer. Jean Gastinel worked was a hacker playing with elctronics and things like remote controlled model airplanes. He got interested into computers and asked his father how to go about building one. His father sends him to François Anceau at Grenoble univeristy, who in turn addresses him to Gerard Noguez, at Paris 6 univeristy. Jean then gets two of his friends (Jean-Marc Frailong and Jean-Luc Richier) in the adventure and, from 1973 to 1975, they set out to work on designing and building a 12 bit computer using Texas Instruments MSI circuits. So, even before his thesis, he was already experimenting with cutting edge technology.

Overall, it is still a bit of an unlikely story. Essentially, a group of students designed a computer and an integrated terminal for it, and then added pixel graphics with hardware accelerated drawing. What a project already! And then, they signed a contract with a company who would manufacture the chips they had designed. Interestingly, they were not interested in creating a start-up, instead the money went to the school and allowed to create a real computer science research lab.

This created an interesting situation. Initially, EFCIS was more used to working on custom design for specific customers (what we would call ASICs today). But in this case, all the work was in public (although patented) and they were allowed to sell the resulting circuits to the general public. They also owned the patents (well, the parent company Thomson did), which allowed to later reuse them in upgraded and modified versions of the circuits, eventually leading to an entire family of display controller with various capabilities.

This in turn eventually shaped a lot of the french microcomputer industry in the following years. Not only the Minitel, but also the VG5000, the Alice, the Squale, the Goupil, the Videopac+ all use chips that derive from this initial work.

Philippe Matherat also says that they continued for a while to work in this way: designing things in a research lab, and hoping that some company would pick it up and industrialize it. Maybe that would explain why there is not quite a French Silicon Valley: for that to work, you may need the designers of the new technologies to fund their own companies and push their own thing commercially and raise funds to support it.

Later on, the commerical strategy at EFCIS meant that they stopped collaborating with ENS students. Around 1982, one of the project was an 8 bit computer called Monocarte-THEMIS, which was designed to deploy in French high schools. But eventually, some cheaper and simpler designed were selected for that (I guess it would be Thomson MO and TO machines).

So now, we can only dream about what could have been, and enjoy the nice video chips that resulted from all of this.

This story is not completely over. During my research, I also looked at other thesis mentioning alphanumeric terminals, I have found one that may be closely related to the Minitel. Unfortunately, the place where it is stored is reorganizing its documents and could not make it available, so this one will have to wait a bit more. We may also soon get a picture of a de-capped EF9364. That would be nice, as the thesis opens with a view of the circuit die, but in quite bad quality in the printed copy I got, and insufficient to attempt a reverse engineering. Maybe I can soon replace it with an actual picture of the chip. I'm curious if all the markings will be exactly the same.

I guess the conclusion is: public libraries are awesome and better than the internet. We should use them more.

Mhing - Mah Jongg simplifié

Posted by pulkomandy on Tue Sep 9 18:44:27 2025  •  Comments (0)  • 

Il y a quelques temps j'ai acheté un jeu de Mah Jongg. Les règles classiques fournies avec sont un peu compliquées et protocolaires (mise en place du mur, détermination des vents de chaque joueur par tirage au dé, etc), et en plus elles sont en anglais et ne se jouent qu'à 4 joueurs. J'ai donc découvert le jeu de Mhing, plus simple et qui peut se jouer à 2 et jusqu'à 6.

Objet du jeu

Le but du jeu est d'être le premier joueur à totaliser 500 points. Les points se comptent lorsqu'un joueur annonce "Mhing" en constituant quatre séries de 3 cartes chacune, accompagnées d'une paire. Les séries sont formées soit en séquences (3 cartes qui se suivent), soit en brelans (3 cartes identiques).

Résumé du jeu

Chaque joueur reçoit 13 tuiles. Le reste forme le puits gardé face cachée (par exemple dans un sac de tuiles). Le donneur commence en tirant une tuile du puits, puis défausse une tuile de sa main. Les cartes défaussées peuvent être utilisées par les autres joueurs pour compléter une série ou pour "sortir" et terminer la manche. Le jeu continue ainsi, dans le sens horaire, chaque joueur piochant dans le puit ou récupérant la tuile défaussée jusqu'à que l'un des joueurs annonce "Mhing".

Lorsqu'un joueur annonce "Mhing", il marque les points correspodnants à la combinaison des séries qu'il a formées. L'intérêt du jeu augmente avec la familiariosation des joueurs aux multiples combinaisons possibles.

Les tuiles du Mah Jongg

Le jeu comporte 144 tuiles: 108 tuiles réparties en 3 couleurs, 28 tuiles d'honneur et 8 tuiles de fleurs.

Les tuiles des couleurs

Il y a 3 couleurs: les bambous, les ronds, et les caractères. Chaque couleur comporte 4 séries de 9 tuiles allant du 1 au 9.

Tuiles de base du jeu

Les tuilles d'honneurs

Les honneurs sont les 4 vents (ouest, nord, sud, est) et les 3 dragons (rouge, vert, blanc). Il y a 4 tuiles de chaque honneur.

Tuiles d'honneurs du jeu

Les tuiles de fleurs

Les 8 tuiles fleurs ne sont pas utilisées dans votre main pour former des combinaisons. Lorsqu'elles sont tirées, posez-les face visible devant vous et tirez une autre tuile pour votre main. Elles rapportent des points supplémentaires en fin de partie.

Mise en place

Chaque joueur tire 13 tuiles, puis étale face visible les tuiles de fleurs et les remplace par le même nombre de cartes du puits.

Les combinaisons

Le but du jeu est de composer 4 séries de 3 cartes chacune, plus une paire. Lorsque vous atteignez cet objectif, vous ne jetez pas de tuile. La combinaison gagnante comporte donc 14 tuiles. Le premier joueur a réussir cette combinaison remporte la manche et gagne les points correspondant à ses combinaisons.

Séries

Les séries peuvent être constituées de 2 manières, soit en séquence (3 cartes qui se suivent de la même couleur), soit en brelan (3 cartes identiques, y compris les honneurs). Les séries de 4 cartes ou plus ne comptent pas.

Une paire est composée de 2 cartes identiques.

Le jeu

À son tour, chaque joueur pioche puis rejette une tuile de sa main. Les autres joueurs peuvent récupérer la tuile défaussée dans 2 conditions:

  • Soit pour constituer une série
  • Soit pour constituer la paire permettant de faire "Mhing" et de terminer la manche.

Si aucun joueur ne souhaite prendre la défausse, la main passe au joueur suivant, qui pioche à son tour une carte, et ainsi de suite.

Réclamer la défausse

Une défausse peut être réclamée par n'importe quel joueur même si ce n'est pas son tour de jouer. Réclamer la ´defausse est possible jusqu'à que le joueur suivant aie commencé son tour en piochant une carte, ensuite la défausse est "morte" et ne peut plus être récupérée.

Lorsqu'un joueur réclame la défausse pour compléter une série, il dépose devant lui face visible la série ainsi obtenue. Il doit ensuite défausser une tuile pour garder une main de 13 cartes (sauf s'il met fin à la manche en faisant Mhing).

Seules les séries constituées avec l'apport de la défausse sont étalées. Les séries constituées uniquement en tirant des tuiles du puit restent dans la main du joueur.

Une défausse réclamée doit être étalée avec la série qu'elle complète. Elle ne peut jamais être gardée en main.

Score

Seul le joueur qui fait Mhing marque des points. Les adversaires ne sont pas pénalisés par les tuiles qu'ils ont en main à ce moment là. Le joueur gagnant compte ses crédits puis les crédits sont convertis en points selon le tableau suivant:

1 crédit2 points 11 à 13 crédits128 points 29 à 31 crédits8192 points
2 crédits4 points 14 à 16 crédits256 points 32 à 34 crédits16384 points
3 crédits8 points 17 à 19 crédits512 points 35 à 37 crédits32768 points
4 crédits16 points 20 à 22 crédits1024 points 38 à 40 crédits65536 points
5 à 7 crédits32 points 23 à 25 crédits2048 points 41 crédits131072 points
8 à 10 crédits64 points 26 à 28 crédits4096 points

Ce principe incite chaque joueur à tenter d'obtenir plus de crédits en améliorant ses combinaisons, tout en prenant le risque qu'un adversaire fasse Mhing avant lui, rendant sa main sans valeur.

Les crédits

Les crédits dépendent des combinaisons de séries. Il en existe 18 sortes illustrées dans les feuilles de marque. Ces combinaisons peuvent être arrangées de millions de façons différentes, multipliant de ce fait l'intérêt du jeu à mesure que les joueurs les connaissent mieux.

1 crédit

  • 1 crédit est attribué pour chaque fleur
  • 1 crédit est attribué pour la paire, si c'est une paire de 2, de 5, ou de 8
  • 1 crédit est attribué pour une double suite, c'est-à-dire deux suites de même hauteur mais de couleur différente
  • 1 crédit est attribué pour un double brelan, c'est-à-dire deux brelans de même valeur
  • 1 crédit est attribué pour une suite royale brisée, c'est-à-dire la combinaison de la suite 1-2-3 et de la suite 7-8-9 dans la même couleur, sans le 4-5-6.
  • 1 crédit est attribué pour chaque brelan d'honneurs
  • 1 crédit est attribué si le jeu ne contient que deux couleurs différentes (sans honneur)
  • 1 crédit est attribué pour un jeu ne contenant que des suites (sauf la paire qui est obligatoire), c'est-à-dire qu'il n'y a aucun brelan.

3 crédits

  • 3 crédits sont attribués pour une double suite identique, c'est-à-dire deux suites de même hauteur et de même couleur
  • 3 crédits sont attribués pour une suite royale c'est-à-dire la combinaison des suites 1-2-3, 4-5-6 et 7-8-9 dans la même couleur
  • 3 crédits sont attribués pour un rien ne s'accorde (voir plus haut)
  • 3 crédits sont attribués pour une main d'une seule couleur et des honneurs
  • 3 crédits sont attribués pour un jeu ne contenant que des brelans (sauf la paire qui est obligatoire), c'est-à-dire qu'il n'y a aucune suite.

5 crédits

  • 5 crédits sont attribués pour une main avec une combinaison de bambous, une de cercles, une de caractères, une de vents et une de dragons
  • 5 crédits sont attribués pour une basse main, c'est-à-dire si aucune carte n'est au-dessus du 5 (sans honneur)
  • 5 crédits sont attribués pour une haute main, c'est-à-dire si aucune carte n'est en dessous du 5 (sans honneur)

8 crédits

  • 8 crédits sont attribués pour un jeu présentant trois brelans de dragons, un de chaque couleur. Ces huit crédits ne sont pas cumulés avec les crédits donnés individuellement pour les brelans de dragons.
  • 8 crédits sont attribués pour une main à couleur unique, sans honneur
  • 8 crédits sont attribués pour un rien ne s'accorde avec honneur (voir plus haut)

Rien ne s'accorde

Le rien ne s'accorde est une alternative au jeu normal, qui consiste à tenter d'avoir le jeu le plus dépareillé possible. Ce jeu est défini non seulement par l'absence de toute combinaison étalée ni de combinaison en main, mais aussi l'absence de toute combinaison en devenir (par exemple un 1 de cercles et un 3 de cercles, auxquels seul manque le 2 de cercles pour faire une suite)

De manière formelle, on peut dire que les cartes de même couleur doivent avoir au moins 3 unités d'écart (1-4-7 ou 2-5-8 ou 3-6-9 est le plus serré à être acceptable). Si on tente le rien ne s'accorde, on ne peut jamais prendre la défausse, puisque cela obligerait à étaler une combinaison avec la carte ramassée, sauf pour faire Mhing.

Quand après avoir pioché, on se retrouve avec un jeu dans cette configuration, on peut annoncer Mhing et on remporte la manche.

Le rien ne s'accorde avec honneur est une variante du rien ne s'accorde où l'on a en plus un dragon de chaque couleur et un vent de chaque direction.

Obscure Soundchip Tuesday: The SGS M114

Posted by pulkomandy on Fri Aug 22 17:28:16 2025  •  Comments (0)  • 

(yes, I'm writing this article on a Monday. But the title sounded better this way. Whatever.)

As you may know, I'm interested in old computers and sound and music. Recently I was reminded by a youtube video about the Amstrad CKX100, a toy-synthetizer from 1988. I had already heard about it but never had an opportunity to disassemble one, so I didn't know how it was built. Well, this video did just that and got me curious about the soundchip used, the ST M114S. I had not heard about it before and was curious how it worked and if I could possibly connect it to a computer to play some music.

So, a few web searches later, I didn't find a lot, but I found what I was looking for: a datasheet and a research paper from the chip creators, both explaining in some detail how the chip operates.

The research paper is dated from 1985 and mentions that the chip was developped in cooperation with GEM, another synth keyboard manufacturer (who went on to use it in several of their keyboards). It was designed by SGS before merging in ST, so it was the opportunity for me to learn a little more about the Italian side of ST Microelectronics. Previously I had known them mostly for the EF9345 (used in the Matra Alice and the Minitel) and some of the chips used in the Thomson TO8 and MO6 machines, mainly the palette circuit that was developped for the TO9 but later became an off-the-shelf component in ST databooks.

The chip uses an 8KB ROM where sound samples are stored. The samples are in power of two sizes from 16 to 2048 bytes, and are played in a loop to sustain the sound. They use delta coding, which means they encode the derivative of the actual sound. An analogue integrator is used to turn this back into the original signal, which has the advantage of providing some level of smoothing of the output.

The chip does a bit more than just playing the table, however, it can smoothly transition from one sample to another by mixing them together. And it can also play the sample at various speeds by using only one out of two or one out of four bytes (but this risks having a not perfectly balanced signal, which will drive the integrator into saturation).

Moreover, it is indeed designed to be driven by a CPU. It is not a "keyboard-on-a-chip" system that would handle keyboard scanning directly and generate the sounds from there. Instead there is an interface for a normal CPU bus where you can send commands to trigger notes. The commands are made of 8 6-bit words. This may seem a bit strange, 6 8-bit words may have made more sense (especially as most of the values transmitted are indeed 8 bits), but I guess it allowed to save some pins and fit the thing in a 40 pin DIP package.

As a result, this chip was apparently used in some pinball and arcade machines. However, MAME doesn't have an emulation for it yet.

During my search for the Datasheet, I of course looked at ST databooks, and found that there is a larger version called M114A. This one has a 48 pin package, which allows to address 265K of memory instead of 8K. Surely this must allow for much more complex sounds.

Interestingly, the datasheet for the M114S shows a clever hack to increase the ROM space to 16K instead of 8. The chip has an "output enable" pin to signal the ROM when it wants to read a byte. This signal will always trigger twice within a short time since the chip will always read a byte from two samples (remember, it then mixes them together). So, they suggest connecting a flip-flop to this output and resetting it with an RC circuit tuned to a longer delay. The output is then used as an extra address bit, and as a result, the first sample will be from the first half of the ROM, and the second sample from the second half. This simulates the behavior of one of the pins of the M114A, which probably does the same without needing an RC circuit as it knows about the chip internal timing.

Of course I am interested in wiring the chip to RAM. The datasheet says it's possible, but some arbitration will be needed to make sure the M114 and CPU don't try to both access the RAM at the same time. Should I go with dual port RAM? Can I simply block the M114 from accessing the bus when the CPU wants to use it? Maybe I can do some fun tricks where the CPU manages to synchronize itself with the chip and feed it data without using the RAM at all? And, first of all, can I even get one of these chips? There seem to be someone selling them ("used") online, but will they be real chips, or will they be some random other chip being relabelled?

I'm also considering writing an emulator for this chip, just to hear how it could sound like. However, to do anything like that, I think I need ROM dumps from one or more of the original devices using it. At least to make sure my understanding of the storage format is correct. There is a dump from one of the pinball machines using it, but it is using mainly or only PCM samples, and not delta encoding.

Finally I am surprised that this chip was not more popular, for example in home computers. Ok, maybe needing a whole 8K of ROM may be the cause. But it sounds quite nicely, and it would have been interesting to see it alongside the AY3, the SID, and the various options from Yamaha, don't you think?

Programming notes

This information is directly from the datasheet, but here in easier to read HTML format.

All communication with the chip is made with 8 byte commands (only 6 bits in each byte). The bytes must be no more than 128us apart, otherwise the chip considers itself out of sync and the command is aborted.

The command layout is as follows:

ByteBit543210
1Attenuation
2OutputT1 address MSBT2 address MSB
3T2 address LSB
4T1 address LSB
5T1 lengthMode
6InterpolationImmediateOctave Divider
7ChannelTuning LSB
7NoteTuning MSB
  • Attenuation: higher values for lower volume
  • Output: route the channel to one of the four outputs
  • Table addresses: bits A12-A4 of the ROM address to read from
  • Length: number of bytes in T1 sample, from 16 to 2048
  • Mode: various ways to access the two samples, see below
  • Interpolation: Mixing value K. The generated output is S1 * (K + 1)/16 + S2 * (15 - K)/16. A value of 15 plays only table 1, and lower values add more and more of the second sample to the mix.
  • Immediate: apply the new attenuation immediately (otherwise a ramp from the previous value is generated)
  • Octave divider: divide the frequency by two, to play an octave lower
  • Channel: channel on which the command must be run
  • Tuning: finetune the selected note, see below.
  • Note: from 0 to E, 15 possible notes a semitone apart, covering a bit more than an octave. F is reserved for special commands (See below)

The possible values for the tuning:

  • 0 to 5: detune by -6/12 to -1/12
  • 6: detune by -2/1000
  • 7: detune by -1/1000
  • 8: exact note
  • 9: detune by 1/1000
  • A: detune by 2/1000
  • B to F: detune by +1/12 to +5/12

The special commands are not related to a channel. They are globally applied

  • F8: ROM ID mode. The chip will generate addresses 0 to 8 in a loop and send them to the ROM.
  • F9: SSG, enable synchronous mode globally. In synchronous mode, frequency changes are delayed until the next loop point in the sample
  • FB: RSG, disable synchronous mode. Frequency changes happen immediately.
  • FA: RSS, invert the state of synchronous mode just for the next command.
  • FC: Keep the existing frequency from a previously played note.
  • FF: Immediately stop the channel playing.

TODO: SSG and RSG are swapped in another part of the datasheet. Check which one is correct.

And finally the modes affect how the waveforms are accessed:

ModeT1 speedT2 speedT2 length
000 /2 /2 =T1
001 1 1 =T1
010 /4 /4 =T1
011 1 1 max(T1/2, 16)
100 1 /2 T1/2
101 1 1 max(T1/4, 16)
110 1 /4 T1/4
111 1 1 max(T1/8, 16)

Emulation

It turns out someone already wrote an emulator!

Example ROM

Thanks to Victor Baeza, here is an example sample ROM from a Baldwin/Viscount C400 organ. I have converted it to a WAV file so you can get an idea how the samples are typically organized.

Next steps

I'm looking for ROM dumps of the Amstrad CKX100 (both the samples and the CPU code) so I can learn more about how it uses the chip. Anyone has one and is willing to dump the ROMs?

Putting a Philips PM3580 logic analyzer back into service

Posted by pulkomandy on Tue Aug 19 20:09:36 2025  •  Comments (1)  • 

For my electronics projects, I occasionally need to probe some logic signals. Until now, I've been using a Saleae 8 clone (don't buy the original, the value is not in the very simple hardware but in the software shipped with it, which you likely don't need), together with sigrok and Pulseview (resulting in an entirely free software tool, since the firmware loaded on the logic analyzer is also open source).

This serves me well, but the 8 channels are quickly limiting if you work with anything other than serial protocols (I2C, SPI, ...). Recently I wanted to probe a printer plugged to an old style parallel port, and that would need at least one more pin. So I started looking for something with more inputs.

I was surprised by the large gap between the super cheap analyzers with 8 channels, and everything else. This is for good reasons: wider analyzers typically also support higher frequencies, larger voltage ranges, and are not as commonly cloned by cheap manufacturers that sell them at a low cost without any support.

I decided that if I was going to buy something new, I'd rather get at least 32 channels, possibly more. This would allow me to, for example, probe a z80 CPU or other similar 8-bit hardware. Not seeing anything with a reasonable price, I turned to eBay to see what I could get from the used market.

First I saw an HP logic analyzer for less than 100€. However, it was in pretty sad shape, missing the power supply, the top cover, the floppy drive, and also all the probes and cables. All of this can be replaced, and each of these part individually would probably have been OK. But, adding it all together, I would have to pay more in parts than the logic analyzer was worth, and also, the state of that machines says that it was used as a parts machine. The parts that remain in it are thus quite likely to be broken, and also need a replacement.

So I kept searching, and eventually I found a Philips PM3580. This one comes with 64 channels and goes up to 50MHz (I think that's more than I'll ever need). Of course you get the downside of legacy hardware: this thing is big and heavy, includes its own CRT display, and is not so easily interfaced with a modern computer. But you also get a nice purpose-built user interface with a dedicated control panels, from back then people still knew how to do that.

I imported it from the US, but thanks to eBay international shipping service, I didn't have to pay any taxes and the shipping was not too costly (the total was under 400€). It also came very complete, with pod cables, user manual, and system floppy.

Unfortunately, that floppy was not working anymore. I could easily find a floppy image of the system software, however, I do not have much hardware with floppy drives at home to write it! Eventually I figured out how to do it from a Sun Ultra 5 running Solaris 8. I also prepared a few other disks I could find online, including the mouse setup utility (allowing to use a serial port mouse), the configuration utility disk (allowing to set the system date, format disks, copy files, etc) and various disassembler software plug-ins.

This anayzer seems less well known than its HP and Tektronix counterparts. Philips made a few other models starting at least in 1985, and this one is from 1992 I think. Anyway, the software for it is running on DOS (at least the versions I could find, supposedly there is also software for UNIX but that would likely not run so well on a current Linux either).

I first put my attention on the software to generate custom disassemblers. This allows to tell the logic analyzer how to parse bit patterns and decode them to show, for example, CPU instructions. This allows to trace CPU execution and do some high level debugging (that is, high level compared to looking directly at the timing capture).

I started disassembling the software using IDA (version 5 free edition, as the newer free versions don't include support for DOS executables anymore). I first tried using reko, but it is not quite ready for this yet. After a long time staring at the code and identifying and reimplementing it bit by bit, I ended up with a complete reimplementation in C (this was made considerably easier by having the documentation for the compiler, as well as following the very complete error messages in all error cases).

To check my work, since I use Haiku, I ported dosemu which allows me to run the original software and compare the output (this works so well, in fact, that I didn't really need to reimplement the thing in C in the first place). With the help of dosemu developers, I could even enable the JIT support, and some of my patches also helped improving support for dosemu on FreeBSD and Mac OS (historically it was very tied to Linux, but there is a lot of activity on dosemu version 2 currently, which fixes that).

After compiling some custom disassemblers with both my tool and the original, I wanted to compare them using biodiff, a tool that uses algorithms developed to analyze genetic mutations and variations. If you can compare a sequence of genes, surely you can also compare a sequence of bytes. Unfortunately, it depends (indirectly) on older versions of some Rust libraries that didn't yet have Haiku support. So I had to settle for vbindiff, whic is a lot less smarter (if one byte is inserted or deleted, it considers that all the next bytes until the file end are different).

I got to the point where my tool generates byte-identical output to the original for all the files I could find (mainly the examples provided with the original software, and one or two other custom disassemblers that people have shared on internet). I even identified some bugs in the original code!

To complete this effort, I would like to try setting up a fuzzer to generate more test files, and check that my code behaves correctly also in error cases.

While studying the code, I found that the output is not just a configuration file or some data for the logic analyzer. Instead, it contains 68000 executable code, which is not entirely surprising as the hardware is built around the 68070 CPU and the same chipset as the CDi console. This means, in theory it's possible to also run arbitrary code on the logic analyzer to do... well, whatever I want.

I started studying the generated file, but it is not in a standard executable format (ELF, COFF, a.out or similar). It took me some time to figure out what I think is the layout of the file (which is nothing very original if you are familiar with the aforementioned file formats). There is a section list at the start, then code and data, and then a list of relocation entries (the executable will be loaded at an arbitrary address, and it needs to be patched as it contains references to itself). I was just puzzled for a while because the file header is in little endian, unexpected for a file that will be run on a 68000 family CPU.

I tried to use Rizin and Cutter to disassemble the file, but the version in Haiku is old and seems a bit broken. Begasus (who maintains a lot of packages for Haiku) had already tried to port a newer one, but it is crashing at startup. I looked at the crash and I think it is because of using an old version of tree-sitter. However, updating that requires some fixes to Haiku support in Rustix. I did that, but now I have to wait for the fix to propagate in a Rustix release and then in tree-sitter dependencies.

Until then, I am working with IRA and Rehex (where I found that the 32 bit windows version of Rehex didn't support 68000 disassembly, which is now fixed). It took a while to get started with the disassembly, in part because I'm not very familiar with the 68000, and in part because there aren't as many error messages and strings in the generated file. But eventually I start to identify some functions (starting with the standard C library string processing, division, modulo and the like). I can also reuse my knowledge from how the compiler generates some data structures, and match that with the code using said structures. So it will take some time, but eventually I hope to figure out the entry point, and that will allow me to write other custom tools for the logic analyzer.

With all that, I didn't even plug it o any of my electronic gadgets yet...

A deep dive into Thomson EFCIS graphic controllers

Posted by pulkomandy on Sun Jun 22 21:34:39 2025  •  Comments (0)  • 

This article was updated after feedback from Philippe Matherat, the designer of the EF9365 GPU. It previously stated that the development happened at the LIENS (that laboratory was created much later), and that the goal of the project was to emulate the Tektronix 4010 display (that was added later in the project as a way to test and validate the design).

So, this week I thought about Thomson EFCIS graphic controllers again. I know that Thomson made at least two families of controllers, the EF934x, which runs the Minitel, and the EF936x, which is designed for pixel graphics, but I did not know as much about this one. It seems a bit more of a "high end" thing, but at the same time, it is loosely similar to the display system used in Thomson TO and MO microcomputers.

I also knew that Thomson EFCIS sold the color palette circuit originally developped for the TO9 as a general color palette chip (the TO9 has it described as a custom chip in its technical manual, but then it became an off the shelf chip).

Anyway, I became curious about the other chips and why and how they were designed. And then I read a lot of datasheets and research papers and patents. Here is what I found.

Note: this is based on some random web searches, I wasn't there at the time (in fact I wasn't born at the time) and I could be misunderstanding or misremembering what I read. Don't trust me too much...

What are we talking about, anyways?

EFCIS is a French semiconductor company. It was created by a public research lab which turned part of its activity into, apparenly, an integrated circuit design and manufacturing company. At the time, France was a bit behind the USA in terms of semiconductors manufacturing. The government helped solve this by injecting money into the sector, but also there would be various merge of companies to join forces, and partnerships with US companies to first resell their chips, and then manufacture them locally.

In the case of EFCIS, it would merge with several other companies to become Thomson Semiconducteurs, which in turn merged with Italian SGS to become SGS-Thomson, later shortened to ST Microelectronics. That company is still around today.

But our story does not start there. Instead, we have to dig into what was happening in the mid-70s in the École Normale Supérieure (ENS). At that time, the school did not have its own computer, instead it had a "computation center" that was mainly a terminal connected to a computer in Orsay faculty. Students had access to that system and were also working on electronis projects, such as a custom 12 bit computer.

The terminals at the time were either teletypes, or early CRT terminals. Either of these options would cost quite a lot of money. Not so much when compared to a full-size computer, but with the development of minicomputers and later of microprocessors, the terminal was becoming a quite costly part of the system.

And that's just for text. If you wanted to display graphics, you would have to go for something like the Tektronix 4010.

These terminals are somewhat complex and costly hardware. The 4010 is a strange device by today's standard, it does not use RAM to store display data, instead, the cathode ray tube itself keeps the pixels "on". This makes the system cheaper, but has the major downside that you can only erase the whole screen, not just subsets of it, and erasing takes a second or so. As a result, this is not giving a great user experience for interactive graphics.

So, the people at ENS have an idea: what if we could use a TV and some RAM to make a cheaper terminal? It would allow interactivity, and also maybe color.

Their first try was a text terminal. This was reasonable to do, with 1Kx8bit of RAM and a character ROM, it is possible to display a 64x16 character grid on a TV. This allowed to have a first go at generating the display timings, clock, and pixel shifting.

There was already prior art in doing this, for example, Dan Lancaster's TV Typewriter from 1973 is based on very similar concepts (at only half the resolution, 32x16 characters). That TV typewriter allowed home computers to get some kind of cheap terminal system, and the video output in early home computers like the Apple I and the SOL-20 were based on this design.

Anyway, with the goal of making the cost lower, one solution is to reduce the number of chips by using an LSI (large scale integration) chip. This would result in the SF.F 96364, with a patent application in early 1977 (FR 77.0524, also known as FR2382049, and its equivalent US patent US4328557A, attributed to Jean Gastinel), I think just a little time before similar chips like the MC6845 were announced (but it's hard to find exact dates).

Of course, there would be no point in manufacturing an LSI chip just for use in the research lab, and so, the SESCOSEM, the chip manufacturer, also introduces it for sale to the general public. I have not introduced the SESCOSEM yet, but I don't have much info on it, it seems it merges with EFCIS just a few months after releasing this chip, and the chip is then rebranded as EF9364.

It's also worth mentionning that SESCOSEM was also a second source for Motorola products. Well, or at least that's how they advertised it, it looks like initially, the chips were manufactured by Motorola but rebadged with SESCOSEM part numbers. This means you could get both the 6845 and the 96364 from the same company.

However, the two chips, while similar in the way they drive the CRT, are different on the way they are interfaced on the other side. The 6845 is designed for use with a microprocessor (initially a 6800, but people figured out how to make it work with other CPU families as well). On the other hand, the 96364 is designed instead to interface directly with an UART chip. This means the chip is less flexible and has a fixed resolution of 64x16 characters, whereas the 6845 can be configured into many videomodes, but on the other hand, and unlike the 6845, it will manage memory writes and cursor movement, with some special commands allowing to move the cursor around. This makes it possible to build a serial terminal ("glass teletype") from a TV set, this chip, an UART, a character generator and 1Kx7 of RAM, with a bit of glue logic.

So, we now have this chip that makes it easy and reasonable for DIY computer builders to assemble a text terminal for their computers. This is taken up by several electronics magazines and kit distributors, one of the well-known ones being the Elektor Elekterminal . Suddenly it is possible to build a terminal that will not require hundreds of TTL chips, or include a microcontroller, and cost more than the computer it is attached to.

Into graphics controllers

While this text terminal is in development, the cost of DRAM chips continues to decrease, and larger capacity chips become available. With 4K chips, it becomes possible to build a 512x512 pixels, black and white bitmap display with 'only' 64 RAM chips. This opens a new world of possibilities, such as working on a CAD application for electronics, replacing the long and tedious process of hand drawing transistors for computer chips.

As the SFF96394 is finally made available to the general public in mid-78, with terminals and VDU cards starting to appear as kits or prebuilt at the end of that year, the design for the next system is published, as four patents and a paper in SIGGRAPH 78 from Philippe Matherat. Now my research was made quite a bit easier because he links to all of this from his personal website :).

So, this next design retains the basic idea of designing a chip for use in a Terminal. However, at this time, and for a more complex graphics terminal, it becomes acceptable to include a microcontroller, or maybe even a microprocessor and turn the terminal into a kind of simple workstation. This design aims to once again reuse a standard TV receiver as the display, but support Tektronix's PLOT10 API or even Tektronix serial line protocol. This way, the existing applications will run on the new, cheap terminal as a starting point, and then the terminal can be extended with color display, faster refresh times, etc.

At the time, the main constraint is not so much on the memory size. 16K DRAMs are now available at an acceptable price for this application. The main problem is rather the memory speed. It is not possible to output single bits from a single DRAM chip fast enough to match the needed pixel clock (the project aims for a 512x512 interlaced display window). The soluion to this is to put several DRAM chips in parallel (as you do when building a computer to address them in bytes), and then use a shift register to send the bits to the display circuit. Essentially, this removes the character ROM from the previous text based system, instead feeding the RAM output directly into the already existing shift register.

This way, the shift register is the only thing that needs to be really fast, the RAM will only need to fetch one 8-bit word at a time.

The second problem is fast vector drawing. Microprocessors at the time could not be programmed to do this fast enough to keep up with even the Tektronix terminals. Part of the reason for this is that the RAM is completely busy with sending pixels to the display in this design, probably to keep it simpler. Other machines at the time (such as the Apple ][) are figuring out how to intermix accesses to the RAM for the display and CPU, but here a different direction is taken. The CPU will have only limited access (in fact, it's possible to have no access at all) and instead the hardware will provide high level drawing commands, with the goal of drawing one pixel per clock cycle (during borders and blanking, when the memory is not busy with either display or refresh cycles).

The commands start with vector operations. The display controller keeps track of X and Y pointers and implements Bresenham's algorithm in hardware for drawing straight lines. Additionally, it can also draw characters (from the ASCII) table, with optional scaling, rotation, and slanting.

The chip internally manages a 4096x4096 coordinate space, of which a part is not visible on screen. This allows line vectors to go out of screen and the drawing to continue correctly with the next vectors. The part that would be outside the screen are clipped cleanly and the drawing is not distorted.

Access to the memory is done in parallel (all 8 chips) when reading, but writes are done one bit at a time, by enabling the RAS pin for just one of the chips.

The SFF96364 could fit in a 28 pins DIP package, but the new EF9365 needs a 40 pins package, and even that is a bit too limiting, so a few extra chips are needed: a demultiplexer to select the RAM chips, a shift register, and also a PROM to derive some signals from the clock. The EF9366 is a simplified version that can only draw a 256x256 display, but requires a little less support logic. Both versions of the chip do include a lightpen controller as well.

Color graphics are possible by adding parallel banks of memory chips, typically one for each of red, green, and blue. Then, the CPU can program an external register to store a 3 bit color, that is loaded to the RAMs when a pixel needs to be plotted.

In one of his later articles, Philippe Matherat explains why this approach turned out not to be such a great idea: it was done with the goal of working similarly to the Tektronix displays available at the time, and it did a decent job at that. But, a few year laters, bitmap displays would become more common, and faster CPUs such as the 68000 would be available, which would do a faster job at drawing lines and text than this chip could do, and also allowed more advanced operations (scrolling, overlapping windows, etc).

There were attempts to workaround these limitations, for example by adding extra operators on the data path between the video controller and the RAM for write operations. That would allow some operations like XOR with a cursor, intersection with existing drawings, etc. However, adding more complex drawing primitives require a larger redesign of the chip, more on that later.

Initially the chip finds some use, both in DIY videocard projects (again with Elektor offering one) and in scientific hardware where it makes it possible to show realtime and color display of measurements (for example I read about its use to display spectrograms). While the initial design was ready by may of 1978 (as you can see in the patents and papers published then), the chip would be available to customers only in late 1979 or early 1980. At that time, computers and workstations with raster graphics are quickly catching up, which means the chip did not really get a chance to be used.

Meanwhile, in Brittany, ...

Meanwhile, other innovations are going on in France. There are experiments to replace the phonebook with an electronic device, basically a terminal that would be plugged into a phone line and allow to send and receive data. Experiments on this start at about the same time in 1978. Of course, the plan is to replace the phonebook, which means the terminals will have to be produced in a large quantity, and be cheap. This justifies the use of an LSI chip again.

This time, there is no room for complex graphics: in a desire to keep the costs low, there are big constraints on the amount of RAM and ROM in the terminal. And so, it's back to the simple character based system. But maybe not quite so simple as the EF9364. The people working on this first experiment with hardware and protocols based on ANTIOPE, the early Teletext system in France. They also want to make the terminal user friendly, and for this they need various character attributes, maybe colors (eventually it will be 8 greyscale levels), double size characters. Hardware vertical scrolling ("roll") also seems useful to render long pages of text at an acceptable speed.

These requirements will lead to a new design, initially as a set of two chips based on the earlier combination of a text mode controller with a character generator. Eventually, both of the chips become a lot more complex than what they were initially, and, as it seems usual for EFCIS, they are not reserved to the Minitel and added to the general catalog of available ICs. These are the EF9340 and EF9341, also known as VIN and GEN and introduced in 1980. They will also find an use in the Philips Videopac+ consoles where their video incrustation support will be used to mix the video signal with the existing one from the Intel 8244 in the older generation Videopac.

The Minitel hardare will be revised several times over the years, to introduce new features (such as a dialer and local phonebook, and a generic ASCII 80 column terminal mode). This leads to new generations of video controllers, the EF9345 (introduced 1983, also ends up being used in the Matra Alice computers as well as the Philips VG5000) and the TS9347 (introduced 1988, with EFCIS now renamed Thomson Semiconducteurs, this one gets a new manufacturer prefix).

These chip are quite far from the initial "dumb terminal" usage. They can redefine some custom characters in RAM, have a "compression" system where a short byte sequence can expand into a longer string that is stored in memory, etc. They also provide a rather large 8x10 font, fitting 40x25 characters in a 320x256 video output.

Ok, back to the pixel based ones

The EF936x family also gets an upgrade in the form of the EF9367. This chip allows doubling the horizontal resolution (to 1024x512 pixels), but it does so in a not so convenient way. It looks like this was first done as an extension to an EF9365 videocard, and then pushed back into the video chip. It also changes the video timing: the initial timing for the 9365/66 resulted in a square, 256x256 or 512x512 window at the center of the display. But, this results in a lot of wasted space to the left and right of that window. The new chip changes the timing, it now operates on a slower clock but keeps the same horizontal frequency for the display. As a result, in 256x256 or 512x512 interlaced modes, the pixels will be slightly wider than high, matching the 4:3 ratio of the display, and at 512x256 or 1024x512 interlaced modes, they will be slightly higher than wide, with a ratio around 2:3, still much better than the 1:2 you would get in the same conditions with the original chip.

Another new feature is the possibility of a 60Hz mode, reducing the video resolution. The datasheet and appnotes also add a few extra tricks to the circuitry, showing how to implement vertical scrolling (useful on the 60Hz version because it reduces the vertical resolution, and so, parts of the pixels stored in RAM are actually not visible).

Finally, one last iteration is the TS68483. This is a much extended design, made to interface with a 68000 CPU and its 16 bit databus (but 8 bit access is also still possible). It adds more complex drawing primitives for curves, arcs, and so on. It also uses a larger 64 pin DIP package, which allows to have more of the logic integrated in the chip, for example, external shift registers are not needed. It can also produce even higher video resolutions. But, at this point, the design with memory completely isolated from the CPU is not appropriate anymore. Graphics terminals are a thing of the past, and other computer systems have shown that it is possible to get quite fast drawing in software, or, if there is anything to accelerate, it's rather moving blocks of data around, with something like a blitter.

Anyway, the EF936x series remains an ahead of its time graphics accelerator. It would take a few more years to see 2D acceleration implemented in hardware, and 3D acceleration would follow.

But what if 8 colors is not enough?

In the EF936x documentation, it is suggested to use 3 bitplanes for an RGB image. This is similar to what was used in the Thomson TO7, but it felt quite limiting. Later models at Thomson Microcomputers added a 4th bit of color, either darkening or brightening the other colors. This was implemented by feeding the 4 color bits from the video generator into a lookup ROM, and then a resistor network. Of course, this can be added to an EF936x system.

But Thomson computers did not stop there. The next step, introduced first in the Thomson TO9, was to use a dedicated palette chip. This would have 16 color registers that would store a 12 bit color (4 bit each for R, G, and B) and replace the ROM used in the previous models. Then, the color palette becomes fully programmable. Typically, you can switch to 16 grey levels, or mix 8 basic colors and 8 greyscales.

Once again, EFCIS is the company manufacturing that chip, first as a custom chip for the TO9, but later on added to their general catalog as the EF9369.

There is also a later revision, the TS9370, which can handle higher pixel clocks, and removes the need for a separate video buffer/amplifier, further simplifying the design. I don't know if there ever was any graphics hardware combining the color palette chip with the graphics processors (or even the textmode one). These products were manufactured by the same company, but for different projects, and it's unclear if the fact that they would fit together quite well is intentional, or just an happy incident.

There is also a TS68494 offering a 256 color palette (out of 4096 colors).

Garbage collectors and how I stopped reinventing the wheel

Posted by pulkomandy on Sun Mar 30 12:02:59 2025  •  Comments (0)  • 

In last month article I wrote about how I was thinking of a non-incremental but still bank based garbage collector. since then (and as that article title said), I continued thinking about it.

The conclusion of that thinking is that my approach is not going to work fast enough. This kind of makes sense: I was trying to go for more efficient memory use (using more than half the memory at a time), but doing that inevitably comes with more complex and less efficient code. And my experiments showed that I do need the code to be fast and also reasonably small, rather than making it memory efficient. The 512K heap is quite large, and "sacrificing" half of it is not that bad.

So I went ahead and reimplemented the Baker two-space garbage collector, as was originally used in Little Smalltalk, but in a modified version that works with two groups of 16 banks. There is a big difference that costs me some performance however: in Little Smalltalk, most of the objects loaded from the initial image are not in fact allocated in the heap, but live in a separate space that's never garbage collected (it's cleaned only when writing back an image to disk). I made something somewhat smilar by reserving one bank for that, except it's too small to contain the initial image (that would need 3 banks and a half). As a result, I have two problems: most of the image gets moved around between the two spaces at each gc cycle, and the remaining part of it still needs to be scanned as it may contain references to things that need to be moved.

Despite that, the results are very good! After a few other small optimizations, I got the GC to run only once for a computation of "4+3", and only for 7 seconds or so. The total computation time for this test is now just under a minute, instead of two minutes and a half before. So, that's 2.5x faster, and now most of the time is spent in the interpreter.

Lesson learned: don't try to reinvent the wheel. Especially if you don't understand very well how wheels work.

This works so well because the Smalltalk interpreter generate a lot of garbage. As a result, when a gc cycle happens, it turns out that most of the objects do not need to be moved. And the Baker GC complexity is measured only in terms of the "live" objects that get moved, the dead/unreachable ones are completely ignored. So they work quite well together.

I have more ideas to improve on this. For example, now my ROM code has gotten smaller and I have more than 1K of ROM space free. I could surely move some of the core Smalltalk objects there, making them read-only and making sure they don't reference anything in RAM. Then, these objects can be completely ignored by the GC: no need to scan them, no need to update references to them. And also the image loading process can be made a bit simpler, as at the moment it's a recursive function that needs a lot of stackspace, that ends up eating precious main memory. It could be made simpler if the core objects are already there, because that's the difficult part with a lot of circular references. If that's already there, it may become much easier to load the other objects in order.

Anyway, since the GC is not the main contributor to execution time, let's leave that aside for now, and focus instead on the interpreter. I think acceptable performance for computing "4+3" would be maybe under 1 second. That means we still need to make things a further 60 times faster than they currently are. That will surely need some deep changes.

The Little Smalltalk interpreter is very "by the book" in the sense that it allocates Smalltalk execution contexts, stacks, and other things as normal Smalltalk objects. This works fine for the original Little Smalltalk, but not so well with my banked memory approach, as accessing each of these objects requires a bankswitch.

So my next plan is to move some of that to main (non-banked) memory. Then, the interpreter can access it without bank switches. This means the interpreter code can be more "normal" z80 code, that is not constantly interrupted by bank memory register writes.

A quick examination of the interpreter code reveals that the most accessed object seems to be the current method execution stack (unsurprisingly). So, let's consider how we may move that to main memory.

In Little Smalltalk, the stack is a typical Smalltalk object, it is an array allocated with a size indicated in the method object when preparing a method execution. It is freed by the normal garbage collection cycle. In theory, Smalltalk code can very well grab a reference to the stack to analyze it (this is how you'd write a debugger that steps through the code, probably). And it needs to be released as normal in a GC cycle, when no references to it exist anymore. In most cases the only reference would be the method context, but I can't be sure that this is always the case for now.

The space in main memory is relatively limited. I may have up to 16K of it if I manage to reoganize things efficiently, but probably a bit less than that (and it's a good idea to also keep some RAM for other future uses anyway). This means setting up a two-space solution is not going to work well here. Some other strategy will be needed.

In any case, the stacks can contain references to other objects, which means the GC will have to know about them for two reasons: to scan them for references, and to free them when not needed anymore.

I also know that this will be a relatively "hot" memory area, as there are a lot of allocations of stacks and contexts during the execution of Smalltalk code. In fact, it's quite likely that this space will run out much before we need to do a full GC cycle.

I can think of many ways to handle this. For example, this space could be handled as a simple cache, where frequently needed objects are simply copied or temporarily moved from their normal space in main memory. This would be similar to the "working set" (objects stored in RAM, as opposition to disk) in the Alto Smalltalk. But moving an object isn't that cheap in our case. The Alto used an object indirection table, which means it had a single place to update to redirect an object to RAM or disk.

So maybe let's try something similar to that: when the object is moved to main RAM, its original place in the heap is replaced with a redirection marker, that says "this object is not actually here". That would work, but now all code that references an object has to handle this special case, and so our speedup of the stack access by the interpreter comes with the cost of everything else having to check if an object has been moved to main RAM in this way.

Another approach is to allocate the object directly in the main RAM space, without having to mirror it in the actual heap. This seems simpler for the allocation: it's just a special case of allocation without bank management and at a special address. Then the object references can be used normally, but also special cased by the interpreter code that needs specific objects to be in main memory. In fact, it may be OK to move objects to main memory at the last minute if they were not create there. Certainly slower (since we have to look for all references to an object when moving it), but if we have to do it only in some very special cases, it should not have cost repercussions otherwise.

The main problem, then, is to decide when to move things out of that main memory space. Ideally, that would be never. There is a pretty good chance that when the time comes to make some free space in there, a lot of the objects using it up are in fact already dead/unreferenced. But how can we be sure of that? The obvious solution would be to run a GC cycle, that will automatically move such objects into banked heap space. But that brings us back to a world of very frequent GC cycles, which we don't want.

We are essentially back to the problem of implementing an efficient incremental garbage collector here. Maybe we can benefit from the fact that this is a special tool used only for execution stacks. There may be a way to reference count these, or explicitly free them at the end of a method execution unless any reference to them was established besides the method context.

I had already wrote down some notes yesterday evening, and went to sleep convinced that I had a workable solution. But I then decided to write an article about it, which helped me see how it doesn't quite work. I guess I will still be thinking about it over the next few days until I find a way that can work...

Leaving Twitter

Posted by pulkomandy on Wed Aug 30 18:25:20 2023  •  Comments (0)  • 

That's it. I'm off Twitter now.

I used it for 5 years. I now have archives of everything I posted there.

Now you can find me on Mastodon instead. See you there!

Website mirroring

Posted by pulkomandy on Sat Jan 1 15:19:32 2022  •  Comments (0)  • 

Please do not mirror my whole website using wget mirroring mode, WebReaper, or other similar tools. It results in a lot of requests to my poor old server, too much memory and CPU use, and the server fan spins up and makes a lot of annoying noise. So I will ban your IP or IP range from accessing the website if you try this.

If you want a full copy of some parts of the website, please contact me so we can set up a better way to share the data (I can set up an ftp or sftp or rsync thing, for example)

Hello World

Posted by pulkomandy on Sun Feb 18 16:30:21 2018  •  Comments (4)  • 

This is the new server of the PulkoTeam! Enjoy your stay here.

On this website you can find various stuff I work on, around the Haiku project, electronics, 8bit computers, and random things depending on the day mood.