Submission #39 for LIVE 2024

Diff-based interactive compiler debugging and testing

Abstract

Debugging and testing compilers has been an strenuous task. Developers need to create a large number of test cases, invoke the compiler on them, and then verify the results against the expected output. Since compilers have many stages, developers might find it difficult to observe the behavior of a certain stage, let alone debugging the stage. Moreover, introducing new features to the compiler may affect existing components intrinsically in ways that traditional tests cannot reveal.

We propose a new prespective on incorporating interactivity into tests. Each test file is divided into smaller blocks by empty lines. The test program sequentially executes each block and appends the inferred type, evaluation results, diagnostic messages, and debugging information as comments to each block. Because each block inherits the context of the previous block, developers can incrementally compose test files in an interactive manner. Additionally, developers can selectively inspect debugging information from different stages by prepending flags to each test block. By monitoring file changes, the test program reruns modified tests and update comments in situ. Those updates are displayed in editors in real time, which makes composing tests as smooth as using an interactive notebook.

These comments are committed to the version control system and can thus track changes made to the compiler. It relies on the diff operation of version control systems, thus we name this testing method “diff-based testing” because. We have implemented this testing system on the MLscript compiler and developed an interactive editor plugin for exploring debugging information.

Backgrounds
Diff-based testing, explained
Enhanced experience with folding and filtering UI
- Demonstration
Implementation
Comparison
Discussion
- Developer practices
- Evaluation as a playground
Conclusion

Backgrounds

Compilers are highly complex software, as they usually have many stages, such as lexing, parsing, type checking, code generation, linking, and optimization, in a single pipeline. What makes testing compilers more difficult is that developers need to write source programs that are then fed into a long process of transformation and lowering. Traditional compiler tests are mostly made of a set of programs and expected output.

For example, the Rust compiler has a set of UI tests. The following is a UI test for the Rust compiler. The source file is tests/ui/fn/bad-main.rs, which contains an ill-formed main function. The line where the error occurs is annotated with //~ ERROR, which let the test suite know that that line is expected to produce an error.

fn main(x: isize) { } //~ ERROR: `main` function has wrong type [E0580]

Below is another file tests/ui/fn/bad-main.stderr, which documents the expected output from stderr.

error[E0580]: `main` function has wrong type
  --> $DIR/bad-main.rs:1:1
   |
LL | fn main(x: isize) { }
   | ^^^^^^^^^^^^^^^^^ incorrect number of function parameters
   |
   = note: expected signature `fn()`
              found signature `fn(isize)`

error: aborting due to 1 previous error

For more information about this error, try `rustc --explain E0580`.

Rust test suite runs the Rust compiler on the source file and compare the output with the expected output. If there is a mismatch, the test case fails.

This kind of test is also prevalent in compilers of other programming languages. For example, Scala 3 has a large number of similar tests. For existing programming languages, there are usually datasets for testing compilers. For instance, Plum Hall Validation Suite provides source code test cases that are used for comparing compiler behavior to the ANSI/ISO C++ Standard. Test262 is a test suite for the ECMAScript (JavaScript) language specification. Therefore, developers of compilers for existing programming languages can verify their compilers by running those test cases and comparing the results with the expected output.

However for emerging programming languages, developers have to create this kind of test cases from scratch. This is not a developer-friendly task for the following reasons.

Restrictions in file format. A test file can only be considered as a test case, which is either expected to succeed or expected to fail. However, we may perform multiple checks in the same program context. In such cases, we need to copy and modify test cases, which can result in a large number of test files.
Lack of live feedback. Each test file is usually not written in one go. Developers need to go through a cycle of writing-running-examining. During this process, developers need to switch between different editors and command-line tools to view the results.
Only stdout and stderr are checked. These compiler tests only record and compare the output from stdout and stderr. As the compiler has many stages, and we cannot check and compare the intermediate results of a specific stage. This makes the test completely black-box.

We propose diff-based testing. Although it is still plain-text-based testing method, but we argue that it can address the aforementioned shortcomings. The name comes from the fact that it makes use of the diff operation of version control systems (e.g., git).

Diff-based testing, explained

In this section, we explain the format of diff-based testing and illustrate how test program works through a series of key points.

Test files are made of test blocks

Each test file is divided into smaller test blocks separated by empty lines. For example, the following test file has three test blocks.

The first test block defines two functions sum and abs. The second test block defines a recursive function fact. The third test block contains an expression that calls three functions.

Test blocks are processed sequentially

The test program instantiates a compiler instance for each test file. It reads the test file and invokes the compiler on each test block sequentially. The compiler parses the test block, type-checks the parsed abstract syntax tree, then generates executable code, and finally, evaluates the code in a virtual machine.

These results are written back as multiple single-line comments starting with //│ at the end of this block. As a result, the first test block are followed by the following comments.

It is worth noting that test blocks in the same test file are not independent. The compiler retains the context obtained from the first test block when processing subsequent test blocks. After the test program processes all the remaining test blocks, the test file is updated with the inferred types and evaluation results.

Diagnostic messages are also written back

In the example above, three test blocks are compiled successfully. If the compiler encounters an error, the test block fails and the error messages are also written back to the test file as comments. For example, we get the following error if we add the following test block to the test file above.

How does the test program work?

The test program of diff-based testing is still based on traditional unit test suite. Each test file corresponds to a test suite and each test block corresponds to a test case. Therefore, developers can still see the overall situation of the tests in the command line. This also makes diff-based testing fit into CI (continuous integration) pipelines.

The following is the output from the test program of the erroneous test block above. Note that in the implementation, we have optimized the error messages so that the specific line is indicated when an error is reported.

[info] DiffTests:
[info] - Repro *** FAILED ***
[info]   Unexpected diagnostics (or lack thereof) at: 
[info]          Repro.mls:12 (DiffTests.scala:1106)
[info] Run completed in 374 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 0, failed 2, canceled 0, ignored 0, pending 0
[info] *** 1 TESTS FAILED ***
[error] Failed tests:
[error]         mlscript.DiffTests

Output comments are overwritten

One might wonder whether running the test program several times would accumulate a large number of output comments in the test file. The answer is no. Because the output comments start with //│, they can be easily distinguished from comments written by the user.

The test program ignores those comments when reading the test file. In other words, such comments are invisible to the test program. Therefore, the file still remains unchanged even after the test program has run on a test file twice.

The diff operation can detect changes in the compiler

The diff operation of version control systems plays an important role in diff-based testing. If a component of the compiler changes, it may affect the program's output, thereby changing the test files. Because these output comments are committed to the version control system, it can be aware of any modifications to the compiler.

At the same time, developers are also responsible for any modifications to output comments—they should ensure the output comments are reasonable.

Record debugging traces from intermediate stages

Sometimes error messages are superficial. We still don't know what happens at a certain stage. If we suspect that the compiler's behavior is incorrect and want to check the debug information of a specific part, we should make use of debug flags.

Debug flags are single-line short phrases starting with :. For example, :p shows the following test block's parsing information, including tokens, parsed abstract syntax tree, and serialized abstract syntax tree.

Note that it doesn't not affect the subsequent test blocks. We call this kind of debug flags that only take effect on the test block after it as local debug flags. Another example of local debug flags is flag :ShowRepl, which displays the REPL interaction history between test program and the JavaScript interpreter.

This debug flag is very convenient when we want to know whether the code generator emits ill-formed JavaScript code or not.

Besides, some debug flags are global: they take effect until the end of the test file. For instance, flag :NoJS disable JavaScript code generation for all remaining test blocks. We often use this flag in some test files purely made for type-checking.

Document errors and warnings with debug flags

Sometimes we want to document examples that are bound to go wrong. At this time, the :e flag can be used. It indicates that the test block following it must produce an error. In addition, we also have

:w flag which expects a warning,
:pe which expects parsing errors,
:ge which expects code generation errors,and
:re which expects runtime errors.

Each test block can have multiple debug flags to document various behaviors of test blocks.

Moreover, if a test block fails because some parts of the compiler are not working properly or has not been implemented yet, we can add // TODO or // FIXME comments to the test block. Although they do not start with :, they are also treated as debug flags as they can be highlighted and located by code editors

Composing test files with watch mode

We argue that this is a more interactive way of testing compilers and greatly improves the developer experience. The test program can be run in watch mode, in which the test program monitors the changes to the test files. When a test file has been modified, the test program re-runs the compiler on the test file.

This makes the experience of composing test files similar to REPL (real-eval-print loop). Developers can continuously write tests. When they save the test file, they can immediately see the results of the tests. We provide a further discussion regarding this in the following sections.

Why are diff-based testing useful?

Developers have the responsibility to confirm that the output comments are correct before committing them to the version control system. Because This reduces the trouble of manually constructing the expected output. Using human-verified outputs as expected output reduces the hassle of manually constructing expected outputs.

By deliberately committing the output comments with some debug flags enabled, test files can track intermediate results from the specific stages of the compiler. When developers update or extend the compiler in the future, the diff tool of the version control system indicates which test files have changed to prevent regression.

However, a careless developer casually adding new features without running diff-based testing would be a disaster for the project. In order to ensure developers effectively examine diff-based testing, we also added diff-based testing to GitHub's CI. Any PR can only be merged if no files in the project are changed after running diff-based testing once.

Enhanced experience with folding and filtering UI

The common use scenarios for DiffTests are: people edit test files with a text editor, and after saving, the text editor loads the files updated by DiffTests automatically. But it also has some inconveniences.

When the code in the block is complex, the debug information can be particularly long. For example, :d shows the debugging traces of type checking, and type checking a simple fact function produces 165 lines of debugging traces!

Debug flags cannot control the level of detail (LoD) of the debug traces. As a solution, we can provide parameters for debug flags to filter debug traces. The downsides are:

every time we modify the filter, diff-based testing has to recompile and re-run the program;
the filter syntax might become very complex (e.g., to support Boolean expression), and we need to implement an interpreter for it.

Because debug traces are structured according to function calls, we naturally expect they are interactive. For example, here are some motivation scenarios.

When we first find there might be bugs, it can show an overview of debug traces. In this mode, the detailed debug traces are hidden and only the key results at certain stages are shown.
When we are tracing and locating bugs, we can fold reviewed parts to reduce distractions.
When we’re sure that a function in the compiler is problematic, we can click on its debug traces to navigate to its implementation in the source code.

We hope they can naturally integrate with the text editor so that people will not find them obtrusive (just like the Copilot pop-up menu), think they look complicated, or feel they are difficult to learn.

Therefore, we designed a UI for filtering and searching debug traces and implemented it as an extension for Visual Studio Code.

Demonstration

The video above is a demonstration of the filtering and searching UI. The UI is located at a side panel in Visual Studio Code. Users can also type in the search bar to highlight the debug traces. Compared to the plain-text debugging traces, the UI provides a more interactive and user-friendly experience.

Implementation

We implement diff-based testing, which we called DiffTests, in the compiler of a new programming language called MLscript. The examples in this essay are all written in MLscript, and the outputs are copied from actual test files and console outputs.

The filtering and searching UI is implemented as an extension for Visual Studio Code. We extend DiffTests so that it does not only write comments to test files but also dumps debugging traces to temporary files. The extension reads the debugging traces and displays them in the tree view. The tree view is interactive: it can be filtered, folded, and navigated.

Comparison

Unit tests

Unit tests are a form testing which isolate a specific component of the program and verify its behavior. But unit tests for compiler stages are not very useful because:

The input and output are often syntax trees, which are long and difficult to write.
Many internal components depend on specific contexts and parameters, manually constructing them are error-prune and tedious.

Therefore, the content relied upon in unit tests is not as reliable as obtaining them by running real programs. Diff-based testing can display the structured debug information output at each stage through debug flags. Verifying this information can improve the efficiency of developers.

Moreover, what makes diff-based testing more useful is that the output comments are committed to the version control system. As a result, diff-based testing can subsume characterization test and regression test.

Unified test format

The unified test format in Mercurial’s test framework combines shell command inputs and expected outputs into a single file with a .t extension. Input lines start with two spaces followed by $ and output lines show the command results. Lines not starting with two spaces are comments.

For example, the following is a test file for the sort command. It first writes some lines to a file and then shows the sorted lines. The test program runs the command and compares the output with the expected output.

Dump some lines into a file

  $ cat > colors << HERE
  > red
  > yellow
  > green
  > HERE

sort the file and dump to stdout

  $ sort colors

During the first run of the test program, it reports an error because the output does not match what is documented in the test file. The test program also shows the diff and ask the developer to accept the changes.

--- test.t
+++ test.t.err
@@ -10,5 +10,11 @@
 sort the file and dump to stdout

   $ sort colors
+  green
+  red
+  yellow

After the user accepts the changes, the test file will be modified to the following content.

Dump some lines into a file

  $ cat > colors << HERE
  > red
  > yellow
  > green
  > HERE

sort the file and dump to stdout

  $ sort colors
  green
  red
  yellow

From this point on, running the test program will not result in any errors unless sort is replaced. One implementation of unified test format is Cram. Therefore, this kind of test is also referred as “cram” tests. The idea of unified test format is not limited to shell commands and can be generalized to more scenarios, as we will show in the next section.

Expect tests

Expect tests are a test pattern developed by Jane Street. Developers put instruction [%expect {||}] in the OCaml source file, and the test program figure out the expected output and fill in the blank.

Because the content of expect statements is plain text, it becomes necessary to choose a format to display structured data. By default, structured data are displayed in S-expressions. Jane Street also introduced alternative display formats (e.g., waveform in ASCII) in a more human-readable way.

Snapshot testing

Jest, a JavaScript testing framework, has a feature called snapshot testing. It renders a UI component, serializes the output, and compares it with the previous snapshot. If the output does not match the snapshot, the test fails and a diff is shown.

We argue that snapshot testing is not quite intuitive because the serialized snapshots are stored in a separate file with the same name as the test file but with a .snap extension. Developers need to read the diff between the snapshot and the output to understand the changes.

Snapshot testing can be enhanced to increase interactivity through various ways. For example, these snapshots can be directly rendered in the editor, allowing for more intuitive comparisons. There exists a VSCode extension to visualize the diff between the snapshot and the output, it does not render faithfully because it does not have the access to the stylesheets of the original component.

Discussion

Developer practices

We have introduced diff-based testing to many people involved in our compiler development. They only need to remember the following key practices to get started.

Developers should check test outputs before committing them to the version control system because committed content is considered as the expected output.
The diff operation notifies developers that the compiler's behavior has changed. If this is intended changes, developers should commit those changes. Otherwise, developers should mark the test blocks as TODO or FIXME.
Developers should add negative test cases to document errors and warnings. Those test cases should be marked as expected to fail. Developers should also emit structured debugging traces during implementing new features, thereby those behavior can be documented by diff-based testing.
Most of the time, the test program should be running in watch mode to provide a live feedback, thus increasing developers’ productivity.

Evaluation as a playground

In this section, we look at diff-based testing from a perspective other than debugging. Diff-based testing can also be used as a programming language playground. We evaluate its interactivity in this section.

We argue that the interactivity of diff-based testing is better than REPL. Because people can modify previous test blocks and test blocks afterward are updated accordingly. While this cannot be realized in any REPL environments. Because the evaluated test blocks can be changed, it is also similar to notebook interfaces, for example, Jupyter Notebook and Observable.

The interactivity of diff-based testing can be compared to notebook interfaces but has not yet surpassed it. This is mainly because test blocks can only be processed sequentially, which is constrained by plain-text-based user interface and test file format. Some notebook interfaces support out-of-order execution. Blocks are executed in an order determined by the dependencies between cells. Moreover, notebook interfaces can make use of UI to enhance the experience of exploring execution results.

Conclusion

In this essay, we propose diff-based testing as a new way of testing compilers. We demonstrate how the test files are organized and how diff-based testing operates. We also provide specific scenarios explaining how people use diff-based testing to implement testing in traditional compilers more efficiently.

We are still exploring how to enhance the developer experience of diff-based testing. The UI of filtering and searching is one example. In addition, some possible directions are as follows.

Syntax highlighting for output comments. The information in the output comments is useful (e.g., type information), and we can apply syntax highlighting to this information in the editor to make it more readable.
Import and export between test files. There exist identical code blocks in the test files. We plan to support importing other test files. The testing program processes the imported test files first and then process the current test file.
Editor-level test block operations (with shortcuts). Editors usually provide some shortcuts for line operations. For example, alt plus the up/down keys can move the currently focused line up/down. We can also extend code editors to provide similar shortcuts to manipulate test blocks.