Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees [EMNLP 2025 🔥]

Ahmed Heakl, Sarim Hashmi, Chaimaa Abi, Celine Lee, Abdulrahman Mahmoud,

MBZUAI · Cornell University

🆕 Latest Updates

📢 August 2025: We're thrilled to share that GG has been accepted to EMNLP 2025! 🎊
📢 June 2025: Evaluation code for Bringup-Bench is released. Checkout eval folder!
📢 June 2025: Paper and inference code is released!

Overview

The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms.

In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73× faster runtime performance, 1.47× better energy efficiency, and 2.41× better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks.

🚀 Highlights

First CISC-to-RISC Transpiler: GG is the first CISC-to-RISC transpiler built via a custom-trained, architecture-aware LM achieving a test accuracy of 99.39% on ARMv8 and 89.93% on RISC-V64.
Testing-Driven Validation: A methodology to measure and build confidence into transpilation output via software testing approaches ("guaranteeing" the guess), including detailed analysis of correctness, errors, and hallucinations.
Hardware-Informed Design: An in-depth analysis into the inner workings of our transpiler, including hardware-informed design decisions to best train an accurate LLM model for assembly transpilation.
Real-World Case Study: GG's generated assembly achieves 1.73× runtime speedup, 1.47× better energy efficiency, and 2.41× memory efficiency compared to Apple Rosetta's x86 to ARM virtualization engine.

Results

GG models significantly outperform all baseline models across different architectures and optimization levels. Most baseline models achieve 0% accuracy, highlighting the unique difficulty of low-level ISA translation.

Real-World Performance vs Rosetta 2

We conducted a real-world study on Apple M2 Pro comparing GG against Rosetta 2 across execution time, CPU energy, and memory usage. GG achieves near-native performance while significantly outperforming Rosetta 2 across all metrics.

Metric	Rosetta 2	GG (Ours)	Native	Improvement
Execution Time (ms)	13.94	8.03	7.39	1.73× faster
CPU Energy (J)	7.50	5.09	5.07	1.47× better
RAM Usage (MB)	2.49	1.03	1.03	2.41× better

Evaluation Benchmarks

We evaluate GG using two complementary benchmarks: HumanEval-C with 164 programming problems and BringUpBench with 65 bare-metal programs (85-5751 lines of code), providing comprehensive coverage from isolated functions to full project structures with internal libraries.

Benchmark	Architecture	Optimization	Data
HumanEval	ARMv5	O0	Link
	ARMv5	O2	Link
	ARMv8	O0	Link
	ARMv8	O2	Link
	RISCv64	O0	Link
	RISCv64	O2	Link
BringUpBench	ARMv8	O0	Link
BringUpBench	ARMv8	O2	Link

Inference

Checkout inference.py for a simple script to run inference on the GG models. The script takes an input assembly file and outputs the transpiled assembly code.

Here are all the available GG models:

Model	Architecture	Optimization	Link
GG-ARMv5	ARMv5	O0	Link
GG-ARMv5	ARMv5	O2	Link
GG-ARMv8	ARMv8	O0	Link
GG-ARMv8	ARMv8	O2	Link
GG-RISCv64	RISCv64	O0	Link
GG-RISCv64	RISCv64	O2	TBR

ISA Similarity Analysis

We observe a direct correlation between ISA similarity and transpilation accuracy. ARMv8 exhibits the highest similarity to x86 (40.19%), followed by ARMv5 (25.09%) and RISC-V64 (21.41%), directly correlating with model accuracy performance across these architectures.

Additionally, we analyze how compiler optimization levels affect opcode usage patterns in ARMv8. At -O2 optimization, mov instructions become dominant (+14.8%), indicating more register reuse and reduced memory traffic, which makes the learning task more challenging for the model.

Todos

Release training and evaluation scripts.
Release dataset compilation scripts.

Citation

If you use this code or the dataset in your research, please cite our paper:

@article{heakl2025guaranteed,
  title={Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees},
  author={Heakl, Ahmed and Hashmi, Sarim and Abi, Chaimaa and Lee, Celine and Mahmoud, Abdulrahman},
  journal={arXiv preprint arXiv:2506.14606},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
eval		eval
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees [EMNLP 2025 🔥]

🆕 Latest Updates

Overview

🚀 Highlights

Results

Real-World Performance vs Rosetta 2

Evaluation Benchmarks

Inference

ISA Similarity Analysis

Todos

Citation

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees [EMNLP 2025 🔥]

🆕 Latest Updates

Overview

🚀 Highlights

Results

Real-World Performance vs Rosetta 2

Evaluation Benchmarks

Inference

ISA Similarity Analysis

Todos

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages