Ahmed Heakl, Sarim Hashmi, Chaimaa Abi, Celine Lee, Abdulrahman Mahmoud,
MBZUAI Β· Cornell University
- π’ August 2025: We're thrilled to share that GG has been accepted to EMNLP 2025! π
- π’ June 2025: Evaluation code for Bringup-Bench is released. Checkout
evalfolder! - π’ June 2025: Paper and inference code is released!
The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms.
In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73Γ faster runtime performance, 1.47Γ better energy efficiency, and 2.41Γ better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks.
- First CISC-to-RISC Transpiler: GG is the first CISC-to-RISC transpiler built via a custom-trained, architecture-aware LM achieving a test accuracy of 99.39% on ARMv8 and 89.93% on RISC-V64.
- Testing-Driven Validation: A methodology to measure and build confidence into transpilation output via software testing approaches ("guaranteeing" the guess), including detailed analysis of correctness, errors, and hallucinations.
- Hardware-Informed Design: An in-depth analysis into the inner workings of our transpiler, including hardware-informed design decisions to best train an accurate LLM model for assembly transpilation.
- Real-World Case Study: GG's generated assembly achieves 1.73Γ runtime speedup, 1.47Γ better energy efficiency, and 2.41Γ memory efficiency compared to Apple Rosetta's x86 to ARM virtualization engine.
GG models significantly outperform all baseline models across different architectures and optimization levels. Most baseline models achieve 0% accuracy, highlighting the unique difficulty of low-level ISA translation.
We conducted a real-world study on Apple M2 Pro comparing GG against Rosetta 2 across execution time, CPU energy, and memory usage. GG achieves near-native performance while significantly outperforming Rosetta 2 across all metrics.
| Metric | Rosetta 2 | GG (Ours) | Native | Improvement |
|---|---|---|---|---|
| Execution Time (ms) | 13.94 | 8.03 | 7.39 | 1.73Γ faster |
| CPU Energy (J) | 7.50 | 5.09 | 5.07 | 1.47Γ better |
| RAM Usage (MB) | 2.49 | 1.03 | 1.03 | 2.41Γ better |
We evaluate GG using two complementary benchmarks: HumanEval-C with 164 programming problems and BringUpBench with 65 bare-metal programs (85-5751 lines of code), providing comprehensive coverage from isolated functions to full project structures with internal libraries.
| Benchmark | Architecture | Optimization | Data |
|---|---|---|---|
| HumanEval | ARMv5 | O0 | Link |
| O2 | Link | ||
| ARMv8 | O0 | Link | |
| O2 | Link | ||
| RISCv64 | O0 | Link | |
| O2 | Link | ||
| BringUpBench | ARMv8 | O0 | Link |
| O2 | Link |
Checkout inference.py for a simple script to run inference on the GG models. The script takes an input assembly file and outputs the transpiled assembly code.
Here are all the available GG models:
| Model | Architecture | Optimization | Link |
|---|---|---|---|
| GG-ARMv5 | ARMv5 | O0 | Link |
| GG-ARMv5 | ARMv5 | O2 | Link |
| GG-ARMv8 | ARMv8 | O0 | Link |
| GG-ARMv8 | ARMv8 | O2 | Link |
| GG-RISCv64 | RISCv64 | O0 | Link |
| GG-RISCv64 | RISCv64 | O2 | TBR |
We observe a direct correlation between ISA similarity and transpilation accuracy. ARMv8 exhibits the highest similarity to x86 (40.19%), followed by ARMv5 (25.09%) and RISC-V64 (21.41%), directly correlating with model accuracy performance across these architectures.
Additionally, we analyze how compiler optimization levels affect opcode usage patterns in ARMv8. At -O2 optimization, mov instructions become dominant (+14.8%), indicating more register reuse and reduced memory traffic, which makes the learning task more challenging for the model.
- Release training and evaluation scripts.
- Release dataset compilation scripts.
If you use this code or the dataset in your research, please cite our paper:
@article{heakl2025guaranteed,
title={Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees},
author={Heakl, Ahmed and Hashmi, Sarim and Abi, Chaimaa and Lee, Celine and Mahmoud, Abdulrahman},
journal={arXiv preprint arXiv:2506.14606},
year={2025}
}



