Skip to content

ahmedheakl/Guaranteed-Guess

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees [EMNLP 2025 πŸ”₯]

Website arXiv dataset model

Ahmed Heakl, Sarim Hashmi, Chaimaa Abi, Celine Lee, Abdulrahman Mahmoud,

MBZUAI Β· Cornell University


πŸ†• Latest Updates

  • πŸ“’ August 2025: We're thrilled to share that GG has been accepted to EMNLP 2025! 🎊
  • πŸ“’ June 2025: Evaluation code for Bringup-Bench is released. Checkout eval folder!
  • πŸ“’ June 2025: Paper and inference code is released!

Overview

The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms.

In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73Γ— faster runtime performance, 1.47Γ— better energy efficiency, and 2.41Γ— better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks.

GG Overview

πŸš€ Highlights

  • First CISC-to-RISC Transpiler: GG is the first CISC-to-RISC transpiler built via a custom-trained, architecture-aware LM achieving a test accuracy of 99.39% on ARMv8 and 89.93% on RISC-V64.
  • Testing-Driven Validation: A methodology to measure and build confidence into transpilation output via software testing approaches ("guaranteeing" the guess), including detailed analysis of correctness, errors, and hallucinations.
  • Hardware-Informed Design: An in-depth analysis into the inner workings of our transpiler, including hardware-informed design decisions to best train an accurate LLM model for assembly transpilation.
  • Real-World Case Study: GG's generated assembly achieves 1.73Γ— runtime speedup, 1.47Γ— better energy efficiency, and 2.41Γ— memory efficiency compared to Apple Rosetta's x86 to ARM virtualization engine.

Results

GG models significantly outperform all baseline models across different architectures and optimization levels. Most baseline models achieve 0% accuracy, highlighting the unique difficulty of low-level ISA translation.

GG Results

Real-World Performance vs Rosetta 2

We conducted a real-world study on Apple M2 Pro comparing GG against Rosetta 2 across execution time, CPU energy, and memory usage. GG achieves near-native performance while significantly outperforming Rosetta 2 across all metrics.

Metric Rosetta 2 GG (Ours) Native Improvement
Execution Time (ms) 13.94 8.03 7.39 1.73Γ— faster
CPU Energy (J) 7.50 5.09 5.07 1.47Γ— better
RAM Usage (MB) 2.49 1.03 1.03 2.41Γ— better

Evaluation Benchmarks

We evaluate GG using two complementary benchmarks: HumanEval-C with 164 programming problems and BringUpBench with 65 bare-metal programs (85-5751 lines of code), providing comprehensive coverage from isolated functions to full project structures with internal libraries.

GG Benchmarks

Benchmark Architecture Optimization Data
HumanEval ARMv5 O0 Link
O2 Link
ARMv8 O0 Link
O2 Link
RISCv64 O0 Link
O2 Link
BringUpBench ARMv8 O0 Link
O2 Link

Inference

Checkout inference.py for a simple script to run inference on the GG models. The script takes an input assembly file and outputs the transpiled assembly code.

Here are all the available GG models:

Model Architecture Optimization Link
GG-ARMv5 ARMv5 O0 Link
GG-ARMv5 ARMv5 O2 Link
GG-ARMv8 ARMv8 O0 Link
GG-ARMv8 ARMv8 O2 Link
GG-RISCv64 RISCv64 O0 Link
GG-RISCv64 RISCv64 O2 TBR

ISA Similarity Analysis

We observe a direct correlation between ISA similarity and transpilation accuracy. ARMv8 exhibits the highest similarity to x86 (40.19%), followed by ARMv5 (25.09%) and RISC-V64 (21.41%), directly correlating with model accuracy performance across these architectures.

Additionally, we analyze how compiler optimization levels affect opcode usage patterns in ARMv8. At -O2 optimization, mov instructions become dominant (+14.8%), indicating more register reuse and reduced memory traffic, which makes the learning task more challenging for the model.

ISA Similarity Analysis

Todos

  • Release training and evaluation scripts.
  • Release dataset compilation scripts.

Citation

If you use this code or the dataset in your research, please cite our paper:

@article{heakl2025guaranteed,
  title={Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees},
  author={Heakl, Ahmed and Hashmi, Sarim and Abi, Chaimaa and Lee, Celine and Mahmoud, Abdulrahman},
  journal={arXiv preprint arXiv:2506.14606},
  year={2025}
}

About

[EMNLP 2025 πŸ”₯] Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees

Resources

License

Stars

Watchers

Forks