This repository contains the implementation code for the SLM-MUX method, which effectively orchestrates multiple small language models (SLMs) to achieve superior reasoning performance.
slm_mux_code/
├── slm_mux_orchestrator/ # Stage 2: SLM-MUX Orchestration
│ ├── math_benchmark.py
│ ├── gpqa_benchmark.py
│ └── gsm8k_benchmark.py
├── single_model_inference/ # Stage 1: Single Model Inference
│ ├── collect_math.py
│ ├── collect_gpqa.py
│ └── collect_gsm8k.py
├── data/
├── evaluation/
├── utils/
├── config/
├── requirements.txt
└── README.md
# Clone the repository
git clone https://github.com/slm-mux/slm-mux.github.io.git
cd slm-mux.github.io/slm_mux_code
# Install dependencies
pip install -r requirements.txtSet up your API keys as environment variables:
export TOGETHER_API_KEY="your_together_api_key"
export OPENAI_API_KEY="your_openai_api_key" # Optional, for verificationExample: Run SLM-MUX on MATH with 2 models
python slm_mux_orchestrator/math_benchmark.py \
--data_path data/math_500.json \
--output results/math_slmmux_results.json \
--models \
"mistralai/Mistral-7B-Instruct-v0.3" \
"Qwen/Qwen2.5-7B-Instruct" \
--extra_calls 3Example: Run SLM-MUX on GPQA
python slm_mux_orchestrator/gpqa_benchmark.py \
--data_path data/gpqa_shuffled.json \
--output results/gpqa_slmmux_results.json \
--models \
"mistralai/Mistral-7B-Instruct-v0.3" \
"Qwen/Qwen2.5-7B-Instruct" \
--extra_calls 3Example: Run SLM-MUX on GSM8K
python slm_mux_orchestrator/gsm8k_benchmark.py \
--data_path data/gsm8k_500.json \
--output results/gsm8k_slmmux_results.json \
--models \
"mistralai/Mistral-7B-Instruct-v0.3" \
"Qwen/Qwen2.5-7B-Instruct" \
--extra_calls 3This will:
- Query all specified models for each problem
- Run each model multiple times (controlled by
--extra_calls) - Apply the SLM-MUX algorithm to select the best answer
- Save results with accuracy metrics
If you want baseline performance for a single model:
python single_model_inference/collect_math.py \
--dataset data/math_500.json \
--model "mistralai/Mistral-7B-Instruct-v0.3" \
--output collected_outputs/math_mistral7b_baseline.json \
--num_llm_sub_iterations 3Note: This is optional and only needed for comparison with single-model baselines.
Each benchmark script outputs a JSON file containing:
- Individual model responses and extracted answers
- Vote counts for each unique answer
- Selected final answer based on confidence
- Token usage statistics
- Overall accuracy
Example output structure:
{
"problem_id": "...",
"problem": "...",
"reference_answer": "...",
"models": [
{
"model_name": "...",
"calls": [...],
"best_answer": "...",
"confidence": 0.67
}
],
"final_answer": "...",
"is_correct": true
}The single_model_inference/ directory contains scripts for collecting single-model baseline performance. This is optional and only needed for comparison purposes.
# Example: Collect MATH responses from a single model
python single_model_inference/collect_math.py \
--dataset data/math_500.json \
--model "mistralai/Mistral-7B-Instruct-v0.3" \
--output collected_outputs/math_mistral7b_baseline.json \
--num_llm_sub_iterations 3This is useful for comparing SLM-MUX results against individual model performance.
For MATH and GSM8K, we provide scripts to verify answer equivalence using GPT-4o:
# Check MATH answers
python evaluation/check_equivalence_math.py \
--results results/math_results.json \
--output results/math_verified.json
# Check GSM8K answers
python evaluation/check_equivalence_gsm8k.py \
--results results/gsm8k_results.json \
--output results/gsm8k_verified.jsonThe core of SLM-MUX involves:
- Multiple Sampling: Each model generates multiple responses (controlled by
--extra_calls) - Confidence Estimation: Count the frequency of each unique answer
- Model Selection: Choose the answer with highest confidence across models
- Tie Breaking: Use validation set performance when confidence scores tie
These scripts are optional and used for baseline comparison:
- Collect responses from a single model for performance benchmarking
- Generate prompts for each problem
- Call the specified model via the Together API
- Save raw responses and token usage for analysis
Each dataset has specialized answer extraction logic:
- MATH: Extracts content within
\boxed{...}and normalizes LaTeX - GPQA: Uses multiple choice extraction (A/B/C/D)
- GSM8K: Extracts numerical answers with ####
The api_client.py supports:
- Together AI API for running open-source models
- OpenAI API for verification (optional)
- Automatic retry with exponential backoff
- Token usage tracking
When contributing code, please:
- Follow the existing code style (PEP 8)
- Add type hints to function signatures
- Include docstrings for public functions
- Update README if adding new features
This code is released under the MIT License. See LICENSE file for details.
If you use this code in your research, please cite:
@article{slm-mux-2025,
title={SLM-MUX: Orchestrating Small Language Models for Reasoning},
author={Wang, Chenyu and Wan, Zishen and Kang, Hao and Chen, Emma and Xie, Zhiqiang and Krishna, Tushar and Reddi, Vijay Janapa and Du, Yilun},
year={2025}
}- Project Page: https://slm-mux.github.io
- Paper: [Under Review]
- GitHub: https://github.com/slm-mux/SLM-MUX
For questions or issues, please:
- Open an issue on GitHub
- Contact: [email protected]