Note
The MicroRemed benchmark and the LLM backbone are designed to run within the same repository. In other words, if you intend to use this setup, you must ensure that both the microservice environment and the LLM can be executed on the same machine. This can be challenging on many non-bare-metal servers (e.g., DSW, DLC), where deploying large models often lacks the necessary permissions to create containers (e.g., Docker or containerd) for launching microservices.
To address this, we provide an alternative setup via MicroRemed-CS, which supports a client-server mode. In this configuration, microservices are deployed on one cluster, exposing operational interfaces, while the LLM runs on a separate cluster and interacts with the microservices remotely. Despite this separation, the overall benchmarking methodology remains consistent with the original MicroRemed design.
MicroRemed is a benchmark for assessing large language models in end-to-end microservice remediation, aiming to directly produce executable Ansible playbooks from diagnosis reports and autonomously restore faulty systems.
MicroRemedβs design adheres to the following principles:
Unlike most LLM benchmarks that collect static data to form fixed datasets, MicroRemed is designed as a live and interactive execution environment.
It actively launches real microservice systems, injects controlled failures, and interacts dynamically with running services.
This design enables the benchmark to capture real-time behaviors, system dynamics, and contextual dependencies that static datasets cannot represent.
Evaluation is not determined by linguistic or structural similarity of generated outputs,
but by execution outcomes.
Each generated playbook is executed within the microservice environment,
and the benchmark verifies success by assessing whether the system has been fully recovered to its normal operational state.
Built on these foundations, the benchmark is designed to be:
- Method-scalable β supports diverse LLM-based remediation methods.
- LLM-scalable β allows plug-and-play replacement of remediation models.
- Failure-scalable β accommodates various failure scenarios.
- System-scalable β easily extendable to new microservice systems with minimal configuration effort.
MicroRemed currently includes, seven representative failure types, three real-world microservice systems, and two reference methodologies.
| No. | Category | Failure Types |
|---|---|---|
| 1 | Resource-Level | CPU Saturation |
| 2 | Memory Saturation | |
| 3 | IO Saturation | |
| 4 | Network-Level | Network Loss |
| 5 | Network Delay | |
| 6 | Application-Level | Pod Failure |
| 7 | Configuration Error |
MicroRemed integrates three microservice systems:
- Train-Ticket [Zhou et al., 2018]: a comprehensive benchmark simulating a railway ticketing platform.
- Online-Boutique [Google, 2025]: an e-commerce microservice application widely used in reliability and observability studies.
- Simple-Micro: a self-developed lightweight system designed for controlled experimentation and fine-grained analysis.
Although MicroRemed supports arbitrary combinations of injected failures,
we define three standardized difficulty levels to ensure fair and structured comparison across remediation methods:
| Level | #Cases | Included Failure Types |
|---|---|---|
| π’ Easy | 23 | CPU Saturation, Memory Saturation, Network Loss, Network Delay |
| π‘ Medium | 49 | CPU Saturation, Memory Saturation, IO Saturation, Network Loss, Network Delay, Pod Failure |
| π΄ Hard | 80 | CPU Saturation, Memory Saturation, IO Saturation, Network Loss, Network Delay, Pod Failure, Configuration Error |
To facilitate fair evaluation and comparison across different LLMs, we introduce two reference methodologies: SoloGen and ThinkRemed.
SoloGen represents a straightforward one-shot generation baseline. It replaces the candidate remediation LLM in the MicroRemed pipeline with a pre-trained large language model that receives all relevant contextual information in a single prompt and directly outputs the final Ansible playbook. This approach eliminates intermediate reasoning or iterative refinement, serving as a minimal yet effective baseline for debugging and evaluating the benchmark setup.
While SoloGen performs direct generation without adaptive reasoning, it often struggles with complex multi-service dependencies and incomplete contextual information. To address these limitations, we propose ThinkRemed, a multi-agent framework designed to emulate the SRE-like remediation process in microservice systems.
As illustrated in the figure above, ThinkRemed comprises four cooperative agentsβCoordinator, Probe, Execution, and Verificationβthat operate within an iterative reasoningβactionβreflection loop.
- The Coordinator first receives the auxiliary context
C0and failure reportR0, and adaptively determines whether to invoke the Probe Agent to collect additional runtime information from the system. - It then synthesizes a candidate Ansible playbook
p_t, which is executed by the Execution Agent to remediate the faulty microservice system. - The Verification Agent subsequently assesses the remediation result, producing a binary outcome
v_t β {0,1}indicating success or failure. - If remediation fails, the system enters a reflection phase, and control returns to the Coordinator for iterative refinement based on feedback.
To ensure timely remediation and accommodate LLM context limitations, the iteration loop is bounded by a maximum trial budget T_max.
The MicroRemed benchmark repository is organized as follows:
- π
chaos/β Contains all configurations and scripts for failure injection and recovery, including both Chaos Meshβbased faults and configuration-driven dynamic modifications. - π§©
envs/β Hosts the built-in microservice systems (Train-Ticket, Online-Boutique, and Simple-Micro). Additional systems can be easily added to this directory. - π
experiments/β Provides a collection of predefined failure sequences used for fair and reproducible evaluations. - π§
methods/β Includes implementations of different microservice remediation methods. Currently, it contains two representative approaches: SoloGen and ThinkRemed. - π€
models/β Defines the base LLM invocation logic for various remediation agents. - βοΈ
scripts/β Contains essential scripts for installation, startup, and shutdown of benchmark components.
βββ README.md
βββ chaos
β βββ __init__.py
β βββ check_status.py
β βββ deployment.py
β βββ failures.py
β βββ injection.py
β βββ templates
β βββ cpu-stress.yaml
β βββ disk-io.yaml
β βββ memory-stress.yaml
β βββ network-delay.yaml
β βββ network-loss.yaml
β βββ pod-fail.yaml
βββ config.py
βββ envs
β βββ __init__.py
β βββ env.py
β βββ online-boutique
β β βββ deploy.sh
β β βββ kubernetes-manifests.yaml
β βββ simple-micro
β β βββ deploy.sh
β β βββ k8s-deploy.yaml
β βββ source-config
β β βββ online-boutique-config.yaml
β β βββ simple-micro-config.yaml
β β βββ train-ticket-config.yaml
β βββ train-ticket
β βββ CHANGELOG-1.0.md
β βββ LICENSE
β βββ Makefile
β βββ README.md
β βββ build_upload_image.py
β βββ deploy.sh
β βββ deployment
β βββ docker-compose.yml
β βββ hack
β βββ pom.xml
βββ experiments
β βββ cpu-stress.txt
β βββ disk-io.txt
β βββ easy.txt
β βββ hard.txt
β βββ medium.txt
β βββ memory-stress.txt
β βββ network-delay.txt
β βββ network-loss.txt
β βββ pod-config-error.txt
β βββ pod-fail.txt
βββ inject_and_remediate.py
βββ inventory.ini
βββ logs
βββ methods
β βββ ThinkRemed
β β βββ __init__.py
β β βββ coordinator.py
β β βββ probe_agent.py
β β βββ tools.py
β β βββ verification_agent.py
β βββ SoloGen
β β βββ __init__.py
β β βββ generator.py
β β βββ tools.py
β βββ __init__.py
β βββ execution_agent.py
β βββ remediate.py
βββ models
β βββ __init__.py
β βββ llm.py
βββ requirement.txt
βββ scripts
β βββ install.sh
β βββ install_ansible.sh
β βββ install_chaos.sh
β βββ install_k3s.sh
β βββ run.sh
β βββ run_all.sh
β βββ start_chaos.sh
β βββ stop.sh
β βββ stop_chaos.sh
This section describes how to set up and run MicroRemed for evaluating LLM-based microservice remediation methods.
To prepare the environment, execute the installation script and install Python dependencies:
cd MicroRemed
./scripts/install.sh
pip install -r requirement.txtThis will automatically install and configure:
- k3s (lightweight Kubernetes)
- Chaos Mesh (for controlled failure injection)
- Ansible (for automated remediation)
- Python packages required for running the benchmark
MicroRemed provides two usage modes: quick-start and custom.
If no additional configuration is required, you can directly run:
./scripts/run_all.sh <env_name> <log_dir> <model>Example:
./scripts/run_all.sh train-ticket logs/qwen-plus qwen-plus<env_name>: Target microservice environment (e.g.,train-ticket)<log_dir>: Directory to store experiment logs<model>: Backbone LLM to use for remediation (e.g.,qwen-plus)
This will automatically:
- Launch the specified environment
- Inject failures as defined in the benchmark
- Use the specified LLM to generate remediation playbook
- Save logs under the specified directory
For fine-grained control over experiments, use inject_and_remediate.py:
python3 inject_and_remediate.py \
--experiments 50 \
--namespace train-ticket \
--wait-interval 10 \
--injection-timeout 60 \
--env train-ticket \
--save-path conversations \
--manifest-path envs/source-config/train-ticket-config.yaml \
--remediate-method SoloGen \
--experiment-path experiments/easy.txt \
--model qwen-plusπ§ Key Arguments:
| Argument | Type | Default | Description |
|---|---|---|---|
--experiments |
int | 100 | Number of experiments to run. |
--namespace |
str | default | Kubernetes namespace to target. |
--wait-interval |
int | 10 | Interval (seconds) between system status checks. |
--injection-timeout |
int | 30 | Timeout (seconds) for failure injection. |
--env |
str | train-ticket | Target environment identifier. |
--save-path |
str | conversations | Directory to store conversation logs. |
--manifest-path |
str | N/A | Path to the original service configuration for restoration. |
--remediate-method |
str | ThinkRemed | Remediation method to use (ThinkRemed / SoloGen). |
--experiment-path |
str | None | Path to the experiment configuration file. |
--enable-strict-restart |
bool | False | Attempt a restart on every injection timeout. |
--model |
str | N/A | Backbone LLM applied for remediation. |
Coming

