Skip to content

LLM4AIOps/MicroRemed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MicroRemed: Benchmarking LLMs in Microservices Remediation

Paper

Note

The MicroRemed benchmark and the LLM backbone are designed to run within the same repository. In other words, if you intend to use this setup, you must ensure that both the microservice environment and the LLM can be executed on the same machine. This can be challenging on many non-bare-metal servers (e.g., DSW, DLC), where deploying large models often lacks the necessary permissions to create containers (e.g., Docker or containerd) for launching microservices.

To address this, we provide an alternative setup via MicroRemed-CS, which supports a client-server mode. In this configuration, microservices are deployed on one cluster, exposing operational interfaces, while the LLM runs on a separate cluster and interacts with the microservices remotely. Despite this separation, the overall benchmarking methodology remains consistent with the original MicroRemed design.

πŸ‘‹ Overview

MicroRemed is a benchmark for assessing large language models in end-to-end microservice remediation, aiming to directly produce executable Ansible playbooks from diagnosis reports and autonomously restore faulty systems.

MicroRemed Benchmark

MicroRemed’s design adheres to the following principles:


🧩 Dynamic Execution Benchmark

Unlike most LLM benchmarks that collect static data to form fixed datasets, MicroRemed is designed as a live and interactive execution environment.
It actively launches real microservice systems, injects controlled failures, and interacts dynamically with running services.
This design enables the benchmark to capture real-time behaviors, system dynamics, and contextual dependencies that static datasets cannot represent.


βš™οΈ Execution-based Evaluation

Evaluation is not determined by linguistic or structural similarity of generated outputs,
but by execution outcomes.
Each generated playbook is executed within the microservice environment,
and the benchmark verifies success by assessing whether the system has been fully recovered to its normal operational state.


🌐 Comprehensive Scalability

Built on these foundations, the benchmark is designed to be:

  • Method-scalable – supports diverse LLM-based remediation methods.
  • LLM-scalable – allows plug-and-play replacement of remediation models.
  • Failure-scalable – accommodates various failure scenarios.
  • System-scalable – easily extendable to new microservice systems with minimal configuration effort.

πŸ’œ Statistics

MicroRemed currently includes, seven representative failure types, three real-world microservice systems, and two reference methodologies.


🧩 Failure Types

No. Category Failure Types
1 Resource-Level CPU Saturation
2 Memory Saturation
3 IO Saturation
4 Network-Level Network Loss
5 Network Delay
6 Application-Level Pod Failure
7 Configuration Error

πŸ—οΈ Microservice Systems

MicroRemed integrates three microservice systems:

  • Train-Ticket [Zhou et al., 2018]: a comprehensive benchmark simulating a railway ticketing platform.
  • Online-Boutique [Google, 2025]: an e-commerce microservice application widely used in reliability and observability studies.
  • Simple-Micro: a self-developed lightweight system designed for controlled experimentation and fine-grained analysis.

βš–οΈ Difficulty Levels

Although MicroRemed supports arbitrary combinations of injected failures,
we define three standardized difficulty levels to ensure fair and structured comparison across remediation methods:

Level #Cases Included Failure Types
🟒 Easy 23 CPU Saturation, Memory Saturation, Network Loss, Network Delay
🟑 Medium 49 CPU Saturation, Memory Saturation, IO Saturation, Network Loss, Network Delay, Pod Failure
πŸ”΄ Hard 80 CPU Saturation, Memory Saturation, IO Saturation, Network Loss, Network Delay, Pod Failure, Configuration Error

🧩 Reference Methodologies

To facilitate fair evaluation and comparison across different LLMs, we introduce two reference methodologies: SoloGen and ThinkRemed.

⚑ SoloGen

SoloGen represents a straightforward one-shot generation baseline. It replaces the candidate remediation LLM in the MicroRemed pipeline with a pre-trained large language model that receives all relevant contextual information in a single prompt and directly outputs the final Ansible playbook. This approach eliminates intermediate reasoning or iterative refinement, serving as a minimal yet effective baseline for debugging and evaluating the benchmark setup.

πŸ€– ThinkRemed

While SoloGen performs direct generation without adaptive reasoning, it often struggles with complex multi-service dependencies and incomplete contextual information. To address these limitations, we propose ThinkRemed, a multi-agent framework designed to emulate the SRE-like remediation process in microservice systems.

ThinkRemed

As illustrated in the figure above, ThinkRemed comprises four cooperative agentsβ€”Coordinator, Probe, Execution, and Verificationβ€”that operate within an iterative reasoning–action–reflection loop.

  1. The Coordinator first receives the auxiliary context C0 and failure report R0, and adaptively determines whether to invoke the Probe Agent to collect additional runtime information from the system.
  2. It then synthesizes a candidate Ansible playbook p_t, which is executed by the Execution Agent to remediate the faulty microservice system.
  3. The Verification Agent subsequently assesses the remediation result, producing a binary outcome v_t ∈ {0,1} indicating success or failure.
  4. If remediation fails, the system enters a reflection phase, and control returns to the Coordinator for iterative refinement based on feedback.

To ensure timely remediation and accommodate LLM context limitations, the iteration loop is bounded by a maximum trial budget T_max.


πŸ“¦ Repository Structure

The MicroRemed benchmark repository is organized as follows:

  • πŸ“ chaos/ β€” Contains all configurations and scripts for failure injection and recovery, including both Chaos Mesh–based faults and configuration-driven dynamic modifications.
  • 🧩 envs/ β€” Hosts the built-in microservice systems (Train-Ticket, Online-Boutique, and Simple-Micro). Additional systems can be easily added to this directory.
  • πŸ“Š experiments/ β€” Provides a collection of predefined failure sequences used for fair and reproducible evaluations.
  • 🧠 methods/ β€” Includes implementations of different microservice remediation methods. Currently, it contains two representative approaches: SoloGen and ThinkRemed.
  • πŸ€– models/ β€” Defines the base LLM invocation logic for various remediation agents.
  • βš™οΈ scripts/ β€” Contains essential scripts for installation, startup, and shutdown of benchmark components.

πŸ—‚οΈ Directory Tree

β”œβ”€β”€ README.md
β”œβ”€β”€ chaos
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ check_status.py
β”‚   β”œβ”€β”€ deployment.py
β”‚   β”œβ”€β”€ failures.py
β”‚   β”œβ”€β”€ injection.py
β”‚   └── templates
β”‚       β”œβ”€β”€ cpu-stress.yaml
β”‚       β”œβ”€β”€ disk-io.yaml
β”‚       β”œβ”€β”€ memory-stress.yaml
β”‚       β”œβ”€β”€ network-delay.yaml
β”‚       β”œβ”€β”€ network-loss.yaml
β”‚       └── pod-fail.yaml
β”œβ”€β”€ config.py
β”œβ”€β”€ envs
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ env.py
β”‚   β”œβ”€β”€ online-boutique
β”‚   β”‚   β”œβ”€β”€ deploy.sh
β”‚   β”‚   └── kubernetes-manifests.yaml
β”‚   β”œβ”€β”€ simple-micro
β”‚   β”‚   β”œβ”€β”€ deploy.sh
β”‚   β”‚   └── k8s-deploy.yaml
β”‚   β”œβ”€β”€ source-config
β”‚   β”‚   β”œβ”€β”€ online-boutique-config.yaml
β”‚   β”‚   β”œβ”€β”€ simple-micro-config.yaml
β”‚   β”‚   └── train-ticket-config.yaml
β”‚   └── train-ticket
β”‚       β”œβ”€β”€ CHANGELOG-1.0.md
β”‚       β”œβ”€β”€ LICENSE
β”‚       β”œβ”€β”€ Makefile
β”‚       β”œβ”€β”€ README.md
β”‚       β”œβ”€β”€ build_upload_image.py
β”‚       β”œβ”€β”€ deploy.sh
β”‚       β”œβ”€β”€ deployment
β”‚       β”œβ”€β”€ docker-compose.yml
β”‚       β”œβ”€β”€ hack
β”‚       └── pom.xml
β”œβ”€β”€ experiments
β”‚   β”œβ”€β”€ cpu-stress.txt
β”‚   β”œβ”€β”€ disk-io.txt
β”‚   β”œβ”€β”€ easy.txt
β”‚   β”œβ”€β”€ hard.txt
β”‚   β”œβ”€β”€ medium.txt
β”‚   β”œβ”€β”€ memory-stress.txt
β”‚   β”œβ”€β”€ network-delay.txt
β”‚   β”œβ”€β”€ network-loss.txt
β”‚   β”œβ”€β”€ pod-config-error.txt
β”‚   └── pod-fail.txt
β”œβ”€β”€ inject_and_remediate.py
β”œβ”€β”€ inventory.ini
β”œβ”€β”€ logs
β”œβ”€β”€ methods
β”‚   β”œβ”€β”€ ThinkRemed
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ coordinator.py
β”‚   β”‚   β”œβ”€β”€ probe_agent.py
β”‚   β”‚   β”œβ”€β”€ tools.py
β”‚   β”‚   └── verification_agent.py
β”‚   β”œβ”€β”€ SoloGen
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ generator.py
β”‚   β”‚   └── tools.py
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ execution_agent.py
β”‚   └── remediate.py
β”œβ”€β”€ models
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── llm.py
β”œβ”€β”€ requirement.txt
β”œβ”€β”€ scripts
β”‚   β”œβ”€β”€ install.sh
β”‚   β”œβ”€β”€ install_ansible.sh
β”‚   β”œβ”€β”€ install_chaos.sh
β”‚   β”œβ”€β”€ install_k3s.sh
β”‚   β”œβ”€β”€ run.sh
β”‚   β”œβ”€β”€ run_all.sh
β”‚   β”œβ”€β”€ start_chaos.sh
β”‚   β”œβ”€β”€ stop.sh
β”‚   └── stop_chaos.sh

πŸ’½ How to Use

This section describes how to set up and run MicroRemed for evaluating LLM-based microservice remediation methods.

1. Setup

To prepare the environment, execute the installation script and install Python dependencies:

cd MicroRemed
./scripts/install.sh
pip install -r requirement.txt

This will automatically install and configure:

  • k3s (lightweight Kubernetes)
  • Chaos Mesh (for controlled failure injection)
  • Ansible (for automated remediation)
  • Python packages required for running the benchmark

2. Usage

MicroRemed provides two usage modes: quick-start and custom.

2.1 Quick-start

If no additional configuration is required, you can directly run:

./scripts/run_all.sh <env_name> <log_dir> <model>

Example:

./scripts/run_all.sh train-ticket logs/qwen-plus qwen-plus
  • <env_name>: Target microservice environment (e.g., train-ticket)
  • <log_dir>: Directory to store experiment logs
  • <model>: Backbone LLM to use for remediation (e.g., qwen-plus)

This will automatically:

  1. Launch the specified environment
  2. Inject failures as defined in the benchmark
  3. Use the specified LLM to generate remediation playbook
  4. Save logs under the specified directory

2.2 Custom Mode

For fine-grained control over experiments, use inject_and_remediate.py:

python3 inject_and_remediate.py \
  --experiments 50 \
  --namespace train-ticket \
  --wait-interval 10 \
  --injection-timeout 60 \
  --env train-ticket \
  --save-path conversations \
  --manifest-path envs/source-config/train-ticket-config.yaml \
  --remediate-method SoloGen \
  --experiment-path experiments/easy.txt \
  --model qwen-plus

πŸ”§ Key Arguments:

Argument Type Default Description
--experiments int 100 Number of experiments to run.
--namespace str default Kubernetes namespace to target.
--wait-interval int 10 Interval (seconds) between system status checks.
--injection-timeout int 30 Timeout (seconds) for failure injection.
--env str train-ticket Target environment identifier.
--save-path str conversations Directory to store conversation logs.
--manifest-path str N/A Path to the original service configuration for restoration.
--remediate-method str ThinkRemed Remediation method to use (ThinkRemed / SoloGen).
--experiment-path str None Path to the experiment configuration file.
--enable-strict-restart bool False Attempt a restart on every injection timeout.
--model str N/A Backbone LLM applied for remediation.

Citation

Coming

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published