MicroRemed: Benchmarking LLMs in Microservices Remediation (MicroRemed-CS)

Note

This repository, MicroRemed-CS, is designed to address the limitations of some non–bare-metal servers (e.g., DSW, DLC), where deploying large models often lacks the required permissions to create containers (e.g., Docker or containerd) for running microservices.

If your machine does not have such restrictions, it is recommended to use the main repository, MicroRemed, which provides the full benchmark design, detailed configurations, and end-to-end workflows.

📦 Repository Structure

The MicroRemed-CS benchmark repository adopts a client–server architecture.

The Server side provides core system-level services, including microservice deployment, runtime operations, failure injection, and failure recovery.
The Client side is responsible for interacting with the LLM, performing reasoning and decision-making, and invoking the Server’s APIs to retrieve system information or perform remote actions.

This design enables MicroRemed-CS to support distributed deployment, where the LLM can operate on a different node or cluster from the microservice system.

🗂️ Directory Tree

├── MicroRemed-Client
│   ├── config.py
│   ├── envs
│   │   ├── __init__.py
│   │   └── env.py
│   ├── experiments
│   │   ├── cpu-stress.txt
│   │   ├── disk-io.txt
│   │   ├── easy.txt
│   │   ├── hard.txt
│   │   ├── medium.txt
│   │   ├── memory-stress.txt
│   │   ├── network-delay.txt
│   │   ├── network-loss.txt
│   │   ├── pod-config-error.txt
│   │   └── pod-fail.txt
│   ├── inject_and_remediate.py
│   ├── logs
│   ├── methods
│   │   ├── SoloGen
│   │   ├── ThinkRemed
│   │   ├── __init__.py
│   │   ├── execution_agent.py
│   │   └── remediate.py
│   ├── models
│   │   ├── __init__.py
│   │   └── llm.py
│   ├── remote
│   │   ├── __init__.py
│   │   └── client.py
│   └── scripts
│       ├── run.sh
│       ├── run_all.sh
│       └── stop.sh
├── MicroRemed-Server
│   ├── README.md
│   ├── chaos
│   │   ├── __init__.py
│   │   ├── check_status.py
│   │   ├── deployment.py
│   │   ├── failures.py
│   │   ├── injection.py
│   │   └── templates
│   ├── envs
│   │   ├── __init__.py
│   │   ├── env.py
│   │   ├── online-boutique
│   │   ├── simple-micro
│   │   ├── source-config
│   │   └── train-ticket
│   ├── inventory.ini
│   ├── logs
│   ├── operators
│   │   ├── __init__.py
│   │   ├── execution_agent.py
│   │   └── probe_agent.py
│   ├── requirement.txt
│   ├── scripts
│   │   ├── install.sh
│   │   ├── install_ansible.sh
│   │   ├── install_chaos.sh
│   │   ├── install_k3s.sh
│   │   ├── run_server.sh
│   │   ├── start_chaos.sh
│   │   ├── stop_chaos.sh
│   │   └── stop_server.sh
│   └── server.py

## 💽 How to Use

This section describes how to set up and run MicroRemed-CS for evaluating LLM-based microservice remediation methods.

### 1. Setup  

- Deploy **MicroRemed-Client** on the machine that has access to the large language model (hereafter referred to as the *Client Machine*).  
- Deploy **MicroRemed-Server** on the machine that can control and manage the microservice system (hereafter referred to as the *Server Machine*).  

---

#### On the Client Machine

Install Python dependencies:  

```bash
cd MicroRemed-Client
pip install -r requirements.txt

On the Server Machine

Run the installation script and install Python dependencies:

cd MicroRemed-Server
./scripts/install.sh
pip install -r requirement.txt

This will automatically install and configure:

k3s (lightweight Kubernetes)
Chaos Mesh (for controlled failure injection)
Ansible (for automated remediation)
Python packages required for running the server

Start the Server

To launch the MicroRemed server on the Server Machine:

cd MicroRemed-Server
python server.py

By default, the server will start at: 👉 http://127.0.0.1:5000

2. Client Usage

MicroRemed-Client provides two usage modes: quick-start and custom.

2.1 Quick-start

If no additional configuration is required, you can directly run:

./scripts/run_all.sh <env_name> <log_dir> <model> <server_url>

Example:

./scripts/run_all.sh train-ticket logs/qwen-plus qwen-plus http://8.130.181.45:5000

<env_name>: Target microservice environment (e.g., train-ticket)
<log_dir>: Directory to store experiment logs (e.g., logs/qwen-plus)
<model>: Backbone LLM to use for remediation (e.g., qwen-plus)
<server_url>: The url of remote server machine (e.g. 8.130.181.45)

This will automatically:

Launch the specified environment
Inject failures as defined in the benchmark
Use the specified LLM to generate remediation playbook
Save logs under the specified directory

2.2 Custom Mode

For fine-grained control over experiments, use inject_and_remediate.py:

python3 inject_and_remediate.py \
  --experiments 50 \
  --namespace train-ticket \
  --wait-interval 10 \
  --injection-timeout 60 \
  --env train-ticket \
  --save-path conversations \
  --manifest-path envs/source-config/train-ticket-config.yaml \
  --remediate-method SoloGen \
  --experiment-path experiments/easy.txt \
  --model qwen-plus \
  --server-url http://8.130.181.45:5000

🔧 Key Arguments:

Argument	Type	Default	Description
`--experiments`	int	100	Number of experiments to run.
`--namespace`	str	default	Kubernetes namespace to target.
`--wait-interval`	int	10	Interval (seconds) between system status checks.
`--injection-timeout`	int	30	Timeout (seconds) for failure injection.
`--env`	str	train-ticket	Target environment identifier.
`--save-path`	str	conversations	Directory to store conversation logs.
`--manifest-path`	str	N/A	Path to the original service configuration for restoration.
`--remediate-method`	str	ThinkRemed	Remediation method to use (`ThinkRemed` / `SoloGen`).
`--experiment-path`	str	None	Path to the experiment configuration file.
`--enable-strict-restart`	bool	False	Attempt a restart on every injection timeout.
`--model`	str	N/A	Backbone LLM applied for remediation.
`--server-url`	str	N/A	The url of remote server.

Citation

Coming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MicroRemed: Benchmarking LLMs in Microservices Remediation (MicroRemed-CS)

📦 Repository Structure

🗂️ Directory Tree

On the Server Machine

Start the Server

2. Client Usage

2.1 Quick-start

2.2 Custom Mode

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
MicroRemed-Client		MicroRemed-Client
MicroRemed-Server		MicroRemed-Server
README.md		README.md

LLM4AIOps/MicroRemed-CS

Folders and files

Latest commit

History

Repository files navigation

MicroRemed: Benchmarking LLMs in Microservices Remediation (MicroRemed-CS)

📦 Repository Structure

🗂️ Directory Tree

On the Server Machine

Start the Server

2. Client Usage

2.1 Quick-start

2.2 Custom Mode

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages