Note
This repository, MicroRemed-CS, is designed to address the limitations of some non–bare-metal servers (e.g., DSW, DLC), where deploying large models often lacks the required permissions to create containers (e.g., Docker or containerd) for running microservices.
If your machine does not have such restrictions, it is recommended to use the main repository, MicroRemed, which provides the full benchmark design, detailed configurations, and end-to-end workflows.
The MicroRemed-CS benchmark repository adopts a client–server architecture.
- The Server side provides core system-level services, including microservice deployment, runtime operations, failure injection, and failure recovery.
- The Client side is responsible for interacting with the LLM, performing reasoning and decision-making, and invoking the Server’s APIs to retrieve system information or perform remote actions.
This design enables MicroRemed-CS to support distributed deployment, where the LLM can operate on a different node or cluster from the microservice system.
├── MicroRemed-Client
│ ├── config.py
│ ├── envs
│ │ ├── __init__.py
│ │ └── env.py
│ ├── experiments
│ │ ├── cpu-stress.txt
│ │ ├── disk-io.txt
│ │ ├── easy.txt
│ │ ├── hard.txt
│ │ ├── medium.txt
│ │ ├── memory-stress.txt
│ │ ├── network-delay.txt
│ │ ├── network-loss.txt
│ │ ├── pod-config-error.txt
│ │ └── pod-fail.txt
│ ├── inject_and_remediate.py
│ ├── logs
│ ├── methods
│ │ ├── SoloGen
│ │ ├── ThinkRemed
│ │ ├── __init__.py
│ │ ├── execution_agent.py
│ │ └── remediate.py
│ ├── models
│ │ ├── __init__.py
│ │ └── llm.py
│ ├── remote
│ │ ├── __init__.py
│ │ └── client.py
│ └── scripts
│ ├── run.sh
│ ├── run_all.sh
│ └── stop.sh
├── MicroRemed-Server
│ ├── README.md
│ ├── chaos
│ │ ├── __init__.py
│ │ ├── check_status.py
│ │ ├── deployment.py
│ │ ├── failures.py
│ │ ├── injection.py
│ │ └── templates
│ ├── envs
│ │ ├── __init__.py
│ │ ├── env.py
│ │ ├── online-boutique
│ │ ├── simple-micro
│ │ ├── source-config
│ │ └── train-ticket
│ ├── inventory.ini
│ ├── logs
│ ├── operators
│ │ ├── __init__.py
│ │ ├── execution_agent.py
│ │ └── probe_agent.py
│ ├── requirement.txt
│ ├── scripts
│ │ ├── install.sh
│ │ ├── install_ansible.sh
│ │ ├── install_chaos.sh
│ │ ├── install_k3s.sh
│ │ ├── run_server.sh
│ │ ├── start_chaos.sh
│ │ ├── stop_chaos.sh
│ │ └── stop_server.sh
│ └── server.py
## 💽 How to Use
This section describes how to set up and run MicroRemed-CS for evaluating LLM-based microservice remediation methods.
### 1. Setup
- Deploy **MicroRemed-Client** on the machine that has access to the large language model (hereafter referred to as the *Client Machine*).
- Deploy **MicroRemed-Server** on the machine that can control and manage the microservice system (hereafter referred to as the *Server Machine*).
---
#### On the Client Machine
Install Python dependencies:
```bash
cd MicroRemed-Client
pip install -r requirements.txt
Run the installation script and install Python dependencies:
cd MicroRemed-Server
./scripts/install.sh
pip install -r requirement.txtThis will automatically install and configure:
- k3s (lightweight Kubernetes)
- Chaos Mesh (for controlled failure injection)
- Ansible (for automated remediation)
- Python packages required for running the server
To launch the MicroRemed server on the Server Machine:
cd MicroRemed-Server
python server.pyBy default, the server will start at: 👉 http://127.0.0.1:5000
MicroRemed-Client provides two usage modes: quick-start and custom.
If no additional configuration is required, you can directly run:
./scripts/run_all.sh <env_name> <log_dir> <model> <server_url>Example:
./scripts/run_all.sh train-ticket logs/qwen-plus qwen-plus http://8.130.181.45:5000<env_name>: Target microservice environment (e.g.,train-ticket)<log_dir>: Directory to store experiment logs (e.g.,logs/qwen-plus)<model>: Backbone LLM to use for remediation (e.g.,qwen-plus)<server_url>: The url of remote server machine (e.g.8.130.181.45)
This will automatically:
- Launch the specified environment
- Inject failures as defined in the benchmark
- Use the specified LLM to generate remediation playbook
- Save logs under the specified directory
For fine-grained control over experiments, use inject_and_remediate.py:
python3 inject_and_remediate.py \
--experiments 50 \
--namespace train-ticket \
--wait-interval 10 \
--injection-timeout 60 \
--env train-ticket \
--save-path conversations \
--manifest-path envs/source-config/train-ticket-config.yaml \
--remediate-method SoloGen \
--experiment-path experiments/easy.txt \
--model qwen-plus \
--server-url http://8.130.181.45:5000🔧 Key Arguments:
| Argument | Type | Default | Description |
|---|---|---|---|
--experiments |
int | 100 | Number of experiments to run. |
--namespace |
str | default | Kubernetes namespace to target. |
--wait-interval |
int | 10 | Interval (seconds) between system status checks. |
--injection-timeout |
int | 30 | Timeout (seconds) for failure injection. |
--env |
str | train-ticket | Target environment identifier. |
--save-path |
str | conversations | Directory to store conversation logs. |
--manifest-path |
str | N/A | Path to the original service configuration for restoration. |
--remediate-method |
str | ThinkRemed | Remediation method to use (ThinkRemed / SoloGen). |
--experiment-path |
str | None | Path to the experiment configuration file. |
--enable-strict-restart |
bool | False | Attempt a restart on every injection timeout. |
--model |
str | N/A | Backbone LLM applied for remediation. |
--server-url |
str | N/A | The url of remote server. |
Coming