Skip to content

LLM4AIOps/MicroRemed-CS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

MicroRemed: Benchmarking LLMs in Microservices Remediation (MicroRemed-CS)

Paper

Note

This repository, MicroRemed-CS, is designed to address the limitations of some non–bare-metal servers (e.g., DSW, DLC), where deploying large models often lacks the required permissions to create containers (e.g., Docker or containerd) for running microservices.

If your machine does not have such restrictions, it is recommended to use the main repository, MicroRemed, which provides the full benchmark design, detailed configurations, and end-to-end workflows.

📦 Repository Structure

The MicroRemed-CS benchmark repository adopts a client–server architecture.

  • The Server side provides core system-level services, including microservice deployment, runtime operations, failure injection, and failure recovery.
  • The Client side is responsible for interacting with the LLM, performing reasoning and decision-making, and invoking the Server’s APIs to retrieve system information or perform remote actions.

This design enables MicroRemed-CS to support distributed deployment, where the LLM can operate on a different node or cluster from the microservice system.


🗂️ Directory Tree

├── MicroRemed-Client
│   ├── config.py
│   ├── envs
│   │   ├── __init__.py
│   │   └── env.py
│   ├── experiments
│   │   ├── cpu-stress.txt
│   │   ├── disk-io.txt
│   │   ├── easy.txt
│   │   ├── hard.txt
│   │   ├── medium.txt
│   │   ├── memory-stress.txt
│   │   ├── network-delay.txt
│   │   ├── network-loss.txt
│   │   ├── pod-config-error.txt
│   │   └── pod-fail.txt
│   ├── inject_and_remediate.py
│   ├── logs
│   ├── methods
│   │   ├── SoloGen
│   │   ├── ThinkRemed
│   │   ├── __init__.py
│   │   ├── execution_agent.py
│   │   └── remediate.py
│   ├── models
│   │   ├── __init__.py
│   │   └── llm.py
│   ├── remote
│   │   ├── __init__.py
│   │   └── client.py
│   └── scripts
│       ├── run.sh
│       ├── run_all.sh
│       └── stop.sh
├── MicroRemed-Server
│   ├── README.md
│   ├── chaos
│   │   ├── __init__.py
│   │   ├── check_status.py
│   │   ├── deployment.py
│   │   ├── failures.py
│   │   ├── injection.py
│   │   └── templates
│   ├── envs
│   │   ├── __init__.py
│   │   ├── env.py
│   │   ├── online-boutique
│   │   ├── simple-micro
│   │   ├── source-config
│   │   └── train-ticket
│   ├── inventory.ini
│   ├── logs
│   ├── operators
│   │   ├── __init__.py
│   │   ├── execution_agent.py
│   │   └── probe_agent.py
│   ├── requirement.txt
│   ├── scripts
│   │   ├── install.sh
│   │   ├── install_ansible.sh
│   │   ├── install_chaos.sh
│   │   ├── install_k3s.sh
│   │   ├── run_server.sh
│   │   ├── start_chaos.sh
│   │   ├── stop_chaos.sh
│   │   └── stop_server.sh
│   └── server.py

## 💽 How to Use

This section describes how to set up and run MicroRemed-CS for evaluating LLM-based microservice remediation methods.

### 1. Setup  

- Deploy **MicroRemed-Client** on the machine that has access to the large language model (hereafter referred to as the *Client Machine*).  
- Deploy **MicroRemed-Server** on the machine that can control and manage the microservice system (hereafter referred to as the *Server Machine*).  

---

#### On the Client Machine

Install Python dependencies:  

```bash
cd MicroRemed-Client
pip install -r requirements.txt

On the Server Machine

Run the installation script and install Python dependencies:

cd MicroRemed-Server
./scripts/install.sh
pip install -r requirement.txt

This will automatically install and configure:

  • k3s (lightweight Kubernetes)
  • Chaos Mesh (for controlled failure injection)
  • Ansible (for automated remediation)
  • Python packages required for running the server

Start the Server

To launch the MicroRemed server on the Server Machine:

cd MicroRemed-Server
python server.py

By default, the server will start at: 👉 http://127.0.0.1:5000

2. Client Usage

MicroRemed-Client provides two usage modes: quick-start and custom.

2.1 Quick-start

If no additional configuration is required, you can directly run:

./scripts/run_all.sh <env_name> <log_dir> <model> <server_url>

Example:

./scripts/run_all.sh train-ticket logs/qwen-plus qwen-plus http://8.130.181.45:5000
  • <env_name>: Target microservice environment (e.g., train-ticket)
  • <log_dir>: Directory to store experiment logs (e.g., logs/qwen-plus)
  • <model>: Backbone LLM to use for remediation (e.g., qwen-plus)
  • <server_url>: The url of remote server machine (e.g. 8.130.181.45)

This will automatically:

  1. Launch the specified environment
  2. Inject failures as defined in the benchmark
  3. Use the specified LLM to generate remediation playbook
  4. Save logs under the specified directory

2.2 Custom Mode

For fine-grained control over experiments, use inject_and_remediate.py:

python3 inject_and_remediate.py \
  --experiments 50 \
  --namespace train-ticket \
  --wait-interval 10 \
  --injection-timeout 60 \
  --env train-ticket \
  --save-path conversations \
  --manifest-path envs/source-config/train-ticket-config.yaml \
  --remediate-method SoloGen \
  --experiment-path experiments/easy.txt \
  --model qwen-plus \
  --server-url http://8.130.181.45:5000

🔧 Key Arguments:

Argument Type Default Description
--experiments int 100 Number of experiments to run.
--namespace str default Kubernetes namespace to target.
--wait-interval int 10 Interval (seconds) between system status checks.
--injection-timeout int 30 Timeout (seconds) for failure injection.
--env str train-ticket Target environment identifier.
--save-path str conversations Directory to store conversation logs.
--manifest-path str N/A Path to the original service configuration for restoration.
--remediate-method str ThinkRemed Remediation method to use (ThinkRemed / SoloGen).
--experiment-path str None Path to the experiment configuration file.
--enable-strict-restart bool False Attempt a restart on every injection timeout.
--model str N/A Backbone LLM applied for remediation.
--server-url str N/A The url of remote server.

Citation

Coming

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published