Skip to content
/ MIDLog Public

[ICSE 2025, Research Track] MIDLog: Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning

License

Notifications You must be signed in to change notification settings

hemh02/MIDLog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔥MIDLog: Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning (ICSE 2025)

drawing

This is the basic implementation of our paper in ICSE 2025 (Research Track): Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning

📌 Description

Log-based anomaly detection is essential for maintaining software availability. However, existing log-based anomaly detection approaches heavily rely on fine-grained exact labels of log entries which are very hard to obtain in real-world systems. This brings a key problem that anomaly detection models require supervision signals while labeled log entries are unavailable. Facing this problem, we propose a new labeling strategy called inexact labeling that instead of labeling an log entry, system experts can label a bag of log entries in a time span. Furthermore, we propose MIDLog, a weakly supervised log-based anomaly detection approach with inexact labels. We leverage the multi-instance learning paradigm to achieve explicit separation of anomalous log entries from the inexact labeled anomalous log set so as to deduce exact anomalous log labels from inexact labeled log sets. Extensive evaluation on three public datasets shows that our approach achieves an F1 score of over 85% with inexact labels.

🔍 Project Structure

├─checkpoint      # Saved models
├─data            # Log data
├─glove           # Pre-trained Language Models for Log Embedding
├─src             
|  ├─dataset.py   # Load dataset
|  ├─models.py    # Attention-based Patch Encoder, PatchMixer Model definition   
|  └─utils.py     # Log Embedding
├─main.py         # entries
└─process.py      # Data preprocess

📑 Datasets

We used 3 open-source log datasets for evaluation, Spirit, Thunderbird and Hadoop.

Software System Description Time Span # Messages Data Size Link
Spirit Spirit (ICC2) supercomputer log 558 days 272,298,969 30.289 GB Usenix-CFDR Data
Thunderbird Thunderbird supercomputer log 244 days 211,212,192 27.367 GB Usenix-CFDR Data
Hadoop Hadoop mapreduce job log N.A. 394,308 48.61MB LogHub

Note: Considering the huge scale of the Spirit and Thunderbird datasets, we followed the settings of the previous study LogADEmpirical and selected the earliest 1 GB and 10 million log messages from the Spirit and Thunderbird datasets, respectively, for experimentation.

⚙️ Environment

Key Packages:

Numpy==1.20.3

Pandas==1.3.5

Pytorch_lightning==1.1.2

scikit_learn==0.24.2

torch==1.13.1+cu116

tqdm==4.62.3

Drain3

📜 Preparation

You need to follow these steps to completely run MIDLog.

  • Step 1: Download Log Data and put it under data folder.
  • Step 2: Using Drain to parse the unstructed logs.
  • Step 3: Download glove.6B.300d.txt from Stanford NLP word embeddings, and put it under glove folder.

🚀 Quick Start

you can run MIDLog on Spirit dataset with this code:

👉 Stage 1: Preprocessing data for training and evaluation.

python process.py --log_file data/log_structured --temp_file data/log_templates

👉 Stage 2: Global Normality Center Learning (Pre-train on Inexact Labeled Logs)

Step1: Performing Unsupervised Local Normality Learning.
python main.py --stage ulnl --mode train --epochs 10 --batch_size 256 --num_keys 1278
Step2: Preprocessing data for Global Normality Distillation.
python main.py --stage ulnl --mode gen --epochs 10 --batch_size 256 --num_keys 1278 --candidates 150 --load_checkpoint True
Step3: Performing Global Normality Distillation.
python main.py --stage gnd --mode train --epochs 10 --batch_size 256

👉 Stage 3: Performing Multi-Instance Deviation Learning.

python main.py --stage midl --mode train --epochs 150 --batch_size 32 --num_keys 1278

👉 Stage 4: Evaluation on Spirit Dataset.

python main.py --stage midl --mode auc_eval --epochs 150 --batch_size 32 --num_keys 1278 --load_checkpoint True

About

[ICSE 2025, Research Track] MIDLog: Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published