🔥MIDLog: Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning (ICSE 2025)

This is the basic implementation of our paper in ICSE 2025 (Research Track): Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning

MIDLog
- Preparation
- Quick Start

📌 Description

Log-based anomaly detection is essential for maintaining software availability. However, existing log-based anomaly detection approaches heavily rely on fine-grained exact labels of log entries which are very hard to obtain in real-world systems. This brings a key problem that anomaly detection models require supervision signals while labeled log entries are unavailable. Facing this problem, we propose a new labeling strategy called inexact labeling that instead of labeling an log entry, system experts can label a bag of log entries in a time span. Furthermore, we propose MIDLog, a weakly supervised log-based anomaly detection approach with inexact labels. We leverage the multi-instance learning paradigm to achieve explicit separation of anomalous log entries from the inexact labeled anomalous log set so as to deduce exact anomalous log labels from inexact labeled log sets. Extensive evaluation on three public datasets shows that our approach achieves an F1 score of over 85% with inexact labels.

🔍 Project Structure

├─checkpoint      # Saved models
├─data            # Log data
├─glove           # Pre-trained Language Models for Log Embedding
├─src             
|  ├─dataset.py   # Load dataset
|  ├─models.py    # Attention-based Patch Encoder, PatchMixer Model definition   
|  └─utils.py     # Log Embedding
├─main.py         # entries
└─process.py      # Data preprocess

📑 Datasets

We used 3 open-source log datasets for evaluation, Spirit, Thunderbird and Hadoop.

Software System	Description	Time Span	# Messages	Data Size	Link
Spirit	Spirit (ICC2) supercomputer log	558 days	272,298,969	30.289 GB	Usenix-CFDR Data
Thunderbird	Thunderbird supercomputer log	244 days	211,212,192	27.367 GB	Usenix-CFDR Data
Hadoop	Hadoop mapreduce job log	N.A.	394,308	48.61MB	LogHub

Note: Considering the huge scale of the Spirit and Thunderbird datasets, we followed the settings of the previous study LogADEmpirical and selected the earliest 1 GB and 10 million log messages from the Spirit and Thunderbird datasets, respectively, for experimentation.

⚙️ Environment

Key Packages:

Numpy==1.20.3

Pandas==1.3.5

Pytorch_lightning==1.1.2

scikit_learn==0.24.2

torch==1.13.1+cu116

tqdm==4.62.3

Drain3

📜 Preparation

You need to follow these steps to completely run MIDLog.

Step 1: Download Log Data and put it under data folder.
Step 2: Using Drain to parse the unstructed logs.
Step 3: Download glove.6B.300d.txt from Stanford NLP word embeddings, and put it under glove folder.

🚀 Quick Start

you can run MIDLog on Spirit dataset with this code:

👉 Stage 1: Preprocessing data for training and evaluation.

python process.py --log_file data/log_structured --temp_file data/log_templates

👉 Stage 2: Global Normality Center Learning (Pre-train on Inexact Labeled Logs)

Step1: Performing Unsupervised Local Normality Learning.

python main.py --stage ulnl --mode train --epochs 10 --batch_size 256 --num_keys 1278

Step2: Preprocessing data for Global Normality Distillation.

python main.py --stage ulnl --mode gen --epochs 10 --batch_size 256 --num_keys 1278 --candidates 150 --load_checkpoint True

Step3: Performing Global Normality Distillation.

python main.py --stage gnd --mode train --epochs 10 --batch_size 256

👉 Stage 3: Performing Multi-Instance Deviation Learning.

python main.py --stage midl --mode train --epochs 150 --batch_size 32 --num_keys 1278

👉 Stage 4: Evaluation on Spirit Dataset.

python main.py --stage midl --mode auc_eval --epochs 150 --batch_size 32 --num_keys 1278 --load_checkpoint True

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
checkpoint		checkpoint
data		data
glove		glove
imgs		imgs
src		src
LICENSE		LICENSE
README.md		README.md
main.py		main.py
process.py		process.py
requirements.txt		requirements.txt
submit.sh		submit.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔥MIDLog: Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning (ICSE 2025)

📌 Description

🔍 Project Structure

📑 Datasets

⚙️ Environment

📜 Preparation

🚀 Quick Start

👉 Stage 1: Preprocessing data for training and evaluation.

👉 Stage 2: Global Normality Center Learning (Pre-train on Inexact Labeled Logs)

Step1: Performing Unsupervised Local Normality Learning.

Step2: Preprocessing data for Global Normality Distillation.

Step3: Performing Global Normality Distillation.

👉 Stage 3: Performing Multi-Instance Deviation Learning.

👉 Stage 4: Evaluation on Spirit Dataset.

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

hemh02/MIDLog

Folders and files

Latest commit

History

Repository files navigation

🔥MIDLog: Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance Learning (ICSE 2025)

📌 Description

🔍 Project Structure

📑 Datasets

⚙️ Environment

📜 Preparation

🚀 Quick Start

👉 Stage 1: Preprocessing data for training and evaluation.

👉 Stage 2: Global Normality Center Learning (Pre-train on Inexact Labeled Logs)

Step1: Performing Unsupervised Local Normality Learning.

Step2: Preprocessing data for Global Normality Distillation.

Step3: Performing Global Normality Distillation.

👉 Stage 3: Performing Multi-Instance Deviation Learning.

👉 Stage 4: Evaluation on Spirit Dataset.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages