This is the basic implementation of our paper in ISSRE 2024 (Research Track): LLMeLog: An Approach for Anomaly Detection based on LLM-enriched Log Events
Log-based anomaly detection is an essential task in maintaining software reliability. Existing log-based anomaly detection approaches often consist of three key phases: log parsing, event embedding, and model construction. Event embedding efficiently extracts semantic information from log events and produces vector representations of log events. However, existing event embedding methods suffer from two key problems. First, semantic noises are buried in log events leading to inevitable gaps between the obtained semantics from log events and their essential meanings. Second, there exists a gap between general semantic embedding and the specific embedding requirement of anomaly detection tasks. To mitigate these problems and improve the quality of representations of log events, we propose a novel anomaly detection approach named LLMeLog. It leverages the capabilities of large language models (LLMs) to enrich the contents of log events with in-context learning techniques. Then it utilizes the enriched log events to fine-tune a pre-trained BERT model. At last, it trains a transformer-based anomaly detection model with the event representations produced by the pre-trained BERT model. Evaluation results on three public log datasets show that LLMeLog achieves the best performance across all datasets, boasting F1-scores exceeding 99%. Besides, when using only 10% of labeled data as training data, our approach can still achieve over 90% F1-scores.
├─checkpoint # Saved models
├─bert-base-en # Pretrained BERT model
├─new_encoder # Fine-tuned BERT model
├─data # Log data
├─src
| ├─dataset.py # Load dataset
| ├─models.py # Transformer-Based Anomaly Detection model
| └─utils.py # Log Embedding
├─main.py # entries
└─predata.py # Data preprocess
We used 3 open-source log datasets for evaluation, HDFS, BGL and Thunderbird.
| Software System | Description | Time Span | # Messages | Data Size | Link |
|---|---|---|---|---|---|
| HDFS | Hadoop distributed file system log | 38.7 hours | 11,175,629 | 1.47GB | Loghub |
| BGL | Blue Gene/L supercomputer log | 214.7 days | 4,747,963 | 708.76MB | Usenix-CFDR Data |
| Thunderbird | Thunderbird supercomputer log | 244 days | 211,212,192 | 27.367 GB | Usenix-CFDR Data |
Note: Considering the huge scale of the Thunderbird dataset, we followed the settings of the previous study LogADEmpirical and selected the earliest 10 million log messages from the Thunderbird dataset for experimentation.
Key Packages:
Numpy==1.20.3
Pandas==1.3.5
Pytorch_lightning==1.1.2
torch==1.13.1+cu116
tqdm==4.62.3
transformers==4.15.0
You need to follow these steps to completely run LLMeLog.
- Step 1: Download Log Data and put it under
datafolder. - Step 2: Using Drain to parse the unstructed logs.
- Step 3: Download
bert-basefrom Hugging Face, and put it underbert-base-enfolder. - Step 4: Enriching the log event with the provided prompt, we recommend using ChatGPT.
you can run LLMeLog on HDFS dataset with this code:
python predata.py --dataset hdfs
python main.py --mode train --encoder 1 --dataset hdfs --lr 0.0002
python main.py --mode gen --encoder 1 --dataset hdfs --lr 0.0002
python main.py --mode train --dataset hdfs --batch_size 256 --lr 0.0003
python main.py --mode eval --dataset hdfs --batch_size 256 --lr 0.0003 --load_checkpoint True
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@inproceedings{he2024llmelog,
title={LLMeLog: An Approach for Anomaly Detection based on LLM-enriched Log Events},
author={He, Minghua and Jia, Tong and Duan, Chiming and Cai, Huaqian and Li, Ying and Huang, Gang},
booktitle={2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)},
pages={132--143},
year={2024},
organization={IEEE}
}
