Skip to content

Xiaohao-Yang/LLM-ITL

Repository files navigation

🧠 LLM-ITL: Neural Topic Modeling with Large Language Models in the Loop

ACL 2025

This repository contains the official implementation of the paper:

πŸ“„ Neural Topic Modeling with Large Language Models in the Loop
πŸ† Accepted to the Main Conference of ACL 2025
πŸ”— Read the paper


πŸ“‚ Table of Contents


πŸ“– Overview

LLM-ITL Framework Diagram

Overview of the LLM-ITL framework.

LLM-ITL is a novel LLM-in-the-loop framework that creates a synergistic integration between Large Language Models (LLMs) and Neural Topic Models (NTMs), enhancing topic interpretability while preserving efficient topic and document representation learning.

In this framework:

  • NTMs are used to learn global topics and document representations.
  • An LLM refines the learned topics via an Optimal Transport (OT)-based alignment objective, dynamically adjusting based on the LLM's confidence in suggesting topical words.

LLM-ITL is modular and compatible with many existing NTMs. It significantly improves topic interpretability while maintaining the quality of document representations, as demonstrated through extensive experiments.


✨ Features

  • πŸ”„ Flexible integration: Seamlessly combine a wide range of Large Language Models (LLMs) with various Neural Topic Models (NTMs) through a unified interface.

  • βš™οΈ Modular design: Swap LLMs, NTMs, datasets or evaluation metrics independently β€” ideal for research experiments and benchmarking.

  • 🧠 Supports various LLMs: Easily integrates with LLMs such as LLAMA3, Mistral, Yi, Phi-3, Qwen, through a flexible, extensible design.

  • πŸ“Š Supports various NTMs: Easily integrates with NTMs such as NVDM, PLDA, SCHOLAR, ETM, NSTM, CLNTM, WeTe, ECRTM, through a flexible, extensible design.

  • πŸ“ˆ Built-in evaluation metrics: Includes support for evaluation of topic coherence, topic alignment, topic diversity, for comprehensive evaluation of topic models.

  • πŸ—‚οΈ Dataset flexibility: Compatible with widely used datasets such as 20News, AGNews, DBpedia, R8 β€” and easy to extend to custom datasets.


πŸ›  Installation

# Clone the repository
git clone https://github.com/Xiaohao-Yang/LLM-ITL.git
cd LLM-ITL

# Create a virtual environment
python3.9 -m venv venv
source venv/bin/activate  

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

πŸ“ External Files

Some required files are too large to be hosted. Please download them from the links below, place them in the specified directories, and unzip them if needed.

File Name Description Path in Project Download Link
Wikipedia_bd.zip Reference Corpus needed by evaluation ./ πŸ”—
glove.6B.100d.txt Word embeddings needed by WeTe ./topic_models/WeTe/ πŸ”—

πŸš€ Usage

πŸ”Ή Run a Topic Model

python main.py --dataset 20News --model nvdm --n_topic 50 --random_seed 1

πŸ”Ή Run a Topic Model with LLM-ITL

python main.py --dataset 20News --model nvdm --n_topic 50 --random_seed 1 --llm_itl

πŸ”Ή Run Evaluation

python eval.py --dataset 20News --model nvdm --n_topic 50 --eval_topics

πŸ“ Output Files

πŸ”Έ Model checkpoints β†’ ./save_models/

πŸ”Έ Topic files β†’ ./save_topics/

πŸ”Έ Evaluation results β†’ ./evaluation_output/

πŸ“Œ Examples

πŸ” Refine topics and get confidence ( Jupyter Notebook )
# load functions
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from generate import generate_one_pass, generate_two_step

We support the following LLMs. Please follow the links below to gain access (if necessary) to the corresponding models:

We are not limited to these LLMs. Feel free to play with other models and modify the prompts in the create_messages_xx functions within generate.py.

# load the LLM

model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'
# model_name = 'mistralai/Mistral-7B-Instruct-v0.3'
# model_name = '01-ai/Yi-1.5-9B-Chat'
# model_name = 'microsoft/Phi-3-mini-128k-instruct'

# Larger models:
# model_name = 'Qwen/Qwen1.5-32B-Chat'
# model_name = 'meta-llama/Meta-Llama-3-70B-Instruct'

# load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             trust_remote_code=True,
                                             torch_dtype=torch.float16
                                             ).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
# example topics
topic1 = ['book', 'university', 'bank', 'science', 'vote', 'gordon', 'surrender', 'intellect', 'skepticism', 'shameful']
topic2 = ['game', 'team', 'hockey', 'player', 'season', 'year', 'league', 'nhl', 'playoff', 'fan']
topic3 = ['written', 'performance', 'creation', 'picture', 'chosen', 'clarify', 'second', 'appreciated', 'position', 'card']
topics = [topic1, topic2, topic3]
# some configurations
voc = None                        # A list of words. 
                                  # The refined words will be filtered to retain only those that are present in the vocabulary.

inference_bs = 5                  # Batch size: the number of topics sent to the LLM for refinement at once.
                                  # Increase or reduce this number depending on your GPU memory.


instruction_type = 'refine_labelTokenProbs'    

# Different ways to get confidence socre, we support the following options:
# 'refine_labelTokenProbs'    -- Label token probaility
# 'refine_wordIntrusion'      -- Word intrusion confidence
# 'refine_askConf'            -- Ask for confidence
# 'refine_seqLike'            -- Length normalized sequence likelihood
# 'refine_twoStep_Score'      -- Self-reflective confidence score
# 'refine_twoStep_Boolean'    -- p(True)

# For more details about these confidence scores, please refer to our Paper.
# generate topics
if instruction_type in ['refine_labelTokenProbs', 'refine_wordIntrusion', 'refine_askConf', 'refine_seqLike']:
    topic_probs, word_prob = generate_one_pass(model,
                                               tokenizer,
                                               topics,
                                               voc=voc,
                                               batch_size = inference_bs,
                                               instruction_type=instruction_type)

elif instruction_type in ['refine_twoStep_Score', 'refine_twoStep_Boolean']:
    topic_probs, word_prob = generate_two_step(model,
                                                   tokenizer,
                                                   topics,
                                                   voc=voc,
                                                   batch_size=inference_bs,
                                                   instruction_type=instruction_type)
print('Topic label and confidence:')
for i in range(len(topic_probs)):
    print('Topic %s: ' % i, topic_probs[i])

print()
print('Topic words and probabilities:')
for i in range(len(word_prob)):
    print('Topic %s: ' % i, word_prob[i])
Topic label and confidence:
Topic 0:  {'Higher Learning': 0.17292044166298481}
Topic 1:  {'Ice Sport': 0.39517293597115355}
Topic 2:  {'Artistic Expression': 0.056777404880380314}

Topic words and probabilities:
Topic 0:  {'university': 0.1, 'degrees': 0.1, 'curriculum': 0.1, 'book': 0.1, 'research': 0.1, 'skepticism': 0.1, 'education': 0.1, 'intellect': 0.1, 'knowledge': 0.1, 'science': 0.1}
Topic 1:  {'nhl': 0.1, 'league': 0.1, 'season': 0.1, 'hockey': 0.1, 'match': 0.1, 'player': 0.1, 'rival': 0.1, 'playoff': 0.1, 'game': 0.1, 'team': 0.1}
Topic 2:  {'creative': 0.1, 'written': 0.1, 'picture': 0.1, 'appreciated': 0.1, 'artist': 0.1, 'imagination': 0.1, 'clarify': 0.1, 'creation': 0.1, 'chosen': 0.1, 'performance': 0.1}

🧾 Citation

@article{yang2024neural,
  title={Neural Topic Modeling with Large Language Models in the Loop},
  author={Yang, Xiaohao and Zhao, He and Xu, Weijie and Qi, Yuanyuan and Lu, Jueqing and Phung, Dinh and Du, Lan},
  journal={arXiv preprint arXiv:2411.08534},
  year={2024}
}

πŸ“„ License

This work is licensed under the Creative Commons Attribution 4.0 International License.

About

[ACL 2025 Main] Neural Topic Modeling with Large Language Models in the Loop

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published