This repository provides the official implementation of MARIOH, a supervised method for reconstructing hyperedges in hypergraphs by leveraging edge multiplicity. MARIOH integrates several key components: a theoretically guaranteed filtering step to identify true size-2 hyperedges, a multiplicity-aware classifier for scoring hyperedge candidates, and a bidirectional search strategy that explores both high- and low-confidence cliques. These components work together to achieve accurate and efficient hypergraph reconstruction. For further details, please refer to our accompanying research paper.
main.py: The entry point script to run the hyperedge reconstruction pipeline.params.py: Parameter dictionaries for various datasets and modes (reduced or preserved).utils/: A directory containing modularized code for data processing, feature extraction, graph operations, model training, evaluation, and input/output utilities.data/: A directory that should contain the dataset-specific training and testing files.
-
Python Version: 3.8+ recommended
-
Dependencies:
numpynetworkxtorchjoblibargparse- Additional Python dependencies can be installed via:
pip install -r requirements.txt
Adjust
requirements.txtor the installation commands as needed for your environment.
You must place your datasets into the data/ directory. Each dataset should have its own subdirectory, for example:
data/
|-- {dataset_name}/
|-- train.txt # Training data (reduced mode)
|-- test.txt # Testing data (reduced mode)
|-- train_dup.txt # Training data (preserved mode)
+-- test_dup.txt # Testing data (preserved mode)
- Reduced mode uses
train.txtandtest.txt. - Preserved mode uses
train_dup.txtandtest_dup.txt.
Please refer to the related publication for details on dataset formats and preprocessing steps.
To run the pipeline, navigate to the directory containing main.py and execute:
python main.py --data {dataset_name} --gpu 0 --seed 42 --output_dir output--data {dataset_name}: Specify the dataset folder name located underdata/.--gpu {int}: GPU device number. If no GPU is available or you wish to run on CPU, set--gputo a non-existent GPU ID (e.g.,--gpu 99), and it will default to CPU.--seed {int}: Random seed for reproducibility.--output_dir {path}: Directory to store the output hyperedge predictions and results.--preserved: Optional flag. If set, the pipeline will run in "preserved" mode usingtrain_dup.txtandtest_dup.txt. If omitted, the pipeline runs in "reduced" mode usingtrain.txtandtest.txt.
Reduced mode (default):
python main.py --data hschool --gpu 0 --seed 123 --output_dir outputPreserved mode:
python main.py --data hschool --gpu 0 --seed 123 --output_dir output --preservedIn these examples, the code will:
- Load and preprocess the graph data.
- Extract features and prepare a training dataset.
- Train a classifier network with the best parameters specified in
params.py. - Use the trained classifier to reconstruct hyperedges in the test graph.
- Save the reconstructed hyperedges to
output/reconstructed_hyp_reduced/{dataset_name}_{seed}.txt(in reduced mode) oroutput/reconstructed_hyp_preserved/{dataset_name}_{seed}.txt(in preserved mode).
- Output Files: The final reconstructed hyperedges are stored as comma-separated node IDs per line.
- Evaluation Metrics: The code prints evaluation metrics such as Jaccard similarity and multiset Jaccard similarity during execution. These metrics help assess the quality of hyperedge reconstruction relative to the ground truth.
- Performance & Reproducibility: By setting the random seed (
--seed) and controlling hyperparameters throughparams.py, you can reproduce experimental results reported in the associated research paper.
- Modify
params.pyto add or change hyperparameters for different datasets. - Adjust or add dataset loaders in
utils/data_processing.pyif your input format differs. - Add new evaluation metrics in
utils/evaluation.py.