reproducibility.md

Reproducibility of SIGMOD 2020 Experiments

This page contains a detailed description to reproduce the experimental results reported in the SIGMOD 2020 paper #125 titled Factorized Graph Representations for Semi-Supervised Learning from Sparse Data as submitted to the ACM SIGMOD 2021 Reproducibility Track.

Research Paper

The official paper is available in the ACM Digital Library (https://dl.acm.org/doi/10.1145/3318464.3380577). The full version is available on arXiV.2003.02829. For citing our work, we suggest using the DBLP bib file.

Programming Language and Dependencies

All code is implemented in Python3 with following dependencies (also listed in requirements.txt):

jupyter>=1.0.0
matplotlib>=1.4.2
numpy>=1.9.1
networkx>=1.11
pandas>=0.19.0
pyamg>=2.2.1
pytest>=2.8.0
scipy>=0.15.1
scikit-learn>=0.18
sklearn
seaborn>=0.8.0

To install all dependencies, simply run: pip install -r requirements.txt

Datasets Used

Synthetic data generator: Our github repository includes our synthetic data generators /sslh/graphGenerator.py. It is not required to separately generate data, as the experimental scripts auto-generate the necessary data.

Real dataset Repository: We used 8 real datasets in our experiments that are available in the form of 16 CSV files totaling 1.2GB in a separate Google Drive folder. Please download those real datasets and copy them into the following directory before running the experiments on real data: /experiments_sigmod20/realData/

Hardware Info

Experiments were primarily run on a 2016 MacBook Pro 13-inch with the below configuration ("Hardware 1"). Some of the real-world large datasets were run on a cluster, detailed below ("Hardware 2").

Hardware 1: used for all Timing Experiments (including Figures 3, 5, an 6 in the paper) :

Processor: 2.5 GHz Intel Core i5
Memory: 16 GB
Secondary Storage: 1 TB SSD

Hardware 2: used for Accuracy Estimation Experiments on real-world datasets (Fig.7) (For detailed specs, please refer The Discovery Cluster at MGHPCC):

Processor: 2.4 GHz Intel E5-2680 v4 CPUs
Memory: 256 GB/node
Secondary Storage: GPFS, disk type unavailable, assume SSD since it’s a world-class HPC facility.
Network: InfiniBand (IB) interconnect running at 100 Gbps

Repeating the Experiments

Jupyter notebooks

You can repeat the experiments and produce all figures from the paper by using two Jupyter notebooks, one for synthetic data and the other for real datasets:

/experiments_sigmod20/Figures_syntheticdata_sigmod20.ipynb: all experiments with synthetic data sets
/experiments_sigmod20/Figures_realdata_sigmod20.ipynb: all experiments on real-world datasets

Run the Jupyter notebook and jump to the necessary functions to learn how a certain figure was generated in the paper.

Cached experimental traces

The code allows two levels of granularity to reproduce all results:

Plot figures from the paper using our saved experimental traces: We have cached all the intermediate results required to produce the figures in our paper in the /experiments_sigmod20/datacache subfolder. You can simply run the cells in Jupyter notebook to generate all figures. By default the code uses our stored results and we recommend this mode during the first pass.
In order to run all the experiments from scratch, please use the option create_data = True in the provided Jupyter notebooks. On a 2016 Macbook Pro, this would take approximately 200 hours for the synthetic. We have provided another option in each cell, to run a mini-version of our experiments, with about one-tenth of the sample point. You can uncomment those mini-versions to run synthetic experiments much faster.

We suggest to thus reproduce the graphs with less accuracy instead (fewer data samples and thus more wiggly) by toggling between the comment/uncomment lines in the notebook.

Example Use: To generate Fig 5(a) using cached intermediate data, run
Fig_Backtracking_Advantage.run(choice=31, variant=0, create_data=False, show_plot=True, create_pdf=True)
To ignore the cached data and recreate figures from scratch
Fig_Backtracking_Advantage.run(choice=31, variant=0, create_data=True, show_plot=True, create_pdf=True)

Please note that choice and variant parameter values are already set to the ones used for the paper.

A note about timing: Many of the accuracy experiments over real datasets (Figure.7) were run at MGHPCC using parallelized code on high-performance compute infrastructure. Although the plots are reproducible instantaneously from cached data provided in the repository, if you wish to recreate the experimental data points from scratch (i.e. using the create_data=True flag), it is highly recommended to run the real data experiments with a large number of CPU cores and plenty of memory. We estimate it to take at least 30 days on a common hardware. To make it simpler, we can leave away the most costly baseline method for the larger graphs and estimate to run in one day. They are feasible to run on a home computer, but it will likely take several day's time to produce plots with comparable variance to those presented in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility of SIGMOD 2020 Experiments

Research Paper

Programming Language and Dependencies

Datasets Used

Hardware Info

Repeating the Experiments

Jupyter notebooks

Cached experimental traces

FilesExpand file tree

reproducibility.md

Latest commit

History

reproducibility.md

File metadata and controls

Reproducibility of SIGMOD 2020 Experiments

Research Paper

Programming Language and Dependencies

Datasets Used

Hardware Info

Repeating the Experiments

Jupyter notebooks

Cached experimental traces