DINI is a tool to impute tabular multi-input/multi-output data that can have features that are continuous, categorical, or a combination thereof. DINI takes in data with missing values, and iteratively imputes it while training a surrogate model that could be leveraged for downstream tasks. This facilitates machine learning with corrupted/missing data by state-of-the-art imputation. It works with any dataset and any PyTorch model.
git clone --recurse-submodules https://github.com/jha-lab/dini.git
cd diniThe python environment setup is based on conda. The script below creates a new environment named dini or updates an existing environment on the macOS-arm64 platform:
source setup/env_step.shFor any other platform, you can use the environment files. For pip installation:
pip install --requirement setup/requirements.txtFor conda installation:
conda env create --file setup/environment.yaml
conda activate diniTo generate corrupt data:
python3 corrupt.py --dataset <dataset> --strategy <strategy>where <dataset> can either be breast, diabetes, diamonds, energy, flights, or yacht. The flag <strategy> can be any one of MCAR, MAR, MNAR, MSAR, or MPAR.
To run DINI model:
python3 dini.py --model <model> --dataset <dataset> --retrainwhere <model> can either be FCN, FCN2, LSTM2, or TXF2. The one used in the paper is FCN2. To model uncertainties using an MC dropout layer, use the flag --model_unc. You can also define the fraction to start imputing on using --impute_fraction <fracion>, where <fraction> is a number between 0 and 1 (see Table 3 in the paper).
To run imputation using all baselines, including DINI:
python3 impute.py --dataset <dataset> --strategy <strategy>To run surrogate modeling on imputed data, for three case studies:
python3 model.py --dataset <case_dataset> --strategy <strategy>where <case_dataset> can either be gas, swat, or covid_cxr. Note that swat dataset is not public and will have to be downloaded into data/swat/ directory. To do this, get access to the dataset using this link. Then, save SWaT_Dataset_Attack_v0.csv to data/swat/ directory.
To run any PyTorch model, you can modify the src/models.py file. See examples (namely models FCN, FCN2, LSTM2, or TXF2) in that file. To use any dataset, convert it to a data.csv file, placed in data/<dataset> directory. Then, the following lines can be added to the process function in corrupt.py:
elif dataset == <dataset>:
def split(df):
return df.iloc[:, :-<out_col>].values, df.iloc[:, -<out_col>:].valueswhere <dataset> is the name of the dataset, and <out_col> is the number of output columns in the chosen dataset.
Shikhar Tuli. For any questions, comments or suggestions, please reach me at [email protected].
Cite our work using the following bitex entry:
@article{tuli2022sr,
title={{DINI}: Data Imputation using Neural Inversion for Edge Applications},
author={Tuli, Shikhar and Jha, Niraj K.},
journal={Scientific Reports},
volume={12},
pages={20210},
year={2022},
publisher={Nature Publishing Group}
}
BSD-3-Clause. Copyright (c) 2021, Shikhar Tuli and JHA-Lab. All rights reserved.
See License file for more details.