Hierarchical Attention Propagation (HAP) is a medical ontology embedding framework which generalizes GRAM by hierarchically propagating attention across the entire ontology structure, where a medical concept adaptively learns its embedding from all other concepts in the hierarchy instead of only its ancestors.
For more information, please check our paper:
M. Zhang, C. King, M. Avidan, and Y. Chen, Hierarchical Attention Propagation for Healthcare Representation Learning, Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-20), 2020. [PDF]
Like GRAM, the code trains an RNN (Gated Recurrent Units) to predict, at each timestep (i.e. visit), the diagnosis/procedure codes occurring in the next visit. The code uses Multi-level Clinical Classification Software for ICD-9-CM as the domain knowledge.
STEP 1: Installation
-
Install python, Theano. We use Python 2.7, Theano 0.8.2. Theano can be easily installed in Ubuntu as suggested here
-
If you plan to use GPU computation, install CUDA
STEP 2: Run on MIMIC-III
-
You will first need to request access for MIMIC-III, a publicly avaiable electronic health records collected from ICU patients over 11 years.
-
You can use "process_mimic.py" located in "data/mimic3/" to process MIMIC-III dataset and generate a suitable training dataset for HAP. Place the script to the same location where the MIMIC-III CSV files are located, and run the script with:
python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv mimicMore instructions are described inside the script. You may use the already processed files included in "data/mimic3/"; otherwise, please copy your generated "mimic.*" files to "data/mimic3/".
-
Use "build_trees.py" in "data/mimic3/" to build files that contain the ancestor information of each medical code. This requires "ccs_multi_dx_tool_2015.csv" (Multi-level CCS for ICD9), which can be downloaded from here. We also include it in "data/mimic3/".
Run "build_trees.py" with:
python build_trees.py ccs_multi_dx_tool_2015.csv mimic.seqs mimic.types remapRunning this script will re-map integer codes assigned to all medical codes. Therefore you also need the ".seqs" file and the ".types" file created by process_mimc.py. The execution command is
python build_trees.py ccs_multi_dx_tool_2015.csv <seqs file> <types file> <output path>. This will build five files "remap.level#.pk" and a "remap.p2c" which contain level information and parent to children mapping extracted from the hierarchy. This will replace the old "mimic.seqs" and "mimic.types" files with the correct ones. -
Run HAP using the "remap.seqs" and "remap.p2c" files generated by "build_trees.py". The ".seqs" file contains the sequence of visits for each patient. Each visit consists of multiple diagnosis codes. The command is:
python hap.py data/mimic3/ remap.seqs remap.seqs remap result/mimic3/HAP/ --p2c_file remap.p2c --sep_attention --L2 0 --n_epochs 50More commands for generating the experimental results are contained in "run_mimic.sh".
STEP 3: How to pretrain the code embedding
For sequential diagnoses prediction, it is very effective to pretrain the code embeddings with some co-occurrence based algorithm such as word2vec or GloVe To pretrain the code embeddings with GloVe, do the following:
-
Use "create_glove_comap.py" with ".seqs" file, which is generated by "build_trees.py". The execution command is:
python create_glove_comap.py remap.seqs remapThis will create a file "cooccurrenceMap.pk" that contains the co-occurrence information of codes and ancestors.
-
Use "glove.py" on the co-occurrence file generated by "create_glove_comap.py". The execution command is:
python glove.py cooccurrenceMap.pk remap pretrained_embedding -
Use the pretrained embeddings when you train HAP by appending "--embed_file pretrained_embedding.npz" to your command.
If you find the code useful, please cite our paper:
@inproceedings{zhang2020hierarchical,
title={Hierarchical Attention Propagation for Healthcare Representation Learning},
author={Zhang, Muhan and King, Christopher R and Avidan, Michael and Chen, Yixin},
booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
pages={249--256},
year={2020}
}
Muhan Zhang, Washington University in St. Louis [email protected] 11/2/2020