This submodule contains the data preprocessing pipelines and scripts used for building the MR-RATE dataset, a novel dataset of brain and spine MRI volumes paired with corresponding radiology text reports and metadata.
➡️ To start using the dataset right away, refer to the Dataset Guide and Downloading Dataset.
In MR-RATE, brain and spine MRI examinations are acquired from patients using MRI scanners and organized into multiple imaging sequence categories, including T1-weighted, T2-weighted, FLAIR, SWI, and MRA, constituting the series of a study. Then, each study is paired with associated metadata and a radiology report, which is produced by the radiologists during clinical interpretation. Together, these components constitute the MR-RATE dataset for multimodal brain and spine MRI research. Via preprocessing steps applied here, our goal is to convert raw, heterogeneous clinical data into a clean, anonymized, and spatially standardized collection that is ready for downstream machine learning and neuroscientific research.
data-preprocessing/
├── README.md
├── pyproject.toml
├── environment.yml
├── data/
│ ├── raw/ # Raw PACS CSVs, DICOMs, NIfTIs, mapping Excels
│ ├── interim/ # Intermediate outputs from each step
│ └── processed/ # Final processed studies
├── logs/ # Per-batch log files
├── run/
│ ├── run_mri_preprocessing.py # Orchestrates steps 1–5
│ ├── run_mri_upload.py # Orchestrates steps 6–7
│ ├── utils.py # Shared runner utilities
│ └── configs/
│ └── mri_batch00.yaml # Batch config template
├── scripts/
│ └── hf/
│ ├── download.py # Download MR-RATE batches from Hugging Face
│ └── merge_downloaded_repos.py # Merge derivative repos into MR-RATE repo on study level
├── src/
│ └── mr_rate_preprocessing/
│ ├── configs/
│ │ ├── config_mri_preprocessing.py # Pipeline constants and thresholds
│ │ └── config_metadata_columns.json # DICOM metadata column definitions
│ ├── mri_preprocessing/
│ │ ├── dcm2nii.py # Step 1: DICOM-to-NIfTI conversion
│ │ ├── pacs_metadata_filtering.py # Step 2: metadata filtering
│ │ ├── series_classification.py # Step 3: series classification
│ │ ├── modality_filtering.py # Step 4: modality filtering
│ │ ├── brain_segmentation_and_defacing.py # Step 5: HD-BET + Quickshear
│ │ ├── zip_and_upload.py # Step 6: zip & upload MRI to HF
│ │ ├── prepare_metadata.py # Step 7: metadata preparation & upload to HF
│ │ ├── hdbet.py # HD-BET brain segmentation wrapper
│ │ ├── quickshear.py # Quickshear defacing wrapper
│ │ └── utils.py # Shared logging and helper utilities
│ ├── registration/
│ │ ├── registration.py # ANTs co-registration and atlas registration
│ │ └── upload.py # Zip registration outputs and upload to HF
│ └── reports_preprocessing/ # Report anonymization, translation, structuring, QC
│ ├── 01_anonymization/
│ ├── 02_translation/
│ ├── 03_translation_qc/
│ ├── 04_structuring/
│ ├── 05_structure_qc/
│ └── utils/
├── tests/ # Coming soon
└── figures/ # Figures for submodule
Raw DICOM exports from PACS are noisy, heterogeneous, and contain patient-identifiable information. This stage converts them into clean, anonymized NIfTI volumes, classifies each series by modality, filters out low-quality acquisitions, removes facial features, and uploads the processed volumes along with a cleaned metadata table to Hugging Face.
-
DICOM to NIfTI Conversion — Reads a CSV of DICOM folder paths, extracts the AccessionNumber from each folder's first DICOM file, and runs
dcm2niixto produce gzip-compressed NIfTI files and JSON sidecars organized into per-accession subfolders. -
PACS Metadata Filtering — Loads raw DICOM metadata exports from PACS, enforces required columns, retains as many optional columns as possible, and removes rows with missing critical identifiers or duplicate series.
-
Series Classification — Assigns each series a modality label (T1w, T2w, SWI, …) and additional flags (
is_derived,sequence_family, …) using a 5-level rule hierarchy: DICOM diffusion tags → vendor-specific sequence IDs → scanning sequence parameters → description keyword matching → numeric fallback. -
Modality Filtering — Filters classified series against acceptance criteria (modality type, acquisition plane, image shape/FOV, patient age) defined in mri preprocessing config. Reads NIfTI headers in parallel to measure shape and spacing, constructs standardized modality IDs (e.g.
t1w-raw-sag), and designates one T1w series per study as the center modality to be used later in registration and segmentation. -
Brain Segmentation & Defacing — Using an adapted and parallelized version of the BrainLesion Suite Preprocessing Module (BrainLes-Preprocessing Toolkit), a binary brain mask is predicted for each series with HD-BET, and defacing is then applied with Quickshear to remove identifiable facial features. Brain masks and defacing masks are saved alongside the defaced volumes.
For details on adaptations to
BrainLes-Preprocessing, see Why is this specific MRI preprocessing?.
-
Upload MRI to HF — Validates that all expected modality files (image, brain mask, defacing mask) are present for each study, anonymizes study IDs to de-identified UIDs, zips each processed study folder, and uploads the zip files to the Hugging Face dataset repository in parallel. Supports the Xet high-performance transfer backend.
-
Upload metadata to HF — Validates that all expected modality files (image, brain mask, defacing mask) are present for each study, merges patient IDs and anonymized study dates from mapping files, drops sensitive columns, and uploads a clean metadata CSV to Hugging Face.
Raw Turkish radiology reports are converted to structured English through an iterative LLM-based pipeline using Qwen3.5-35B-A3B-FP8 via vLLM. Each step follows a run → automated QC → retry → manual review loop until quality thresholds are met. See reports_preprocessing/README.md for full pipeline documentation.
-
Anonymization — Replaces patient names, dates, hospitals, and other PHI with deterministic tokens (
[patient_1],[date_1], etc.). Validated to ensure no PHI leakage. -
Translation — Turkish-to-English translation preserving medical terminology, anonymization tokens, and report structure.
-
Translation QC — LLM-based quality check for translation completeness and accuracy, rule-based detection of remaining Turkish text, and automated retranslation of failures.
-
Structuring — Extracts four sections from each report:
clinical_information,technique,findings, andimpression. Uses a two-pass approach with a no-think fallback for reports where chain-of-thought reasoning exhausts the token budget. -
Structure QC — LLM-based verification comparing structured output against the raw report, checking for missing content, hallucinations, and misplaced sections.
Because different modalities within a study are acquired in different orientations and resolutions, they must be spatially aligned before any cross-modal analysis can be performed. Co-registration to a shared T1w reference and subsequent normalization to the MNI152 atlas put all volumes into a common coordinate space, enabling direct voxel-wise comparisons across modalities and subjects, and allowing researchers to readily apply the registered data to their specific downstream tasks without additional alignment steps.
After MRI & Metadata Preprocessing is run, processed and uploaded studies are downloaded to a separate server where registration is performed independently.
- Registration — Following a similar approach to BrainLesion Suite Preprocessing Module (BrainLes-Preprocessing Toolkit), within each study, moving modalities are co-registered to the T1-weighted center modality using ANTs. The center modality is then registered to the MNI152 (ICBM 2009c Nonlinear Symmetric) atlas, and all co-registered modalities are transformed to atlas space.
For details on adaptations to
BrainLes-Preprocessing, see Why is this specific MRI preprocessing?.
- Upload to HF — Zips each registered study folder, and uploads the zip files to the Hugging Face dataset repository in parallel.
(Coming soon) Similar to registration, after MRI & Metadata Preprocessing is run, processed and uploaded studies are downloaded to a separate server where segmentation is performed independently. Voxel-wise anatomical segmentations are predicted for center modality volumes in native space using the NV-Segment-CTMR model based on VISTA3D, supporting region-of-interest analysis and various downstream tasks.
-
Download Repos — Downloads data from any combination of the four MR-RATE HuggingFace repositories (Forithmus/MR-RATE, Forithmus/MR-RATE-coreg, Forithmus/MR-RATE-atlas, Forithmus/MR-RATE-vista-seg) into per-repo output directories under a shared base, with optional concurrent on-the-fly unzipping and zip deletion. Supports resumable batch-level downloads via
snapshot_downloadand the Xet high-performance transfer backend. -
Merge Downloaded Repos — Merges extracted study folders from downloaded derivative repos (
MR-RATE-coreg/,MR-RATE-atlas/,MR-RATE-vista-seg/) into the baseMR-RATE/directory by moving each batch in-place. Mirrors the interface ofdownload.py: same--output-base, same modality flags (--coreg,--atlas,--vista-seg), and same--batchesselector. Filenames across repos are non-colliding by design, so subdirs that don't yet exist in the destination are renamed wholesale (instant move), while subdirs that already exist (e.g.transform/) are merged file-by-file.
-
Clone the repository:
git clone https://github.com/Forithmus/MR-RATE.git cd MR-RATE/data-preprocessing -
Create and activate conda environment, install the package in editable mode:
conda env create -f environment.yml conda activate mr-rate-preprocessing pip install -e . -
(Optional) Set up Hugging Face credentials:
Required for uploading to or downloading from the Hugging Face dataset repositories.
hf auth login # or: export HF_TOKEN=<your_token>
All MRI pipeline steps are driven by a single YAML config file. Batch configs are located at run/configs/. A template config for batch00 is provided at run/configs/mri_batch00.yaml.
-
Set up your config file by copying
mri_batch00.yamland filling in your input/output paths, Hugging Face repo ID, and processing parameters. -
Run MRI preprocessing (steps 1–5: DICOM conversion → metadata filtering → classification → modality filtering → segmentation & defacing):
python run/run_mri_preprocessing.py --config run/configs/<your_config>.yaml
Output structure after step 5:
data/raw/ └── batchXX/ └── batchXX_raw_niftis/ └── <AccessionNumber>/ └── <SeriesNumber>_<SeriesDescription>.nii.gz # Step 1: Series of a study as NIfTI files data/interim/ └── batchXX/ ├── batchXX_raw_metadata.csv # Step 2: filtered PACS metadata ├── batchXX_raw_metadata_classified.csv # Step 3: per-series modality labels ├── batchXX_raw_metadata_classification_summary.csv # Step 3: classification summary ├── batchXX_modalities_to_process.json # Step 4: accepted modality list └── batchXX_modalities_to_process_metadata.csv # Step 4: accepted modality metadata data/processed/ └── batchXX/ └── <study_id>/ ├── img/ │ └── <study_id>_<modality_id>.nii.gz # Defaced native images (uint16 or float32) └── seg/ ├── <study_id>_<modality_id>_brain-mask.nii.gz # Brain masks (uint8) └── <study_id>_<modality_id>_defacing-mask.nii.gz # Defacing masks (uint8) -
Run upload (steps 6–7: zip & upload studies → prepare & upload metadata):
python run/run_mri_upload.py --config run/configs/<your_config>.yaml
Output structure after step 7:
data/processed/ ├── batchXX_metadata.csv # Anonymized metadata CSV for the batch └── MR-RATE_batchXX/ └── mri/ └── batchXX/ └── <study_uid>.zip # Each zip preserves the internal folder structure: # <study_uid>/ # img/ # <study_uid>_<series_id>.nii.gz # seg/ # <study_uid>_<series_id>_brain-mask.nii.gz # <study_uid>_<series_id>_defacing-mask.nii.gzHugging Face repository structure after step 7:
<repo_id> (Hugging Face dataset) ├── mri/ │ └── batchXX/ │ └── <study_uid>.zip # Uploaded by step 6 └── metadata/ └── batchXX_metadata.csv # Uploaded by step 7
Intermediate outputs are written to the paths defined in your config file, following the data/interim/ → data/processed/ convention.
Each pipeline step is a standalone parallel script designed for SLURM execution. Scripts are located in src/mr_rate_preprocessing/reports_preprocessing/ and documented in its own README.md.
Steps are run sequentially, with each step's output feeding the next. Within each step, the iterative QC loop is repeated until quality thresholds are met:
# 1. Anonymize raw Turkish reports
srun python src/mr_rate_preprocessing/reports_preprocessing/01_anonymization/anonymize_reports_parallel.py \
--input_file data/raw/turkish_reports.csv --output_dir anonymized_shards
# 2. Translate to English
srun python src/mr_rate_preprocessing/reports_preprocessing/02_translation/translate_reports_parallel.py \
--input_file anonymized_reports.csv --output_dir translated_shards
# 3. QC translations, retranslate failures, repeat
srun python src/mr_rate_preprocessing/reports_preprocessing/03_translation_qc/quality_check_parallel.py \
--input_file translated_reports.csv --output_dir qc_shards
# 4. Structure into sections
srun python src/mr_rate_preprocessing/reports_preprocessing/04_structuring/structure_reports_parallel.py \
--input_file translated_reports.csv --output_dir structure_shards
# 5. Verify structured output
srun python src/mr_rate_preprocessing/reports_preprocessing/05_structure_qc/qc_llm_verify.py \
--input_file structured_reports.csv --output_dir qc_verify_shards
# Merge any step's shards
python src/mr_rate_preprocessing/reports_preprocessing/utils/merge_shards.py \
--shard_dir <shard_dir> --output <merged.csv>There are no runner scripts or config files for the registration pipeline, as there are two blocks to be run independently.
-
Download a batch from Hugging Face:
python scripts/hf/download.py \ --batches XX --unzip --delete-zips --no-reports --xet-high-perfSee
python scripts/hf/download.py --helpfor the full list of options (workers, timeout, output base, etc.).Output structure after step 1:
data/MR-RATE/ ├── mri/ │ └── batchXX/ │ └── <study_uid>/ │ ├── img/ │ │ └── <study_uid>_<series_id>.nii.gz │ └── seg/ │ ├── <study_uid>_<series_id>_brain-mask.nii.gz │ └── <study_uid>_<series_id>_defacing-mask.nii.gz └── metadata/ └── batchXX_metadata.csv -
Run registration (co-registration to center modality + atlas registration to MNI152):
python src/mr_rate_preprocessing/registration/registration.py \ --input-dir data/MR-RATE/mri/batchXX \ --metadata-csv data/MR-RATE/metadata/batchXX_metadata.csv \ --output-dir data/MR-RATE-reg \ --num-processes 4 \ --threads-per-process 4 \ --verboseFor large batches, studies can be split across multiple independent jobs (e.g., SLURM array jobs) using
--total-partitionsand--partition-index:Output structure after step 2:
data/MR-RATE-reg/ ├── MR-RATE-coreg_batchXX/ │ └── mri/ │ └── batchXX/ │ └── <study_uid>/ │ ├── coreg_img/ │ │ ├── <study_uid>_<center_series_id>.nii.gz # Center modality (unchanged copy from native) (uint16 or float32) │ │ └── <study_uid>_coreg_<moving_series_id>.nii.gz # Moving modalities warped to center space (float32) │ ├── coreg_seg/ │ │ ├── <study_uid>_<center_series_id>_brain-mask.nii.gz # Center modality brain mask (unchanged copy from native) (uint8) │ │ └── <study_uid>_<center_series_id>_defacing-mask.nii.gz # Center modality defacing mask (unchanged copy from native) (uint8) │ └── transform/ │ └── M_coreg_<moving_series_id>.mat # Moving→center ANTs transform (one per moving modality) │ └── MR-RATE-atlas_batchXX/ └── mri/ └── batchXX/ └── <study_uid>/ ├── atlas_img/ │ ├── <study_uid>_atlas_<center_series_id>.nii.gz # Center modality in atlas space (float32) │ └── <study_uid>_atlas_<moving_series_id>.nii.gz # Moving modalities in atlas space (float32) ├── atlas_seg/ │ ├── <study_uid>_atlas_<center_series_id>_brain-mask.nii.gz # Brain mask in atlas space (uint8) │ └── <study_uid>_atlas_<center_series_id>_defacing-mask.nii.gz # Defacing mask in atlas space (uint8) └── transform/ └── M_atlas_<center_series_id>.mat # Center→atlas ANTs transform -
Zip and upload to Hugging Face:
# Zip and upload co-registration outputs python src/mr_rate_preprocessing/registration/upload.py \ --input-dir data/MR-RATE-reg/MR-RATE-coreg_batchXX \ --zip-suffix _coreg \ --repo-id <repo_id_coreg> \ --zip-workers 8 --hf-workers 16 --xet-high-perf --verbose # Zip and upload atlas outputs python src/mr_rate_preprocessing/registration/upload.py \ --input-dir data/MR-RATE-reg/MR-RATE-atlas_batchXX \ --zip-suffix _atlas \ --repo-id <repo_id_atlas> \ --zip-workers 8 --hf-workers 16 --xet-high-perf --verbose
Upload progress is tracked in a
.cache/folder inside each zipped directory for resumability. Use--delete-zipsto remove the zipped folder after a successful upload. Seepython src/mr_rate_preprocessing/registration/upload.py --helpfor the full list of options.Output structure after step 3:
data/MR-RATE-reg/ ├── MR-RATE-coreg_batchXX_zipped/ │ └── mri/ │ └── batchXX/ │ └── <study_uid>_coreg.zip # zip root: <study_uid>/coreg_img/, coreg_seg/, transform/ │ └── MR-RATE-atlas_batchXX_zipped/ └── mri/ └── batchXX/ └── <study_uid>_atlas.zip # zip root: <study_uid>/atlas_img/, atlas_seg/, transform/Hugging Face repository structure after step 3:
<repo_id_coreg> (Hugging Face dataset) └── mri/ └── batchXX/ └── <study_uid>_coreg.zip # Uploaded by coreg upload <repo_id_atlas> (Hugging Face dataset) └── mri/ └── batchXX/ └── <study_uid>_atlas.zip # Uploaded by atlas upload
(Coming soon)
For detailed overview, refer to the MR-RATE dataset and Dataset Guide.
-
Follow ⚙️ Installation steps 1 & 3 (if you haven't done so already)
All four repositories are gated. Make sure you have access.
-
Download Repos
scripts/hf/download.pyis a standalone script that downloads any combination of data from the four MR-RATE repositories. Each repo is written to its own subdirectory under--output-base(default:./data).Flag Default Repository Zip suffix Output directory --nativeon Forithmus/MR-RATE— ./data/MR-RATE/--coregoff Forithmus/MR-RATE-coreg_coreg./data/MR-RATE-coreg/--atlasoff Forithmus/MR-RATE-atlas_atlas./data/MR-RATE-atlas/--vista-segoff Forithmus/MR-RATE-vista-seg_vista-seg./data/MR-RATE-vista-seg/Pass
--no-mrito disable all MRI downloads (metadata/reports only). Metadata and reports are always fetched fromForithmus/MR-RATEinto./data/MR-RATE/. Pass--no-metadataand/or--no-reportsto disable metadata and/or reports downloads. Pass--xet-high-perfto enable Hugging Face's high-performance Xet transfer backend, which uses all available CPUs and maximum bandwidth. If you haven't deleted zip-files, downloads are resumable:snapshot_downloadskips zip files already present locally. After every run, the script compares downloads with the remote repo files and prints a per-batch download status table.# Some examples: # Download native MRI plus metadata and reports for all batches, unzip and free disk as you go: python scripts/hf/download.py \ --batches all --unzip --delete-zips --xet-high-perf # Download metadata plus co-registered and atlas registered MRI for specific batches, no native, no reports: python scripts/hf/download.py \ --batches 00,01 --no-native --coreg --atlas \ --no-reports --unzip --delete-zips # Download all MRI derivatives with a custom output base: python scripts/hf/download.py \ --native --coreg --atlas --vista-seg \ --no-metadata --no-reports --output-base /data # To check download status for all batches without downloading anything: python scripts/hf/download.py --batches all --no-mri --no-metadata --no-reports
See
python scripts/hf/download.py --helpfor the full list of options (workers, timeout, output base, etc.).Output structure after downloading all data for batch XX, unzipping and deleting zips:
./data/ ├── MR-RATE/ │ ├── mri/ │ │ └── batchXX/ │ │ └── <study_uid>/ │ │ ├── img/ │ │ │ └── <study_uid>_<series_id>.nii.gz # Defaced native-space image (uint16 or float32) │ │ └── seg/ │ │ ├── <study_uid>_<series_id>_brain-mask.nii.gz # Brain mask (uint8) │ │ └── <study_uid>_<series_id>_defacing-mask.nii.gz # Defacing mask (uint8) │ ├── metadata/ │ │ └── batchXX_metadata.csv │ └── reports/ │ └── batchXX_reports.csv ├── MR-RATE-coreg/ │ └── mri/ │ └── batchXX/ │ └── <study_uid>/ │ ├── coreg_img/ │ │ ├── <study_uid>_<center_series_id>.nii.gz # Center modality (unchanged copy from native) (uint16 or float32) │ │ └── <study_uid>_coreg_<moving_series_id>.nii.gz # Moving modalities warped to center space (float32) │ ├── coreg_seg/ │ │ ├── <study_uid>_<center_series_id>_brain-mask.nii.gz # Center modality brain mask (unchanged copy from native) (uint8) │ │ └── <study_uid>_<center_series_id>_defacing-mask.nii.gz # Center modality defacing mask (unchanged copy from native) (uint8) │ └── transform/ │ └── M_coreg_<moving_series_id>.mat # Moving→center ANTs transform (one per moving modality) ├── MR-RATE-atlas/ │ └── mri/ │ └── batchXX/ │ └── <study_uid>/ │ ├── atlas_img/ │ │ ├── <study_uid>_atlas_<center_series_id>.nii.gz # Center modality in atlas space (float32) │ │ └── <study_uid>_atlas_<moving_series_id>.nii.gz # Moving modalities in atlas space (float32) │ ├── atlas_seg/ │ │ ├── <study_uid>_atlas_<center_series_id>_brain-mask.nii.gz # Brain mask in atlas space (uint8) │ │ └── <study_uid>_atlas_<center_series_id>_defacing-mask.nii.gz # Defacing mask in atlas space (uint8) │ └── transform/ │ └── M_atlas_<center_series_id>.mat # Center→atlas ANTs transform └── MR-RATE-vista-seg/ └── mri/ └── batchXX/ └── <study_uid>/ └── seg/ └── <study_uid>_<center_series_id>_vista-seg.nii.gz # Multi-label brain segmentation mapPer-batch download status table example printed after downloads:
Download Status ════════════════════════════════════════════════════════════════════════════════════════════ Batch │ native │ coreg │ atlas │ vista-seg │ metadata│ reports ──────────┼───────────────┼───────────────┼───────────────┼───────────────┼─────────┼──────── batchXX │ ✅ 120/120 │ ☑️ 120/120 * │ 🔄 45/120 │ ❌ 0/120 │ ✅ │ ✅ ════════════════════════════════════════════════════════════════════════════════════════════ * mixed zip/folder: all studies are present but there are both zips and extracted folders in the same batchLegend: ✅ complete · ☑️ complete but mix of zips and extracted folders · 🔄 partial · ❌ missing ·
⚠️ remote listing unavailable -
(optional) Merge Downloaded Repos
After downloading and unzipping,
scripts/hf/merge_downloaded_repos.pycan consolidate derivative repo contents intoMR-RATE/on a per-study basis. Each selected derivative repo must already exist under--output-base. At least one of--coreg,--atlas, or--vista-segmust be passed.# Merge coreg and atlas into native for all batches python scripts/hf/merge_downloaded_repos.py --coreg --atlas # Merge all derivatives for specific batches only python scripts/hf/merge_downloaded_repos.py --coreg --atlas --vista-seg --batches 00,01 # Custom output base python scripts/hf/merge_downloaded_repos.py --coreg --atlas --output-base /data
Output structure after merging all derivatives for batch XX:
./data/ └── MR-RATE/ └── mri/ └── batchXX/ └── <study_uid>/ ├── img/ # from MR-RATE/ ├── seg/ │ ├── <study_uid>_<series_id>_brain-mask.nii.gz # from MR-RATE/ │ ├── <study_uid>_<series_id>_defacing-mask.nii.gz # from MR-RATE/ │ └── <study_uid>_<series_id>_vista-seg.nii.gz # merged from MR-RATE-vista-seg/ ├── coreg_img/ # merged from MR-RATE-coreg/ ├── coreg_seg/ # merged from MR-RATE-coreg/ ├── atlas_img/ # merged from MR-RATE-atlas/ ├── atlas_seg/ # merged from MR-RATE-atlas/ └── transform/ # merged from MR-RATE-coreg/ and MR-RATE-atlas/See
python scripts/hf/merge_downloaded_repos.py --helpfor the full list of options. -
Quick reference for common operations:
import pandas as pd # Load metadata for a batch meta = pd.read_csv("data/MR-RATE/metadata/batch00_metadata.csv", dtype={"patient_uid": str}, low_memory=False) # Load reports reports = pd.read_csv("data/MR-RATE/reports/batch00_reports.csv", low_memory=False) # Load patient-level assigned study splits splits = pd.read_csv("data/MR-RATE/splits.csv", usecols=["study_uid", "split"], low_memory=False) # Apply patient-level assigned study splits meta_with_split = meta.merge(splits, on="study_uid") train_meta = meta_with_split[meta_with_split["split"] == "train"] # Find all series for a study in train split study_series = train_meta[train_meta["study_uid"] == "<study_uid>"] # Find the report for a study study_report = reports[reports["study_uid"] == "<study_uid>"] # Find the center modality series for a study (used in coreg/atlas/segmentation) center = meta[(meta["study_uid"] == "<study_uid>") & (meta["is_center_modality"] == True)]
