CLEAR provides an end-to-end evaluator for radiology reports built on the taxonomy introduced in CLEAR (EMNLP Findings 2025). The pipeline pairs label-level reasoning with fine-grained feature extraction so you can score generated reports against radiologist-grade annotations.
- Covers both condition classification and detailed feature extraction with consistent schema enforcement.
- Supports open-source (vLLM) and closed-source (Azure OpenAI) backends via declarative model configs.
- Ships with orchestration scripts that stage inference, evaluation, and intermediate data hand-offs.
- Produces granular metrics (per-condition F1, QA/IE scores, optional LLM-based scoring) for auditability.
- CLEAR-Bench (coming soon): our expert evaluation dataset, to be released on PhysioNet.
- vLLM: see the official documentation for deployment and performance tuning.
- Responsible AI Use: follow the Responsible Use of MIMIC Data with Online Services like GPT guidelines. We recommend the Azure OpenAI Service for secure commercial model access.
.
├── environment.yaml # Conda environment definition
├── feature/ # Feature extraction prompts, configs, processors
├── label/ # Label extraction prompts, configs, processors
├── main.py # Orchestrates label + feature pipelines end-to-end
├── models/ # Instructions for fine-tuning bespoke evaluators
├── run.bash # Convenience wrapper around main.py
├── data/ # (User-provided) report and label CSVs
├── runs/ # Default output directory created at runtime
└── README.md
- Linux environment with Python 3.10+ (matching
environment.yaml). - Conda (recommended) or an equivalent virtual environment manager.
- GPU with CUDA drivers when running vLLM backends.
- Azure OpenAI subscription when using the Azure processors.
cd CLEAR-evaluator
conda env create -f environment.yaml
conda activate clear-evaluator- Generated reports (
--gen-reports): CSV with at leaststudy_idandreportcolumns. Thereportfield should include both FINDINGS and IMPRESSION sections. - Reference reports (
--gt-reports, optional): CSV with the same schema as generated reports. When supplied, the pipeline will derive ground-truth labels and features from the reference run.
Model definitions live in label/configs/models.py and feature/configs/models.py.
- Azure entries must include
api_key,api_version,endpoint, anddeployment. Optional fields such asmax_tokenscan be added per deployment. - vLLM entries must include
model_path,temperature,max_tokens, andtensor_parallel_size. Ensure the model weights are accessible on disk and compatible with your hardware.
Prompts for each stage are defined in the paired prompts.py files. You can extend or adjust them to suit new conditions or features.
run.bash orchestrates the entire pipeline. Open the script and edit the configuration block at the top to point to your data, models, and preferred output directory before running it.
Key variables inside run.bash:
GEN_REPORTS/GT_REPORTS: CSVs containing generated and (optional) reference reports withstudy_idandreportcolumns. LeaveGT_REPORTSempty if you do not have references.LABEL_BACKBONE/FEATURE_BACKBONE: chooseazureorvllmfor each stage.LABEL_MODEL/FEATURE_MODEL: model identifiers defined inlabel/configs/models.pyandfeature/configs/models.py.OUTPUT_ROOT: directory where the pipeline writes outputs (runs/<timestamp>by default).ENABLE_LLMandSCORING_LLM: toggle the optional LLM-based IE metrics and choose the scoring model.PYTHON_BIN: interpreter used to runmain.py(defaults to the active environment).
After updating those values, launch the pipeline with:
bash run.bashEach run builds runs/<timestamp>/ with the following structure:
generated/labels/tmp/output_labels_<MODEL>.json: raw label predictions.generated/output_labels_<MODEL>.csv: normalized label table used for evaluation.generated/filtered_tp_labels_<MODEL>.csv: positive-condition filter passed to the feature stage.generated/features/tmp/output_feature_<MODEL>.json: extracted feature set.generated/features/results_qa_avg_<MODEL>.csv,results_ie_avg_<MODEL>.csv: quantitative metrics per feature type.generated/label_metrics_<MODEL>.csv: label evaluation summary.
When a reference dataset is provided, the same sub-directories are created under reference/ for comparison and for deriving ground-truth annotations.
label/run_label.bashandfeature/run_feature.bashshow minimal examples for invoking processors in isolation.label/processor/eval.pyreports per-condition positive/negative F1 scores, includingPos F1,Pos F1_5,Neg F1, and micro variants.feature/processor/eval.pyreports QA metrics (Acc. micro/macro,F1 micro/macro) plus IE metrics (o1-mini score,ROUGE-L,BLEU-4). Pass--enable_llm_metricand--scoring_llmto compute the LLM-based IE score.
- Ensure vLLM model definitions specify
temperature,max_tokens, andtensor_parallel_size; missing fields will trigger runtime errors. - When using Azure, double-check that environment keys match your active subscription and that the deployment name aligns with the configured model.
- Reports must contain a
reportcolumn with combined FINDINGS and IMPRESSION text; missing sections degrade model performance. - CLEAR assumes the label schema
{0: negative, 1: positive, -1: unclear}; normalize upstream data before ingestion to avoid misaligned metrics.
If you use CLEAR in academic work, please cite the original CLEAR paper linked above.
