RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts

🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 🤗 Model

📢 News

[2025-12-27] We have released the Training and Evaluation Scripts !
[2025-09-18] RadarQA has been accepted by NeurIPS 2025 !
[2025-09-15] We have released the Model, and Dataset !
[2025-08-12] We have released the Paper !

🧩Overview of RadarQA

We introduce RadarQA, an MLLM-based weather forecast analysis method that integrates key physical attributes with detailed assessment reports. We introduce a novel and comprehensive task paradigm for multi-modal quality analysis, encompassing both single frame and sequence, under both rating and assessment scenarios. To support training and benchmarking, we design a hybrid annotation pipeline that combines human expert labeling with automated heuristics. With such an annotation method, we construct RQA-70K, a large-scale dataset with varying difficulty levels for radar forecast quality evaluation. We further design a multi-stage training strategy thatciteratively improves model performance at each stage. Extensive experiments show that RadarQA outperforms existing general MLLMs across all evaluation settings, highlighting its potential for advancing quality analysis in weather prediction.

🚀 Usage

Installation

We conduct training and inference using ms-swift, an awesome framework that supports fine-tuning and deployment of large-scale and multimodal models.

conda create -n ms-swift python=3.10
conda activate ms-swift
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

For evaluation, first set up the environment using the following scripts:

conda create -n radarqa_eval python=3.10
conda activate radarqa_eval
pip install openai tqdm bert_score rouge_score nltk evaluate

Then, you need to download additional tokenizer resources for scorers:

Model	Download
`bert-base-uncased`	google-bert/bert-base-uncased
`nltk_data`	python /RadarQA/eval/download_nltk.py

⚠️ Notice: since bert_score does not support loading models from a local path by default, a manual modification is required. please modify the model loading logics in /miniconda3/envs/RadarQA_eval/lib/python3.10/site-packages/bert_score/scorer.py as follows:

# self._model = get_model(self.model_type, self.num_layers, self.all_layers) # the original code that needs to be commented out
from transformers import AutoModel
self._model = AutoModel.from_pretrained(model_path) # the new code that needs to be added

Also, add model_path to the parameter list of the initialize function.

Data Preparation

You can fetch the full dataset from RQA-70K. After downloading, unzip the images folder. A valid directory structure should look as follows:

RQA-70K/
└──images/
    ├── 6k_img_brief/
    ├── 6k_seq_brief_v1/
    ├── 6k_seq_frame_v1/
    ├── 15k_img_detail/
    ├── 15k_seq_detail_v1/
    └── 15k_seq_frame_v1/

The organized data structures are provided in RadarQA/data directory with relative image / video paths. To convert them to absolute paths, you can use /data/add_prefix.py by setting the prefix to the absolute path of RQA-70K.

Inference Preparation

You can fetch our model from RadarQA-7B, which is a 7B model fine-tuned on Qwen2.5-VL-7B-Instruct.

Training

RadarQA adopts a three-stage training pipeline. In the first stage, we perform supervised fine-tuning on large-scale multimodal data to equip the model with basic task solving capabilities. In the second stage, we use reinforcement learning and carefully design two reward functions for the rating task. In the third stage, we apply post-training with a small set of samples to further refine performance.

All the training scripts are in the train folder. You need to specify the corresponding model path and output path for training. After completing each stage, you need to run merge_lora.sh to merge the LoRA weights and use the merged model path as the input for the next training stage.

Training Script	Description
`train_stage_1.sh`	Supervised fine-tuning on RQA-70K.
`train_stage_2.sh`	Reinforcement learning based on GRPO for rating tasks.
`train_stage_3.sh`	Post-training to further refine performance.
`merge_lora.sh`	Merge the LoRA weight and pre-trained weight.

Inference

After merging the LoRA weights, you can directly follow the method provided by ms-swift to load the model for inference. We also provide a batch inference script to facilitate evaluation on the test set:

cd inference
bash inference_img.sh # for image
bash inference_seq.sh # for sequence

For inference with closed-source models, we provide scripts to generate predictions on the test sets of the four tasks. The resulting outputs are saved under the /inference/close_sourced directory. For each task, the files in its corresponding subdirectory are used as follows:

Scripts	Description
`generate.py`	The main logic for generating predictions, performing inference in parallel using `concurrent.futures`
`auto_gen.sh`	Can be used directly to ensure all samples are generated, handling API connection errors by repeatedly running `generate.py`.
`quality_control.py`	Format validation to ensure the quality of the generated outputs and maintain fairness in evaluation.

You can run auto_gen.sh to first generate all samples, then run quality_control.py to delete the invalid generated samples. Repeat this process until all samples are qualified.

Evaluation

For rating tasks, we directly calculate accuracy. You need to organize your inference results into a valid JSONL format, with each samples as follows:

{"response": "...", "labels": "...", ...}

Each sample should include both response and labels. Execute the following script to compute accuracy for open-source models:

cd eval/brief/open_source
python img_brief.py # for image
python seq_brief.py # for sequence

For assessment tasks, the evaluation is divided into two types:

Metric-based evaluation
LLM-as-a-Judge

Similar to rating tasks, predictions need to be organized into a valid JSONL format. Run the following code to generate scores for the different metrics:

cd eval/detail/open_source
python detail_all.py

For LLM-as-a-Judge, we first need to run the following script to generate the LLM evaluation results:

bash auto_gen.sh

After all samples are evaluated, run parse_gpt4_score.py to compute the GPT4-Score.

📌For closed-source models, evaluation can be performed by directly using the evaluation scripts in the corresponding closed_source directory.

✍️Citation

If you find RadarQA helpful, please cite:

@article{he2025radarqa,
  title={RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts},
  author={He, Xuming and You, Zhiyuan and Gong, Junchao and Liu, Couhua and Yue, Xiaoyu and Zhuang, Peiqin and Zhang, Wenlong and Bai, Lei},
  journal={arXiv preprint arXiv:2508.12291},
  year={2025}
}

📬 Contact

For questions or submissions, please open an issue or email [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
data		data
eval		eval
inference		inference
train		train
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts

🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 🤗 Model

📢 News

🧩Overview of RadarQA

🚀 Usage

Installation

Data Preparation

Inference Preparation

Training

Inference

Evaluation

✍️Citation

📬 Contact

About

Uh oh!

Releases

Packages

Languages

hexmSeeU/RadarQA

Folders and files

Latest commit

History

Repository files navigation

RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts

🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 🤗 Model

📢 News

🧩Overview of RadarQA

🚀 Usage

Installation

Data Preparation

Inference Preparation

Training

Inference

Evaluation

✍️Citation

📬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages