The WorldMedQA team is on a mission to elevate medical AI by refining the benchmarks used to evaluate vision and language models for healthcare.
- MedQA Datasets: Medical knowledge is typically evaluated using MedQA datasets, consisting of multiple-choice questions from exams like the USMLE.
- Big Data Training: LLMs, like GPT, are trained on vast datasets, including medical content from sources like PubMed and scholarly articles. 📚
- 🩺 Real-world Validity: Existing datasets contain errors that can affect clinical relevance.
- 🌍 Linguistic Diversity: Many benchmarks lack proper representation of non-English languages.
- 🖼️ Imaging Data: Most medical QA benchmarks don't account for multimodal (text + image) data.
- 🕰️ Training Data Contamination: Older datasets may overlap with LLM training corpora, leading to biased evaluation.
We’ve launched our first dataset - WorldMedQA-V - to help bridge these gaps.
WorldMedQA-V is a multilingual, multimodal, clinically-validated dataset with 568 image-based medical QAs from Brazil, Israel, Japan, and Spain, designed to evaluate vision and language models for healthcare.
It is available now on Hugging Face and GitHub:
Let’s build more equitable, effective, and representative health AI together!