I build retrieval systems that understand images, text, video, and sound - not just literal matches.
I'm a PhD researcher at Virginia Tech, working on vision-language models (VLMs), RAG, and ranking/reranking. My focus is multi-prompt (multi-vector) embeddings: many small, controllable "views" of meaning that make search richer, more interpretable, and less prone to collapse.
- Reasoning in vision-language models (VLMs).
- Cross-modal retrieval across images, text, video, and audio.
- Structured information extraction from multimodal data.
- Knowledge representation for multimodal reasoning.
- Exploring room acoustics (RIRs) as spatial signals for learning geometry-aware representations
Real-world queries are polysemous: idioms, metaphor, culture, and context often matter more than surface similarity. I design retrieval pipelines that surface the right connections, not only the nearest neighbor.
- Multi-Prompt Embedding for Retrieval
- One input -> multiple focused embeddings to boost recall and reduce length/bias collapse.
- RAG + Reranker for Multimodal Search
- Lightweight bi-encoder retrieval + VLM reader + cross-encoder reranker for better final ranking.
- Diversity-Aware VLM Retrieval
- Retrieves multiple perspectives (literal/figurative/emotional/abstract/background) instead of forcing a single vector.
If you are working on diversity-aware retrieval, interpretable VLMs, or multimodal reasoning benchmarks, lets talk.
- Website: https://hanialomari.github.io/
- Google Scholar: https://scholar.google.com/citations?user=Ft_qTcwAAAAJ&hl=en
- LinkedIn: https://www.linkedin.com/in/hanialomari/
- Email: mailto:[email protected]
