Mahsa Khosh

I am a first-year PhD student in Computer Science at Georgetown University, advised by Dr. Sarah Adel Bargal in the Georgetown University Computer Vision (GUCV) Lab, and I also collaborate closely with Dr. Michael Saxon in the UCSB NLP group. My research focuses on developing multimodal AI systems that perform explicit reasoning over visual and linguistic inputs. I work on vision-language models that go beyond pattern matching to construct interpretable reasoning chains—enabling models to ground their decisions in visual evidence, articulate their intermediate steps, and align their outputs with human cognitive processes.

Specifically, I focus on:

Multimodal reasoning: How models can systematically integrate visual scenes, spatial relationships, and textual context to solve complex reasoning tasks requiring geometric awareness and positional inference.
Visual grounding and interpretability: Methods for making vision-language models explain their predictions through attention mechanisms, reasoning graphs, or natural language justifications that articulate configurational relationships.
Compositional spatial reasoning in VLMs: Training methodologies and architectural designs that decompose visual understanding into explicit spatial primitives, relational graphs, and geometric transformations to enable transparent, verifiable reasoning chains.

My research goal is to develop multimodal AI systems that reason explicitly over visual and linguistic information, enabling models that are transparent, interpretable, and aligned with human understanding.