Chih Yao Hu2* 
Yang-Sen Lin1* 
Yuna Lee1 
Chih-Hai Su1 
Jie-Ying Lee1 
Shr-Ruei Tsai1 
Chin-Yang Lin1 
Kuan-Wen Chen1 
Tsung-Wei Ke2 
Yu-Lun Liu1 
1National Yang Ming Chiao Tung University   2National Taiwan University
*Indicates Equal Contribution
CoRL 2025
Zero-shot language-guided UAV control. See, Point, Fly (SPF) enables UAVs to navigate to any goal based on free-form natural language instructions in any environment, without task-specific training. The system demonstrates robust performance across diverse scenarios including obstacle avoidance, long-horizon planning, and dynamic target following.
- [Oct 17, 2025] See, Point, Fly now supports open-source Microsoft AirSim simulator! Check out the AirSim Mode Configuration section for details.
- [Oct 07, 2025] PMLR has published the proceedings of CoRL 2025, with our paper available here.
- [Sep 29, 2025] Our paper is submitted to 🤗 Hugging Face's daily papers! Check it out and upvote here!
- [Sep 29, 2025] The preprint of our paper is now available on arXiv.
- [Sep 27, 2025] The codebase is now public! Check out the Installation section to get started.
- [Sep 17, 2025] Our paper is now visible to everyone on OpenReview.
- [Aug 29, 2025] Our project page is now live! Check it out here.
- [Aug 01, 2025] Our paper has been accepted by CoRL 2025! We will present our work at the conference in September, 2025.
- uv (Python package manager)
- Python 3.13+
- Google Gemini API key or OpenAI-compatible API key
- DJI Tello drone (for real-world testing in
tellomode) - DRL Simulator (for simulation testing in
simmode) - Microsoft AirSim simulator (for simulation testing in
airsimmode)
-
Make sure uv is installed. If not, follow the instructions at uv docs.
-
Make sure Python 3.13 is installed. If not, run the following command to install it via uv:
uv python install 3.13- Clone this repository and navigate to the project directory:
git clone https://github.com/Hu-chih-yao/see-point-fly.git
cd see-point-fly- Sync the project dependencies and activate the virtual environment:
uv sync
source .venv/bin/activate- Test the installation by running:
spf --help- Follow the steps in the next section to set up the environment variables and configuration files.
After setting up, you can start the system in
tello,sim, orairsimmode:
# Start in tello mode
spf tello
# Start in simulator mode
spf sim
# Start in AirSim mode
spf airsimCopy the env.example file to .env and fill in the required API keys, you only need to provide the key of provider you want to use (either Gemini or OpenAI compatible):
# Gemini API Configuration
GEMINI_API_KEY=your_gemini_api_key_here
# OpenAI compatible API Configuration
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_BASE_URL=https://example.com/api/v1There are three modes (tello, sim, and airsim) available in this project. You can switch between them in the command line when starting the system.
Each mode has its own configuration file (config_tello.yaml, config_sim.yaml, and config_airsim.yaml).
Update config_tello.yaml as needed:
# Tello Drone Configuration
# This file controls the operational mode and behavior of the Tello drone system
# API Provider Configuration
# Choose between "gemini" or "openai" (OpenAI compatible API)
api_provider: "gemini" # or "openai"
# Model Configuration
# Specify the exact model name to use (overrides operational_mode defaults)
# Leave empty to use operational_mode defaults
model_name: "" # e.g., "gemini-2.5-flash", "gemini-2.5-pro", "openai/gpt-4.1"
# Operational Mode Configuration
# adaptive_mode: Original version with depth estimation and adaptive navigation
# obstacle_mode: Enhanced version with obstacle detection and intensive keepalive
operational_mode: "adaptive_mode" # Change to "obstacle_mode" for enhanced obstacle detection
#operational_mode: "obstacle_mode"
# Processing Configuration
command_loop_delay: 2 # Delay in seconds between processing cycles| Mode | Best For | Default AI Model | Safety Features |
|---|---|---|---|
adaptive_mode |
Indoor precision tasks | Gemini 2.5 Flash | Standard error handling |
obstacle_mode |
Complex environments | Gemini 2.5 Pro | Enhanced safety + obstacle detection |
Simulator mode uses the Gemini 2.5 Flash model by default, but you can specify a different model if desired.
Update config_sim.yaml as needed:
# Simulator Navigation Configuration
# API Provider Configuration
# Choose between "gemini" or "openai" (OpenAI compatible API)
api_provider: "gemini" # or "openai"
# Model Configuration
# Specify the exact model name to use
# Leave empty to use default model for the provider
model_name: "" # e.g., "gemini-2.5-flash", "gemini-2.5-pro", "openai/gpt-4.1"
# Navigation Mode
adaptive_mode: false # Enable adaptive depth-based movement (true/false)
# Processing Configuration
command_loop_delay: 0 # seconds between processing cycles
# Display Configuration
monitor: 1 # monitor index to capture (1=primary monitor)AirSim mode requires the Microsoft AirSim simulator to be installed and running.
-
Install AirSim: Follow the AirSim installation guide for your platform.
-
Configure Camera Settings: AirSim's default camera resolution (256x144) is too low for effective navigation. You need to configure higher resolution:
- Copy the example settings file from
src/spf/airsim/settings.json.exampleto the directory where the AirSim executable is launched, and rename it tosettings.json. - The example provides 1920x1080 resolution, which is recommended for best results
- Restart AirSim after updating settings
- Copy the example settings file from
Refer to AirSim Settings for details.
- Launch AirSim: Start AirSim with your preferred environment before running SPF.
Update config_airsim.yaml as needed:
# AirSim Navigation Configuration
# API Provider Configuration
# Choose between "gemini" or "openai" (OpenAI compatible API)
api_provider: "gemini" # or "openai"
# Model Configuration
# Specify the exact model name to use
# Leave empty to use default model for the provider
model_name: "" # e.g., "gemini-2.5-flash", "gemini-2.5-pro", "openai/gpt-4.1"
# Navigation Mode
adaptive_mode: false # Enable adaptive depth-based movement (true/false)
# Processing Configuration
command_loop_delay: 0 # seconds between processing cycles
# Movement Configuration
base_velocity: 2.0 # base velocity in m/s for drone movement
base_yaw_rate: 30.0 # base yaw rate in degrees/s for rotation
min_command_duration: 2.0 # minimum duration for movement commands in seconds
# AirSim Configuration
camera_name: "0" # Camera name/ID in AirSim
# Wind Configuration (NED frame: North, East, Down in m/s)
wind_x: 0.0 # Wind in North direction
wind_y: 0.0 # Wind in East direction
wind_z: 0.0 # Wind in Down directionThis research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 113-2628-E-A49-023- and 111-2628-E-A49-018-MY4. The authors are grateful to Google, NVIDIA, and MediaTek Inc. for their generous donations. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.
Read the LICENSE file for details.
@inproceedings{pmlr-v305-hu25e,
title = {See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation},
author = {Hu, Chih Yao and Lin, Yang-Sen and Lee, Yuna and Su, Chih-Hai and Lee, Jie-Ying and Tsai, Shr-Ruei and Lin, Chin-Yang and Chen, Kuan-Wen and Ke, Tsung-Wei and Liu, Yu-Lun},
year = 2025,
month = {27--30 Sep},
booktitle = {Proceedings of The 9th Conference on Robot Learning},
publisher = {PMLR},
series = {Proceedings of Machine Learning Research},
volume = 305,
pages = {4697--4708},
url = {https://proceedings.mlr.press/v305/hu25e.html},
editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won},
pdf = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/hu25e/hu25e.pdf},
abstract = {We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harness VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, out performing the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs.}
}