See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

Chih Yao Hu^2* Yang-Sen Lin^1* Yuna Lee¹ Chih-Hai Su¹ Jie-Ying Lee¹
Shr-Ruei Tsai¹ Chin-Yang Lin¹ Kuan-Wen Chen¹ Tsung-Wei Ke² Yu-Lun Liu¹

¹National Yang Ming Chiao Tung University ²National Taiwan University
*Indicates Equal Contribution

CoRL 2025

Zero-shot language-guided UAV control. See, Point, Fly (SPF) enables UAVs to navigate to any goal based on free-form natural language instructions in any environment, without task-specific training. The system demonstrates robust performance across diverse scenarios including obstacle avoidance, long-horizon planning, and dynamic target following.

🔥 News

[Oct 17, 2025] See, Point, Fly now supports open-source Microsoft AirSim simulator! Check out the AirSim Mode Configuration section for details.
[Oct 07, 2025] PMLR has published the proceedings of CoRL 2025, with our paper available here.
[Sep 29, 2025] Our paper is submitted to 🤗 Hugging Face's daily papers! Check it out and upvote here!
[Sep 29, 2025] The preprint of our paper is now available on arXiv.
[Sep 27, 2025] The codebase is now public! Check out the Installation section to get started.
[Sep 17, 2025] Our paper is now visible to everyone on OpenReview.
[Aug 29, 2025] Our project page is now live! Check it out here.
[Aug 01, 2025] Our paper has been accepted by CoRL 2025! We will present our work at the conference in September, 2025.

Requirements

uv (Python package manager)
Python 3.13+
Google Gemini API key or OpenAI-compatible API key
DJI Tello drone (for real-world testing in tello mode)
DRL Simulator (for simulation testing in sim mode)
Microsoft AirSim simulator (for simulation testing in airsim mode)

Installation

Make sure uv is installed. If not, follow the instructions at uv docs.
Make sure Python 3.13 is installed. If not, run the following command to install it via uv:

uv python install 3.13

Clone this repository and navigate to the project directory:

git clone https://github.com/Hu-chih-yao/see-point-fly.git
cd see-point-fly

Sync the project dependencies and activate the virtual environment:

uv sync
source .venv/bin/activate

Test the installation by running:

spf --help

Follow the steps in the next section to set up the environment variables and configuration files. After setting up, you can start the system in tello, sim, or airsim mode:

# Start in tello mode
spf tello

# Start in simulator mode
spf sim

# Start in AirSim mode
spf airsim

Environment Variables Setup

Copy the env.example file to .env and fill in the required API keys, you only need to provide the key of provider you want to use (either Gemini or OpenAI compatible):

# Gemini API Configuration
GEMINI_API_KEY=your_gemini_api_key_here

# OpenAI compatible API Configuration
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_BASE_URL=https://example.com/api/v1

Configuration

There are three modes (tello, sim, and airsim) available in this project. You can switch between them in the command line when starting the system.

Each mode has its own configuration file (config_tello.yaml, config_sim.yaml, and config_airsim.yaml).

A. Tello Mode Configuration

Update config_tello.yaml as needed:

# Tello Drone Configuration
# This file controls the operational mode and behavior of the Tello drone system

# API Provider Configuration
# Choose between "gemini" or "openai" (OpenAI compatible API)
api_provider: "gemini" # or "openai"

# Model Configuration
# Specify the exact model name to use (overrides operational_mode defaults)
# Leave empty to use operational_mode defaults
model_name: "" # e.g., "gemini-2.5-flash", "gemini-2.5-pro", "openai/gpt-4.1"

# Operational Mode Configuration
# adaptive_mode: Original version with depth estimation and adaptive navigation
# obstacle_mode: Enhanced version with obstacle detection and intensive keepalive
operational_mode: "adaptive_mode" # Change to "obstacle_mode" for enhanced obstacle detection
#operational_mode: "obstacle_mode"

# Processing Configuration
command_loop_delay: 2 # Delay in seconds between processing cycles

Tello Mode Selection Guide

Mode	Best For	Default AI Model	Safety Features
`adaptive_mode`	Indoor precision tasks	Gemini 2.5 Flash	Standard error handling
`obstacle_mode`	Complex environments	Gemini 2.5 Pro	Enhanced safety + obstacle detection

B. Simulator Mode Configuration

Simulator mode uses the Gemini 2.5 Flash model by default, but you can specify a different model if desired.

Update config_sim.yaml as needed:

# Simulator Navigation Configuration

# API Provider Configuration
# Choose between "gemini" or "openai" (OpenAI compatible API)
api_provider: "gemini" # or "openai"

# Model Configuration
# Specify the exact model name to use
# Leave empty to use default model for the provider
model_name: "" # e.g., "gemini-2.5-flash", "gemini-2.5-pro", "openai/gpt-4.1"

# Navigation Mode
adaptive_mode: false # Enable adaptive depth-based movement (true/false)

# Processing Configuration
command_loop_delay: 0 # seconds between processing cycles

# Display Configuration
monitor: 1 # monitor index to capture (1=primary monitor)

C. AirSim Mode Configuration

AirSim mode requires the Microsoft AirSim simulator to be installed and running.

Setting up AirSim

Install AirSim: Follow the AirSim installation guide for your platform.
Configure Camera Settings: AirSim's default camera resolution (256x144) is too low for effective navigation. You need to configure higher resolution:
- Copy the example settings file from src/spf/airsim/settings.json.example to the directory where the AirSim executable is launched, and rename it to settings.json.
- The example provides 1920x1080 resolution, which is recommended for best results
- Restart AirSim after updating settings

Refer to AirSim Settings for details.

Launch AirSim: Start AirSim with your preferred environment before running SPF.

Configuration File

Update config_airsim.yaml as needed:

# AirSim Navigation Configuration

# API Provider Configuration
# Choose between "gemini" or "openai" (OpenAI compatible API)
api_provider: "gemini" # or "openai"

# Model Configuration
# Specify the exact model name to use
# Leave empty to use default model for the provider
model_name: "" # e.g., "gemini-2.5-flash", "gemini-2.5-pro", "openai/gpt-4.1"

# Navigation Mode
adaptive_mode: false # Enable adaptive depth-based movement (true/false)

# Processing Configuration
command_loop_delay: 0 # seconds between processing cycles

# Movement Configuration
base_velocity: 2.0 # base velocity in m/s for drone movement
base_yaw_rate: 30.0 # base yaw rate in degrees/s for rotation
min_command_duration: 2.0 # minimum duration for movement commands in seconds

# AirSim Configuration
camera_name: "0" # Camera name/ID in AirSim

# Wind Configuration (NED frame: North, East, Down in m/s)
wind_x: 0.0 # Wind in North direction
wind_y: 0.0 # Wind in East direction
wind_z: 0.0 # Wind in Down direction

Additional Repositories

Acknowledgement

This research was funded by the National Science and Technology Council, Taiwan, under Grants NSTC 113-2628-E-A49-023- and 111-2628-E-A49-018-MY4. The authors are grateful to Google, NVIDIA, and MediaTek Inc. for their generous donations. Yu-Lun Liu acknowledges the Yushan Young Fellow Program by the MOE in Taiwan.

License

Read the LICENSE file for details.

Citation

@inproceedings{pmlr-v305-hu25e,
	title        = {See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation},
	author       = {Hu, Chih Yao and Lin, Yang-Sen and Lee, Yuna and Su, Chih-Hai and Lee, Jie-Ying and Tsai, Shr-Ruei and Lin, Chin-Yang and Chen, Kuan-Wen and Ke, Tsung-Wei and Liu, Yu-Lun},
	year         = 2025,
	month        = {27--30 Sep},
	booktitle    = {Proceedings of The 9th Conference on Robot Learning},
	publisher    = {PMLR},
	series       = {Proceedings of Machine Learning Research},
	volume       = 305,
	pages        = {4697--4708},
	url          = {https://proceedings.mlr.press/v305/hu25e.html},
	editor       = {Lim, Joseph and Song, Shuran and Park, Hae-Won},
	pdf          = {https://raw.githubusercontent.com/mlresearch/v305/main/assets/hu25e/hu25e.pdf},
	abstract     = {We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harness VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, out performing the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs.}
}

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
docs		docs
src		src
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
config_airsim.yaml		config_airsim.yaml
config_sim.yaml		config_sim.yaml
config_tello.yaml		config_tello.yaml
env.example		env.example
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

🔥 News

Requirements

Installation

Environment Variables Setup

Configuration

A. Tello Mode Configuration

Tello Mode Selection Guide

B. Simulator Mode Configuration

C. AirSim Mode Configuration

Setting up AirSim

Configuration File

Additional Repositories

Acknowledgement

License

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Hu-chih-yao/see-point-fly

Folders and files

Latest commit

History

Repository files navigation

See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

🔥 News

Requirements

Installation

Environment Variables Setup

Configuration

A. Tello Mode Configuration

Tello Mode Selection Guide

B. Simulator Mode Configuration

C. AirSim Mode Configuration

Setting up AirSim

Configuration File

Additional Repositories

Acknowledgement

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages