[NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering.]
Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, Matthieu Cord.
Valeo.ai, Sorbonne University, CNRS.
Upsample any Vision Foundation Model features, zero-shot, to high-resolution with NAF — Neighborhood Attention Filtering.
And obtain state-of-the-art results on multiple downstream tasks for multiple VFM families, sizes and datasets:
| Method | Semantic Seg. | Depth Est. | Open Vocab. | Video Prop. | ⚡ FPS | 📏 Max Ratio |
|---|---|---|---|---|---|---|
| FeatUp | 4th | 4th | 3rd | 4th | 🥈 | 🥈 |
| JAFAR | 🥈 | 3rd | 🥈 | 🥇 | 3rd | 4th |
| AnyUp | 3rd | 🥈 | 4th | 3rd | 3rd | 3rd |
| NAF (ours) | 🥇 | 🥇 | 🥇 | 🥈 | 🥇 | 🥇 |
🏆 Performance Summary: Ranks (🥇 First · 🥈 Second)
Three simple steps:
- Select any Vision Foundation Model (DINOv3, DINOv2, RADIO, FRANCA, PE-CORE, CLIP, SAM, etc.)
- Choose your target resolution (up to 2K)
- Upsample features with NAF — zero-shot, no retraining needed
Why it works: NAF combines classical filtering theory with modern attention mechanisms, learning adaptive kernels through Fourier space transformations.
Usage: To use NAF on any features, to any resolution, simply run the following code (note that natten should be installed, see INSTALL.md):
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
naf = torch.hub.load("valeoai/NAF", "naf", pretrained=True, device=device)
naf.eval()
# High-resolution image (B, 3, H, W)
image = ...
# Low-resolution features (B, C, h, w)
lr_features = ...
# Desired output size (H_o, W_o)
target_size = ...
# High-resolution features (B, C, H_o, W_o)
upsampled = naf(image, lr_features, target_size)- Release trained checkpoints for NAF++.
- [2025-11-31] Add HuggingFace demo.
- [2025-11-25] NAF has been uploaded on arXiv.
- [2025-11-24] NAF code has been publicly released.
Vision Foundation Models produce downsampled spatial features, which are challenging for pixel-level tasks.
❌ Traditional upsampling methods:
- Classical filters – fast, generic, but fixed (bilinear, bicubic, joint bilateral, guided)
- Learnable VFM-specific upsamplers – accurate, but need retraining (FeatUp, LiFT, JAFAR, LoftUp)
✅ NAF (Neighborhood Attention Filtering):
- Learns adaptive spatial-and-content weights using Cross-Scale Neighborhood Attention + RoPE
- Works zero-shot for any VFM
- Outperforms existing upsamplers on multiple downstream tasks
- Efficient: scales up to 2K features, ~18 FPS for intermediate resolutions
- Also effective for image restoration
|
Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility.
We provide Jupyter notebooks to easily run NAF for inference and visualize attention maps:
- Inference: notebooks/inference.ipynb runs NAF upsampler on any VFM.
NAF enables zero-shot feature upsampling across any Vision Foundation Model |
Seamless upsampling from low-resolution to high-resolution features |
- Attention Maps: notebooks/attention_maps.ipynb visualizes NAF neighborhood attention maps.
|
|
| Given a query point and a kernel size, we compute and show its neighborhood attention map. | |
See the docs folder for detailed setup instructions concerning installs, datasets, training and evaluation.
If you need or want to retrain NAF, it takes less than 2 hours and consumes less than 8GB of GPU memory on a single NVIDIA A100. Otherwise, we provide pretrained weights for direct evaluation. We can also share evaluation logs upon request.
Many thanks to these excellent open source projects:
- https://github.com/SHI-Labs/NATTEN
- https://github.com/PaulCouairon/JAFAR
- https://github.com/mhamilton723/FeatUp
- https://github.com/saksham-s/lift/tree/main
- https://github.com/mc-lan/ProxyCLIP
To structure our code we used:
Do not hesitate to look and support our previous feature upsampling work:
If this work is helpful for your research, please consider citing the following BibTeX entry and putting a star on this repository. Feel free to open an issue for any questions.
@misc{chambon2025nafzeroshotfeatureupsampling,
title={NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering},
author={Loick Chambon and Paul Couairon and Eloi Zablocki and Alexandre Boulch and Nicolas Thome and Matthieu Cord},
year={2025},
url={https://arxiv.org/abs/2511.18452},
}




