Skip to content
/ NAF Public

Official Implementation of NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

License

Notifications You must be signed in to change notification settings

valeoai/NAF

Repository files navigation

Official Implementation of NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering.

[NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering.]
Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, Matthieu Cord.
Valeo.ai, Sorbonne University, CNRS.

arXiv HuggingFace Paper HuggingFace Space GitHub stars

Upsample any Vision Foundation Model features, zero-shot, to high-resolution with NAFNeighborhood Attention Filtering.

NAF Demo

▶️ Full quality (here)

And obtain state-of-the-art results on multiple downstream tasks for multiple VFM families, sizes and datasets:

Method Semantic Seg. Depth Est. Open Vocab. Video Prop. ⚡ FPS 📏 Max Ratio
FeatUp 4th 4th 3rd 4th 🥈 🥈
JAFAR 🥈 3rd 🥈 🥇 3rd 4th
AnyUp 3rd 🥈 4th 3rd 3rd 3rd
NAF (ours) 🥇 🥇 🥇 🥈 🥇 🥇

🏆 Performance Summary: Ranks (🥇 First · 🥈 Second)

🎯 TL;DR

Three simple steps:

  1. Select any Vision Foundation Model (DINOv3, DINOv2, RADIO, FRANCA, PE-CORE, CLIP, SAM, etc.)
  2. Choose your target resolution (up to 2K)
  3. Upsample features with NAF — zero-shot, no retraining needed

Why it works: NAF combines classical filtering theory with modern attention mechanisms, learning adaptive kernels through Fourier space transformations.

Usage: To use NAF on any features, to any resolution, simply run the following code (note that natten should be installed, see INSTALL.md):

import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
naf = torch.hub.load("valeoai/NAF", "naf", pretrained=True, device=device)
naf.eval()

# High-resolution image (B, 3, H, W)
image = ...
# Low-resolution features (B, C, h, w)
lr_features = ...   
# Desired output size (H_o, W_o)
target_size = ...                                

# High-resolution features (B, C, H_o, W_o)
upsampled = naf(image, lr_features, target_size)

⚡ News & Updates

  • Release trained checkpoints for NAF++.
  • [2025-11-31] Add HuggingFace demo.
  • [2025-11-25] NAF has been uploaded on arXiv.
  • [2025-11-24] NAF code has been publicly released.

📜 Abstract

Summary

Vision Foundation Models produce downsampled spatial features, which are challenging for pixel-level tasks.

❌ Traditional upsampling methods:

  • Classical filters – fast, generic, but fixed (bilinear, bicubic, joint bilateral, guided)
  • Learnable VFM-specific upsamplers – accurate, but need retraining (FeatUp, LiFT, JAFAR, LoftUp)

NAF (Neighborhood Attention Filtering):

  • Learns adaptive spatial-and-content weights using Cross-Scale Neighborhood Attention + RoPE
  • Works zero-shot for any VFM
  • Outperforms existing upsamplers on multiple downstream tasks
  • Efficient: scales up to 2K features, ~18 FPS for intermediate resolutions
  • Also effective for image restoration

Overview

Full abstract

Vision Foundation Models (VFMs) extract spatially downsampled representations, posing challenges for pixel-level tasks. Existing upsampling approaches face a fundamental trade-off: classical filters are fast and broadly applicable but rely on fixed forms, while modern upsamplers achieve superior accuracy through learnable, VFM-specific forms at the cost of retraining for each VFM. We introduce Neighborhood Attention Filtering (NAF), which bridges this gap by learning adaptive spatial-and-content weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE), guided solely by the high-resolution input image. NAF operates zero-shot: it upsamples features from any VFM without retraining, making it the first VFM-agnostic architecture to outperform VFM-specific upsamplers and achieve state-of-the-art performance across multiple downstream tasks. It maintains high efficiency, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS. Beyond feature upsampling, NAF demonstrates strong performance on image restoration, highlighting its versatility.

🔄 Notebooks

We provide Jupyter notebooks to easily run NAF for inference and visualize attention maps:


NAF enables zero-shot feature upsampling across any Vision Foundation Model

Seamless upsampling from low-resolution to high-resolution features
Given a query point and a kernel size, we compute and show its neighborhood attention map.

🔨 Setup

See the docs folder for detailed setup instructions concerning installs, datasets, training and evaluation.

If you need or want to retrain NAF, it takes less than 2 hours and consumes less than 8GB of GPU memory on a single NVIDIA A100. Otherwise, we provide pretrained weights for direct evaluation. We can also share evaluation logs upon request.

👍 Acknowledgements

Many thanks to these excellent open source projects:

To structure our code we used:

Do not hesitate to look and support our previous feature upsampling work:

✏️ Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry and putting a star on this repository. Feel free to open an issue for any questions.

@misc{chambon2025nafzeroshotfeatureupsampling,
      title={NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering}, 
      author={Loick Chambon and Paul Couairon and Eloi Zablocki and Alexandre Boulch and Nicolas Thome and Matthieu Cord},
      year={2025},
      url={https://arxiv.org/abs/2511.18452}, 
}

About

Official Implementation of NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published