Skip to content

Archiel19/FRAx4

Repository files navigation

Segmenting France Across Four Centuries

Marta López-Rauhut, Hongyu Zhou, Loic Landrieu, Mathieu Aubry

Official implementation of Segmenting France Across Four Centuries (ICDAR 2025).

Historical Map Segmentation

Historical maps pre-date satellite imagery, offering insights into centuries of landscape transformation before the 1950s. Unlocking this information can, for example, support the development of sustainable long-term policies.

Visually comparing maps from different eras is a complex task. Segmenting the maps into Land Use and Land Cover (LULC) classes makes it easier, but segmenting maps manually is still extremely time-consuming. That is why several works have explored training models to segment historical maps automatically. However, the lack of annotated training data remains a major limitation.

We introduce FRAx4, a new dataset of historical maps tailored for analyzing long-term LULC evolution with limited annotations. Using our dataset, we train three segmentation baselines:

Baselines

Index

Install

First of all, clone the repository:

git clone --recurse-submodules https://github.com/Archiel19/FRAx4.git
cd FRAx4

Conda with QGIS

We provide an environment definition with QGIS to run scripts that require manipulating geographical data. Create the environment by running:

conda env create -f pyqgis_env.yml

and activate it with:

conda activate pyqgis_env

Specifically, activate this environment if you want to:

Pip

If you are not planning to run any QGIS-related functionality, you can simply install the dependencies with pip:

pip install -r requirements.txt

Dataset

FRAx4 spans metropolitan France (548,305 km2) across the 18th, 19th, 20th and 21st centuries, and considers four LULC segmentation classes:

  • Map collections
    • Cassini
    • État-Major
    • SCAN50
    • Plan IGN
  • Segmentation classes:
    • background
    • forest
    • buildings
    • hydrography
    • road

Dataset content

For each historical map collection, we provide:

  • 10952 historical map tiles
  • 470 historical label tiles*
  • Modern Plan IGN maps adapted to resemble the style of the historical map collection:
    • 10952 modern map tiles
    • 10952 modern label tiles

Additionally, settings.json stores the exact settings that were used to generate the dataset and tile_extents.csv provides the west, south, east and north coordinates of each tile.

Cassini
(Historical)
Cassini
  (Modern)  
État-Major
(Historical)
État-Major
  (Modern)  
SCAN50
(Historical)
SCAN50
  (Modern)  
Map
Labels Not available

*Historical labels are available for a subset of map sheets, for Cassini and État-Major. Namely: Cambrai, Clermont, Dunkerque, LaonNoyon, LePuy, SaintMalo, SaintMalo2 and SaintOmer.

Dataset structure

<root dir>/
├─ <map collection>/
│  ├─ historical/
│  │  ├─ labels/
│  │  ├─ raster/
│  │  ├─ raster_labeled/
│  ├─ modern/
│  │  ├─ labels/
│  │  ├─ raster/
...

For convenience, historical/raster_labeled/ contains a duplicate of the labeled subset of historical/raster/.

Data sources

The raw data sources that we use to build our dataset are detailed in Data sources.

Build from scratch

First, install the repository and set up the pyqgis_env environment with conda (see Install). Then, generate the dataset by running:

python3 dataset/generate_dataset.py --hist_maps --hist_labels --mod_maps --mod_labels --num_workers <num_workers>

Replace <num_workers> with the number of concurrent processes you want to use for downloading and processing the data. --num_workers 0 will use the main process only. Run python3 dataset/generate_dataset.py -h for a complete list of options.

By default, the dataset is generated in the root directory of the repository.

If the dataset generation fails for whatever reason, run the code again: only the missing tiles will be generated.

Customize

The code, data and settings that are used to generate the dataset can be found in the dataset folder. Please refer to the README for instructions on how to build your own historical map segmentation dataset.

Download

You can download the dataset directly from HuggingFace by running:

python3 scripts/download_dataset.py

Training

Tip

The paths to the dataset, checkpoints and data directories can be modified in scripts/constants.sh.

Dataset splits

Dataset splits are defined in src/dataset_splits.json. Map sheets in test or val are excluded, and the rest are used for training.

A subset of the dataset can be isolated by specifying all (again, test and val are excluded for training). val_seed can be used instead of val to define a random train/val split.

For cross-validation, the splits for each fold are nested inside a main split definition.

For a custom dataset

Create your own JSON file based on src/dataset_splits.json. For each split, you may optionally specify the partitions all, test and val or val_seed (val_seed takes precedence over val). To define cross-validation folds, the name of the root key does not matter, but the folds inside should be named fold1 ... foldN, and the all partition should be outside of any folds.

The values of each partition should be map sheet names, or alternatively the identifiers of the rectangles used to delimit the dataset area.

U-Net

Train a U-Net segmentation model by running:

scripts/train_unet.sh <map style> <baseline> [<fold number>]
  • map style: Cassini, État-Major or SCAN50.
  • baseline: A, B or C.
    • A: fully-supervised direct segmentation. Trains on the labeled subset, excluding one map sheet for testing.
    • B: weakly-supervised direct segmentation.
    • C: fully-supervised modern map segmentation. To be used after applying image-to-image translation with CycleGAN.
  • fold number: for baseline A, the fold number between 1 and 7 that defines the train/val/test split in src/dataset_splits.json.

Train on a custom dataset

To train on a custom dataset, run src/segmentation.py directly:

python3 src/segmentation.py \
        --input_path <input path> \
        --labels_path <labels path> \
        --split_config_path <dataset splits configuration file> \
        --split_key <dataset split key> \
        --output_dir <output dir> \
        --unet_checkpoint_dir <U-Net checkpoint dir> \
        --crop_size <crop size> \
        train --n_epochs <epochs>

The main options that you will need to adjust are:

  • input_path: path to the directory containing the input raster images.
  • labels_path: path to the directory containing ground truth labels.
  • split_config_path: path to a JSON file containing dataset split definitions. Unused if split_key is empty.
  • split_key: key identifying a dataset split definition in the file at split_config_path. Leave empty to train on the whole dataset.
  • output_dir: output directory for logs and figures.
  • unet_checkpoint_dir: directory where U-Net model checkpoints will be saved.
  • crop_size: size of the crop that is taken from each dataset tile during training. Should be adapted depending on the scale of the map collection.

Run python3 src/segmentation.py -h to display a full list of options.

CycleGAN

First, prepare the dataset to respect the format expected by the CycleGAN training code:

scripts/prepare_dataset_for_cyclegan.sh

Once the dataset has been formatted properly, you may train CycleGAN models by running:

scripts/train_cyclegan.sh <map style> [<experiment name suffix>]
  • map style: Cassini, État-Major or SCAN50.
  • experiment name suffix: optional naming suffix to avoid overwriting the output of independent training runs.

Train on a custom dataset

To train on a custom dataset, first you need to format it properly. In order to do so, you will need to adapt prepare_dataset_for_cyclegan.sh with the name of your dataset and its map collections ("styles") before running it.

To avoid some typing, run source scripts/constants.sh to initialize the $GAN_DIR variable.

Then, run $GAN_DIR/train.py. For example, to train with the settings used for the paper:

python3 $GAN_DIR/train.py \
        --dataset_mode frax4 \
        --dataroot $GAN_DIR/datasets/<dataset name>/<style> \
        --checkpoints_dir <GAN checkpoint dir> \
        --name <experiment name> \
        --preprocess crop_and_resize \
        --crop_size <crop size> \
        --load_size 256 \
        --lambda_identity 0.5 \
        --lambda_translation 0.5 \
        --lambda_cycle 1.0 \
        --lambda_gan 1.0 \
        --n_epochs 100 \
        --n_epochs_decay 0 \
        --pool_size 0 \
        --save_by_iter \

The values you should change are:

  • dataset name: name of your custom dataset.
  • style: name of the target map collection.
  • GAN checkpoint dir: directory where CycleGAN checkpoints will be stored.
  • experiment name: experiment name, to identify independent separate training runs.
  • crop size: size of the crop that is taken from each dataset tile during training. Should be adapted depending on the scale of the map collection.

Run python3 $GAN_DIR/train.py -h to display a full list of options.

Testing

To reproduce the main results of the paper, first download the model checkpoints from HuggingFace:

python3 scripts/download_checkpoints.py

and then run:

scripts/test_all.sh [--log_figures] [--log_metrics_per_tile] [--log_densities]
  • --log_figures: save images, labels and predictions for each tile.
  • --log_metrics_per_tile: log individual per-tile metrics in addition to the final metrics over the whole test set.
  • --log_densities: predict forest densities over France. This option requires QGIS to merge the per-tile density images, so remember to activate pyqgis_env (see Conda with QGIS).

The expected results are:

Map collection Baseline OA Mean dIoU Forest dIoU Buildings dIoU Hydrography dIoU Road dIoU
Cassini A96.7076.4388.3469.7685.0362.59
B79.8026.0635.805.3463.100.00
C084.7736.8359.243.2971.2013.57
C184.6934.6249.594.7072.5011.70
C286.5536.8959.466.2168.9312.94
État-Major A91.3361.2277.3856.2565.3545.92
B83.3238.5649.6339.6358.646.34
C077.7429.7641.8714.1357.255.80
C182.1232.3749.6716.1459.124.57
C275.2124.0237.3616.5736.605.57

Inference

To run inference on an image with specific model checkpoints, call the main Python script directly:

python3 src/segmentation.py --input_path <image path> [--labels_path <labels path>] --output_dir <out dir> \
                            --unet_checkpoint_dir <U-Net checkpoint dir> [--gan_checkpoint_dir <CycleGAN checkpoint dir>] \
                            --crop_size <prediction window size> --log_figures test

Run python3 src/segmentation.py -h to display more options. crop_size should be 1000 for Cassini maps and 500 for État-Major and SCAN50.

Citation

@inproceedings{lopez2025segmenting,
  title={Segmenting France Across Four Centuries},
  author={L{\'o}pez-Rauhut, Marta and Zhou, Hongyu and Aubry, Mathieu and Landrieu, Loic},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={3--22},
  year={2025},
  organization={Springer}
}

Acknowledgements

This work was supported by the European Research Council (ERC project DISCOVER, number 101076028) and by ANR project sharp ANR-23-PEIA-0008 in the context of the PEPR IA. This work was also granted access to the HPC resources of IDRIS under the allocation 2024-AD011015314 made by GENCI.

About

Segmenting France Across Four Centuries - Official project repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages