Geo-ER

Code for 'Geospatial Entity Resolution' paper (WWW 2022)

Geo-ER is an Entity Resolution (ER) framework to match geospatial entities. It allows to integrate geospatial databases, performing deduplication of the entries. Geo-ER leverages Transformer-based Language Models (LMs), Distance embedding and a novel Neighbourhood Attention component, based on Graph Attention (GAT).

UPDATE

Now Geo-ER supports JSON-formatted neighbors for entities, to make it easier to use it for other tasks and dataset. The JSON format is the following:

# list of python dictionaries
[
    {
        "name1": "...",               # name of entity1
        "name2": "...",               # name of entity2
        "neigh1": [
                                      # list of entity1 neighbors' names
        ],
        "neigh2": [
                                      # list of entity2 neighbors' names
        ],
        "dist1": [
                                      # list of entity1 neighbors' distances from entity1
        ],
        "dist2": [
                                      # list of entity2 neighbors' distances from entity2
        ]
    }
]

Requirements

Python 3.7.7
PyTorch 1.9
HuggingFace Transformers 4.9.2

Install required packages

pip install -r requirements.txt

Geospatial ER

The following image (left) depicts an example of Geospatial ER, in which two sources are being joined. Geo-ER uses textual information, distance and spatial neighbours information (right) to infer if two records, from the two sources, refer to the same real-world entity.

Each entity is pre-serialized as follows:

COL name VAL Wine and Spirits COL latitude VAL 40.4535 COL longitude VAL -80.009 COL address VAL Ohio Street COL postalCode VAL NULL

Each entity pair <e_1, e_2> is serialized as follows:

<e_1> \t <e_2> \t <label>

where <e_i> is the serialized version of entity i and <label> is the either 0 (no-match) or 1 (match).

Training

To train Geo-ER, please use:

python main.py -c pit -s osm_fsq

The meaning of the flags and their possible values are listed here:

-c, --city: Specify the dataset of which city you wish to use for training. Possible values are sin, edi, tor, pit.
-s, --source: Specify the sources to be joined to create the dataset. Possible values are osm_fsq, osm_yelp.

Datasets

Please use the search_neighbors.py script to generate the neighbourhood data for any city. The files containing such neighbourhood information are quite large, being composed of BERT encodings.

This paper introduces 8 real-world datasets, joining 3 different sources (OpenStreetMap, Foursquare and Yelp), in 4 different cities (Singapore, Edinburgh, Toronto and Pittsburgh). The statistics of the datasets are the following:

Source	City	Size	Positive (%)
OSM-FSQ	Singapore	19,243	2,116 (11.0%)
	Edinburgh	17,386	3,350 (19.3%)
	Toronto	17,858	3,862 (21.6%)
	Pittsburgh	5,001	1,454 (29.1%)

OSM-Yelp	Singapore	21,588	2,941 (13.6%)
	Edinburgh	18,733	2,310 (12.3%)
	Toronto	27,969	5,426 (19.4%)
	Pittsburgh	5,116	1,622 (31.7%)

The column Positive (%) shows the number of positive samples. The 8 datasets can be found in data/train_valid_test/

If you wish to use any of our datasets for your research, please cite our paper

Citation Information

@inproceedings{10.1145/3485447.3512026,
  author = {Balsebre, Pasquale and Yao, Dezhong and Cong, Gao and Hai, Zhen},
  title = {Geospatial Entity Resolution},
  year = {2022},
  isbn = {9781450390965},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  doi = {10.1145/3485447.3512026},
  booktitle = {Proceedings of the ACM Web Conference 2022},
  location = {Virtual Event, Lyon, France},
  series = {WWW '22}
}

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
__pycache__		__pycache__
data		data
imgs		imgs
saved_models		saved_models
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
config.py		config.py
functions.py		functions.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
search_neighbors.py		search_neighbors.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Geo-ER

Code for 'Geospatial Entity Resolution' paper (WWW 2022)

UPDATE

Requirements

Geospatial ER

Training

Datasets

If you wish to use any of our datasets for your research, please cite our paper

Citation Information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

PasqualeTurin/Geo-ER

Folders and files

Latest commit

History

Repository files navigation

Geo-ER

Code for 'Geospatial Entity Resolution' paper (WWW 2022)

UPDATE

Requirements

Geospatial ER

Training

Datasets

If you wish to use any of our datasets for your research, please cite our paper

Citation Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages