Positional Encoder Graph Neural Networks
Positional Encoder Graph Neural Networks
a joint embedding with a data-dependent, secondary 2018) and GraphSAGE (Hamilton et al., 2017) are power-
encoder (e.g., text encoder), we use the output of PE ful methods for inference and representation learning with
concatenated with other node features to directly pre- spatial data. Recently, GNN approaches tailored to the spe-
dict an outcome variable. PE learns through backprop- cific complexities of geospatial data have been developed.
agation on the main regression loss in an end-to-end The authors of Kriging Convolutional Networks (Appleby
fashion. Training PE thus takes into account not only et al., 2020) propose using GNNs to perform a modified
the eventual variable of interest, but also further con- kriging task. Hamilton et al. (2017) apply GNNs for a
textual information at the current location–and its rela- spatio-temporal Kriging task, recovering data from unsam-
tion to other points. Within PE-GNN, spatial informa- pled nodes on an input graph. We look to extend this line
tion is thus represented both through the constructed of research by providing stronger, explicit capacities for
graph and the learned PE embeddings. GNNs to learn spatial structures. Additionally, our pro-
posed method is highly modular and can be combined with
• We expand the Moran’s I auxiliary task learning any GNN backbone.
framework proposed by Klemmer and Neill (2021) for
continuous spatial coordinates.
2.2 Spatial context embeddings for geographic data
• Our training strategy involves the creation of a new
training graph at each training step from the current, Through many decades of research on spatial patterns, a
random point batch. This enables learning of a more myriad of measures, metrics, and statistics have been de-
generalizable PE embedding and allows computation veloped to cover a broad range of spatial interactions. All
of a “shuffled” Moran’s I, which accounts for different of these measures seek to transform spatial locations, with
neighbors at different training steps, thus tackling the optional associated features, into some meaningful embed-
well-known scale sensitivity of Moran’s I. ding, for example, a theoretical distribution of the loca-
tions or a measure of spatial association. The most com-
• To the best of our knowledge, PE-GNN is the first
mon metric for continuous geographic data is the Moran’s
GNN based approach that is competitive with Gaus-
I statistic, developed by Anselin (1995). Moran’s I mea-
sian Processes on pure spatial interpolation tasks, i.e.,
sures local and global spatial autocorrelation and acts as
predicting a (continuous) output based solely on spa-
a detector of spatial clusters and outliers. The metric has
tial coordinates, as well as substantially improving
also motivated several methodological expansions, like lo-
GNN performance on all predictive tasks.
cal spatial heteroskedasticity (Ord and Getis, 2012) and lo-
cal spatial dispersion (Westerholt et al., 2018). Measures
2 Related work of spatial autocorrelation have already been shown to be
useful for improving neural network models through auxil-
2.1 Traditional and neural-network-based spatial iary task learning (Klemmer and Neill, 2021), model selec-
regression modeling tion (Klemmer et al., 2019), embedding losses (Klemmer
et al., 2022) and localized representation learning (Fu et al.,
Our work considers the problem of modeling geospatial 2019). Beyond these traditional metrics, recent years have
data. This poses a distinct challenge, as standard regres- seen the emergence of neural network based embeddings
sion models (such as OLS) fail to address the spatial nature for geographic information. Wang et al. (2017) use kernel
of the data, which can result in spatially correlated resid- embeddings to learn social media user locations. Fu et al.
uals. To address this, spatial lag models (Anselin et al., (2019) devise an approach using local point-of-interest
2001) add a spatial lag term to the regression equation that (POI) information to learn region embeddings and integrate
is proportional to the dependent variable values of nearby similarities between neighboring regions to learn mobile
observations, assigned by a weight matrix. Likewise, ker- check-ins. Yin et al. (2019) develop GPS2Vec, an embed-
nel regression takes a weighted average of nearby points ding approach for latitude-longitude coordinates, based on
when predicting the dependent variable. The most popu- a grid cell encoding and spatial context (e.g., tweets and
lar off-the-shelf methods for modeling continuous spatial images). Mai et al. (2020b) developed Space2Vec, an-
data are based on Gaussian processes (Datta et al., 2016). other latitude-longitude embedding without requiring fur-
Recently, there has been a rise of research on applications ther context like tweets or POIs. Space2Vec transforms the
of neural network models for spatial modeling tasks. More input coordinates using sinusoidal functions and then re-
specifically, graph neural networks (GNNs) are often used projects them into a desired output space using linear lay-
for these tasks with the spatial data represented graphically. ers. In follow-up work, Mai et al. (2020a) first propose
Particularly, they offer flexibility and scalability advan- the direct integration of Space2Vec into downstream tasks
tages over traditional spatial modeling approaches. Spe- and show its potential with experiments on spatial seman-
cific GNN operators including Graph Convolutions (Kipf tic lifting and geographic question answering. In this study,
and Welling, 2017), Graph Attention (Veličković et al., we propose to generalize their approach to any geospatial
Konstantin Klemmer, Nathan Safir, Daniel B. Neill
regression task by conveniently integrating Space2Vec em- graph connecting point locations is known, one would typ-
beddings into GNNs. ically construct a graph using the distance (Euclidean or
other) between pairs of points. In many real world settings
3 Method (e.g., points-of-interest along a road network) this assump-
tion is unrealistic and may lead to poorly defined neighbor-
hoods. Lastly, GCNs contain no intrinsic tool to transform
3.1 Graph Neural Networks with Geographic Data
point coordinates into a different (latent) space that might
We now present PE-GNN, using Graph Convolutional Net- be more informative for representing the spatial structure,
works (GCNs) as example backbone. Let us first define a with respect to the particular problem the GCN is trying to
datapoint pi = {yi , xi , ci }, where yi is a continuous target solve.
variable (scalar), xi is a vector of predictive features and ci As such, GCNs can struggle with tasks that explicitly re-
is a vector of point coordinates (latitude / longitude pairs). quire learning of complex spatial dependencies, as we con-
We use the great-circle distance dij = haversin(ci , cj ) firm in our experiments. We propose a novel approach
between point coordinates to create a graph of all points in to overcome these difficulties, by devising a new posi-
the set, using a k-nearest-neighbor approach to define each tional encoder module, learning a flexible spatial con-
point’s neighborhood. The graph G = (V, E) consists of text encoding for each geographic location. Given a
a set of vertices (or nodes) V = {v1 , . . . , vn } and a set of batch of datapoints, we create the spatial coordinate ma-
edges E = {e1 , . . . , em } as assigned by the adjacency ma- trix C from individual point coordinates c1 , . . . , cn and
trix A. Each vertex i 2 V has respective node features xi define a positional encoder P E(C, min , max , ⇥P E ) =
and target variable yi . While the adjacency matrix A usu- N N (ST (C, min , max ), ⇥P E ), consisting of a sinu-
ally comes as a binary matrix (with values of 1 indicating soidal transform ST ( min , max ) and a fully-connected
adjacency and values of 0 otherwise), one can account for neural network N N (⇥P E ), parametrized by ⇥P E . Fol-
different distances between nodes and use point distances lowing the intuition of transformers (Vaswani et al., 2017)
dij or kernel transformations thereof (Appleby et al., 2020) for geographic coordinates (Mai et al., 2020b), the sinu-
to weight A. Given a degree matrix D and an identity ma- soidal transform is a concatenation of scale-sensitive sinu-
trix I, the normalized adjacency matrix Ā is defined as: soidal functions at different frequencies, so that
1/2 1/2
Ā = D (A + I)D (1)
ST (C, min , max ) =
As proposed by Kipf and Welling (2017), a GCN layer can (3)
[ST0 (C, min , max ); . . . ; STS 1 (C, min , max )]
now be defined as:
H(l) = (ĀH(l 1)
W(l) ), l = 1, . . . , L (2) with S being the total number of grid scales and min
and max setting the minimum and maximum grid
scale (comparable to the lengthscale parameter of a ker-
where describes an activation function (e.g., ReLU)
nel). The scale-specific encoder STs (C, min , max ) =
and W(l) is a weight matrix parametrizing GCN layer
[STs,1 (C, min , max ); STs,2 (C, min , max )] processes
l. The input for the first GCN layer H(0) is given by
the spatial dimensions v (e.g., latitude and longitude) of C
the feature matrix X containing all node feature vectors
separately, so that
x1 , . . . , xn . The assembled GCN predicts the output Ŷ =
GCN (X, ⇥GCN ) parametrized by ⇥GCN .
STs,v (C, min , max )=
3.2 Context-aware spatial coordinate embeddings ✓ [v]
◆ ✓ ◆
C C[v]
cos s/(S 1)
; sin s/(S 1)
(4)
Traditionally, the only intuition for spatial context in GCNs min g min g
stems from connections between nodes which allow for 8s 2 {0, . . . , S 1}, 8v 2 {1, 2},
graph convolutions, akin to pixel convolutions with image
data. This can restrict the capacity of the GCN to cap- where g = max . The output from ST is then fed
min
ture spatial patterns: While defining good neighborhood through the fully connected neural network N N (⇥P E )
structures can be crucial for GCN performance, this of- to transform it into the desired vector space shape,
ten comes down to somewhat arbitrary choices like select- creating the coordinate embedding matrix Cemb =
ing the k nearest neighbors of each node. Without prior P E(C, min , max , ⇥P E ).
knowledge on the underlying data, the process of setting
the right neighborhood parameters may require extensive 3.3 Auxiliary learning of spatial autocorrelation
testing. Furthermore, a single value of k might not be best
for all nodes: different locations might be more or less de- Geographic data often exhibit spatial autocorrelation: ob-
pendent on their neighbors. Assuming that no underlying servations are related, in some shape or form, to their geo-
Positional Encoder Graph Neural Networks for Geographic Data
cemb
Linear
ST
I(ŷ)
Linear
L2(I(y), I(ŷ))
concat(x,cemb)
GCNConv
GCNConv
Embed coordinates
c using positional
encoder PE ŷ
Linear
GCNConv
GCNConv
x ŷ x L1(y, ŷ)
Linear
L(y, ŷ)
L=L1+λL2
Randomly sample nbatch Build kNN graph with Graph Randomly sample nbatch Build kNN graph with Graph
datapoints p=[x,c,y] from k neighbors using Convolutional Compute datapoints p=[x,c,y] from k neighbors using Convolutional Compute
geo-database coordinates c Network loss geo-database coordinates c Network loss
GCN PE-GCN
Figure 1: PE-GCN compared to the GCN baseline: PE-GCN contains a (1) positional encoder network, learning a
spatial context embedding throughout training which is concatenated with node-level features and (2) an auxiliary learner,
predicting the spatial autocorrelation of the outcome variable simultaneously to the main regression task.
graphic neighbors. Spatial autocorrelation can be measured the training data as batch B. A graph with corresponding
using the Moran’s I metric of local spatial autocorrelation adjacency matrix AB is constructed for the batch and the
(Anselin, 1995). Moran’s I captures localized homogeneity Moran’s I metric of the outcome variable I(YB ) is com-
and outliers, functioning as a detector of spatial clustering puted. This approach brings a unique advantage: When
and spatial change patterns. In the context of our problem, training with (randomly shuffled) batches, points may have
the Moran’s I measure of spatial autocorrelation for out- different neighbors in different training iterations. The
come variable yi is defined as: Moran’s I for point i can thus change throughout iterations,
reflecting a differing set of more distant or closer neigh-
bors. This also naturally helps to tackle Moran’s I scale
(yi ȳi )
n
X sensitivity. Altogether, we refer to this altered Moran’s I as
Ii = (n 1) Pn ai,j (yj ȳj ), (5) “shuffled Moran’s I”.
j=1 (yj ȳj )2
j=1,j6=i
et al. (2020a), who learn a specific joint embedding be- sampled, this creates a “shuffled” version of the metric.
tween the geographic coordinates and potential other inputs We then run inputs XB , CB , AB through the two-headed
(e.g., text data). Our approach allows for separate treatment model M⇥P E ,⇥GCN obtaining predictions ŶB , I(ŶB ).
of geographic coordinates and potential other predictors, We then compute the loss L(YB , I(YB ), ŶB , I(ŶB ), ),
allowing a higher degree of flexibility: PE-GCN can be de- weighing the Moran’s I auxiliary task according to weight
ployed for any regression task, geo-referenced in the form parameter . Lastly, we use the loss L to update our model
of latitude longitude coordinates. Lastly, to integrate the parameters ⇥GCN , ⇥P E according to stochastic gradient
Moran’s I auxiliary task, we compute the metric I(YB ) for descent. Training is conducted for tsteps after which the
our outcome variable YB at the beginning of each training final model M is returned.
step according to Equation 5, using spatial weights from
AB . Prediction is then facilitated by creating two predic- PE-GNN, with any GNN backbone, helps to tackle many
tion heads, here linear layers, while the graph operation of the particular challenges of geographic data: While our
layers (e.g., GCN layers) are shared between tasks. Finally, approach still includes the somewhat arbitrary choice of k-
ˆ B ). The loss of nearest neighbors to define the spatial graph, the proposed
we obtain predicted values ŶB and I(Y
positional encoder network is not bound by this restriction,
PE-GCN can be computed with any regression criterion,
as it does not operate on the graph. This enables a separate
for example mean squared error (MSE):
learning of context-aware embeddings for each coordinate,
accounting for neighbors at any potential distance within
the batch. While the spatial graph used still relies on pre-
L = M SE(ŶB , YB ) + M SE(I(ŶB ), I(YB )) (7) defined distance measure, the positional encoder embeds
latitude and longitude values in a high-dimensional latent
where denotes the auxiliary task weight. The final model space. These high-dimensional coordinates are able to re-
is denoted as M⇥P E ,⇥GCN . Algorithm 1 describes a train- flect spatial complexities much more flexibly and, added as
ing cycle. node features, can communicate these throughout the learn-
ing process. Batched PE-GNN training is not conducted
Algorithm 1 PE-GNN Training on a single graph, but a new graph consisting of randomly
Require: M , , k, tsteps,nbatch hyper-parameter sampled training points at each iteration. As such, at dif-
1: Initialize model M with random weights and hyper- ferent iterations, focus is put on the relationships between
parameter different clusters of points. This helps our method to gen-
2: Set optimizer with hyper-parameter eralize better, rather than just memorizing neighborhood
3: for number of training steps (tsteps) do structures. Lastly, the differing training batches also help
4: Sample minibatch B of nbatch points with features us to compute a “shuffled” version of the Moran’s I metric,
XB , coordinates CB and outcome YB . capturing autocorrelation at the same location for different
5: Construct a spatial graph with adjacency matrix (closer or more distant), random neighborhoods.
AB from coordinates CB using k-nearest neighbors
6: Using spatial adjacency AB , compute Moran’s I of
4 Experiments
output as I(YB )
7: Predict outcome
4.1 Data
[ŶB , I(ŶB )] = M⇥P E ,⇥GCN (XB , CB , AB )
8: Compute loss We evaluate PE-GNN and baseline competitors on four
L(YB , I(YB ), ŶB , I(ŶB ), ) real-world geographic datasets of different spatial resolu-
9: Update the parameters ⇥GCN , ⇥P E of model M tions (regional, continental and global):
using stochastic gradient descent California Housing: This dataset contains the prices of
10: return M over 20, 000 California houses from the 1990 U.S. census
(Kelley Pace and Barry, 2003). The regression task at hand
We begin training by initializing our model M , for exam- is to predict house prices y using features x (e.g., house
ple a PE-GCN, with random weights and potential hyper- age, number of bedrooms) and location c. California hous-
parameters (e.g., PE embedding dimension) and defining ing is a standard dataset for assessment of spatial autocor-
our optimizer. We then start the training cycle: At each relation.
training step, we first sample a minibatch B of points from Election:This dataset contains the election results of over
our training data. These points come as features XB , 3, 000 counties in the United States (Jia and Benson, 2020).
point coordinates CB and outcome variables YB . We con- The regression task here is to predict election outcomes y
struct a graph from spatial coordinates CB using k-nearest- using socio-demographic and economic features (e.g., me-
neighborhood, obtaining an adjacency matrix AB . Next we dian income, education) x and county locations c.
use AB as spatial weight matrix to compute local Moran’s Air temperature:The air temperature dataset (Hooker et al.,
I values I(YB ) from YB . As minibatches are randomly 2018) contains the coordinates of 3, 000 weather stations
Positional Encoder Graph Neural Networks for Geographic Data
(b) Test error curves of GCN, GAT and GraphSAGE based models, measured by the MSE metric.
Table 1: Spatial Interpolation: Test MSE and MAE scores from four different datasets, using four different GNN back-
bones with and without our proposed architecture.
around the globe. For this regression task we seek to pre- 4.2 Experimental setup
dict mean temperatures y from a single node feature x,
mean precipitation, and location c. We compare PE-GNN with four different graph neural
3d Road:The 3d road dataset (Kaul et al., 2013) provides network backbones: The original GCN formulation (Kipf
3-dimensional spatial co-ordinates (latitude, longitude, and and Welling, 2017), graph attention mechanisms (GAT)
altitude) of the road network in Jutland, Denmark. The (Veličković et al., 2018) and GraphSAGE (Hamilton et al.,
dataset comprises over 430, 000 points and can be used for 2017). We also use Kriging Convolutional Networks
interpolating altitude y using only latitude and longitude (KCN) (Appleby et al., 2020), which differs from GCN pri-
coordinates c (no node features x). marily in two ways: it transforms the distance-weighted ad-
jacency matrix A using a Gaussian kernel and adds the out-
Konstantin Klemmer, Nathan Safir, Daniel B. Neill
Table 2: Spatial Regression: Test MSE and MAE scores from three different datasets, using four different GNN backbones
with and without our proposed architecture.
(b) 3d Road.
Figure 3: MSE bar plots of mean performance and 2 confidence intervals obtained from 10 different training checkpoints.
come variable and features of neighboring points to the fea- To allow for a fair comparison between the different ap-
tures of each node. Test set points can only access neigh- proaches, we equip all models with the same architec-
bors from the training set to extract these features. We com- ture, consisting of two GCN / GAT / GraphSAGE lay-
pare the naive version of all these approaches to the same ers with ReLU activation and dropout, followed by lin-
four backbone architectures augmented with our PE-GNN ear layer regression heads. The KCN model also uses
modules. Beyond GNN-based approaches, we also com- GCN layers, following the author specifications. We found
pare PE-GNN to the most popular method for modeling that adding additional layers to the GNNs did not increase
continuous spatial data: Gaussian processes. For all ap- their capacity for processing raw latitude / longitude co-
proaches, we compare a range of different training settings ordinates. We test four different auxiliary task weights
and hyperparameters, as discussed below. = {0, 0.25.0.5, 0.75}, where = 0 implies no auxiliary
Positional Encoder Graph Neural Networks for Geographic Data
task. Spatial graphs are constructed assuming k = 5 near- compete with Gaussian Processes on simple spatial inter-
est neighbors, following rigorous testing. This also con- polation baselines, though especially exact GPs still some-
firms findings from previous work (Appleby et al., 2020; times have the edge. PE-GNN is substantially more scal-
Jia and Benson, 2020). We include a sensitivity analysis able than exact GPs, which rely on expensive pair-wise dis-
of the k parameter and different batch sizes in our results tance calculations across the full training dataset. Due to
section. Training for the GNN models is conducted using this problem, we do not run an exact GP baseline for the
PyTorch (Paszke et al., 2019) and PyTorch Geometric (Fey high-dimensional 3d Road dataset. For KCN models, we
and Lenssen, 2019). We use the Adam algorithm to op- observe a proneness to overfitting. As the authors of KCN
timize our models (Kingma and Ba, 2015) and the mean mention, this effect diminishes in large enough data do-
squared error (MSE) loss. Gaussian process models (ex- mains (Appleby et al., 2020). For example, KCNs are the
act and approximate) are trained using GPyTorch (Gardner best performing method on the 3d Road dataset–by far our
et al., 2018). Due to the size of the dataset, we only pro- largest experimental dataset. Here, we also observe that in
vide an approximate GP result for 3d Road. All training cases when KCN learns well, PE-KCN can still improve
is conducted on single CPU. On the Cali. Housing dataset its performance. The KCN experiments also highlight the
(n > 20, 000) training times for one step (no batched train- strongest effects of the Moran’s I auxiliary tasks: In cases
ing) are as follows: PE-GCN = 0.23s (with aux. task when KCN overfits (Election, Cali. Housing datasets), PE-
0.24s), PE-GAT = 0.38s, PE-GraphSAGE = 0.33s, PE- KCN without auxiliary task ( = 0) is not sufficient to
KCN = 0.41, exact GP = 0.77s. Results are averaged over overcome the problem. However, adding the auxiliary task
100 training steps. The code for PE-GNN and our exper- can mitigate most of the overfitting issue. This directly con-
iments can be accessed here: https://github.com/ firms a theory of Klemmer and Neill (2021) on the benefi-
konstantinklemmer/pe-gnn. cial effects of auxiliary learning of spatial autocorrelation.
Regarding the question of spatial scale, we find no systemic
variation in PE-GNN performance between applications
4.3 Results
with regional (California Housing, 3d Road), continental
(Election) and global (Air Temperature) spatial coverage.
4.3.1 Predictive performance
PE-GNN performance depends on the difficulty of the task
We test our methods on two tasks: Spatial Interpolation, at hand and the complexity of present spatial dependencies.
predicting outcomes from spatial coordinates alone, and We also assess the robustness of PE-GNN training cy-
Spatial Regression, where other node features are available cles. Figure 3 highlights the confidence intervals of PE-
in addition to the latitude / longitude coordinates. The re- GNN models with GCN, GAT and GraphSAGE backbones
sults of our experiments are shown in Table 1 and 2. For all trained on the California Housing and 3d Road datasets,
models, we provide mean squared error (MSE) and mean obtained from 10 different training cycles. We can see
absolute error (MAE) metrics on held-out test data. For the that training runs exhibit only little variability. These find-
spatial interpolation task, we observe that the PE-GNN ap- ings thus confirm that PE-GNN can consistently outper-
proaches consistently and vastly improve performance for form naive GNN baselines.
all four backbone architectures across the California Hous-
ing, Air Temperature and 3d Road datasets and, by a small
margin, for the Election dataset. For the spatial regres-
sion task, we observe that the PE-GNN approaches consis-
tently and substantially improve performance for all four
backbone architectures on the California Housing and Air
Temperature datasets. Performance remains unchanged or
decreases by very small margins in the Election dataset, ex-
cept for the KCN backbone which benefits tremendously
from the PE-GNN approach, particularly with auxiliary
tasks.
Generally, PE-GNN substantially improves over baselines
in regression and interpolation settings. Most of the im-
provement can be attributed to the positional encoder, how-
ever the auxiliary task learning also has substantial benefi-
cial effects in some settings, especially for the KCN mod- Figure 4: Predictive performance of PE-GCN and PE-GAT
els. The best setting for the task weight hyperparameter models on the California Housing dataset, using different
seems to heavily depend on the data, which confirms find- values of k for constructing nearest-neighbor graphs and
ings by Klemmer and Neill (2021). To our knowledge, PE- different batch sizes (bs).
GNN is the first GNN-based learning approach that can
Konstantin Klemmer, Nathan Safir, Daniel B. Neill
4.3.2 Sensitivity analyses with main and aux defining the model noise parameters.
By minimizing this objective, we learn the relative weight
Figure 4 highlights some results from our sensitivity anal- or contribution of main and auxiliary task to the combined
yses with the k and nbatch (batch size) parameters. After loss. The last term of the loss prevents it from moving to-
rigorous testing, we opt for k = 5-NN approach to create wards infinity and acts as a regularizer. While this approach
the spatial graph and compute the shuffled Moran’s I across performs equally compared to a well selected parameter,
all models. We chose nbatch = 2048 for Cali. Housing it eliminates the need to manually tune and select . Figure
and 3d Road datasets and nbatch = 1024 for the Election 5 highlights the learning of main and aux loss weights
and Air Temperature datasets. Note that while our exper- using PE-GCN and the Air Temperature dataset.
iments focus on batched training to highlight the applica-
bility of PE-GNN to high-dimensional geospatial datasets,
we also tested our approach with non-batched training on 5 Conclusion
the smaller datasets (Election, Air Temperature, Califor-
nia Housing). We found only marginal performance differ- With PE-GNN, we introduce a flexible, modular GNN-
ences between these settings. based learning framework for geographic data. PE-GNN
leverages recent findings in embedding spatial context into
neural networks to improve predictive models. Our em-
pirical findings confirm a strong performance. This study
highlights how domain expertise can help improve machine
learning models for applications with distinct characteris-
tics. We hope to build on the foundations of PE-GNN to
develop further methods for geospatial machine learning.
References
Figure 5: Automatic learning of loss weights via task un-
certainty on the Air Temp. dataset with PE-GCN. The Luc Anselin. 1995. Local Indicators of Spatial As-
left graphic shows the training loss (MSE), while the right sociation—LISA. Geographical Analysis 27, 2 (sep
graphic shows the main and auxiliary task weight param- 1995), 93–115. https://doi.org/10.1111/j.
eters main and aux . The training steps are given on the 1538-4632.1995.tb00338.x arXiv:1011.1669
x-axis. Luc Anselin et al. 2001. Spatial econometrics. A compan-
ion to theoretical econometrics 310330 (2001).
4.3.3 Learning auxiliary loss weights using task Gabriel Appleby, Linfeng Liu, and Li Ping Liu. 2020. Krig-
uncertainty ing convolutional networks. In AAAI 2020 - 34th AAAI
Conference on Artificial Intelligence, Vol. 34. AAAI
Lastly, following work by Cipolla et al. (2018) and Klem- press, 3187–3194. https://doi.org/10.1609/
mer and Neill (2021), we provide an intuition for automat- aaai.v34i04.5716
ically selecting the Moran’s I auxiliary task weights using
task uncertainty. This eliminates the need to manually tune Cen Chen, Kenli Li, Sin G. Teo, Xiaofeng Zou, Kang
and select the parameter. The approach first proposed by Wang, Jie Wang, and Zeng Zeng. 2019. Gated residual
Cipolla et al. (2018) formalizes the idea by first defining recurrent graph neural networks for traffic prediction. In
a probabilistic multi-task regression problem with a main 33rd AAAI Conference on Artificial Intelligence, AAAI
and auxiliary task as: 2019, 31st Innovative Applications of Artificial Intelli-
gence Conference, IAAI 2019 and the 9th AAAI Sympo-
sium on Educational Advances in Artificial Intelligence,
p(Ŷmain , Ŷaux |f (X)) = p(Ŷmain |f (X))p(Ŷaux |f (X)) EAAI 2019, Vol. 33. AAAI Press, 485–492. https:
(8) //doi.org/10.1609/aaai.v33i01.3301485
Roberto Cipolla, Yarin Gal, and Alex Kendall. 2018. Multi-
with Ŷmain , Ŷaux giving the main and auxiliary task
task Learning Using Uncertainty to Weigh Losses for
predictions. Following maximum likelihood estima-
Scene Geometry and Semantics. In Proceedings of the
tion, the regression objective function is given as
IEEE Computer Society Conference on Computer Vision
min L( main , aux ):
and Pattern Recognition. https://doi.org/10.
1109/CVPR.2018.00781 arXiv:1705.07115
= log p(Ŷmain , Ŷaux |f (X))
Abhirup Datta, Sudipto Banerjee, Andrew O. Finley,
1 1 and Alan E. Gelfand. 2016. Hierarchical Nearest-
= 2 Lmain + 2 Laux + (9)
2 main 2 aux Neighbor Gaussian Process Models for Large Geosta-
(log main + log aux ), tistical Datasets. J. Amer. Statist. Assoc. 111, 514 (apr
Positional Encoder Graph Neural Networks for Geographic Data
Spatial Semantic Lifting. Transactions in GIS 24 (6 geolocating social network users. In Lecture Notes
2020), 623–655. Issue 3. https://doi.org/10. in Computer Science (including subseries Lecture
1111/TGIS.12629 Notes in Artificial Intelligence and Lecture Notes
Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, in Bioinformatics). Vol. 10234 LNAI. Springer Ver-
Ling Cai, and Ni Lao. 2020b. Multi-Scale Represen- lag, 599–611. https://doi.org/10.1007/
tation Learning for Spatial Feature Distributions using 978-3-319-57454-7_47
Grid Cells. In International Conference on Learning Rene Westerholt, Bernd Resch, Franz Benjamin Mocnik,
Representations (ICLR). arXiv:2003.00824 http:// and Dirk Hoffmeister. 2018. A statistical test on the
arxiv.org/abs/2003.00824 local effects of spatially structured variance. Interna-
Yan Meng, Chao Lin, Weihong Cui, and Jian Yao. 2014. tional Journal of Geographical Information Science 32,
Scale selection based on Moran’s i for segmentation 3 (mar 2018), 571–600. https://doi.org/10.
of high resolution remotely sensed images. In Inter- 1080/13658816.2017.1402914
national Geoscience and Remote Sensing Symposium Yifang Yin, Zhenguang Liu, Ying Zhang, Sheng Wang,
(IGARSS). Institute of Electrical and Electronics Engi- Rajiv Ratn Shah, and Roger Zimmermann. 2019.
neers Inc., 4895–4898. https://doi.org/10. GPS2Vec: Towards generating worldwide GPS embed-
1109/IGARSS.2014.6947592 dings. In SIGSPATIAL: Proceedings of the ACM In-
J. Keith Ord and Arthur Getis. 2012. Local spatial het- ternational Symposium on Advances in Geographic In-
eroscedasticity (LOSH). Annals of Regional Science 48, formation Systems. Association for Computing Machin-
2 (apr 2012), 529–539. https://doi.org/10. ery, New York, NY, USA, 416–419. https://doi.
1007/s00168-011-0492-y org/10.1145/3347146.3359067
Di Zhu, Fan Zhang, Shengyin Wang, Yaoli Wang, Xi-
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
meng Cheng, Zhou Huang, and Yu Liu. 2020. Under-
James Bradbury, Gregory Chanan, Trevor Killeen, Zem-
standing Place Characteristics in Geographic Contexts
ing Lin, Natalia Gimelshein, Luca Antiga, Alban
through Graph Convolutional Neural Networks. Annals
Desmaison, Andreas Köpf, Edward Yang, Zach De-
of the American Association of Geographers 110, 2 (mar
Vito, Martin Raison, Alykhan Tejani, Sasank Chil-
2020), 408–420. https://doi.org/10.1080/
amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and
24694452.2019.1694403
Soumith Chintala. 2019. PyTorch: An imperative style,
high-performance deep learning library. In Advances
in Neural Information Processing Systems, Vol. 32.
arXiv:1912.01703
S. C. Suddarth and Y. L. Kergosien. 1990. Rule-injection
hints as a means of improving network performance and
learning time. In Lecture Notes in Computer Science (in-
cluding subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics), Vol. 412 LNCS.
Springer Verlag, 120–129. https://doi.org/10.
1007/3-540-52255-7_33
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is
all you need. In Advances in Neural Information
Processing Systems, Vol. 2017-Decem. 5999–6009.
arXiv:1706.03762 https://research.google/
pubs/pub46201/
Petar Veličković, Arantxa Casanova, Pietro Liò, Guillem
Cucurull, Adriana Romero, and Yoshua Bengio. 2018.
Graph attention networks. In 6th International Confer-
ence on Learning Representations, ICLR 2018 - Con-
ference Track Proceedings. International Conference on
Learning Representations, ICLR. arXiv:1710.10903
https://arxiv.org/abs/1710.10903v3
Fengjiao Wang, Chun Ta Lu, Yongzhi Qu, and Philip S.
Yu. 2017. Collective geographical embedding for