0% found this document useful (0 votes)
8 views21 pages

Graph Based Diffusion Model

R EL D IFF is a novel diffusion generative model designed to synthesize complete relational databases by explicitly modeling their foreign key graph structure, addressing limitations of existing methods that simplify relational data. It employs a joint graph-conditioned diffusion process for attribute synthesis and a 2K+SBM graph generator for structure generation, ensuring high fidelity and referential integrity. Experimental results show R EL D IFF significantly outperforms prior methods in generating realistic synthetic relational databases across multiple benchmarks.

Uploaded by

gavish04bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views21 pages

Graph Based Diffusion Model

R EL D IFF is a novel diffusion generative model designed to synthesize complete relational databases by explicitly modeling their foreign key graph structure, addressing limitations of existing methods that simplify relational data. It employs a joint graph-conditioned diffusion process for attribute synthesis and a 2K+SBM graph generator for structure generation, ensuring high fidelity and referential integrity. Experimental results show R EL D IFF significantly outperforms prior methods in generating realistic synthetic relational databases across multiple benchmarks.

Uploaded by

gavish04bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

R EL D IFF: Relational Data Generative Modeling

with Graph-Based Diffusion Models

Valter Hudovernik1∗ , Minkai Xu2∗ , Juntong Shi2 ,


Lovro Šubelj1 , Stefano Ermon2 , Erik Štrumbelj1 , Jure Leskovec2
1
University of Ljubljana 2 Stanford University ∗ Equal Contribution
arXiv:2506.00710v1 [[Link]] 31 May 2025

Abstract
Real-world databases are predominantly relational, comprising multiple interlinked
tables that contain complex structural and statistical dependencies. Learning gener-
ative models on relational data has shown great promise in generating synthetic data
and imputing missing values. However, existing methods often struggle to capture
this complexity, typically reducing relational data to conditionally generated flat
tables and imposing limiting structural assumptions. To address these limitations,
we introduce R EL D IFF, a novel diffusion generative model that synthesizes com-
plete relational databases by explicitly modeling their foreign key graph structure.
R EL D IFF combines a joint graph-conditioned diffusion process across all tables
for attribute synthesis, and a 2K+SBM graph generator based on the Stochastic
Block Model for structure generation. The decomposition of graph structure and
relational attributes ensures both high fidelity and referential integrity, both of
which are crucial aspects of synthetic relational database generation. Experiments
on 11 benchmark datasets demonstrate that R EL D IFF consistently outperforms
prior methods in producing realistic and coherent synthetic relational databases.
Code is available at [Link]

1 Introduction
Relational databases, which organize information into multiple interconnected tables governed by
foreign key references, underpin over 70% of global data management systems (DB-Engines, 2024)
and form the foundation for much of today’s digital infrastructure. However, unlocking access to
high-quality real-world datasets is often limited by fairness and privacy concerns (Ntoutsi et al.,
2020; Hernandez et al., 2022; Breugel van, Schaar van der, 2023), particularly in sensitive domains
like healthcare (Appenzeller et al., 2022; Gonzales et al., 2023) and finance (Assefa et al., 2020;
Potluru et al., 2024). Thus, synthetic data generation emerges as a promising solution, offering a
way to preserve crucial statistical properties while effectively mitigating privacy risks (Raghunathan,
2021). Moreover, synthetic data can unlock access to valuable enterprise and healthcare databases,
facilitating the creation of powerful relational and tabular foundation models (Breugel van, Schaar
van der, 2024) and has shown promise in missing value imputation (You et al., 2020; Zhang et al.,
2025) and data augmentation (Fonseca, Bacao, 2023).
Unlike image data, which comprises pure continuous pixel values with local spatial correlations, or
text data, which comprises tokens that share the same vocabulary, tabular data includes complex
and varied distributions (Xu et al., 2019), making it challenging to learn joint probabilities across
multiple columns. Moreover, relational databases exhibit complex structural hierarchies and statistical
dependencies across their interconnected tables, often stored at varying levels of normalization (Codd,
1970; Delobel, 1978), which further intensifies these inherent difficulties.
A common simplification involves flattening relational schemas into single tables (Ge et al., 2021;
Ghazi et al., 2023), but this approach quickly becomes impractical for large-scale and complex
schemas characteristic of real-world databases (Pang et al., 2024). Current approaches have struggled

Preprint. Under review.


with two key limitations: (1) the inability to generate arbitrary relational schemas and (2) the failure to
effectively preserve inter-table correlations between attributes linked by foreign key relationships (Hu-
dovernik et al., 2024). While recent progress in tabular diffusion models has demonstrated significant
success in preserving within-table attribute correlations Shi et al. (2025); Zhang et al. (2024), their di-
rect application to relational contexts remains limited, with only a few existing methods. Furthermore,
a critical limitation of the majority of current techniques lies in their approach of modeling relational
databases as a sequence of conditionally generated tabular datasets. This necessitates pre-specified
orderings of tables and often involves limiting assumptions, hindering the preservation of overall
relational structure and statistical interdependencies inherent in real-world databases.
In this paper, we propose R EL D IFF, a novel generative framework for relational databases. R EL D IFF
enables the synthesis of arbitrary relational schemas by explicitly modeling the underlying database
structure with graphs. R EL D IFF first utilizes a specifically designed 2K+SBM graph generator to
preserve the cardinality of foreign key relationships and the hierarchical dependencies inherent in
the relational structure. Built upon this faithful structural representation, we further define a joint
graph-based diffusion model for attribute synthesis across interconnected tables, powered by graph
neural networks (GNNs) to explicitly capture both inter- and intra-table dependencies.
The key innovations of our approach include: (1) A principled formulation for generating foreign
key structures in relational databases, incorporating hard constraints to ensure referential integrity
through a novel application of Bayesian stochastic blockmodels. (2) A joint diffusion model for
synthesizing mixed-type attributes, conditioned on graph structure using GNNs, to better capture
global dependencies. (3) We explicitly consider dimension tables, a fundamental component of
real-world relational databases (Garcia-Molina, 2008), as a distinct data type and define our diffusion
model in data space. These innovations allow us to model relational databases with arbitrarily
complex schemas and preserve both statistical and structural dependencies of the data.
We conduct comprehensive experiments to justify the effectiveness of our proposed method. Empirical
results across two benchmarks, covering 11 datasets and 8 metrics, demonstrate that R EL D IFF can
consistently outperform previous methods, with up to 80% improvement over the state-of-the-art in
preserving column correlation between connected tables. The significant improvements highlight the
superior generative capacity of our approach on relational data.

2 Related Work

Relational Database Synthesis. Patki et al. (2016) were the first to propose a learning-based method
for relational database synthesis - the Synthetic Data Vault (SDV). Recent methods broadly fall into
neural network-based (prioritizing fidelity) and marginal-based (focusing on differential privacy)
approaches. While marginal-based methods are established for single-table synthesis (Zhang et al.,
2017; McKenna et al., 2022) , their extension to relational data is more recent, with PrivLava (Cai
et al., 2023) and several newer methods emerging (Alimohammadi et al., 2025; Kapenekakis et al.,
2024; Cai et al., 2025).
We focus on neural network-based relational database synthesis, preserving fidelity and utility for
arbitrary schemas. Graph variational autoencoder-based methods (Mami et al., 2022) have been
investigated but encounter scalability issues with real-world databases. Generative adversarial network
(GAN)-based methods, such as RCTGAN (Gueye et al., 2023) and IRG (Li, Tay, 2023), extend
the successful CTGAN (Xu et al., 2019) architecture from single-table synthesis to relational data.
However, recent advancements have demonstrated the superior performance of diffusion models over
GANs in various generative tasks. Autoregressive approaches, leveraging language models (Solatorio,
Dupriez, 2023) and any-order networks (Tiwald et al., 2025), have also been investigated, but their
autoregressive nature makes them better suited for simpler relational structures, particularly those
with single-parent schemas. In contrast, the work of Xu et al. (2023) proposes a method for generating
many-to-many datasets using bipartite 2K random graphs. Similar to our approach, they decouple the
generation of database structure and attribute synthesis. However, their graph generation method does
not capture hierarchical relational structures and primarily focuses on many-to-many relationships,
generating tables sequentially. Building on the success of diffusion models for single-table data
generation (Kotelnikov et al., 2023; Zhang et al., 2024; Shi et al., 2025), two diffusion-based methods
for relational data have emerged: ClavaDDPM (Pang et al., 2024) and RGCLD (Hudovernik, 2024).
Both of these approaches largely treat relational database synthesis as a series of conditional single-

2
table generation tasks, relying on a pre-specified table ordering and introducing limiting assumptions
about the relational dependencies. A detailed overview of related work is provided in Appendix A.
Graph Structure Generation. Realistic and efficient graph generation models originate from the
degree sequence problem. dK-random graphs (Mahadevan et al., 2006) preserve the degrees of nodes
in d-sized connected subgraphs. 0K graphs preserve: the density (by convention), 1K graphs: the
degree distribution, 2K graphs: the joint degree distribution of neighbors, 3K graphs: the degrees
of connected triplets of nodes. While there exist efficient methods to generate up to 2K-graphs,
generating 3K graphs is NP-hard. Therefore, the 2K+ graph construction framework (Tillman et al.,
2019) proposes heuristic approaches for targeting additional properties such as connected components
and clustering. We extend the 2K+ framework by preserving (not only targeting) hierarchical and any
other modular organization present in relational databases. We employ the Stochastic Block Model
(SBM) (Holland et al., 1983) where nodes are partitioned into blocks which define their connectivity
(blocks are subsets of relational tables). The degree-corrected SBM (Dasgupta et al., 2004) accounts
for the variance in node degrees, while the microcanonical version enforces hard constraints on the
edges (Peixoto,
√ 2017). A hierarchy of nested SBM-s reduces the minimum detectable size of blocks
from O( n) down to O(log n), where n is the number of nodes (Peixoto, 2014).
Deep generative models for graphs, such as diffusion (Liu et al., 2019; Vignac et al., 2023) and
autoregressive models (You et al., 2018; Liao et al., 2019), exist but often assume dense representation,
scaling poorly to large relational database graphs (Jang et al., 2024; Li et al., 2024). Unlike our model,
they typically do not enforce hard structural constraints relevant to our work.

3 Preliminaries
Notation. We begin by introducing a formal definition of relational databases, following the RDL
framework (Fey et al., 2024), which provides a principled abstraction of relational data as heteroge-
neous graphs. This formulation enables us to decouple the generative process into two components:
(1) the generation of database structure via a schema-consistent relational entity graph, and (2) the
joint synthesis of entity-level attributes conditioned on this structure and local neighborhoods.
A relational database (T , L) consists of a collection of tables T = {T1 , . . . , Tn }, and links between
tables L ⊆ T × T . A link L = (Tfkey , Tpkey ) exists if a foreign key column in Tfkey references a
primary key column in Tpkey . Each table is a set T = {v1 , . . . , vnT }, whose elements vi ∈ T are
called rows or entities . Each entity v ∈ T is defined as a tuple: v = (pv , Kv , xv ) where: pv is the
primary key, uniquely identifying the entity v, Kv ⊆ {pv′ : v ′ ∈ T ′ and (T, T ′ ) ∈ L} is the set of
foreign keys, establishing links from v ∈ T to entities v ′ ∈ T ′ , where pv′ is the primary key of v ′ in
table T ′ and xv contains the attributes, representing the informational content of the entity.
The two central objects of RDL as defined by Fey et al. (2024) are the schema graph and relational
entity graph. The schema graph defines the table-level structure of data. Given a relational database
(T , L) and the inverse set of links as L−1 = {(Tpkey , Tfkey ) | (Tfkey , Tpkey ) ∈ L}, the schema graph is
the graph (T , R) with node set T and edge set R = L ∪ L−1 . The nodes of the schema graph serve
as type definitions for the heterogeneous relational entity graph.
The relational entity graph is a heterogeneous graph G = (V, E, ϕ, ψ), with node set V and edge set
E ⊆ V × V and type mapping functions ϕ : V → T and ψ : E → R, where each node v ∈ V belongs
to a node type ϕ(v) ∈ T and each edge e ∈ E belongs to an edge type ψ(e) ∈ R. Specifically, the
sets T and R from the schema graph define the node and edge types of our relational entity graph.
Real-world relational databases contain diverse data types. Following prior work, we focus on
numerical, categorical, and datetime attributes, representing each as either a continuous or discrete
random variable for unified modeling. We handle dimension tables explicitly as fixed-size vocabulary
lookups to ensure schema consistency and improve sample quality.
Gaussian Diffusion. For a numerical attribute z, the forward diffusion process gradually perturbs
the data with increasing Gaussian noise: qnum (zt | z0 ) = N (z0 , (σ num (t))2 ), where σ num (t) :
[0, 1] → R+ is an increasing function that governs the cumulative noise level over time. The
marginal distribution p(z0 ) at time t = 0 corresponds to the data distribution, while p(z1 ) converges
to the know noise distribution N (0, (σ num (1))2 ) from which we can easily sample. Following the
framework of Karras et al. (2022), the true reverse distribution qnum (zs | zt ) can be formulated by the
d num
solutions to the ordinary differential equation (ODE) dz num = −[ dt σ (t)]σ num (t)∇z log pt (z)dt,

3
Bayesian SBM

Non-parametric Sample new


Structural Learning Graph

Reverse

TRAN
Lookup

TRF.
Embedding
Diffusion

SFOR
GNN
MLP
Forward Diffusion

TRF.
MERS
Learning Network Generation
Figure 1: R EL D IFF framework overview. R EL D IFF applies forward diffusion to mixed-type
attributes within each relational table and performs joint reverse denoising across tables, conditioned
on the relational entity graph and node neighborhoods. Learnable embeddings handle dimension
tables (e.g., Products), and a sampled synthetic entity graph guides the generation process.

where 0 ≤ s < t ≤ 1. To learn the generative model, we approximate the true score function
∇z log pt (z num ) using a neural network µnum
θ , which is trained via the following denoising loss:
2
Lnum (θ) = Ez0 ∼p(z0 ) Et∼U [0,1] Eϵ∼N (0,I) ∥µnum
θ (zt , t) − ϵ∥2 . (1)
Masked Diffusion. For a categorical attribute c with K categories, we introduce an additional
(K + 1)th to represent the special [MASK] state and denote it as m = (0, . . . , 1) ∈ {0, 1}K
using the one-hot representation. Let cat(·; π) denote the categorical distribution over K classes,
parameterized by the probability vector π ∈ ∆K . The forward diffusion process operates in
continuous-time by gradually masking the original values: qcat (ct |c0 ) = cat(ct ; αt c0 + (1 − αt )m),
where αt = exp(−σ cat (t)) is a decreasing function between 0 and 1 controlling the masking rate.
Following Sahoo et al. (2024), the true reverse transition distribution q(cs |ct ) is given as:
(
qcat (cs |ct ) =
cat(c
 s ; ct )  ct ̸= m, (2)
s −αt )qcat (c0 |ct )
cat cs ; (1−αs )m+(α1−α t
ct = m.

To model this generative process, we train a neural network µcat θ to predict the original category c0
from the masked input ct , i.e. to estimate qcat (c0 |ct ). The model is optimized using a cross-entropy
loss under the continuous-time limit:
Z t=1
αt′
Lcat (θ) = Eq 1{ct =m} log⟨µcat cat
θ (ct , t), c0 ⟩dt. (3)
t=0 1 − αt

4 Method
This section details R EL D IFF, our framework for generating synthetic relational databases. We
provide a high-level overview of our generative model in Section 4.1. First, we describe how we model
the relational graph structure using Bayesian stochastic blockmodels, ensuring referential integrity and
preserving relationship cardinalities and hierarchical dependencies (Section 4.2). Next, we present our
joint relational diffusion model for synthesizing attributes across the database schema, parameterized
by a single heterogeneous graph neural network with tabular transformer-based encoders and decoders
(Section 4.3). Finally, we detail the training and sampling procedures (Section 4.4).

4.1 Overview

In line with Probabilistic Relational Models (PRMs) (Friedman et al., 1999; Getoor et al., 2001), we
decompose the generative process into modeling the relational graph structure and the attributes of
the tables. We treat the relational entity graph G = (V, E) as a single sample from some unknown
joint distribution p(V, E). Our objective is to sample from this distribution, ensuring adherence to
referential integrity constraints and the preservation of statistical dependencies. We formalize this
joint sampling through the following conditional decomposition p(V, E) = p(E)p(V | E).
To model graph structure p(E), we present a novel approach based on established models from
graph theory. Within our joint diffusion framework, we formulate p(V | E) to explicitly model the
dependencies between nodes representing entities across different tables during generation.

4
(a) Original dataset (b) 2K+SBM (Ours) (c) Bipartite 2K (d) ClavaDDPM
Figure 2: Hierarchical structure of the F1 dataset. Our SBM-based method preserves the F1
dataset’s foreign key graph’s joint degree distribution and hierarchy. In contrast, the bipartite 2K-
graph approach (Xu et al., 2023) loses this structure despite matching degree distributions, and
ClavaDDPM, by implicitly modeling the structure, retains some hierarchy but not the degrees.

4.2 Graph Structure Generation

To generate realistic synthetic relational structures, we focus on sampling graphs that preserve
the original database size, enforce referential integrity, and respect exact table and relationship
cardinalities. Modern deep generative approaches are not applicable here due to their scalability
limitations with large, sparse graphs (Jang et al., 2024). Instead, we base our approach on classical
random graph models, which provide principled, efficient mechanisms for sampling large, structured
graphs while preserving referential integrity by design.
The graph structure of a relational database is a heterogeneous entity graph G = (V, E, ϕ, ψ) defined
above. Our objective is to learn a distribution p(E) over such graphs, conditioned on a fixed set of
nodes v ∈ V, the numberP of edges mr = |Er | of each type r ∈ R and the node degree sequence
kvr = |v ∈ Er |, where v∈V kvr = 2mr . This corresponds to fixed row counts, and exact entity and
relationship cardinalities. We also preprocess the entity graph to convert tables with two parents and
no children into many-to-many edges. This one-to-one transformation avoids sampling edges not
consistent with the original database and is reverted afterwards.
To generate samples from p(E), we utilize a nonparametric Bayesian Stochastic Block Model
(SBM) (Peixoto, 2019) as a model of graphs with the above structural constraints. The microcanonical
degree-corrected SBM (Peixoto, 2017) defines the distribution p(E|b, m, k), where b : V → Z is a
partition of nodes into some latent (disjoint) blocks. By setting b = ϕ, the model is equivalent to
the above, a.k.a. 2K-random graphs in the literature (Mahadevan et al., 2006). 2K-random graphs
preserve the node degree sequence and the degree correlations of neighbouring nodes. To preserve
also a global hierarchical and other modular organization of the database, we use the maximum
likelihood partition b∗ that minimizes the description length of a nested hierarchy of SBM-s (Peixoto,
2014). Note that we constrain the partition b∗ by node types ϕ such that b∗ (v) = b∗ (u) implies
ϕ(v) = ϕ(u) for all v, u ∈ V. We refer to this model as 2K+SBM graphs, consistent with the
literature (Tillman et al., 2019).
The generation process proceeds in three stages.

1. We employ an efficient Markov Chain Monte Carlo approach (Peixoto, 2014) to infer the
most likely partition b∗ of the edge set E.
2. For each relationship r ∈ R, we sample a new edge set Sr independently from
p(Er |b∗r , mr , k r ), where b∗r is the partition of nodes induced by Er . When Er induces a
simple graph, we ensure to sample Sr only from simple graphs (unless stated otherwise).
3. The final generated graph is induced by the edge set ∪r∈R Sr .

Our approach preserves the joint degree distribution by construction and, as illustrated in Figure 2,
retains the hierarchical structure of the relational entity graph.

4.3 Joint Multi-Relational Diffusion

To model the distribution over node attributes, we define a hybrid diffusion process that applies
forward noise independently across entries and independently across attribute types (numerical and
categorical) within each entry. Let Vt denote the set of all entities at time t, and xvt the attribute of a
single entry v. This forward process is given by:

5
ϕ(v)
Y
ϕ(v) num
q(Vt | V0 ) = qnum (xt | xnum
0 ) · qcat (xcat
t | xcat
0 ), xt ∈ vt . (4)
vt ∈Vt

The true reverse process is then modeled as the joint posterior:

ϕ(v )
Y
ϕ(vs ) num
q(Vs | Gt ) = qnum (xs | xnum
t , Gt ) · qcat s (xcat cat
s | xt , Gt ), xs ∈ vs , (5)
vs ∈Vs
which factorizes over entities but allows each denoising step to condition on the full database context
Gt at a given timestep t. Note that to simplify notation, we omit defining a common space on which
q is defined, but make a distinction between q ϕ(v) , which are defined on subsets of the joint space.
We formulate the learning objective as a graph-based denoising task, where the goal is to train a
model pθ (xvs | xvt , Gt ) that reconstruct clean attributes from noisy inputs. However, conditioning
on the full graph at every step can be computationally prohibitive, as real-world databases typically
contain millions of entries.
To improve efficiency while still capturing the interactions between attributes across the connected
tables, we assume conditional independence of each node v ∈ V given its noisy k-hop neighborhood at
timestep t, denoted Nk (v)t . Under this assumption, the model is approximated as pθ (xvs |xvt , Nk (v)t )
and parameterized using a GNN that operates locally over the neighborhood Nk (v)t .
By plugging in the objective functions corresponding to the masked and Gaussian diffusion processes
(defined by eqs. (1) and (3)), we end up with the following optimization objective with two weight
terms λnum and λcat :

X X
LR EL D IFF (θ) = (λnum Lnum (θ) + λcat Lcat (θ)) = Et∼U (0,1) Exvt ∼qTi (xvt ,xv0 )
Ti ∈T Ti ∈T
 λcat αt′ 
1{ct =m} ⟨µcat
X
λnum ∥µnum
θ (xvt , t, Nk (v))Ti − ϵnum ∥22 + v Ti cat
θ (xt , t, Nk (v)) , c0 ⟩
1 − αt
ct ∈xv,cat
t
(6)
We parametrize our model with a heterogeneous graph neural network with transformer encoders
and decoders and one MLP backbone per table. We use a heterogeneous variant of the GraphSAGE
network (Hamilton et al., 2017).

4.4 Training and Inference

With the forward process defined in Eq. (4) we Algorithm 1 Training


present the training procedure for our joint dif-
fusion model in Algorithm 1. We begin each 1: repeat
training step by sampling a subgraph that main- 2: Sample batch Gbatch
tains the proportional representation of nodes 3: Sample t ∼ U (0, 1)
from each table in the original database. Sub- 4: for Ti ∈ T do
sequently, we sample a timestep t ∼ U (0, 1) 5: x0 ← Gbatch .xT0 i
using a low-discrepancy sampler similar to the 6: Sample ϵnum ∼ N (0, IMT num )
i
one proposed by Kingma et al. (2021) and ap- 7: xnum
t = xnum0 + σnum (t) · ϵnum
ply the corresponding noise schedules to perturb 8: Sample xcat t ∼ q(xt |x0 )
the numerical and categorical attributes. The 9: Gbatch .xTt i ← [xnum cat
t , xt ]
noisy subgraph, along with t, is then fed into our 10: end for
model. The model jointly denoises the attributes 11: Take gradient descent step on ∇θ LR EL D IFF
across all tables and we perform a gradient step 12: until converged
on the combined loss function defined in Eq. (6).
Relational databases often consist of millions of entities, making it infeasible to load and process the
full relational graph in memory during training. Training on the entire graph would be ideal from a
computational perspective, since it provides complete access to the computational graph and enables
loss computation across all nodes. However, it would constrain us to sampling a single noise level
per training iteration, as the entire graph must be denoised simultaneously. In contrast, minibatch
training provides a practical trade-off: it reduces memory consumption and enables sampling across
different noise levels, at the cost of reusing only a portion of the computation graph per step.

6
Subgraph Sampling for Efficient Training. For databases organized into multiple disjoint subgraphs
(e.g., those following a snowflake schema), minibatch construction is straightforward. We sample n
subgraphs independently at each training step and compute the loss across all nodes within them. In
more general settings, where foreign key relationships form a connected network rather than isolated
components, we adapt the neighbor sampling procedure of Hamilton et al. (2017). Specifically, we
begin by selecting a set of seed nodes for each table, then sample their k-hop neighborhoods. The
resulting subgraphs are merged to form a single, connected minibatch subgraph used for training.
Sampling During the sampling process, we first generate a new relational entity graph using our
2K+SBM graph generator, capturing the structural properties of the original database. To initiate the
reverse diffusion, we sample initial values for numerical attributes from a standard Gaussian prior
distribution and set all categorical attributes to the masked state. At each subsequent denoising step
of the reverse diffusion, we jointly denoise the attributes of all tables using our learned diffusion
model, allowing for simultaneous refinement and capturing inter-table dependencies. For the reverse
diffusion process, we utilize a stochastic sampler, similar to the one proposed by Shi et al. (2025), to
introduce stochasticity and diversity into the generated samples.
After completing the denoising process, we transform the generated graph entities back into a tabular
format, reconstructing the table structure. Finally, we add foreign keys to the tables based on the
edges of the generated relational entity graph, ensuring the generated data adheres to the structural
relationships defined by the schema.

5 Experiments

We evaluate R EL D IFF by comparing it against 6 generative methods using two relational database
generation benchmarks consisting of 11 real-world datasets totaling 67 tables and 64 foreign key
relationships. We focus on multi-table fidelity and downstream task performance. We provide an
additional fidelity and privacy evaluation in Appendix D.

5.1 Experimental Setup

Real-world datasets. We experiment with eleven real-world relational databases from two bench-
marks from related work. The datasets include: Biodegradability,Berka, a relational version of the
Cora dataset, Walmart Recruiting - Store Sales Forecasting, Airbnb Bookings,Rossmann Store Sales,
CCS, Instacart 05, and F1. These datasets vary in the number of tables, the maximum depth, the
number of foreign-key relationships, and structural complexity. Details can be found in Appendix B.1.
Baselines. We compare our method with state-of-the-art methods from each benchmark. These
include ClavaDDPM (Pang et al., 2024), RCTGAN (Gueye et al., 2023), RealTabFormer (Solatorio,
Dupriez, 2023), SDV (Patki et al., 2016) and TabularARGN (Tiwald et al., 2025) on the SyntheRela
benchmark and ClavaDDPM, SDV and PrivLava (Cai et al., 2023) on the ClavaDDPM benchmark.
Evaluation metrics. We follow the protocols of Jurkovič et al. (2025) and Pang et al. (2024) and
use the same evaluation metrics: 1) Fidelity: Shape, Trend, C2ST, C2ST-Agg, k-hop correlation
and cardinality similarity assess how well the synthetic data can faithfully recover the ground-truth
data distribution; 2) Downstream tasks: Machine learning efficiency using RDL utility evaluates the
models’ potential to power downstream tasks; 3) Privacy: The Distance to Closest Record (DCR)
score evaluates the level of privacy protection by measuring how close the synthetic data is to the
training data. We provide detailed descriptions of all metrics in Appendix B.2.
Implementation Details. All reported experiment results are the average of 3 randomly sampled
synthetic data samples. Additional implementation details, such as the hardware and software
information as well as hyperparameter settings, are in Appendix C.

5.2 Multi-Table Fidelity

In all fidelity experiments we evaluate two versions of our model R EL D IFF and R EL D IFF ORIG , which
preserves the original database structure (PRM with attribute uncertainty).
We first evaluate multi-table fidelity on the SyntheRela benchmark datasets. We focus on two key
metrics: C2ST-Agg, which probes the preservation of higher-order interactions and aggregations

7
Table 1: Multi-table results on the SyntheRela benchmark. For each dataset we report the average
detection accuracy for C2ST-Agg (lower is better) and k-hop correlation similarity (higher is better).
The number of k-hop results is determined by maximum depth of the dataset. We report the best
result in bold (excluding R EL D IFF ORIG as it always achieves the strongest results). DNC denotes
Did Not Converge. (Baselines from Jurkovič et al. (2025).)
Dataset Metric TabARGN ClavaDDPM RCTGAN REALTABF. SDV RelDiff Improv. RelDiffORIG
C2ST-Agg (↓) 63.47±0.88 ≈ 100.0 98.22±0.08 99.13±0.02 99.94±0.01 55.68±0.66 12.26 55.68±0.66
Airbnb Cardinality (↑) 98.59±0.32 99.65±0.06 95.45±0.62 76.38±0.47 26.36±0.03 100.0 0.35 100.0
1-HOP (↑) 79.66±0.36 86.69±0.14 68.78±0.54 33.99±5.76 24.58±0.03 89.37±0.38 3.10 89.37±0.38
C2ST-Agg (↓) 60.43±0.63 85.77±0.07 86.11±1.01 85.90±1.33 98.37±0.23 51.06±1.39 15.51 51.06±1.39
Rossmann Cardinality (↑) 94.17±1.84 99.19±0.29 82.69±1.95 41.82±10.29 99.16±0.15 100.0 0.81 100.0
1-HOP (↑) 92.95±0.78 82.81±0.47 87.02±0.17 80.25±0.84 73.84±0.34 96.73±0.18 4.06 96.73±0.18
C2ST-Agg (↓) 94.81±1.68 73.33±2.92 94.81±1.68 90.0±0.91 88.52±1.60 66.30±1.68 9.60 66.30±1.68
Walmart Cardinality (↑) 65.93±1.98 93.33±2.28 88.15±1.51 85.56±4.57 86.30±1.09 100.0 7.14 100.0
1-HOP (↑) 75.40±1.49 86.40±1.73 79.02±0.15 74.99±0.20 76.64±1.07 91.87±0.42 6.34 91.87±0.42
C2ST-Agg (↓) 80.56±1.86 69.12±0.63 76.86±2.22 77.43±0.14 55.69±0.82 19.43 49.02±0.08
Cardinality (↑) 85.17±0.84 96.43±0.36 81.28±1.07 80.53±0.72 100.0 3.70 100.0
Berka 1-HOP (↑) 72.82±0.38 87.92±1.66 78.87±0.91 - 59.09±0.49 96.88±0.06 10.20 97.58±0.51
2-HOP (↑) 65.51±0.31 84.41±2.46 77.98±0.95 23.09±0.21 95.79±0.02 13.49 97.33±0.61
3-HOP (↑) 59.34±0.62 80.67±2.18 78.65±0.69 58.23±0.58 90.19±0.22 11.81 91.41±0.65
C2ST-Agg (↓) 95.90±0.94 82.52±0.25 91.23±0.39 94.55±0.24 64.85±0.11 21.41 49.0±0.28
Cardinality (↑) 58.17±3.71 88.45±3.05 56.82±1.55 71.88±0.12 100.0 13.06 100.0
F1 -
1-HOP (↑) 77.37±0.26 79.35±0.03 79.14±0.72 68.45±0.20 93.46±0.10 17.78 97.71±0.15
2-HOP (↑) 76.25±0.32 84.18±0.12 83.50±0.82 76.93±0.24 95.91±0.03 13.94 98.46±0.02
C2ST-Agg (↓) 73.76±1.78 65.0±0.34 81.56±2.0 53.29±0.45 18.01 52.26±0.06
IMDB Cardinality (↑) 81.19±0.80 98.95±0.03 79.53±1.27 - DNC 100.0 1.06 100.0
1-HOP (↑) 88.64±0.70 91.57±1.25 81.76±0.20 94.84±0.35 3.57 95.27±0.05
C2ST-Agg (↓) 88.86±0.26 83.82±3.35 98.02±0.06 47.04±0.27 43.88 44.37±0.57
Cardinality (↑) 79.53±0.24 85.22±0.50 61.17±0.36 100.0 17.35 100.0
Biodeg. - -
1-HOP (↑) 61.36±0.47 75.80±1.46 49.09±0.59 89.37±3.76 17.90 95.81±0.30
2-HOP (↑) 60.54±0.44 77.04±1.96 47.80±2.16 86.59±5.02 12.40 95.07±0.43
C2ST-Agg (↓) 68.80±0.67 73.74±0.47 99.59±0.03 69.30±0.52 0.0 66.08±0.57
Cora Cardinality (↑) 96.27±0.13 - 90.48±2.16 - 68.82±0.29 100.0 3.87 100.0
1-HOP (↑) 80.42±0.34 68.39±0.08 4.95±0.12 72.16±1.19 0.0 79.11±0.34

across connected tables, and k-hop similarity, which evaluates correlations between columns of paired
tables at varying depths within the database schema. In line with previous work, we also report
cardinality similarity. However, our approach consistently achieves a perfect score as it preserves
the degree distributions. Table 1 shows that R EL D IFF with the original structure consistently
achieves or matches the best performance across all datasets and metrics. Even when employing
generated structures, R EL D IFF outperforms all baselines in most of experiments, securing second-
best performance in all but three experiments. Notably, R EL D IFF exhibits a significantly smaller
performance drop when transitioning from single-table (Table 6) to multi-table C2ST evaluation.
Figure 3 illustrates that this degradation is 7× lower than that of the closest competitor, ClavaDDPM,
highlighting R EL D IFF’s superior ability to maintain data fidelity in relational contexts. Additional
single-table fidelity and privacy results are in Appendices D.1 and D.2.
Next, we evaluate the results on the ClavaDDPM (Pang et al., 2024) benchmark. Following the
original evaluation protocol, we report single-table metrics (Trend and Shape) and multi-table metrics
(cardinality and k-hop similarity). We omit the MovieLens and Berka datasets as they are already
included in the SyntheRela benchmark. Results in Table 2 show that R EL D IFF is the best on all
multi-table fidelity metrics, with an average improvement of 25.3% in preserving k-hop correlations.
R EL D IFF outperforms other methods on all but two single-table evaluations.

5.3 Performance on Downstream Tasks

High-quality synthetic data offers the key advantage of replacing real data for analysis and effective
learning on downstream tasks like classification and regression. We evaluate this capacity using
Machine Learning Efficiency (ML-E) on RDL tasks (Robinson et al., 2024).
According to the RDL results presented in Table 3, R EL D IFF achieves the best performance on
four out of five datasets and the second best on the remaining one. This demonstrates our method’s
competitive capacity to capture and replicate key features of the real data that are most relevant to
learning downstream machine learning tasks. We observe that methods with lower performance on
data fidelity sometimes outperform stronger methods on utility, highlighting that fidelity and utility
are two distinct aspects of synthetic data quality (Hansen et al., 2023). Despite this nuance, R EL D IFF
consistently achieves strong performance across both aspects.

8
Table 2: End-to-end results on the ClavaDDPM benchmark. We follow the evaluation protocol
by Pang et al. (2024) and report the cardinality similarity, column shapes, trend scores and correlations
between columns in connected tables. DNC denotes Did Not Converge.
End-to-End PrivLava SDV ClavaDDPM RelDiff Improv. RelDiffORIG
CARDINALITY 99.90±0.03 71.45±0.0 99.19±0.29 100.0 0.10 100.0
California Shape 99.71±0.02 72.32±0.0 98.77±0.02 99.52±0.02 0.0 99.52±0.02
Trend 98.49±0.05 50.23±0.0 97.65±0.05 98.72±0.01 0.24 98.73±0.03
1-HOP 97.46±0.12 54.89±0.0 95.16±0.39 98.72±0.0 1.29 98.72±0.02
AVG 2-WAY 97.97±0.09 52.56±0.0 96.41±0.20 98.72±0.01 0.77 98.72±0.01
CARDINALITY 95.30±0.79 100.0 4.93 100.0
Instacart 05

Shape 89.84±0.29 96.85±0.85 7.80 95.61±0.81


Trend 99.62±0.04 95.71±0.42 0.0 94.92±0.75
DNC DNC
1-HOP 76.42±0.39 85.83±1.20 12.31 89.98±0.06
2-HOP 39.29±3.38 70.74±0.14 80.05 94.51±0.17
AVG 2-WAY 76.02±0.78 85.78±0.81 12.84 92.06±0.26
CARDINALITY 74.36±8.40 99.25±0.16 100.0 0.76 100.0
Shape 69.04±4.38 92.37±2.30 98.29±0.04 6.41 98.57±0.09
CCS

Trend DNC 94.84±1.0 98.47±0.79 98.74±0.14 0.28 98.80±0.27


1-HOP 21.74±9.62 83.15±4.22 89.48±4.01 7.62 87.60±0.40
AVG 2-WAY 41.68±6.73 87.33±3.12 92.01±2.95 5.36 90.65±0.22

Table 3: RDL-utility results. We report ROC-AUC (higher is better) for classification and MAE
(lower is better) for regression tasks. We report the naive baseline scores (mean or majority class) in
parentheses. "-" denotes that the utility pipeline could not be used. We highlight the best results for
each dataset and report the mean and standard error for each metric.
Dataset Metric ORIG. TabularARGN CLAVADDPM RCTGAN REALTABF. SDV RelDiff Improv.
Rossmann MAE (↓) 156 (324) 271±10 194±2 218±2 249±12 3341±17 218±3 0.0
Walmart MAE (↓) 9531 (14.7k) 13848±14 11426±1052 13435±416 13862±300 13679±87 10475±1379 8.32
Airbnb AUC (↑) 0.69 (0.5) 0.66±0.01 0.51±0.03 0.63±0.01 - 0.57±0.00 0.66±0.01 0.0
Berka AUC (↑) 0.81 (0.5) 0.59±0.23 0.52±0.16 - - - 0.84±0.02 42.4
F1 AUC (↑) 0.77 (0.5) 0.38±0.06 0.45±0.06 0.48±0.01 - 0.52±0.06 0.72±0.01 38.5

6 Conclusion

In this work, we introduced R EL D IFF, a novel 1.0


diffusion-based generative framework designed Relational
for synthesizing complete relational databases Avg. Multi Table
0.9
Avg. Single Table
by explicitly modeling their inherent foreign key = 0.06
graph structure. Our approach uniquely com-
Detection Accuracy

= 0.16
bines a joint graph-conditioned diffusion pro- 0.8
cess for attribute synthesis across all intercon- = 0.15
= 0.14
nected tables with a 2K+SBM graph generator
0.7
for structure creation. This principled decom- = 0.14
position ensures both high fidelity in the gen-
erated data and strict adherence to referential 0.6

integrity, addressing key limitations of existing = 0.02


relational data synthesis methods that often flat- 0.5 Perfect Fidelity
ten the relational structure or impose restrictive SDV REALTABF. RCTGAN ClavaDDPM TabARGN RelDiff
assumptions. 2016 2023 2023 2024 2025 Ours
Through extensive experiments on 11 bench- Figure 3: Comparing single and multi-table
mark datasets, R EL D IFF consistently out- C2ST performance. As opposed to previous meth-
performs state-of-the-art methods in generat- ods our approach incurs only a slight degrada-
ing realistic and coherent synthetic relational tion between average multi-table (C2ST-Agg) and
databases. Our framework effectively captures single-table (C2ST) performance indicated by the
complex structural and statistical dependencies, relational ∆ = Acc(C2ST-Agg) - Acc(C2ST).
leading to synthetic data that better reflects the
intricacies of real-world relational data. This advancement holds significant promise for various
downstream applications, including privacy-preserving data sharing, data augmentation for relational
learning tasks, and imputation of missing values in complex relational datasets.

9
References
Alimohammadi Kaveh, Wang Hao, Gulati Ojas, Srivastava Akash, Azizan Navid. Differentially
Private Synthetic Data Generation for Relational Databases. 2025.
Appenzeller Arno, Leitner Moritz, Philipp Patrick, Krempel Erik, Beyerer Jürgen. Privacy and utility
of private synthetic data for medical data analyses // Applied Sciences. 2022. 12, 23. 12320.
Assefa Samuel A, Dervovic Danial, Mahfouz Mahmoud, Tillman Robert E, Reddy Prashant, Veloso
Manuela. Generating synthetic data in finance: opportunities, challenges and pitfalls // Proceedings
of the First ACM International Conference on AI in Finance. 2020. 1–8.
Berka Petr, others . Guide to the financial data set // PKDD2000 discovery challenge. 2000.
Blockeel Hendrik, Džeroski Sašo, Kompare Boris, Kramer Stefan, Pfahringer Bernhard, Laer Wim.
Experiments In Predicting Biodegradability // Applied Artificial Intelligence. 06 1999. 18.
Breugel Boris van, Schaar Mihaela van der. Beyond Privacy: Navigating the Opportunities and
Challenges of Synthetic Data. 2023.
Breugel Boris van, Schaar Mihaela van der. Position: Why Tabular Foundation Models Should Be a
Research Priority // Forty-first International Conference on Machine Learning. 2024.
Cai Kuntai, Xiao Xiaokui, Cormode Graham. PrivLava: Synthesizing Relational Data with Foreign
Keys under Differential Privacy // Proc. ACM Manag. Data. jun 2023. 1, 2.
Cai Kuntai, Xiao Xiaokui, Yang Yin. PrivPetal: Relational Data Synthesis via Permutation Relations.
2025.
Chawla Nitesh V, Bowyer Kevin W, Hall Lawrence O, Kegelmeyer W Philip. SMOTE: synthetic
minority over-sampling technique // Journal of artificial intelligence research. 2002. 16. 321–357.
Codd Edgar F. A relational model of data for large shared data banks // Communications of the
ACM. 1970. 13, 6. 377–387.
DB-Engines . DBMS popularity broken down by database model. 2024.
Dasgupta Anirban, Hopcroft John E, McSherry Frank. Spectral analysis of random graphs with
skewed degree distributions // 45th Annual IEEE Symposium on Foundations of Computer Science.
2004. 602–610.
Delobel Claude. Normalization and hierarchical dependencies in the relational data model // ACM
Transactions on Database Systems (TODS). 1978. 3, 3. 201–222.
F1 . F1 DB. 2021.
Fey Matthias, Hu Weihua, Huang Kexin, Lenssen Jan Eric, Ranjan Rishabh, Robinson Joshua, Ying
Rex, You Jiaxuan, Leskovec Jure. Position: relational deep learning-graph representation learning
on relational databases // Proceedings of the 41st International Conference on Machine Learning.
2024. 13592–13607.
FlorianKnauer Will Cukierski. Rossmann Store Sales. 2015.
Fonseca Joao, Bacao Fernando. Tabular and latent space synthetic data generation: a literature
review // Journal of Big Data. 2023. 10, 1. 115.
Friedman Nir, Getoor Lise, Koller Daphne, Pfeffer Avi. Learning probabilistic relational models //
IJCAI. 99. 1999. 1300–1309.
Garcia-Molina Hector. Database systems: the complete book. 2008.
Ge Chang, Mohapatra Shubhankar, He Xi, Ilyas Ihab F. KAMINO: Constraint-aware differentially
private data synthesis // Proceedings of the VLDB Endowment. 2021. 14, 10. 1886–1899.
Getoor Lise, Friedman Nir, Koller Daphne, Pfeffer Avi. Learning probabilistic relational models //
Relational data mining. 2001. 307–335.

10
Ghazi Badih, Hu Xiao, Kumar Ravi, Manurangsi Pasin. Differentially private data release over
multiple tables // Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on
Principles of Database Systems. 2023. 207–219.
Gonzales Aldren, Guruswamy Guruprabha, Smith Scott R. Synthetic data in health care: A narrative
review // PLOS Digital Health. 2023. 2, 1. e0000082.
Gueye Mohamed, Attabi Yazid, Dumas Maxime. Row Conditional-TGAN for generating synthetic
relational databases // ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). 2023. 1–5.
Hamilton Will, Ying Zhitao, Leskovec Jure. Inductive representation learning on large graphs //
Advances in neural information processing systems. 2017. 30.
Hansen Lasse, Seedat Nabeel, Schaar Mihaela van der, Petrovic Andrija. Reimagining synthetic
tabular data generation through data-centric AI: A comprehensive benchmark // Advances in neural
information processing systems. 2023. 36. 33781–33823.
Harper F. Maxwell, Konstan Joseph A. The MovieLens Datasets: History and Context // ACM Trans.
Interact. Intell. Syst. dec 2015. 5, 4.
Hernandez Mikel, Epelde Gorka, Alberdi Ane, Cilla Rodrigo, Rankin Debbie. Synthetic data
generation for tabular health records: A systematic review // Neurocomputing. 2022. 493. 28–45.
Holland Paul W, Laskey Kathryn Blackmond, Leinhardt Samuel. Stochastic blockmodels: First steps
// Social networks. 1983. 5, 2. 109–137.
Hudovernik Valter. Relational Data Generation with Graph Neural Networks and Latent Diffusion
Models // NeurIPS 2024 Third Table Representation Learning Workshop. 2024.
Hudovernik Valter, Jurkovič Martin, Štrumbelj Erik. Benchmarking the Fidelity and Utility of
Synthetic Relational Data. 2024.
Jang Yunhui, Lee Seul, Ahn Sungsoo. A Simple and Scalable Representation for Graph Generation //
The Twelfth International Conference on Learning Representations. 2024.
Jurkovič Martin, Hudovernik Valter, Štrumbelj Erik. SyntheRela: A Benchmark For Synthetic
Relational Database Generation // Will Synthetic Data Finally Solve the Data Access Problem?
2025.
Kapenekakis Antheas, Dell’Aglio Daniele, Vesteghem Charles, Poulsen Laurids, Bøgsted Martin,
Garofalakis Minos, Hose Katja. Synthesizing Accurate Relational Data under Differential Privacy
// 2024 IEEE International Conference on Big Data (BigData). 2024. 433–439.
Karras Tero, Aittala Miika, Aila Timo, Laine Samuli. Elucidating the Design Space of Diffusion-Based
Generative Models // Advances in Neural Information Processing Systems. 2022.
Kingma Diederik, Salimans Tim, Poole Ben, Ho Jonathan. Variational Diffusion Models // Advances
in Neural Information Processing Systems. 34. 2021. 21696–21707.
Kotelnikov Akim, Baranchuk Dmitry, Rubachev Ivan, Babenko Artem. Tabddpm: Modelling tabular
data with diffusion models // International Conference on Machine Learning. 2023. 17564–17579.
Li Jiayu, Tay YC. IRG: Generating Synthetic Relational Databases using GANs // arXiv preprint
arXiv:2312.15187. 2023.
Li Mufei, Kreacic Eleonora, Potluru Vamsi K., Li Pan. GraphMaker: Can Diffusion Models Generate
Large Attributed Graphs? // Transactions on Machine Learning Research. 2024.
Liao Renjie, Li Yujia, Song Yang, Wang Shenlong, Hamilton Will, Duvenaud David K, Urtasun Raquel,
Zemel Richard. Efficient graph generation with graph recurrent attention networks // Advances in
neural information processing systems. 2019. 32.
Liu Jenny, Kumar Aviral, Ba Jimmy, Kiros Jamie, Swersky Kevin. Graph normalizing flows //
Advances in Neural Information Processing Systems. 2019. 32.

11
M. Center . Integrated Public Use Microdata Series, International: Version 7.3 [data set]. Minneapolis,
MN, 2020. [Link]
Mahadevan Priya, Krioukov Dmitri, Fall Kevin, Vahdat Amin. Systematic topology analysis and
generation using degree correlations // ACM SIGCOMM Computer Communication Review. 2006.
36, 4. 135–146.
Mami Ciro Antonio, coser andrea, Boudewijn Alexander Theodorus Petrus, Volpe Marco, Whitworth
michael, Panfilo Daniele, Saccani Sebastiano. Generating Realistic Synthetic Relational Data
through Graph Variational Autoencoders // NeurIPS 2022 Workshop on Synthetic Data for
Empowering ML Research. 2022.
McCallum Andrew Kachites, Nigam Kamal, Rennie Jason, Seymore Kristie. Automating the con-
struction of internet portals with machine learning // Information Retrieval. 2000. 3. 127–163.
McKenna Ryan, Mullins Brett, Sheldon Daniel, Miklau Gerome. AIM: an adaptive and iterative
mechanism for differentially private synthetic data // Proceedings of the VLDB Endowment. 2022.
15, 11.
Montoya Anna, LizSellier , O’Connell Meghan, Kan Wendy, alokgupta . Airbnb New User Bookings.
2015.
Motl Jan, Schulte Oliver. The CTU prague relational learning repository // arXiv preprint
arXiv:1511.03086. 2015.
Ntoutsi Eirini, Fafalios Pavlos, Gadiraju Ujwal, Iosifidis Vasileios, Nejdl Wolfgang, Vidal Maria-
Esther, Ruggieri Salvatore, Turini Franco, Papadopoulos Symeon, Krasanakis Emmanouil, others .
Bias in data-driven artificial intelligence systems—An introductory survey // Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery. 2020. 10, 3. e1356.
Pang Wei, Shafieinejad Masoumeh, Liu Lucy, Hazlewood Stephanie, He Xi. ClavaDDPM: Multi-
relational Data Synthesis with Cluster-guided Diffusion Models // The Thirty-eighth Annual
Conference on Neural Information Processing Systems. 2024.
Patki Neha, Wedge Roy, Veeramachaneni Kalyan. The Synthetic Data Vault // 2016 IEEE International
Conference on Data Science and Advanced Analytics (DSAA). 2016. 399–410.
Peixoto Tiago P. Hierarchical block structures and high-resolution model selection in large networks
// Physical Review X. 2014. 4, 1. 011047.
Peixoto Tiago P. Nonparametric Bayesian inference of the microcanonical stochastic block model //
Physical Review E. 2017. 95, 1. 012317.
Peixoto Tiago P. Bayesian stochastic blockmodeling // Advances in network clustering and block-
modeling. 2019. 289–332.
Potluru Vamsi K., Borrajo Daniel, Coletta Andrea, Dalmasso Niccolò, El-Laham Yousef, Fons
Elizabeth, Ghassemi Mohsen, Gopalakrishnan Sriram, Gosai Vikesh, Kreačić Eleonora, Mani
Ganapathy, Obitayo Saheed, Paramanand Deepak, Raman Natraj, Solonin Mikhail, Sood Srijan,
Vyetrenko Svitlana, Zhu Haibei, Veloso Manuela, Balch Tucker. Synthetic Data Applications in
Finance. 2024.
Raghunathan Trivellore E. Synthetic data // Annual review of statistics and its application. 2021. 8,
1. 129–140.
Robinson Joshua, Ranjan Rishabh, Hu Weihua, Huang Kexin, Han Jiaqi, Dobles Alejandro, Fey
Matthias, Lenssen Jan E., Yuan Yiwen, Zhang Zecheng, He Xinwei, Leskovec Jure. RelBench: A
Benchmark for Deep Learning on Relational Databases. 2024.
Sahoo Subham Sekhar, Arriola Marianne, Gokaslan Aaron, Marroquin Edgar Mariano, Rush
Alexander M, Schiff Yair, Chiu Justin T, Kuleshov Volodymyr. Simple and Effective Masked
Diffusion Language Models // The Thirty-eighth Annual Conference on Neural Information
Processing Systems. 2024.

12
Shi Juntong, Xu Minkai, Hua Harper, Zhang Hengrui, Ermon Stefano, Leskovec Jure. TabDiff:
a Mixed-type Diffusion Model for Tabular Data Generation // The Thirteenth International
Conference on Learning Representations. 2025.

Solatorio Aivin V., Dupriez Olivier. REaLTabFormer: Generating Realistic Relational and Tabular
Data using Transformers. 2023.

Stanley Jeremy, Risdal M., Sharathrao , Cukierski W. Instacart Market Basket Analysis. 2017. Kaggle
competition dataset.

Tillman Bálint, Markopoulou Athina, Gjoka Minas, Buttsc Carter T. 2k+ graph construction frame-
work: Targeting joint degree matrix and beyond // IEEE/ACM Transactions on Networking. 2019.
27, 2. 591–606.

Tiwald Paul, Krchova Ivona, Sidorenko Andrey, Vargas-Vieyra Mariana, Scriminaci Mario, Platzer
Michael. TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating
High-Fidelity Synthetic Data. 2025.

Vignac Clement, Krawczuk Igor, Siraudin Antoine, Wang Bohan, Cevher Volkan, Frossard Pascal. Di-
Gress: Discrete Denoising diffusion for graph generation // The Eleventh International Conference
on Learning Representations. 2023.

Walmart Will Cukierski. Walmart Recruiting - Store Sales Forecasting. 2014.

Xu Kai, Ganev Georgi, Joubert Emile, Davison Rees, Acker Olivier Van, Robinson Luke. Synthetic
Data Generation of Many-to-Many Datasets via Random Graph Generation // The Eleventh
International Conference on Learning Representations. 2023.

Xu Lei, Skoularidou Maria, Cuesta-Infante Alfredo, Veeramachaneni Kalyan. Modeling tabular data
using conditional gan // Advances in neural information processing systems. 2019. 32.

You Jiaxuan, Ma Xiaobai, Ding Yi, Kochenderfer Mykel J, Leskovec Jure. Handling missing data
with graph representation learning // Advances in Neural Information Processing Systems. 2020.
33. 19075–19087.

You Jiaxuan, Ying Rex, Ren Xiang, Hamilton William, Leskovec Jure. Graphrnn: Generating realistic
graphs with deep auto-regressive models // International conference on machine learning. 2018.
5708–5717.

Zein EL Hacen, Urvoy Tanguy. Tabular Data Generation: Can We Fool XGBoost ? // NeurIPS 2022
First Table Representation Workshop. 2022.

Zhang Hengrui, Fang Liancheng, Wu Qitian, Yu Philip S. DiffPuter: An EM-Driven Diffusion


Model for Missing Data Imputation // The Thirteenth International Conference on Learning
Representations. 2025.

Zhang Hengrui, Zhang Jiani, Shen Zhengyuan, Srinivasan Balasubramaniam, Qin Xiao, Faloutsos
Christos, Rangwala Huzefa, Karypis George. Mixed-Type Tabular Data Synthesis with Score-based
Diffusion in Latent Space // The Twelfth International Conference on Learning Representations.
2024.

Zhang Jun, Cormode Graham, Procopiuc Cecilia M, Srivastava Divesh, Xiao Xiaokui. Privbayes:
Private data release via bayesian networks // ACM Transactions on Database Systems (TODS).
2017. 42, 4. 1–41.

Zhao Zilong, Kunar Aditya, Birke Robert, Chen Lydia Y. Ctab-gan: Effective table data synthesizing
// Asian Conference on Machine Learning. 2021. 97–112.

13
A Related Work Overview

Here we present a more detailed overview of synthetic relational database generation approaches.
The Synthetic Data Vault (SDV) Patki et al. (2016) introduced the first learning-based method for
generating relational databases. They introduce a hierarchical modeling approach using Gaussian
Copulas, incorporating recursive conditional parameter aggregation to preserve relational structure.
Mami et al. (2022) proposed GraphVAE, a graph-based approach leveraging graph variational
autoencoders. They represent relational databases as a single homogeneous graph where all rows
become nodes. This contrasts with our approach, which uses a heterogeneous graph to better reflect
the distinct structure and relationships of the table. Foreign keys are implicitly handled by establishing
edges between primary and secondary table rows within this homogeneous graph representation. The
GraphVAE then uses message-passing layers during both encoding and decoding to model inter-table
interactions and generate attribute values for synthetic rows.
Building upon the Conditional Tabular GAN (CT-GAN) method by Xu et al. (2019), two GAN-based
methods have been proposed for relational data synthesis. Row Conditional-TGAN (RCTGAN) (Gu-
eye et al., 2023) extends CT-GAN by integrating hierarchical dependencies, enabling the conditional
synthesis of child tables based on their parent and grandparent rows. The Incremental Relational
Generator (IRG) (Li, Tay, 2023) synthesizes relational databases through a sequential table generation
process that follows a topological ordering. It constructs an extended table by integrating context
from all relevant previously generated or related tables.
The Transformer-based approach REaLTabFormer (Solatorio, Dupriez, 2023) focuses on single-
parent relational databases. It employs a GPT-2 encoder with a causal language model head to
independently model the parent table. For dependent tables, a sequence-to-sequence (Seq2Seq)
transformer is utilized, leveraging the frozen parent model as context. All attributes are transformed
into a common vocabulary; however, this approach inherits the limitations of language models,
particularly concerning the accurate modeling of numerical data.
Xu et al. (2023) introduced a framework for modeling many-to-many datasets using multipartite
graphs under differential privacy. Their method utilizes a factorization of the joint data distribution,
combining techniques from random graph generation to model structure with graph representation
learning methods to conditionally generate tables based on node embeddings. Our approach builds
upon their work by specifically capturing hierarchical characteristics during graph generation, mod-
eling tables jointly rather than sequentially, and addressing general relational databases beyond
many-to-many relationships.
Diffusion models have also been adapted for relational synthesis. ClavaDDPM (Pang et al., 2024)
integrates clustering-guided diffusion models to preserve foreign-key dependencies, utilizing Gaussian
mixture models to encode inter-table dependencies. Similarly, RGCLD (Hudovernik, 2024) uses
conditional latent diffusion models, using a heterogeneous graph representation and GNNs to encode
table relationships, which then guide the diffusion process within the latent space. A key limitation
of both methods is their sequential modeling of tables, which introduces implicit assumptions on
inter-table dependencies and may allow errors to propagate down the hierarchy during generation.
Finally, auto-regressive models have been explored for tabular and sequential relational synthesis.
TabularARGN (Tiwald et al., 2025) employs an any-order auto-regressive network, trained on
discretized attributes, to model conditional dependencies. While it specializes in single-table and
sequential database modeling, TabularARGN also supports multi-parent schemas by preserving
certain dependencies using context tables and maintaining referential integrity for the remaining
relationships.
The marginal-based approaches for synthetic relational database generation primarily focus on
preserving marginal queries, typically with differential privacy guarantees. PrivLava (Cai et al., 2023)
synthesizes relational databases by modeling foreign key relationships as a directed acyclic graph with
latent variables, generating tables incrementally. MARE (Kapenekakis et al., 2024), specializing in
medical relational data, employs correlation partially directed acyclic graphs (CPDAGs) for selective
correlation modeling and orchestrates two-phase data sampling. Alimohammadi et al. (2025) propose
an approach to adapt single-table differentially private generators to relational data by learning
a weighted bi-adjacency matrix to generate the relational structure. Finally, PrivPetal (Cai et al.,

14
2025) synthesizes a flattened relational database using normalized permutation marginals and then
iteratively decomposes it by sampling attributes of reference relations.
We classify related work based on how it approaches attribute and structure generation in Table 4.

Table 4: Classification of related work based on the mechanisms for attribute and structure modeling
as well as the general synthetic data generation (SDG) classification.
Method Attribute Modeling Structure Modeling SDG Class
SDV (Patki et al., 2016) conditional tabular as attributes statistical
GraphVAE (Mami et al., 2022) homogeneous graph-based retains original structure neural
RCTGAN (Gueye et al., 2023) conditional tabular as attributes neural
IRG (Li, Tay, 2023) conditional tabular as attributes neural
REaLTabFormer (Solatorio, Dupriez, 2023) conditional tabular sequential modeling neural
ClavaDDPM (Pang et al., 2024) conditional tabular conditional modeling + matching neural
RGCLD (Hudovernik, 2024) conditional tabular retains original structure neural
TabularARGN (Tiwald et al., 2025) conditional tabular sequential modeling neural
BayesM2M, NeuralM2M Xu et al. (2023) conditional tabular random graphs (BJDD) neural & marginal
PrivLava Cai et al. (2023) conditional tabular as attributes marginal
MARE (Kapenekakis et al., 2024) conditional tabular sequential modeling marginal
DP-Relational (Alimohammadi et al., 2025) independent tabular weighted bi-adjacency matrix marginal
PrivPetal (Cai et al., 2025) flattened tabular flattening connected tables marginal
Ours heterogeneous graph-based 2k+SBM random graphs neural

B Detailed Experiment Setup

This section provides a comprehensive overview of our experimental setup, detailing the datasets
(Appendix B.1) utilized and the evaluation metrics (Appendix B.2) employed.

B.1 Datasets

Here we describe the datasets used in our evaluation. Table 5 provides detailed statistics for each
dataset. The MovieLens and Berka datasets are used in both benchmarks so we only describe them
once.
Rossmann Store Sales: The Rossmann Store Sales dataset (FlorianKnauer, 2015) features historical
sales data for 1115 stores, organized into two connected tables.
Airbnb: The Airbnb dataset (Montoya et al., 2015) contains anonymized user interactions and
demographics for predicting travel destinations. It comprises multiple tables detailing user sessions
and summary information.
Walmart: The Walmart dataset (Walmart, 2014) contains historical sales data for 45 stores across
three connected tables, including store details, features, and department sales.
Cora: The Cora dataset (McCallum et al., 2000) is a graph benchmark of 2708 academic papers
classified into seven categories, linked by a citation network of 5429 relationships. Unlike the graph
representation learning version with one-hot encoded node features, this relational version stores
paper content in a separate table connected via a foreign key, and citation links are represented in a
dedicated foreign-key-only table.
Biodegradability: The Biodegradability dataset (Blockeel et al., 1999) is a collection of 328 chemical
compounds with biodegradation half-life labels, intended for regression analysis based on chemical
features.
IMDB MovieLens: The IMDB MovieLens dataset (Harper, Konstan, 2015) includes information on
movies, actors, directors, user ratings, and related details across seven tables.
Berka: The Berka dataset (Berka, others, 2000) is a real-world financial dataset focused on loan
outcomes, encompassing loan details and transaction histories across multiple tables. For the
SyntheRela benchmark, this dataset is split temporally to facilitate the evaluation of RDL utility.
F1: The F1 dataset (F1, 2021) contains historical Formula 1 racing data and statistics from 1950
onwards, covering drivers, races, and results across numerous tables.

15
California: The California dataset is a real-world anonymized census database (M. Center, 2020) on
household information. It consists of two tables in the form of a basic parent-child relationship.
Instacart 05: The Instacart 05 is created by downsampling 5-percent from the Kaggle competition
dataset Instacart (Stanley et al., 2017), which is a real-world transaction dataset of Instacart orders.
This dataset consists of 6 tables in total with a maximum depth of 3.
CCS: The CCS dataset (Motl, Schulte, 2015) is a real-world transactional dataset Czech debit card
company. It consists of 5 tables with a maximum depth of 2.

Table 5: Summary of the 11 benchmark datasets. The number of columns represents the number
of non-id columns (The MovieLens and Berka datasets appear in both benchmarks). The collection is
diverse and covers all types of relational structures.
Dataset Name # Tables # Rows # Columns # Relationships Max Depth Hierarchy Type
Rossmann 2 59,085 16 1 2 Linear
AirBnB 2 57,217 20 1 2 Linear
Walmart 3 15,317 17 2 2 Multi Child
Cora 3 57,353 2 3 2 Multi Child
Biodegradability 5 21,895 6 5 4 Multi Child & Parent
IMDB MovieLens 7 1,249,411 14 6 2 Multi Child & Parent
Berka 8 757,722 37 8 4 Multi Child & Parent
F1 9 74,063 33 13 3 Multi Child & Parent
California 2 2,076,141 25 1 2 Linear
CCS 5 423,050 11 4 2 Multi Child & Parent
Instacart 05 6 1,906,353 12 6 3 Multi Child & Parent
Berka 8 1,079,680 41 8 4 Multi Child & Parent
MovieLens 7 1,249,411 14 6 2 Multi Child & Parent

B.2 Metrics

B.2.1 Shape and Trend Scores


Shape and Trend are proposed by SDMetrics1 . They are used to measure the column-wise density
estimation performance and pair-wise column correlation estimation performance, respectively. Shape
uses Kolmogorov-Sirnov Test (KST) for numerical columns and the Total Variation Distance (TVD)
for categorical columns to quantify column-wise density estimation. Trend uses Pearson correlation
for numerical columns and contingency similarity for categorical columns to quantify pair-wise
correlation.
Shape. Kolmogorov-Smirnov Test (KST): Given two (continuous) distributions pr (x) and ps (x) (r
denotes real and s denotes synthetic), KST quantifies the distance between the two distributions using
the upper bound of the discrepancy between two corresponding Cumulative Distribution Functions
(CDFs):
KST = sup |Fr (x) − Fs (x)|, (7)
x

where Fr (x) and Fs (x) are the CDFs of pr (x) and ps (x), respectively:
Z x
F (x) = p(x)dx. (8)
−∞

Total Variation Distance: TVD is defined as half the sum of the absolute differences between the real
and synthetic probabilities across all categories:
1X
TVD = |R(ω) − S(ω)|, (9)
2
ω∈Ω

where ω describes all possible categories in a column Ω. R(·) and S(·) denotes the real and synthetic
frequencies of these categories. To comply with previous work on relational data synthesis, we report
the complement of the KST and TVD distances (D(P ||Q)) as 1.0 − DKST/TVD (P ||Q)
1
[Link]

16
Trend. Pearson correlation coefficient: The Pearson correlation coefficient measures whether two
continuous distributions are linearly correlated and is computed as:
Cov(x, y)
ρx,y = , (10)
σx σy
where x and y are two continuous columns. Cov is the covariance, and σ is the standard deviation.
Then, the performance of correlation estimation is measured by the average differences between the
real data correlations and the synthetic data correlations:
1
Pearson Score = 1.0 − Ex,y |ρR (x, y) − ρS (x, y)|, (11)
2
where ρR (x, y) and ρS (x, y) denotes the Pearson correlation coefficient between column x and
column y of the real data and synthetic data, respectively. As ρ ∈ [−1, 1], the average distance of
Pearson coefficients is divided by 2, to ensure that it falls in the range of [0, 1], and subtracted from 1
such that the larger the score, the better the estimation.
Contingency similarity: For a pair of categorical columns A and B, the contingency similarity score
computes the difference between the contingency tables using the Total Variation Distance. The
process is summarized by the formula below:
1XX
Contingency Score = 1.0 − |Rα,β − Sα,β |, (12)
2
α∈A β∈B

where α and β describe all the possible categories in column A and column B, respectively. Rα,β
and Sα,β are the joint frequency of α and β in the real data and synthetic data, respectively. The
distance is again subtracted from 1.

B.2.2 K-hop Trend


The k-hop Trend metric, proposed by Pang et al. (2024), extends the Trend metric to evaluate the
preservation of correlations across multiple tables connected by foreign keys. The 0-hop Trend is
equivalent to the standard Trend metric, measuring pairwise column correlations within a single table.
For k > 0, the k-hop Trend assesses correlations across tables reachable within k foreign key hops.
This is achieved through a series of join operations:

• 1-hop: Refers to the correlation between one column in a table and another column in a
table that is directly linked to it by a foreign key (either its "parent" table that it references,
or its "child" table that references it). To calculate this, we join the two related tables based
on the foreign key and then compute the Trend metric on the resulting joined table.
• (k > 1)-hop: Extends the process iteratively. Based on a foreign key sequence of length
k, the tables are recursively joined. The Trend metric is then computed on the final joined
table.

As in the Trend metric, the Pearson correlation coefficient and contingency similarity are used to
quantify the differences in joint distributions. The final k-hop Trend score is the average of these
correlation/similarity scores across all relevant k-hop relationships within the database schema.
Similar to the 0-hop Trend, a higher score indicates a better preservation of inter-table correlations up
to k hops.

B.2.3 Cardinality Similarity


Given a parent table P with a primary key pk and a child table C with a foreign key f k referencing
pk, this metric evaluates the similarity of the cardinality distribution between the real and synthetic
datasets. The cardinality of a parent row p ∈ P is defined as the number of child rows c ∈ C for
which c.f k = [Link].
Let cardR (p) and cardS (p) denote the cardinality of a parent row p in the real and synthetic
datasets, respectively. This yields two numerical distributions: DR = {cardR (p) | p ∈ Preal } and
DS = {cardS (p) | p ∈ Psynthetic }.

17
The cardinality similarity score is computed as the complement of the Kolmogorov-Smirnov statistic,
defined as 1.0 − KST(DR , DS ), where KST refers to the Kolmogorov-Smirnov Test statistic as
defined in Appendix B.2.1. Cardinality similarity ranges from 0.0 to 1.0, where a score of 1.0
indicates identical cardinality distributions in the real and synthetic data, and a score of 0.0 indicates
maximally different distributions.

B.2.4 C2ST
The Classifier Two-Sample Test (C2ST) evaluates the fidelity of synthetic data by training a discrimi-
nator (in our case, an XGBoost model) to distinguish it from real data. This detection-based approach,
rooted in two-sample testing, uses the classifier’s performance as a proxy for distributional similarity.
If the discriminator achieves better-than-random accuracy, it indicates discernible differences between
the real and synthetic datasets, suggesting lower fidelity. C2ST offers a comprehensive assessment
of single table fidelity that captures complex higher-order dependencies between features beyond
simple correlations, as highlighted in Zein, Urvoy (2022).

B.2.5 C2ST-Agg
The Classifier Two-Sample Test with Aggregations (C2ST-Agg) (Jurkovič et al., 2025) extends
detection-based fidelity evaluation to relational data by capturing inter-table relationships. C2ST-Agg
functions by augmenting parent tables with aggregated features derived from their connected child
tables. This propositionalization approach, drawing inspiration from relational reasoning techniques,
enables the C2ST to evaluate the preservation of interactions between columns in connected tables,
as well as relationship cardinalities. Conceptually, C2ST-Agg assesses how well fundamental SQL
operations such as join and groupby are maintained alongside complex interactions both within
and between tables. By summarizing child-table information using aggregation functions (e.g.,
mean, count, max), C2ST-Agg effectively accounts for both relationship cardinality and high-level
interactions across related tables, offering a comprehensive assessment of relational data fidelity.
In our evaluation we use the aggregation functions used by the original authors mean - for numerical
attributes, count of connected rows and count distinct for categorical variables and use an XGBoost
model with k=5 fold cross validation as the discriminative model.

B.2.6 RDL Utility


The relational deep learning utility (RDL utility) (Jurkovič et al., 2025) metric evaluates the capacity
of synthetic relational data to support effective learning on downstream tasks. This metric adheres to
the RelBench (Robinson et al., 2024) framework, which transforms relational databases into temporal
heterogeneous graphs with predefined tasks specifically designed for GNN training.
To assess utility, the RDL utility metric employs the RelBench GNN pipeline. A heterogeneous
variant of the GraphSage model (Hamilton et al., 2017) is trained on both the real and synthetic data.
These trained models are subsequently evaluated on a dedicated test set composed entirely of real
data.
Data splitting is handled via a time-based splitting strategy. Our evaluation incorporates five datasets
from the SyntheRela benchmark that possess a temporal feature, with the following predictive tasks:

• Rossmann: Prediction of daily number of customers for each store and date.
• Walmart: Prediction of weekly sales for each department within each store.
• Airbnb: Binary prediction of whether a user has previously made a booking.
• Berka: Prediction of the binary loan status (successful or unsuccessful).
• F1: Prediction of whether a driver will qualify in the top-3 for a race in the next month.

All datasets and models utilize the default RelBench hyperparameters.

B.2.7 Distance to Closest Record


The distance to closest record (DCR) (Zhao et al., 2021) evaluates privacy protection by measuring
how closely synthetic data resembles the training data. The DCR score (Zhang et al., 2024) quantifies

18
the fraction of synthetic records whose nearest neighbor is in the training set; a value near 0.5 suggests
the model samples from the true distribution rather than overfitting.
Nearest neighbor distances are calculated using the l2 norm of synthetic, training, and holdout records.
For a synthetic record i, distances to its nearest neighbors in training (Ntrn ) and holdout (Nhold )
datasets are:
d(i)trn = min ∥syni − trnj ∥2 , d(i)hold = min ∥syni − holdj ∥2 .
j∈Ntrn j∈Nhold

An indicator function, I(i)trn , determines if the nearest neighbor of synthetic record i is in the training
set: 
1 if d(i)trn < d(i)hold ,
I(i)trn = 0 if d(i)trn > d(i)hold ,

0.5 if d(i)trn = d(i)hold .
The DCR score is then computed as:
Nsyn
1 X
DCR score = I(i)trn .
Nsyn i=1

C Implementation Details
We implemented R EL D IFF in PyTorch. We performed our experiments on two Nvidia H100 GPUs
with 80G memory.
Data preprocessing. Raw relational datasets often contain missing values. Our initial preprocessing
step involves imputing these, following approaches in (Pang et al., 2024; Shi et al., 2025): numerical
missing values are replaced by the column average, and categorical missing values are treated as a
distinct new category. For the SyntheRela benchmark, we adopt a more nuanced approach for missing
values, similar to (Patki et al., 2016; Hudovernik, 2024). We introduce an additional binary indicator
variable for each attribute to explicitly denote whether a value was originally missing; this indicator
is modeled as a separate categorical variable, allowing us to recover the original missingness pattern
after sampling. To mitigate training instability caused by the diverse ranges of numerical columns,
we transform the numerical values with the QuantileTransformer2 and recover the original values
after sampling.
Hyperparameters Setting. R EL D IFF employs a consistent hyperparameter setting across all datasets,
with the sole exception of the number of epochs and batch size, which is mainly dependent on
the graph structure. We train our models for 10000 epochs on most datasets. For larger network
datasets, specifically Instacart 05 and MovieLens, we utilize a reduced number of epochs (400 and
4000 respectively) to manage computational load. We use the AdamW optimizer with learning rate
γ = 6e − 4 and weight decay w = 1e − 5 in all experiments.
Regarding the specific hyperparameters within R EL D IFF, the values for σmin and σmax are set to
0.002 and 80.0, respectively, referencing the optimal setting in (Karras et al., 2022). The parameter
δ is set to 1e − 3. For the loss weightings, we fix λcat to 1.0 and linearly decay λnum from 1.0 to
0.0 as training proceeds. In all our experiments, the number of GNN layers is set to k = 2. During
inference, we select the checkpoint with the lowest training loss. And utilize 100 discretization steps
(T = 100) during sampling.
Model Architecture Our model is parameterized by a heterogeneous graph neural network, featuring
transformer encoders and decoders and an MLP backbone. Each column is initially projected into
a d-dimensional vector using a linear layer, with d = 4, matching the size used in Zhang et al.
(2024) and Shi et al. (2025). These tokenized columns are then processed by a two-layer transformer.
Subsequently, the concatenated columns are projected to a dimension of dimh = 128, to which noise
embeddings of the same dimensionality are added, consistent with the approach of Pang et al. (2024).
The embeddings are then processed by a heterogeneous variant of the GraphSAGE network (Hamilton
et al., 2017), which is used as the RDL baseline in Fey et al. (2024). For databases containing records
2
[Link]
[Link]

19
at fixed time intervals, and given our use of a permutation-invariant GNN, positional encodings are
added to the embeddings before message passing to preserve record order. The GNN embeddings
for each table are then further processed by five-layer MLPs, conditioned on a time embedding. The
size of these MLPs for each table is comparable to those used in experiments by Kotelnikov et al.
(2023). Finally, the hidden representation is decoded back into the data space by another two-layer
transformer. It is worth noting that the MLP backbone accounts for the majority of the model’s
parameters, and the memory consumed by these parameters is typically less than that used by the
intermediate data representations, especially since relational databases often contain larger tables
than typical tabular datasets.

D Additional Experiments
D.1 Single Table Fidelity Results

In this section, we present detailed single-table fidelity results on the SyntheRela benchmark. We
evaluate the Shape, Trend, and C2ST scores. As shown in Table 6, R EL D IFF consistently demonstrates
the strongest performance. When employing the original structure, R EL D IFF is only outperformed on
the IMDB and Cora datasets, indicating a trade-off where some single-table fidelity is exchanged for
superior multi-table fidelity. This behavior on the IMDB dataset can be attributed to structural motifs
that may cause bottlenecks in GNN message passing. Notably, our approach can achieve near-perfect
performance on Cora if its schema is normalized to third normal form.

Table 6: Single-table results. For each dataset and metric we report the average detection accuracy
(C2ST - lower is better), column shapes and column pair trends (Shape, Trend - higher is better)
across all tables for three independent samples. DNC denotes Did Not Converge and "-" denotes a
method is unable to generate the dataset. The best result (excluding R EL D IFF ORIG ) is bolded. We
report the percentage improvement of R EL D IFF over the state-of-the-art in blue.
Dataset Metric TabARGN ClavaDDPM RCTGAN REALTABF. SDV RelDiff Improv. RelDiffORIG
C2ST (↓) 64.23±0.20 78.10±0.03 88.37±0.14 83.97±4.36 99.75±5e-3 54.11±0.34 15.76 54.11±0.34
Airbnb Shape (↑) 95.70±0.05 94.42±0.01 89.18±0.17 71.66±0.92 59.37±0.04 98.14±0.07 2.55 98.14±0.07
Trend (↑) 93.48±0.33 87.78±0.12 79.37±0.29 53.90±1.26 49.03±0.08 95.76±0.25 2.43 95.76±0.25
C2ST (↓) 56.07±0.58 66.77±0.14 88.02±0.50 74.70±0.55 96.90±0.21 52.46±0.32 6.44 52.46±0.32
Rossmann Shape (↑) 96.96±0.19 94.05±0.07 91.31±0.04 90.65±0.38 81.05±0.19 98.04±0.07 1.12 98.04±0.07
Trend (↑) 91.34±0.08 84.78±0.80 84.38±0.40 84.58±0.88 67.77±0.25 95.93±0.65 5.02 95.93±0.65
C2ST (↓) 83.54±0.84 53.50±1.95 76.40±0.55 70.87±1.07 87.02±0.81 60.30±0.50 0.0 60.30±0.50
Walmart Shape (↑) 89.09±0.33 92.21±0.52 82.31±0.51 81.71±0.40 81.80±0.10 94.04±0.53 1.99 94.04±0.53
Trend (↑) 83.89±0.21 94.02±0.14 86.60±0.25 83.10±0.46 87.61±0.23 95.42±0.45 1.49 95.42±0.45
C2ST (↓) 72.31±0.17 54.48±0.11 68.12±0.44 82.40±0.33 50.23±0.05 7.80 49.10±0.25
Berka Shape (↑) 82.20±0.26 91.62±0.10 81.90±0.38 - 56.27±0.29 97.72±0.03 6.65 98.34±0.34
Trend (↑) 70.43±0.25 88.54±1.19 74.22±0.28 64.01±0.11 98.81±0.02 11.59 98.82±0.41
C2ST (↓) 81.93±0.49 71.42±0.46 80.67±0.31 89.84±0.22 63.50±0.03 11.09 62.21±0.03
F1 Shape (↑) 84.71±1.15 84.63±0.28 89.68±0.40 - 52.62±0.57 94.89±0.08 5.81 97.62±0.09
Trend (↑) 81.31±0.45 84.65±0.05 90.17±0.03 73.05±0.19 95.10±0.12 5.46 97.10±0.19
C2ST (↓) 50.92±0.22 49.83±0.07 55.38±0.11 52.07±0.36 0.0 51.88±0.06
IMDB Shape (↑) 98.40±0.14 99.01±0.05 92.70±0.09 - DNC 96.91±0.49 0.0 96.94±0.17
Trend (↑) 97.80±0.13 98.66±0.10 81.65±0.03 93.88±0.87 0.0 94.15±0.27
C2ST (↓) 58.79±0.19 58.26±0.15 68.59±0.14 48.28±0.20 17.13 46.73±0.50
Biodeg. Shape (↑) 90.85±0.14 - 90.91±0.40 - 79.46±0.47 95.95±0.11 5.55 96.82±0.07
Trend (↑) 74.82±0.35 85.44±2.19 97.58±0.50 99.38±0.13 1.85 99.58±0.09
C2ST (↓) 50.94±0.37 48.97±0.14 75.45±0.16 54.03±0.61 0.0 50.18±0.45
Cora - -
Shape (↑) 92.65±0.12 96.38±0.15 50.24±0.17 87.85±1.22 0.0 94.33±0.66

D.2 Privacy Sanity Check

We follow related work (Kotelnikov et al., 2023; Zhang et al., 2024; Shi et al., 2025; Pang et al.,
2024) to perform a privacy sanity check against SMOTE (Chawla et al., 2002), an interpolation-based
method that generates new data through convex combinations of real data points. To quantify the
privacy level, we evaluate the distance to closest record (DCR) (Zhao et al., 2021). Specifically,
we compare the DCR distributions of R EL D IFF against SMOTE on two datasets, adhering to
the evaluation protocol of Pang et al. (2024): California, a real-world census dataset containing
anonymized household and individual information, and a subset of tables from the Berka dataset,
which holds anonymized financial information from a Czech bank. The results of the DCR score
(Zhang et al., 2024) are presented in Table 7.

20
Table 7: DCR score (the probability that a synthetic example is closer to the training set rather than
the holdout set (%, a score closer to 50% is better).
Method Table Household Individual Transaction Order
SMOTE 77.22 ±0.0 76.25 ±0.0 99.94 ±0.0 99.40 ±0.0
RelDiff 50.54 ±0.0 50.45 ±0.0 50.71 ±0.0 52.38 ±0.0

R EL D IFF consistently achieves DCR scores around 50%. This outcome is indicative of the model’s
ability to sample from the underlying data distribution rather than memorizing the training data.


 5HO'LII+ROGRXW  5HO'LII+ROGRXW

5HO'LII7UDLQ 5HO'LII7UDLQ

6027(7UDLQ 

6027(7UDLQ

)UHTXHQF\ ORJ

)UHTXHQF\ ORJ















 

         
'&5 '&5
(a) Household (b) Individual

 5HO'LII+ROGRXW 
5HO'LII+ROGRXW
 
5HO'LII7UDLQ 5HO'LII7UDLQ

6027(7UDLQ 6027(7UDLQ

)UHTXHQF\ ORJ

)UHTXHQF\ ORJ


 



 




 
 

          


'&5 '&5
(c) Transaction (d) Order
Figure 4: DCR distributions on the California and Berka datasets (log-transformed y-axis). R EL D IFF
exhibits DCR values for the training set that are significantly higher than SMOTE, indicating enhanced
privacy protection. The distribution of DCR values for the held-out data remains consistent with that
of the training data.

E Broader impacts
This research introduces a novel method for generating synthetic relational databases, which presents
potential benefits in fields with privacy restrictions, such as healthcare, finance, and education, and in
scenarios involving limited or biased data. However, there are potential negative impacts to consider.
While our empirical privacy analysis does not raise immediate concerns, our method is not equipped
with provable privacy guarantees like differential privacy. Additionally, due to the method’s ability
to generate datasets that closely resemble the original data, it might inadvertently amplify biases
already present in the original data. Furthermore, synthetic data that closely mirrors real data could
be misused. Consequently, we believe that future work should prioritize research directions focused
on enhancing privacy protection and developing effective bias reduction techniques for synthetic
relational data.

21

You might also like