0% found this document useful (0 votes)
54 views19 pages

Auto BI

Uploaded by

zhangxin1992pm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views19 pages

Auto BI

Uploaded by

zhangxin1992pm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Auto-BI: Automatically Build BI-Models

Leveraging Local Join Prediction and Global Schema Graph


Yiming Lin∗ Yeye He Surajit Chaudhuri
University of California, Irvine Microsoft Research Microsoft Research
[email protected] [email protected] [email protected]

ABSTRACT and perform BI analysis themselves, without relying on DBAs or


Business Intelligence (BI) is crucial in modern enterprises and central IT. The goal is to democratize BI, so that business users can
billion-dollar business. Traditionally, technical experts like data- make agile data-driven decisions themselves, without depending
base administrators would manually prepare BI-models (e.g., in on technical users.
star or snowflake schemas) that join tables in data warehouses, be- Building BI models: still a pain point. At a high level, there
fore less-technical business users can run analytics using end-user are two main steps in BI project: (1) building BI models, and (2)
dashboarding tools. However, the popularity of self-service BI (e.g., performing ad-hoc analysis by querying against BI models. While
Tableau and Power-BI) in recent years creates a strong demand for querying BI-models was made simple by vendors like Tableau and
less technical end-users to build BI-models themselves. Power-BI (through intuitive user interfaces and dashboards) [27, 38],
We develop an Auto-BI system that can accurately predict BI the first step of building “BI-models”, a prerequisite for ad-hoc
models given a set of input tables, using a principled graph-based analysis, remains a key pain point for non-technical users.
optimization problem we propose called k-Min-Cost-Arborescence In the context of self-service BI tools like Tableau and Power-BI,
(k-MCA), which holistically considers both local join prediction “BI-modeling” refers to the process of preparing raw data and estab-
and global schema-graph structures, leveraging a graph-theoretical lishing relationships between tables, where a central task is to estab-
structure called arborescence. While we prove k-MCA is intractable lish join relationships for a given set of input tables1 . This closely
and inapproximate in general, we develop novel algorithms that can relates to foreign-key (FK) detection [17, 30, 58] but works specifi-
solve k-MCA optimally, which is shown to be efficient in practice cally in the context of BI, where the resulting schema graphs from
with sub-second latency and can scale to the largest BI-models we the modeling step frequently correspond to structures known as
encounter (with close to 100 tables). star-schema and snowflake-schema studied in data warehouses [15],
Auto-BI is rigorously evaluated on a unique dataset with over like shown in Figure 1.
100K real BI models we harvested, as well as on 4 popular TPC While self-service BI tools also attempt to improve the usability
benchmarks. It is shown to be both efficient and accurate, achieving of the BI-modeling step through better GUI (e.g., allowing users to
over 0.9 F1-score on both real and synthetic benchmarks. specify join columns using drag-and-drop) [9, 11], building BI mod-
els remains a key pain point. This is because when faced with a large
PVLDB Reference Format: number of tables, even experienced technical users like DBAs can
Yiming Lin, Yeye He, and Surajit Chaudhuri. Auto-BI: Automatically Build find the task of identifying all possible join relationships challeng-
BI-Models ing and time-consuming. For less-technical enterprise users who
Leveraging Local Join Prediction and Global Schema Graph. PVLDB, 16(10): are not familiar with concepts like fact/dimension tables, building
XXX-XXX, 2023.
BI models from scratch can be a daunting challenge.
doi:XX.XX/XXX.XX
Foreign-key detection: not yet sufficient. It would clearly be
1 INTRODUCTION useful, if the join relationships in BI models can be automatically
Business Intelligence (BI) is increasingly important in modern en- predicted on given input tables (without requiring users to specify
terprises for data-driven decision making, and has grown into a them manually). Since joins in BI models are often primary-key (PK)
multi-billion dollar business [24]. In traditional BI settings, data- foreign-key (FK) joins, existing FK detection algorithms [17, 30, 58]
base administrators (DBAs) typically need to manually prepare BI- would seem to apply.
models (table schemas and join relationships) in data warehouses, To study this systematically, we harvested over 100K real BI
so that less-technical business users can perform ad-hoc analysis models built using self-service BI tools, from public sources like
using tools like dashboards [15]. GitHub and search engines. For each BI model file, we program-
In recent years, in a growing trend called “self-service BI ” [25] matically extract the input tables used in the model, as well as the
that is popularized by vendors like Tableau [10] and Power-BI [8], ground-truth join relationships specified by users, thus creating a
less-technical business users are increasingly expected to set up real BI benchmark for large-scale evaluation for the first time (prior
work mostly use synthetic benchmarks for evaluation instead).
∗ Work done at Microsoft. Our large-scale evaluation using these real BI datasets suggests
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of that existing FK-detection algorithms are still insufficient for the
this license. For any use beyond those covered by this license, obtain permission by
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
1We note that more advanced BI modeling can involve additional steps such as schema
Proceedings of the VLDB Endowment, Vol. 16, No. 10 ISSN 2150-8097.
doi:XX.XX/XXX.XX redesign and performance optimization, which is usually out of scope for non-technical
users in self-service BI tools, and thus not considered in this work.
Customer Product Cust_Address Customer Product Cust_Address Customer Product Supplier

Fact_Sales Fact_Sales Fact_Sales Fact_Supplies

Stores Date Cust_Segment Stores Date Cust_Segment Stores Date Warehouse

(a) Star Schema (b) Snowflake Schema (c) Constellation (multi-Snowflake)

Figure 1: Types of common BI schemas. (a) Star schema, which can be seen as a simplified version of Snowflake. (b) Snowflake
schema, where dimension tables like “Customers” can connect to additional dimension tables. (c) Constellation schema, which
can be seen as multiple Snowflake schemas.

task, as they frequently produce incorrect results (with ~0.6 pre- evaluations show that our algorithms are both effective (with
cision for all methods we tested). This is because real BI data in over 0.9 F1 scores), and efficient (scales to the largest BI models
the wild tend to be considerably more challenging (e.g., column- we encounter and runs in a few seconds).
headers can often be cryptic with generic names like “name” or
“id”, and column-values can often overlap accidentally). Crucially, 2 RELATED WORKS
most existing approaches are “local” in nature, that consider two BI and dashboarding tools. There are a wide variety of BI and
tables at a time to make greedy local decisions, without a principled dashboarding tools that aim to help users perform ad-hoc data
global optimization, thus unable to produce accurate predictions analysis, with Tableau [10] and Power-BI [8] being the leading
on challenging real BI test cases (Section 2). vendors [24]. These tools use visual drag-and-drop interfaces [38]
Auto-BI: global optimization using schema graphs. A key (without requiring users to write SQL), and are particularly popular
insight we develop in Auto-BI, is that because we know users are among non-technical users.
finding join relationships specifically for building BI-models, and Foreign key detection. Foreign key (FK) detection is an impor-
we know the typical structure that schema-graphs in BI-models tant problem in database settings, with many influential methods
should generally follow (e.g., variants of snowflake), this gives us developed in the literature [17, 30, 48, 58].
a unique opportunity to leverage the global graph structure of the MC-FK [58] is a pioneering effort to detect FK using an EMD-
resulting schema-graph, to predict joins much more accurately. based randomness metric for distribution similarity between two
We thus formulate the Auto-BI problem as a global optimization columns, which is more reliable to predict true relationships when
problem that holistically considers both the local decision of pair- unrelated key-columns that can frequently have overlapping ranges.
wise joinability, as well as the global decision of graph structures. Fast-FK [17] develops an efficient method that selects FKs with
Specifically, we show that the snowflake schema popular in BI- the best pre-defined scores until all input tables are connected.
modeling corresponds to a graph-theoretical concept called Arbores- HoPF [30] improves upon prior work by considering not only
cence [47]. Leveraging this structure, we formulate a new graph FK-scores but also PK-scores, which are combined using a prede-
problem called k-MCA (k-Minimum-Cost Arborescence), which finds fined scoring function, making this also a global method in spirit.
the most probable 𝑘-snowflakes, by considering both local join- The algorithm enumerates all PK/FK combinations and returns the
ability and global graph structures, all using a precise probabilistic combination that has the highest total score.
interpretation based on calibrated probabilities. ML-FK [48] proposes an ML-based classifier to predict FK, trained
We prove that the new 𝑘-MCA problem and its variants are on known FKs. As we will show analytically and experimentally,
in general intractable (NP-hard) and inapproximable (EXP-APX- without principled global optimizations, ML classification alone is
complete). Nevertheless, we develop novel algorithms based on still local in nature and not sufficient to achieve high accuracy.
branch-and-bound principles, which are surprisingly efficient on While FK detection methods are clearly related to our problem
real test cases and can scale to the largest BI models we encounter, (we experimentally compare them with these methods), there are
while still being provably optimal. Our extensive evaluations using also a few key distinctions that we would like to highlight.
real BI models and synthetic TPC benchmarks suggest that the First, while FK detection targets general database settings, we
proposed Auto-BI is substantially more accurate when compared focus on predicting joins in BI models, which gives us a unique
to existing solutions in the literature. opportunity to exploit the likely graph structure (e.g., snowflake)
Contributions. We make the following contributions: to make more accurate predictions, which is a unique direction not
• We are the first to harvest and leverage over 100K real BI mod- considered in prior work.
els in the wild, for the problem of predicting BI models. This Second, unlike FK-detection methods that typically consider
enables a data-driven algorithm design and makes it possible to the canonical PK/FK (1:N) joins, in Auto-BI the types of joins
rigorously evaluate different algorithms on real BI data. we consider are more general, because real BI models in the wild
• We are the first to exploit the snowflake-like schema structure frequently employ 1:1 joins as well as joins that are not perfectly
in BI settings for our predictions, by mapping snowflakes to a 1:N, making the prediction problem more challenging.
less-known graph-theoretical concept called arboresence. We Lastly, with the exception of [48], most prior FK detection meth-
formulate a set of novel graph-based optimization problems that ods primarily rely on hand-crafted scoring functions to make local
we call 𝑘-MCA, that have precise probabilistic foundations. join predictions (pairwise between two tables). In comparison, we
• We study the theoretical hardness of 𝑘-MCA variants, and pro- leverage the BI models harvested to first predict local-joinability in
pose efficient algorithms that are provably optimal. Extensive a data-driven way, which is then combined into a principled global
optimization for optimal global decisions (at the graph-level for all We note that these three types of schemas are extensively studied
tables). This also makes our technique different from prior work. in the literature [15, 32, 40] and widely adopted in practice.
Detect inclusion dependency. Inclusion dependency (IND)
is closely related to FK, and there is an influential body of work 3.2 Harvest Real BI Models
focusing on efficiently enumerating inclusion dependency (IND) In order to understand real BI models created in the wild, and
in large databases [12, 33, 34, 39, 43, 50, 51, 53]. The focus on effi- systematically evaluate the effectiveness of Auto-BI, we harvest
ciency and scalability of this line of work makes them orthogonal over 100K real BI models from public sources that are created using
to typical FK-detection methods [17, 30, 58], where the focus is on a popular tool Power BI [8], whose model files have a suffix “.pbix”.
accurately predicting meaningful FKs (from a large collection of We crawl these “.pbix” model files from two sources. The first
IND candidates). And like prior FK-detection methods that employ is GitHub, where developers and data analysts upload their Power
efficient IND detection [30, 58], we also use IND-detection as a BI models for sharing and versioning. We crawl a total of 27K
pre-processing step to enumerate possible FK candidates efficiently. such model files from GitHub. As a second source, we go through
Complex non-equi joins. Beyond equi-joins, techniques to the URLs crawled by a commercial search engine that can lead to
automatically detect and handle complex join relationships have “.pbix” model files. Most of these URLs are not crawled by the
also been studied, e.g., transformation-based join [42, 55, 60], fuzzy- search engine so we crawl ourselves and obtain 86K model files.
join [16, 37, 52], search-join [36], semantic lookup-join [29], etc., Combining data from the two sources gives us a large collection
which is an interesting area of future work in the context of BI. of 100K+ real BI models, which cover diverse BI use cases, including
financial reporting, inventory management, among many others.
3 PROBLEM STATEMENT These real BI models become a valuable dataset for rigorous
In this section, we first describe preliminaries and the real BI models evaluation – specifically, from each “.pbix” model file, we pro-
we harvest, before introducing the Auto-BI problem. grammatically extract all tables used in the model, as well as the
ground-truth BI model (join relationships) manually specified by
users. This enables us to thoroughly evaluate different algorithms
3.1 Preliminary: BI models
using real BI models in the wild, which turn out to be more chal-
Business Intelligence is closely related to topics like data warehous- lenging than synthetic TPC benchmarks used in prior work that
ing and decision support system, and has been extensively studied. tend to be clean and simple.
We provide a brief preliminary here and refer readers to surveys
like [15, 32, 40] for more information.
3.3 Problem Statement: Auto-BI
Fact and dimension tables. In data warehousing and BI ter-
minologies, tables involved in BI modeling can be categorized as We now define the high-level Auto-BI problem as follows.
two types: fact tables and dimension tables [32]. A fact table con- Definition 1. [Auto-BI]. Given a set of input tables T = {𝑇1,𝑇2,
tains key metrics and measurements of business processes that . . . , 𝑇𝑛 } used for BI modeling, where each table 𝑇𝑖 consists of a
one intends to analyze (e.g., the revenue of sales transactions). In list of columns 𝑇𝑖 = (𝑐𝑖1, 𝑐𝑖2, . . . , 𝑐𝑖𝑚 ). Predict a desired BI model
addition, a fact table contains foreign keys that can reference multi- 𝑀 (T) that consists of a set of joins 𝑀 (T) = {𝐽𝑖 𝑗 , 𝐽𝑝𝑞 , . . .}, where
ple dimension tables, where each dimension table contains detailed each join 𝐽𝑖 𝑗 is a join between  𝑇𝑖 and 𝑇 𝑗 , in the form of 𝐽𝑖 𝑗 =
information associated with the measurements from a unique facet (𝑐𝑖𝑘 , 𝑐𝑖𝑙 , . . .), (𝑐 𝑗𝑟 , 𝑐 𝑗𝑠 , . . .) .
(e.g., a “Product” dimension table contains details of products sold, Note that the output 𝑀 (T) is restricted to equi-joins for now,
whereas a “Date” dimension table has detailed day/month/year which can be single-column joins, or multi-column joins in the form
info of transactions, etc.). Figure 1 shows a few examples of the BI of 𝐽𝑖 𝑗 = (𝑐𝑖𝑘 , 𝑐𝑖𝑙 , . . .), (𝑐 𝑗𝑟 , 𝑐 𝑗𝑠 , . . .) . Also note that we only state
schemas, with fact and dimension tables in different colors. the high-level problem so far, and will defer the exact graph-based
Such a separation of fact/dimension tables has many benefits, formulation to Section 4.3.
such as storage/query efficiency, ease of maintenance, etc. [15, 32, In addition to Definition 1, a more general version of the problem
40]. The fact/dimension design is a de-facto standard in BI modeling. goes beyond equi-joins to detect and handle more complex forms of
Star, snowflake, and constellation schemas. In BI modeling, joins, such as transformation-joins [60] and fuzzy-joins [37], which
fact/dimension tables are frequently organized into what is known we will leave as future work.
as star/snowflake/constellation schemas [32], like shown in Figure 1.
Star-schema refers to the cases where there is one fact table,
4 AUTO-BI
whose foreign-key columns refer to primary-keys from one or more
(non-hierarchical) dimension tables, as illustrated by Figure 1(a). We will start with an overview of our Auto-BI architecture, before
Snowflake-schema generalizes the star-schema, with dimension discussing how we predict joins holistically on graphs.
tables referring to each other in a hierarchical manner. For example,
in Figure 1(b), the “Customer” dimension refers to a coarser-grained 4.1 Architecture Overview
dimension “Cust-Segment”. Similarly an “Address” dimension can Figure 2 shows the overall architecture of our Auto-BI system.
refer to a coarser-grained “City”, which in turn refers to “Country”. It consists of an offline component and an online component, as
While there is only one fact table in star and snowflake schemas, depicted by the two boxes shown in the figure.
constellation-schema generalizes to the cases with multiple fact In the offline step, using real BI models we harvested and their
tables, as shown in Figure 1(c). ground-truth BI join models, we perform offline training to learn
Input: user tables

Calibrate classifier scores into probabilities. After we train


the feature-based model using data extracted from real BI mod-
Output: els, given a new pair of columns (𝐶𝑖 , 𝐶 𝑗 ), we can use the model
Crawl & Local join predicted
extract models BI model to produce classifier-scores and predict the joinability of (𝐶𝑖 , 𝐶 𝑗 ).
Offline training: Online optimization:
Local join prediction Global join graph However, the scores so produced are still heuristic in nature – e.g.,
a 0.5 classifier score does not necessarily corresponds to a true
100K+ BI models offline online
on the web join-probability of 0.5, or a 50% chance of the join being correct.
Figure 2: Architecture overview of Auto-BI In order to make a principled global decision at the graph level,
we “calibrate” the classifier-scores into true probabilities, using cali-
what we call “local join models” that can predict, for a given pair
bration techniques from the ML literature [41]. For any column pair
of table columns, whether the two are likely joinable. We use the
(𝐶𝑖 , 𝐶 𝑗 ), the calibration step produces 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) that corresponds
term “local” here, because the predictions are pair-wise between
to the probability of the pair being joinable – e.g., 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) = 0.5
two columns, which is in contrast to the “global” decision at the
really means that a join prediction between the two columns has
entire graph level (the focus of this work).
50% chance of being correct. As we will see, this gives a precise
Since the problem of predicting local joinability using data has
probabilistic interpretation, which is important when we reason
been studied in other contexts (e.g., [48, 56]), we do not regard this
about the most probable global join graph.
as our key contribution in Auto-BI, so we will only briefly highlight
the important optimizations we make here (e.g., label transitivity,
and splitting 1:1 and N:1 joins) in Section 4.2. 4.3 Auto-BI: Exploit Global Join Graph
The online step is the key of Auto-BI. Here, given a set of We are now ready to solve the Auto-BI problem. We first introduce
tables (modeled as vertices in a graph), we leverage the local-join how we represent tables and candidate joins on a global join graph,
prediction models trained offline to “score” the joinability of each before describing our graph formulation.
pair of columns/tables, using calibrated probabilities (which are
then modeled as edges in the graph). We formulate Auto-BI on 4.3.1 Representing relationships in a global graph.
the resulting graph as a novel graph problem 𝑘-MCA, which finds Given a set of tables T and possible joins, we can construct a directed
the most probable sub-graph that maximizes the joint probability graph 𝐺 = (𝑉 , 𝐸) as follows. We represent each input table 𝑇 ∈ T
of all edges, subject to graph-structure constraints. We prove the as a vertex 𝑣 (𝑇 ) ∈ 𝑉 , and each possible join candidate between
hardness of the problem and develop efficient algorithms. We will columns (𝐶𝑖 , 𝐶 𝑗 ) as a weighted edge 𝑒𝑖 𝑗 ∈ 𝐸, where the edge weight
describe this online part in Section 4.3. 𝑤 (𝑒𝑖 𝑗 ) is simply the calibrated join probability 𝑃 (𝐶𝑖 , 𝐶 𝑗 ), produced
by our local classifier (Section 4.2), which scores every pair of
candidate columns whose containment is over a threshold. We
4.2 Join Prediction: Train Local Classifier
follow the convention to use a directed edge 𝑒𝑖 𝑗 to represent N:1
In this step, given two lists of columns 𝐶𝑖 ⊆ 𝑇 , 𝐶 𝑗 ⊆ 𝑇 ′ , we need joins, which point from N-side (FK) columns 𝐶𝑖 , to the 1-side (PK)
to predict the probability that (𝐶𝑖 , 𝐶 𝑗 ) is joinable. (Note that 𝐶𝑖 , 𝐶 𝑗 columns 𝐶 𝑗 . We represent 1:1 joins as bi-directional edges.
can be single-columns, but are in general lists of columns for multi-
column joins). This problem of predicting local joins has been stud- Example 1. Figure 3 shows a graph representation of the tables
ied in other contexts [48, 56]. We do not regard this as a new con- in Figure 1(b), where each vertex corresponds to a table. We mark
tribution, and will only briefly describe the overall process. the ground-truth joins in Figure 1(b) as solid green edges in Figure 3,
The prediction task. At a high-level, given any two candidate while other candidate joins not in ground-truth as dotted red edges.
columns (𝐶𝑖 , 𝐶 𝑗 ), our task is to predict the corresponding “join- For instance, the dotted edge (𝑒 5 : 0.8) represents an candidate
ability” label, denoted by 𝐿𝑖 𝑗 , where 𝐿𝑖 𝑗 = 1 if (𝐶𝑖 , 𝐶 𝑗 ) joins, and join between the column “Customer-ID” (in table “Cust-Details”),
𝐿𝑖 𝑗 = 0 otherwise. Since we harvested large amounts of real BI and column “Customer-Segment-ID” (in table “Cust-Segments”).
models, each of which contains both data tables and ground-truth Note that the column pair (“Customer-ID”, “Customer-Segment-ID”)
joins (programmed by human users), we can use this rich collection should not join because they refer to two semantically different
of data to produce training data of the form {(𝐶𝑖 , 𝐶 𝑗 ), 𝐿𝑖 𝑗 }, where types of IDs, which however may appear like a plausible join
𝐿𝑖 𝑗 corresponds to the actual joined vs. not-joined ground-truth to Local-Classifier (because of high name-similarity and value-
between the two column, which can be programmatically extracted overlap), which leads to a high Local-Classifier score (0.8). A greedy
from the BI models we harvest. method that focuses on promising edges locally can incorrectly pre-
This naturally leads to a supervised ML formulation, where we dict this false-positive join, which is a mistake that global methods
featurize (𝐶𝑖 , 𝐶 𝑗 ) both at the schema-level (e.g., column-header like Auto-BI can prevent.
similarity using standard string distance functions such as Jaccard
and Edit, as well as pre-trained embedding-based similarity such 4.3.2 Precision Mode (k-MCA-CC).
as SentenceBERT [5]), and at the content-level (e.g., column-value We are now introducing our formulation using the graph. At a high
overlap based on Jaccard and Containment), for a total of 21 features. level, we operate in two steps: (1) a “precision-mode” stage where
In the interest of space, we leave detailed descriptions of these we focus on finding the salient snowflake-like structures that are
features, as well as two unique optimizations we develop in this the “backbones” of the underlying schema graph (which ensures
work: (1) separate N:1/1:1 joins, and (2) apply label transitivity, to high precision thanks to the graph-structure constraints it imposes);
Appendix A and Appendix B. and (2) a “recall-mode” stage that complements the precision-mode,
Cust_Details Customer Product Cust_Details Customer Product Supplier
e3:0.6 e3:0.6

e1:0.9 Fact_Sales Fact_Supplies


e7:0.8 e1:0.9 e7:0.8 e9:0.9 e11:0.8
e4:0.7 e4:0.7

e5:0.8 e6:0.4 e8:0.9 e8:0.9 e10:0.7 e12:0.8


Fact_Sales Artificial
e2:0.7 root
e2:0.7

Cust_Segment Stores Date Cust_Segment Stores Date Warehouse

Figure 3: Solve 1-MCA on the graph representation of the Figure 4: Solve 𝑘-MCA on the graph representation of the
tables in Figure 1(b), with join candidates as edges. tables in Figure 1(c), with join candidates as edges.
known results new results
Problem 1-MCA k-MCA k-MCA-CC
Intuitively, arborescence is a directed rooted tree where all its
find the most most probable vertices are pointing away from the root. We use an example to
find the most
Description
probable 1-snowflake
probable k-snowflakes w/ show the relationship between snowflakes and arborescence.
k-snowflakes constraints
Poly-time Poly-time solvable EXP-APX-hard Example 2. Consider the sub-graph 𝐺 induced by all green edges
Hardness
solvable [18] (Theorem 2) (Theorem 3)
Chu-Liu/Edmond’s ours ours in Figure 3, where each green edge would correspond to a ground-
Algorithm (Algorithm 2) (Algorithm 3) truth join in the snowflake schema of Figure 1(b). This sub-graph
algorithm [18]
Table 1: Summary of results for MCA problem variants. 𝐺 is an arborescence, because if we take the vertex marked as
by finding additional joins beyond typical snowflakes. We will “Fact_Sales” as the root 𝑟 , then there is exactly one directed path
introduce both in turn below. from 𝑟 to every other vertex in 𝐺. Equivalently, we can check that
We first introduce our precision-mode formulation, referred this root 𝑟 has in-degree of 0, while all other vertices in 𝐺 have
to as 𝑘-MCA-CC. For ease of exposition, we will illustrate the in-degree of exactly 1, which also ensures that 𝐺 is an arborescence.
thought process of arriving at 𝑘-MCA-CC in 3 steps: (1) We will
start with a simplistic assumption that the schema graph has exactly Recall that we know there is exactly one snowflake in 1-MPS,
one snowflake, which we model with a graph-theoretical problem which is equivalent to saying that the sub-graph induced by 𝐽 ⊆ 𝐸
called 1-MCA. (2) We then generalize this to arbitrary numbers is an arborescence. We rewrite 1-MPS into 1-MPA (1-Most-Probable-
of snowflakes, using a formulation we call 𝑘-MCA. (3) Finally, we Arborescence), defined as follows.
Ö
generalize 𝑘-MCA to include additional graph-structure constraints (1-MPA) max 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) (3)
to arrive at our final formulation, which we call 𝑘-MCA-CC. We 𝐽 ⊆𝐸
𝑒𝑖 𝑗 ∈ 𝐽
summarize our key results in this section in Table 1 for reference. ′
s.t. 𝐺 = (𝑉 , 𝐽 ) is an arborescence (4)
(1) Exactly one snowflake: 1-MCA. To start with, we discuss
the simplistic case where we know there is exactly one snowflake Example 3. We revisit the graph in Figure 3. Using the formula-
in the underlying graph (e.g., Figure 1(b)). Note that this is only for tion of 1-MPA, the best arborescence with the highest joint proba-
ease of illustration, and not an actual assumption that we rely on bility (correspondingly, the most probable snowflake), is the set of
(we will show how this can be relaxed next). solid edges marked in green, denoted by 𝐽 ∗ = {𝑒 1, 𝑒 2, 𝑒 3, 𝑒 4, 𝑒 7, 𝑒 8 },
Given a graph 𝐺 = (𝑉 , 𝐸) where candidate joins are marked whose joint probability is 0.9 ∗ 0.7 ∗ 0.6 ∗ 0.7 ∗ 0.8 ∗ 0.9. It can be
as edges like in Figure 3. Since we know there is one snowflake verified that 𝐽 ∗ is the most probable arborescence among all ar-
schema, intuitively we want to select edges (joins) 𝐽 ⊆ 𝐸, such that: borescences (that span the entire vertex set), which corresponds to
(a) The graph induced by 𝐽 , 𝐺 ′ = (𝑉 , 𝐽 ), is a snowflake that the ground-truth snowflake schema shown in Figure 1(b).
connects all vertices in 𝑉 ; Consider an alternative arborescence, such as 𝐽 ′ = {𝑒 1, 𝑒 3, 𝑒 5, 𝑒 6,
(b) If more than one such snowflake-structure exists, find the 𝑒 7, 𝑒 8 }, which removes 𝑒 2, 𝑒 4 from 𝐽 ∗ and adds 𝑒 5, 𝑒 6 (marked in dot-
most probable snowflake, based on the joint-probability of all joins ted red lines). Observe that this new 𝐽 ′ also forms a 1-arborescence
based on Definition 2. However, 𝐽 ′ has a lower joint probability
Î
selected in 𝐽 , 𝑃 (𝐽 ) = 𝑒𝑖 𝑗 ∈ 𝐽 𝑃 (𝐶𝑖 , 𝐶 𝑗 ).
These two considerations can be written as the following opti- than 𝐽 ∗ (because 𝑒 5, 𝑒 6 in 𝐽 ′ has a joint probability of 0.8 ∗ 0.4=0.32,
mization problem, which we refer to as 1-Most-Probable-Snowflake while 𝑒 2, 𝑒 4 in 𝐽 ∗ has a joint probability of 0.7 ∗ 0.7=0.49), which
(1-MPS): Ö will thus not be selected based on the 1-MPA formulation above.
(1-MPS) max 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) (1) We note that in 1-MPA, 𝐽 ∗ is selected based on a global deci-
𝐽 ⊆𝐸 sion of optimality, which avoids locally-greedy decisions – e.g.,
𝑒𝑖 𝑗 ∈ 𝐽
s.t. 𝐺 ′ = (𝑉 , 𝐽 ) is a snowflake (2) the incorrect 𝑒 5 will not be selected despite its high score like we
Note that the constraint in Equation (2) of 1-MPS corresponds to mentioned in Example 1.
requirement (a), while the objective function in Equation (1) maps While we can now use arborescence to formalize 1-MPA, it still
to requirement (b). has a cross-product that is not amenable to optimization. We thus
Since snowflake is used informally in BI, to make it more formal perform the following transformation: instead of assigning join-
we map it to a structure in graph theory called arboresence [22]. probability 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) as the edge weight for each edge 𝑒𝑖 𝑗 , we per-
form a logarithmic transformation and set the edge weight of each
Definition 2. [Arborescence]. A directed graph 𝐺 = (𝑉 , 𝐸) is
𝑒𝑖 𝑗 as:
called an arborescence, if there is a unique vertex 𝑟 ∈ 𝑉 known
as the root, such that there is exactly one directed path from 𝑟 to 𝑤 (𝑒𝑖 𝑗 ) = − log(𝑃 (𝐶𝑖 , 𝐶 𝑗 )) (5)
every other 𝑣 ∈ 𝑉 , 𝑣 ≠ 𝑟 . Equivalently, a directed graph 𝐺 is an
arborescence if all its vertices have in-degree of 1, except a unique Using the new transformed 𝑤 (𝑒𝑖 𝑗 ), for each instance of the 1-MPA
root vertex 𝑟 ∈ 𝑉 that has in-degree of 0 [22]. problem, we construct a new minimization problem below that we
Algorithm 1: Construct graph with edge-weights efficiently solved using the Chu–Liu/Edmonds algorithm. Note that
input : all input tables T in a BI-model 𝐽 ∗ is the same optimal solution to 1-MPA, as we see in Example 3.
output : Graph 𝐺 = (𝑉 , 𝐸 ) that represents T Algorithm for 1-MCA. Given a set of input tables T (for which
1 𝑉 ← {𝑣𝑇 |𝑇 ∈ T}, with 𝑣𝑇 representing each 𝑇 ∈ T a BI-model needs to be built), the complete pseudo-code to con-
2 𝐸 ← {} struct a graph 𝐺 for 1-MCA is shown in Algorithm 1. Since we
3 foreach (𝐶𝑖 , 𝐶 𝑗 ) satisfying Inclusion-Dependency in T do are given tables T, in Line 3 we enumerate column-pairs (𝐶𝑖 , 𝐶 𝑗 )
4 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) ← Local-Classifier(𝐶𝑖 , 𝐶 𝑗 ) in T for which Inclusion-Dependencies (IND) [14] hold approxi-
5 𝑤 (𝑒𝑖 𝑗 ) ← − log(𝑃 (𝐶𝑖 , 𝐶 𝑗 ) ) mately, which are possible joins that we should consider (note that
6 𝐸 ← 𝐸 ∪ {𝑒𝑖 𝑗 }, with edge-weight 𝑤 (𝑒𝑖 𝑗 ) efficient IND-enumeration is a standard step [19], so we invoke
7 return 𝐺 (𝑉 , 𝐸 ) existing techniques here). In Line 4, we “score” each (𝐶𝑖 , 𝐶 𝑗 ) using
term 1-MCA, which uses summation our Local-classifier to obtain calibrated probabilities 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) (Sec-
∑︁ in the objective function:
(1-MCA) min 𝑤 (𝑒𝑖 𝑗 ) (6) tion 4.2), which are transformed in Line 5 to become edge-weights
𝐽 ⊆𝐸
𝑒𝑖 𝑗 ∈ 𝐽
𝑤 (𝑒𝑖 𝑗 ) (Equation (5)).
Using the resulting graph 𝐺 constructed from T, we can invoke
s.t. 𝐺 ′ = (𝑉 , 𝐽 ) is arboresence (7) the Chu–Liu/Edmonds’ algorithm, which yields the optimal solu-
It can be shown that solving 1-MPA is equivalent to solving the tion 𝐽 ∗ to 1-MCA (and thus also 1-MPA).
corresponding 1-MCA. We summarize our main result for 1-MCA in the first column of
Lemma 1. A solution 𝐽 ∗ ⊆ 𝐸 is an optimal solution to 1-MPA, if Table 1. We note that although we leverage known results from the
and only if 𝐽 ∗ is an optimal solution to the corresponding 1-MCA. graph theory here, to the best of our knowledge, we are the first to
connect the popular concept of snowflakes in BI, with arborescence
Proof. First, observe that 1-MPA and 1-MCA have identical in graph theory.
feasible regions, as they are subject to the same constraints. In the following, we will extend 1-MCA to general cases, and
Next, we show that an optimal solution 𝐽 ∗ that maximizes the develop new results not known in graph theory or databases.
(2) Arbitrary number of snowflakes: 𝑘-MCA. So far we use
Î
objective function 𝑒𝑖 𝑗 ∈ 𝐽 ∗ 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) in Equation (3) of 1-MPA, will
Í
simultaneously minimize the objective function 𝑒𝑖 𝑗 ∈ 𝐽 𝑤 (𝑒𝑖 𝑗 ) in 1-MCA to solve the simplistic case where we know there is exactly
Equation (6) of 1-MCA. This is the case because the objective 1 snowflake. In general, there can be multiple snowflakes, and we
Í Í
function of 1-MCA is: 𝑒𝑖 𝑗 ∈ 𝐽 ∗ 𝑤 (𝑒𝑖 𝑗 ) = 𝑒𝑖 𝑗 ∈ 𝐽 ∗ − log 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) do not know its exact number beforehand (e.g., the ground-truth
Î
= − log( 𝑒𝑖 𝑗 ∈ 𝐽 ∗ 𝑃 (𝐶𝑖 , 𝐶 𝑗 )), where the term inside − log() is ex- schema of Figure 1(c) has 2 snowflakes, which is unknown a priori).
actly the objective function of 1-MPA, thus ensuring that 1-MCA is In this section, we extend 1-MCA to the more general 𝑘-MCA,
minimized if and only if 1-MPA is maximized, and completes the where the proposed 𝑘-MCA formulation will not only find the
proof. □ most probable 𝑘 snowflakes like in 1-MCA, but also infer the right
number of snowflakes 𝑘 at the same time.
The reason we construct 1-MCA for each instance of 1-MPA, We first extend the concept of arborescence in graph-theory
is that 1-MCA relates to a known problem in graph theory called (Definition 2), to the more general 𝑘-arboresence below.
“Minimum-Cost-Arborescence” (abbreviated as MCA2 ) [20], which Definition 3. [𝑘-arboresence]. A directed graph 𝐺 = (𝑉 , 𝐸) is
finds a spanning arborescence (covering all vertices) in a directed an 𝑘-arborescence, if its underlying undirected graph has a total of
graph 𝐺 that has the smallest edge-weights. Note that in the sim- 𝑘 joint connected-components, written as {𝐺𝑖 = (𝑉𝑖 , 𝐸𝑖 )|𝑖 ∈ [𝑘]},
Ð Ð
plistic setting where we know there is exactly 1 snowflake (arbores- such that 𝑖 ∈ [𝑘 ] 𝑉𝑖 = 𝑉 , 𝑖 ∈ [𝑘 ] 𝐸𝑖 = 𝐸. Furthermore, each 𝐺𝑖 is
cence), our 1-MCA problem directly translates to the MCA prob- an arborescence, for all 𝑖 ∈ [𝑘].
lem [20]. Since MCA is known in graph theory with polynomial- Note that when 𝑘 = 1, the notion of 1-arborescence degenerates
time solutions called the Chu–Liu/Edmonds’ algorithm [18, 20], our into arborescence described in Definition 2. We will henceforth
construction allows us to solve 1-MPA efficiently by leveraging the use 1-arborescence and arborescence interchangeably when the
same algorithms. context is clear.
We use our running example to show the connection between Example 5. Consider the graph in Figure 4, which is the graph
1-MPA and 1-MCA. representation of the constellation (multi-snowflake) schema in Fig-
Example 4. Continue with Example 3. To solve the 1-MPA prob- ure 1(c). The sub-graph with green solid edges (ignoring the dotted
lem for the graph in Figure 3, we use the transformation in Equa- edges for now), is a 2-arboresence. It is a 2-arboresence because
tion (5) and construct an instance of 1-MCA on the same graph, both of its two connected-components are arboresences (Defini-
where all edge-weights are now 𝑤 (𝑒𝑖 𝑗 ) = − log(𝑃 (𝐶𝑖 , 𝐶 𝑗 )). It can tion 2), rooted at “Fact_Sales” and “Fact_Supplies”, respectively.
be verified that 𝐽 ∗ = {𝑒 1, 𝑒 2, 𝑒 3, 𝑒 4, 𝑒 7, 𝑒 8 } is the minimizer of Equa-
tion (6) in 1-MCA, with the smallest objective-value −(log(0.9) + Intuitively, we use the 𝑘-arboresence structure to force the 𝑘
log(0.7) + log(0.6) + log(0.7) + log(0.8) + log(0.9)), which can be underlying snowflakes to emerge, which reveals the most salient
“backbones” of the underlying schema. It should be noted that in the
2 Although MCA is not a well-known concept, we note that MCA is directly anal- precision-mode of Auto-BI (this section), because we focus exclu-
ogous to a better-known problem in graph theory called “Minimum-Spanning-Tree
(MST)” [31], with the only difference that MST is defined on undirected graphs whereas sively on finding snowflake/arborescence structures to ensure high
MCA is on directed graphs. precision, some desired joins may be missing in the 𝑘-arboresence
(e.g., the orange dotted edges in Figure 4), which is something that Algorithm 2: Solve 𝑘-MCA for constellation schema
we will get to in the recall-mode of Auto-BI next (Section 4.3.3). input : Graph 𝐺 = (𝑉 , 𝐸 )
Given that we want to use 𝑘-arboresence to let the 𝑘 snowflakes output : optimal k-MCA (k-snowflakes)
emerge, the next question is how to find the right 𝑘. Conceptually, 1 𝑉 ′ ← 𝑉 ∪ {𝑟 }
we can iterate over all possible 𝑘, and pick the “best” 𝑘, using a 2 𝐸 ′ ← 𝐸 ∪ {𝑒 (𝑟, 𝑣) |𝑣 ∈ 𝑉 }, with 𝑤 (𝑒 (𝑟, 𝑣) ) = 𝑝
suitable “goodness” function (e.g., revealing the right number of 3 𝐽1∗ ← solve 1-MCA on 𝐺 ′ = (𝑉 ′ , 𝐸 ′ ) with Chu-Liu/Edmonds’ algo
snowflakes). 4 𝐽𝑘∗ = 𝐽1∗ \ {𝑒 (𝑟, 𝑣) |𝑣 ∈ 𝑉 }
First, recall that a 𝑘-arborescence on graph 𝐺 = (𝑉 , 𝐸) has exactly 5 return 𝐽𝑘∗
𝑘 connected components, and is bound to have exactly (|𝑉 | − 𝑘)
edges (because every vertex except the root has in-degree of 1, per indeed the right choice (Section 5), thanks to the fact that we use
Definition 2). true calibrated probabilities (Section 4.2).
Given these, we can see that while the “sum of edge-weights” Observe that if we know 𝑘 = 1 a priori, our 𝑘-MCA degener-
objective function in Equation (6) for 1-MCA is a suitable “goodness” ates exactly into 1-MCA like before. When 𝑘 is unknown however,
function to compare two 1-arboresences (both with 𝑘 = 1), it is no the objective function in Equation (8) would help reveal the best
longer suitable to compare a 𝑘 1 -arboresence and a 𝑘 2 -arboresence 𝑘-snowflakes, as well as the right number of 𝑘. We show this intu-
(with 𝑘 1 ≠ 𝑘 2 ), because the two will have different number of itively using the example below.
edges, making the comparison using the sum of edge-weights in Example 6. We revisit Figure 4 where all edges are possible join
Equation (6) “unfair”. In fact, using Equation (6) and a flexible 𝑘 candidates, and solve 𝑘-MCA on this example graph.
is guaranteed to lead to |𝑉 | disconnected vertices (a trivial |𝑉 |- First, observe that there is no 1-arboresence in this graph. Let
arborescence) as the optimal, because it has no edges and thus no 𝐽 ∗ be all the green edges, then the subgraph induced by 𝐽 ∗ is a 2-
edge-weights. arboresence (rooted at the two fact-tables). Since 𝑘 = 2, the penalty
To make the comparisons “fair” between two 𝑘-arboresences term in this case is (2 − 1)(− log(0.5)) = 1. It can be verified that
with different values of 𝑘, and prevent disconnected components 𝐽 ∗ has the lowest cost in all 2-arboresences.
from having artificial advantages, we modify the objective function It is possible to have 3-arboresences here too – for example if we
as follows. Since a 𝑘-arboresences has exactly 𝑘 connected compo- remove 𝑒 1 from 𝐽 ∗ , then the remaining green-edges in 𝐽 ′ = 𝐽 ∗ \ {𝑒 1 }
nents and (|𝑉 | −𝑘) edges, we imagine that there are (𝑘 − 1) “virtual- induce a 3-arboresence. It can be verified that the cost of 𝐽 ′ is higher
edges”, each with a parameterized edge-weight 𝑝, that connect the 𝑘 than that of 𝐽 ∗ , because 𝐽 ′ removes 𝑒 1 from 𝐽 ∗ and thus lowers its
connected components into one, such that a 𝑘-arboresences always cost by 𝑤 (𝑒 1 ) = − log(0.9) = 0.13, but incurs a higher penalty-cost
has the same number of edges as 1-arboresences, regardless of 𝑘 of (3 − 1)(− log(0.5)) = 2, which makes 𝐽 ′ less desirable than 𝐽 ∗ .
(because (|𝑉 | − 𝑘) + (𝑘 − 1) = (|𝑉 | − 1)). Accounting for the (𝑘 − 1)
Algorithm for 𝑘-MCA. A naive way to solve 𝑘-MCA that builds
new virtual edges in Equation (6) leads to Equation (8) below, which
is new objective function we use on top of 1-MCA above, is to enumerate different values of 𝑘, and
∑︁for 𝑘-MCA: for each 𝑘, exhaustively enumerate all 𝑘-way partitions of 𝐺, then
(𝑘-MCA) min 𝑤 (𝑒𝑖 𝑗 ) + (𝑘 − 1) · 𝑝 (8)
𝐽 ⊆𝐸, 𝑘 ≤ |𝑉 | invoke 1-MCA on each resulting graph partition, to find the optimal
𝑒𝑖 𝑗 ∈ 𝐽
𝑘-MCA. This approach is straightforward but inefficient (because
s.t. 𝐺 ′ = (𝑉 , 𝐽 ) is a 𝑘-arboresence (9) 𝑘 can be as large as |𝑉 |, making it exponential in |𝑉 |).
The parameter 𝑝 we introduce effectively controls the number of We design an algorithm that solves 𝑘-MCA, using a graph-based
snowflakes (e.g., a larger 𝑝 would “penalize” having more discon- construction that reduces any instance of the new 𝑘-MCA problem
nected snowflakes). The question is how to set 𝑝 appropriately. into one instance of 1-MCA (which admits efficient solutions [18]).
Recall that we use calibrated join-probability 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) to produce Specifically, given a graph 𝐺 = (𝑉 , 𝐸) on which 𝑘-MCA needs to
edge weight 𝑤 (𝑒𝑖 𝑗 ) = − log(𝑃 (𝐶𝑖 , 𝐶 𝑗 )), then naturally the virtual be solved, we introduce a new vertex 𝑟 that is an “artificial root”,
edges should be imagined as edges with a join-probability of ex- and connects 𝑟 with all 𝑣 ∈ 𝑉 using edges 𝑒 (𝑟, 𝑣), with edge-weight
actly 0.5, which means a 50% chance of being joinable and thus a 𝑤 (𝑒 (𝑟, 𝑣)) = 𝑝. This leads to a new constructed graph 𝐺 ′ = (𝑉 ′, 𝐸 ′ )
coin-toss that is ok to either include or exclude (which makes them where 𝑉 ′ = 𝑉 ∪ {𝑟 }, 𝐸 ′ = 𝐸 ∪ {𝑒 (𝑟, 𝑣)|𝑣 ∈ 𝑉 }.
“virtual”). We solve 1-MCA on the new 𝐺 ′ , and let 𝐽1∗ be the optimal 1-MCA.
As another way to look at this, consider when we drop a regular Let 𝐽𝑘∗ = 𝐽1∗ \ {𝑒 (𝑟, 𝑣)|𝑣 ∈ 𝑉 } be edges in 𝐽1∗ not incident with the
edge 𝑒𝑖 𝑗 from a 𝑘-arboresence to create a (𝑘 + 1)-arboresence – artificial-root 𝑟 . We can show that 𝐽𝑘∗ is the optimal solution to
the objective function of the latter in Equation (8) will have the 𝑘-MCA on the original 𝐺. The complete steps of this algorithm are
edge-weight 𝑤 (𝑒𝑖 𝑗 ) removed, but incur a penalty cost of 𝑝 from shown in Algorithm 2.
the additional virtual-edge we introduce. When the penalty-weight
𝑝 on the virtual-edge corresponds to a join-probability of 0.5, it Theorem 2. Algorithm 2 solves 𝑘-MCA optimally, in time poly-
would discourage 𝑒𝑖 𝑗 from being dropped if its 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) is over 0.5 nomial to the input size.
(50% chance of being joinable), and encourage 𝑒𝑖 𝑗 to be dropped if
its 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) is under 0.5, which is exactly the right thing to do.
Proof. Given a graph 𝐺 = (𝑉 , 𝐸), let 𝐽𝑘∗ be the optimal 𝑘-MCA
We, therefore, use 𝑝 = − log(0.5) as our natural choice of penalty
of 𝐺, with objective function value 𝑐𝑘 (𝐽𝑘∗ ), where 𝑐𝑘 is the objective
weight in 𝑘-MCA. Our experiments suggest that empirically 0.5 is
function defined in Equation (8). Because 𝐽𝑘∗ is a 𝑘-arboresence, by
Definition 3, 𝐽𝑘∗ has exactly 𝑘 disjoint components {𝐺 1, 𝐺 2, . . . 𝐺𝑘 },
all of which are 1-arboresence, with 𝑅 = {𝑟 1, 𝑟 2, . . . 𝑟𝑘 } as their improved by adding an additional constraint on the graph that we
roots. call “FK-once”.3
We show that on our constructed graph 𝐺 ′ = (𝑉 ′, 𝐸 ′ ), the solu- FK-once refers to the property that the same FK column in a
tion 𝐽1∗ = 𝐽𝑘∗ ∪ {(𝑟, 𝑟𝑖 )|𝑟𝑖 ∈ [𝑘]} must be an optimal solution to the fact-table should likely not refer to two different PK columns in two
1-MCA problem on 𝐺 ′ . dimension tables (a common property known in the database litera-
We show this by contradiction. Suppose 𝐽1∗ is not an optimal ture [17, 30]). On a graph, such a structure would correspond to two
solution to 𝐺 ′ , which means that there is another 𝐽1′ with a smaller edges (𝐶𝑖 , 𝐶 𝑗 ), (𝐶𝑖 , 𝐶𝑚 ), pointing from the same 𝐶𝑖 , to two separate
cost as defined in the 1-MCA objective function. Let the objective 𝐶 𝑗 and 𝐶𝑚 , which usually indicates that one of the joins is incor-
function in Equation (6) be 𝑐 1 , we can write this as rect. For example, the same FK column “Customer-ID” in “Sales”
𝑐 1 (𝐽1′ ) < 𝑐 1 (𝐽1∗ ) (10) table may appear joinable with both (1) the PK column “C-ID” of
In the solution 𝐽1′ , let 𝑅 ′ be the set of vertices incident with artificial the “Customers” table, and (2) the PK “Customer-Segment-ID” of
root 𝑟 . Removing from 𝐽1′ the set of edges incident with 𝑟 produces the “Customer-Segments” table (because both have high column-
𝐽𝑘′ = 𝐽1′ \ {𝑒 (𝑟, 𝑟 ′ )|𝑟 ′ ∈ 𝑅 ′ }. 𝐽𝑘′ is an |𝑅 ′ |-arboresence, and thus a header similarity and value-overlap). However, we know that an FK
feasible solution to 𝑘-MCA on 𝐺. Similarly, removing from 𝐽1∗ the should likely only join one PK (or otherwise we have two redundant
set of edges incident with 𝑟 produces 𝐽𝑘∗ = 𝐽1∗ \ {𝑒 (𝑟, 𝑟 ′ )|𝑟 ′ ∈ 𝑅}, dimension tables).
which is also a feasible solution to 𝑘-MCA on 𝐺. We add the FK-once constraint as a cardinality-constraint (CC)
Since 𝐽𝑘∗ is an |𝑅|-arboresence and 𝐽𝑘′ is an |𝑅 ′ |-arboresence, by to 𝑘-MCA, which leads to 𝑘-MCA-CC ∑︁ below:
the definition of 𝑐𝑘 in Equation ∑︁ (8), we know (k-MCA-CC) min 𝑤 (𝑒𝑖 𝑗 ) + (𝑘 − 1) · 𝑝 (14)
𝐽 ⊆𝐸, 𝑘 ≤ |𝑉 |
𝑐𝑘 (𝐽𝑘∗ ) = 𝑤 (𝑒) + (|𝑅| − 1)𝑝 (11) 𝑒𝑖 𝑗 ∈ 𝐽

𝑒 ∈ 𝐽𝑘∗ s.t. 𝐺 = (𝑉 , 𝐽 ) is 𝑘-arboresence (15)
𝑖 ≠ 𝑙, ∀𝑒𝑖 𝑗 ∈ 𝐽, 𝑒𝑙𝑚 ∈ 𝐽, 𝑗 ≠ 𝑚 (16)
∑︁
𝑐𝑘 (𝐽𝑘′ ) = ′
𝑤 (𝑒) + (|𝑅 | − 1)𝑝 (12)
𝑒 ∈ 𝐽𝑘′
Note that Equation (16) is the new FK-once constraint, which
states that no two edges 𝑒𝑖 𝑗 , 𝑒𝑙𝑚 in the selected edge-set 𝐽 should
Because we know 𝐽𝑘∗ is an optimal solution to 𝑘-MCA on 𝐺, we share the same starting column-index, or (𝑖 ≠ 𝑙).4
know 𝑐𝑘 (𝐽𝑘∗ ) ≤ 𝑐𝑘 (𝐽𝑘′ ). Combining with Equation (11) and (12), we Theorem 3. 𝑘-MCA-CC is NP-hard. Furthermore, it is Exp-APX-
get: complete, making it inapproximable in polynomial time.
∑︁ ∑︁
𝑤 (𝑒) + (|𝑅| − 1)𝑝 ≤ 𝑤 (𝑒) + (|𝑅 ′ | − 1)𝑝 (13) We prove Theorem 3 using a reduction from non-metric min-
𝑒 ∈ 𝐽𝑘∗ 𝑒 ∈ 𝐽𝑘′ TSP [21]. Proof of this can be found in Appendix D. It is interesting
Í to see that adding one constraint in 𝑘-MCA-CC makes it consid-
Adding 𝑝 to both sides of Equation (13), we get 𝑒 ∈ 𝐽 ∗ 𝑤 (𝑒) + erably more difficult than 𝑘-MCA (which however is important in
𝑘
(|𝑅|)𝑝 ≤ 𝑒 ∈ 𝐽 ′ 𝑤 (𝑒) + (|𝑅 ′ |)𝑝. Note that the left-hand-side is ex-
Í
𝑘
terms of result quality, as we will show in experiments).
actly the objective-value of 𝐽1∗ on 𝐺 ′ , or 𝑐 1 (𝐽1∗ ), while the right-hand- Algorithm for 𝑘-MCA-CC. Despite the hardness, we develop a
side is the objective-value of 𝐽1′ on 𝐺 ′ . This gives 𝑐 1 (𝐽1∗ ) ≤ 𝑐 1 (𝐽1′ ), novel algorithm that solves 𝑘-MCA-CC optimally, leveraging the
contradicting with our assumption in Equation (10), thus proving branch-and-bound principle [35], and the sparsity of join edges
that 𝐽1∗ must be an optimal solution to 1-MCA on 𝐺 ′ . pointing from the same columns.
Since 𝐽1∗ can be solved optimally in Line 3 of Algorithm 2 for Algorithm 3 shows the pseudo-code of this recursive procedure.
1-MCA in polynomial time (Line 3), and all other steps in Algo- We first invoke Algorithm 2 to solve the unconstrained version
rithm 2 also take time polynomial in the input size, we conclude that 𝑘-MCA and obtain 𝐽 (Line 1). We check 𝐽 for constraint violations
Algorithm 2 can solve 𝑘-MCA optimally, in polynomial time. □ (Line 2) – if there is no violation, we are done and 𝐽 is the optimal
solution to 𝑘-MCA-CC. Alternatively, if 𝐽 has constraint violations,
but the cost of the unconstrained version 𝑐 (𝐽 ) is already higher
Example 7. We continue with Example 6. Using Algorithm 2, we
than a best solution to 𝑘-MCA-CC found so far (Line 4), we know
would construct a new graph 𝐺 ′ by adding an artificial root 𝑟 (shown
adding the constraints will only degrade solution quality so we
on the right of Figure 4), which connects to all existing vertices
return null for this solution space. Otherwise, in Line 7, let 𝐶𝑠 =
𝑣𝑖 with an edge 𝑒 (𝑟, 𝑣𝑖 ) with the same penalty weight 𝑤 (𝑒) = 𝑝.
{𝑒𝑠 𝑗 , 𝑒𝑠𝑘 , . . .} ⊆ 𝐽 be one set of conflicting edges in 𝐽 from the same
Using Line 4 of Algorithm 2 to solve the 1-MCA on 𝐺 ′ produces
column (thus violating the FK-once constraint). We partition 𝐶𝑠
the optimal solution of 𝐽1∗ that consists of all solid green edges, plus |𝐶 |
two artificial-edges connecting 𝑟 with the two fact-table, which can into |𝐶𝑠 | number of subsets 𝐶𝑠1 , 𝐶𝑠2 , . . . , 𝐶𝑠 𝑠 , each with exactly one
be verified is the optimal 1-MCA. Line 4 of Algorithm 2 would then edge in 𝐶𝑠 (Line 8). We then construct |𝐶𝑠 | number of 𝑘-MCA-CC
produce 𝐽𝑘∗ corresponding to of all green edges (removing the two problem instances, each with a new graph 𝐺𝑖 = (𝑉𝑖 , 𝐸𝑖 ), where
artificial-edges connecting fact-tables to the artificial-root). It can 𝑉𝑖 = 𝑉 , 𝐸𝑖 = 𝐸 \ 𝐶𝑠 ∪ 𝐶𝑠𝑖 . We recursively solve 𝑘-MCA-CC on
be verified that 𝐽𝑘∗ in this example is the optimal 𝑘-MCA, which is each graph 𝐺𝑖 to get 𝐽𝑖 (Line 11). Let 𝑐 (𝐽 ) be the objective function
also the ground-truth schema in Figure 1(c). 3 Other possible constraints that can benefit the inference, such as “no-cycles” of join
(3) Arbitrary number of snowflakes with constraints: k- paths in inferred schema graph, is implicit from the definition of arboresence.
4 Note that the constraint is on column-index and at a column-level – one table can still
MCA-CC. The 𝑘-MCA formulation we considered so far can handle have multiple different FK columns pointing from the same table/vertex, to different
arbitrary numbers of snowflakes. This however, can be further PK columns of different tables/vertices.
Average 50-th p% 90-th p% 95-th p%
Algorithm 3: Solve 𝑘-MCA-CC # of rows per table 1053.4 50.1 1925.4 13981.2
input : Graph 𝐺 = (𝑉 , 𝐸 ) # of columns per table 8.1 4.2 11 17.1
output : optimal solution to k-MCA-CC (k-snowflakes) # of tables (nodes) per case 3.2 2 6.9 9.1
# of relationships (edges) per case 3.9 3 8.2 11.6
1 𝐽 ← Solve-k-MCA(𝐺) using Algorithm 2
2 if 𝐽 is a feasible solution to 𝑘-MCA-CC(G)) then Table 2: Characteristics of
.
all BI models harvested
3 return 𝐽 Average 50-th p% 90-th p% 95-th p%
# of rows per table 7730 80 10992 33125
4 else if cost 𝑐 (𝐽 ) is higher than best 𝑘-MCA-CC found so far then # of columns per table 9 5 20 27
5 return null # of tables (nodes) per case 12.9 9 25 32
6 else # of relationships (edges) per case 11.7 8.6 19 29
7 𝐶𝑠 ← edges in 𝐽 with the same source column index 𝑠 that Table 3: Characteristics of 1000-case
.
Real benchmark
violates FK-once constraint (Equation 16 ) TPC-C TPC-E TPC-H TPC-DS
|𝐶 | average # of rows per table 288590 7850832 1082655 814889
8 𝐶𝑠1 , 𝐶𝑠2 , . . .𝐶𝑠 𝑠 ← disjoint subsets of 𝐶𝑠 , each with exactly one
average # of columns per table 10.2 5.8 7.6 17.7
edge from 𝐶𝑠 # of tables (nodes) 9 32 8 24
9 𝐸𝑖 = 𝐸 \ 𝐶𝑠 ∪ 𝐶𝑠𝑖 , ∀𝑖 ∈ [ |𝐶𝑠 | ] # of relationships (edges) 10 45 8 107
10 𝐺𝑖 = (𝑉 , 𝐸𝑖 ) ∀𝑖 ∈ [ |𝐶𝑠 | ] Table 4: Characteristics. of 4 TPC benchmarks
11 𝐽𝑖 = Call Solve-k-MCA-CC(𝐺𝑖 ) recursively, ∀𝑖 ∈ [ |𝐶𝑠 | ]
marked as dotted orange edges (these are joins that reference the
12 𝐽 ∗ = argmin 𝐽𝑖 ,𝑖 ∈ [|𝐶𝑠 | ] 𝑐 (𝐽𝑖 )
same dimension-table, from the multiple fact tables).
13 return 𝐽 ∗
We, therefore, introduce the “recall-mode” of Auto-BI, where
in Equation (14), the 𝐽𝑖 that minimizes the cost function, 𝐽 ∗ = we “grow” additional edges on top of the “backbone” identified in
argmin 𝐽𝑖 𝑐 (𝐽𝑖 ), is the optimal solution to the original 𝑘-MCA-cc the precision-mode.
problem on 𝐺 (Line 12). Specifically, given the original graph 𝐺 = (𝑉 , 𝐸), let 𝐽 ∗ be the
Note that our recursive call (Line 11) partitions and reduces the optimal solution to 𝑘-MCA-CC. Let 𝑅 = {𝑒𝑖 𝑗 |𝑒𝑖 𝑗 ∈ (𝐸 \ 𝐽 ∗ ), 𝑃 (𝑒𝑖 𝑗 ) ≥
solution space into disjoint subsets without losing optimal solutions, 𝜏 } be the remaining edges that are promising (meeting a precision
while the pruning step (Line 4) quickly prunes away unpromising threshold 𝜏 5 ) but not yet selected by 𝐽 ∗ . For our recall-oriented
solution spaces leveraging efficient 𝑘-MCA algorithms. This two solution of Auto-BI, we want to select as many as edges 𝑆 ⊆ 𝑅,
steps can conceptually be seen as a form of branch-and-bound used subject to certain graph-structure constraints below. We write this
in the optimization literature [35], but is specifically tailored to as EMS (Edge-Maximizing Schema) below:
our graph problem (leveraging the mutually-exclusive nature of (EMS) argmax |𝑆 | (17)
𝑆 ⊆𝑅
conflicting edges based on the FK-once property).
s.t. 𝑖 ≠ 𝑙, ∀𝑒𝑖 𝑗 , 𝑒𝑙𝑚 ∈ 𝑆 ∪ 𝐽 ∗, 𝑗 ≠ 𝑚 (18)
Theorem 4. Algorithm 3 solves 𝑘-MCA-CC optimally.

𝑆 ∪ 𝐽 is cycel-free (19)
A proof of the theorem can be found in Appendix E. Intuitively,
The new constraint in Equation (19) ensures that there should
this algorithm ensures optimality, because at most one edge in
be no cycles in the resulting graph induced by 𝑆 ∪ 𝐽 ∗ , as circular
𝐶𝑠 may appear in the optimal solution to 𝑘-MCA-CC (otherwise
joins between columns are uncommon at a schema level.
the FK-once constraint is violated), making edges in 𝐶𝑠 mutually
The EMS problem is NP-hard, and is 1/2-inapproximable using
exclusive. This allows us to partition the conflicting edges in 𝐶𝑠 ,
a reduction from Max-Sub-DAG [26]. However, because this step
and recursively construct instances of the problem with smaller
operates on top of the optimal solution 𝐽 ∗ to 𝑘-MCA-CC, there is
feasible regions, without losing the optimal solution in the process.
typically limited flexibility in selecting from 𝑅. We find different
Given the hardness of 𝑘-MCA-CC, we clearly cannot hope Algo-
solutions have very similar results (Section 5), and we thus solve
rithm 3 to solve 𝑘-MCA-CC in polynomial time. In practice, how-
EMS greedily by picking the most confident edges (based on 𝑃 (𝑒𝑖 𝑗 )),
ever, our branch-and-bound partitioning exploits the sparseness of
without using more expensive solutions.
edges (there are few edges pointing from the same columns), which
This now completes the overall Auto-BI. To recap, we first solve
turns out to be highly efficient. In our experiments (Section 5), we
𝑘-MCA-CC optimally using Algorithm 3 in the precision-mode, to
observe the mean and median latency of Algorithm 3 on real-world
get 𝐽 ∗ (the salient 𝑘-snowflakes). Then in the recall-mode, we solve
BI schemas is 0.11 and 0.02 second, respectively, with the max be-
EMS using 𝐽 ∗ and obtain 𝑆 ∗ . The sub-graph induced by 𝐽 ∗ ∪ 𝑆 ∗ is
ing 11 seconds on a case with 88 data-tables (the largest case we
the final solution of Auto-BI.
encounter in the 100K+ real BI models harvested). We note that
this is encouraging, and analogous to many classical combinatorial
problems (e.g., Euclidean TSP and Knapsack) that are intractable in
5 EXPERIMENTS
theory but reliably solvable in practice. We perform large-scale evaluations to test Auto-BI and related
4.3.3 Recall Mode (MaxEdge-CC). methods in the literature, in both quality and efficiency.
At this point, we did with the precision-mode of Auto-BI, which is
precision-oriented as we focus on finding the salient 𝑘-snowflakes 5.1 Evaluation Setup
to reveal the “backbone” of the underlying schema. However, not all Benchmarks. We use two benchmarks to evaluate Auto-BI.
desired joins are included in 𝑘-snowflakes/arboresence. For exam-
ple, the green edges in Figure 4 correspond to the optimal solution 5 Bydefault, we threshold with 𝜏 = 0.5 here, since our 𝑃 (𝑒𝑖 𝑗 ) are calibrated probability,
to 𝑘-MCA-CC, but we are still missing two joins in the ground truth, and 0.5 is a natural cutoff for joinability.
Real (OLAP) Synthetic (OLAP) Synthetic (OLTP)
Benchmark 1000-case Real benchmark TPC-H TPC-DS TPC-C TPC-E
Method Category Method 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑐𝑎𝑠𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒
Auto-BI-P 0.98 0.664 0.752 0.92 1 0.88 0.93 0.99 0.28 0.43 1 0.6 0.75 1 0.4 0.58
Auto-BI Auto-BI 0.973 0.879 0.907 0.853 1 1 1 0.96 0.91 0.93 1 0.8 0.89 0.96 0.93 0.95
Auto-BI-S 0.951 0.848 0.861 0.779 1 1 1 0.92 0.89 0.91 1 0.7 0.82 0.93 0.94 0.94
Commercial System-X 0.916 0.584 0.66 0.754 0 0 0 0 0 0 0 0 0 0 0 0
MC-FK 0.604 0.616 0.503 0.289 1 1 1 0.73 0.65 0.68 0.46 0.8 0.63 0.57 0.79 0.48
Fast-FK 0.647 0.585 0.594 0.259 0.71 0.88 0.79 0.62 0.35 0.44 0.62 0.57 0.6 0.73 0.84 0.78
Baselines
HoPF 0.684 0.714 0.67 0.301 0.86 0.75 0.8 0.87 0.51 0.65 0.75 0.7 0.72 0.71 0.91 0.81
ML-FK 0.846 0.77 0.773 0.557 0.6 0.75 0.667 0.369 0.589 0.454 0.273 0.3 0.286 0.694 0.756 0.723
Language model GPT-3.5 0.73 0.64 0.67 0.43 0.75 0.75 0.75 0 0 0 0.16 0.2 0.15 0.78 0.56 0.65
Table 5: Quality comparison on the 1000-case Real benchmark and 4 TPC benchmarks.
Denormalized (OLAP-like) Normalized (OLTP-like)
Benchmark Foodmart Northwind AdventureWork WorldWideImporters Foodmart Northwind AdventureWork WorldWideImporters
Method Category Method 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒
Auto-BI-P 1 0.5 0.67 1 1 1 1 0.34 0.51 1 0.31 0.47 1 0.63 0.77 1 1 1 1 0.44 0.62 1 0.25 0.4
Auto-BI Auto-BI 0.75 1 0.86 1 1 1 0.98 0.93 0.97 1 0.83 0.91 0.8 1 0.89 1 1 1 0.97 0.82 0.89 0.97 0.86 0.91
Auto-BI-S 0.68 1 0.81 0.9 1 0.95 0.94 0.91 0.92 1 0.79 0.88 0.8 1 0.88 1 1 1 0.93 0.8 0.86 0.92 0.84 0.88
Commercial System-X 0.75 0.6 0.67 0.85 0.79 0.82 0.97 0.66 0.78 1 0.59 0.74 0.75 0.6 0.67 0.9 1 0.95 0.9 0.34 0.5 1 0.09 0.17
MC-FK 0.42 0.9 0.57 0.28 1 0.44 0.33 0.71 0.45 0.21 0.86 0.34 0.54 0.88 0.67 0.43 0.8 0.56 0.2 0.87 0.33 0.2 0.52 0.3
Fast-FK 0.38 0.6 0.46 0.53 0.64 0.65 0.33 0.34 0.34 0.27 0.47 0.35 0.55 0.75 0.63 0.55 0.6 0.57 0.43 0.69 0.53 0.45 0.14 0.2
Baselines HoPF 0.44 0.4 0.42 0.53 0.5 0.52 0.78 0.73 0.75 0.89 0.72 0.8 0.43 0.38 0.4 0.8 0.8 0.8 0.55 0.71 0.62 0.75 0.51 0.61
ML-FK 0.43 0.8 0.56 0.35 0.86 0.5 0.84 0.8 0.82 0.89 0.75 0.81 0.57 0.88 0.69 0.5 0.9 0.64 0.95 0.83 0.89 0.9 0.5 0.64
Language model GPT-3.5 1 0.6 0.75 1 0.71 0.83 1 0.32 0.48 0.77 0.69 0.73 1 0.75 0.86 1 1 1 0.87 0.54 0.67 0.88 0.61 0.72

Table 6: Quality comparison on FoodMart, NorthWind, AdventureWork and WorldWideImporters . (𝑃𝑒 , 𝑅𝑒 , 𝐹𝑒 ) = (𝑃𝑒𝑑𝑔𝑒 , 𝑅𝑒𝑑𝑔𝑒 , 𝐹𝑒𝑑𝑔𝑒 ).
# of tables 4 5 6 7 8 9 10 [11,15] [16,20] 21+
Case-type statistics (ST,SN,C,O) (50,18,31,1) (49,12,37,2) (36,14,48,2) (19,23,49,9) (10,22,57,11) (7,25,50,18) (2,39,40,19) (7,14,60,19) (1,7,72,20) (9,5,62,24)
Auto-BI 0.97 (0.99,0.95) 0.97 (0.98,0.96) 0.96 (0.99,0.94) 0.95 (0.98,0.91) 0.95 (0.99,0.92) 0.96 (1.00,0.93) 0.94 (0.98,0.90) 0.90 (0.97,0.85) 0.84 (0.95,0.75) 0.79 (0.94,0.69)
Auto-BI Auto-BI-P 0.91 (1.00,0.84) 0.91 (0.99,0.85) 0.88 (1.00,0.79) 0.81 (0.98,0.69) 0.83 (0.98,0.71) 0.83 (1.00,0.70) 0.81 (0.99,0.69) 0.71 (0.98,0.56) 0.60 (0.96,0.44) 0.55 (0.95,0.39)
Auto-BI-S 0.95 (0.98,0.93) 0.95 (0.96,0.94) 0.94 (0.97,0.92) 0.94 (0.97,0.90) 0.95 (0.98,0.91) 0.93 (0.97,0.89) 0.92 (0.96,0.88) 0.85 (0.92,0.79) 0.78 (0.90,0.70) 0.74 (0.89,0.63)
Commercial System-X 0.76 (0.94,0.66) 0.67 (0.91,0.55) 0.76 (0.94,0.66) 0.76 (0.93,0.66) 0.75 (0.91,0.65) 0.78 (0.91,0.70) 0.77 (0.92,0.67) 0.74 (0.90,0.65) 0.65 (0.80,0.56) 0.66 (0.88,0.54)
MC-FK 0.69 (0.88,0.57) 0.65 (0.93,0.49) 0.63 (0.70,0.58) 0.65 (0.67,0.63) 0.62 (0.70,0.56) 0.56 (0.46,0.72) 0.54 (0.48,0.63) 0.54 (0.49,0.61) 0.52 (0.42,0.69) 0.42 (0.30,0.68)
Fast-FK 0.76 (0.79,0.72) 0.76 (0.79,0.74) 0.68 (0.69,0.66) 0.53 (0.53,0.52) 0.65 (0.67,0.63) 0.47 (0.46,0.47) 0.45 (0.48,0.42) 0.49 (0.53,0.46) 0.47 (0.49,0.46) 0.42 (0.40,0.44)
Baselines
HoPF 0.83 (0.86,0.81) 0.77 (0.77,0.76) 0.72 (0.73,0.71) 0.63 (0.58,0.70) 0.68 (0.67,0.70) 0.55 (0.49,0.64) 0.62 (0.55,0.70) 0.61 (0.58,0.64) 0.57 (0.55,0.60) 0.49 (0.44,0.56)
ML-FK 0.87(0.91,0.84) 0.86 (0.91,0.85) 0.8 (0.94,0.79) 0.83 (0.88,0.83) 0.84 (0.89,0.84) 0.8 (0.84,0.81) 0.8 (0.88,0.79) 0.72 (0.84,0.71) 0.64 (0.69,0.65) 0.56 (0.65,0.6)
Language model GPT-3.5 0.79 (0.86,0.76) 0.75 (0.83,0.7) 0.8 (0.83,0.78) 0.74 (0.78,0.74) 0.72 (0.77,0.69) 0.69 (0.76,0.66) 0.69 (0.76,0.66) 0.56 (0.62,0.54) 0.49 (0.56,0.46) 0.39 (0.46,0.36)
Table 7: Edge-level quality reported as “F-1 (precision, recall)”, by number of input tables in Real benchmark. In the first row, we report “case
type statistics”, denoted as (𝑆𝑇 , 𝑆𝑁 , 𝐶, 𝑂), which stand for the number of cases in (Star, Snowflake, Constellation, Others), respectively.
# of tables 4 5 6 7 8 9 10 [11,15] [16,20] 21+
- Real. We sample 1000 real BI models crawled in the wild Auto-BI 1.00 0.96 0.95 0.89 0.95 0.97 0.85 0.78 0.64 0.55
(Section 3.2) as our first benchmark, henceforth referred to as Real. Auto-BI Auto-BI-P
Auto-BI-S
1.00
0.99
0.98
0.95
0.99
0.93
0.94
0.93
0.96
0.95
0.99
0.80
0.95
0.76
0.89
0.69
0.83
0.49
0.67
0.31
In order to account for different levels of difficulty in predicting BI Commercial System-X 0.91 0.87 0.81 0.85 0.77 0.78 0.74 0.75 0.52 0.54
MC-FK 0.76 0.68 0.41 0.34 0.19 0.14 0.1 0.13 0.09 0.05
models, we perform stratified sampling as follows – we bucketize Fast-FK 0.68 0.65 0.39 0.14 0.32 0.08 0.09 0.12 0.09 0.03
models into 10 groups based on the number of input tables as {4, 5, Baselines HoPF 0.78 0.57 0.42 0.21 0.32 0.06 0.18 0.19 0.18 0.11
ML-FK 0.87 0.86 0.68 0.65 0.7 0.55 0.53 0.42 0.18 0.12
6, 7, 8, 9, 10, [11-15], [16-20], 21+}, and randomly select 100 cases Language model GPT-3.5 0.67 0.61 0.61 0.6 0.44 0.42 0.42 0.21 0.1 0.05
in each group, to create this 1000-case benchmark that covers the Table 8: Case-level precision, by number of tables.
entire spectrum in terms of levels of difficulty, from easy (with only We further perform tests on 4 commonly-used synthetic databases:
a few tables) to hard (with over 20 tables). Table 3 summarizes the FoodMart [3], Northwind [4], AdventureWorks [2] and WorldWide-
characteristics of the 1000-case Real benchmark. Note that these Importers [6]. We use both of their OLAP and OLTP versions to
test cases are held out and never used when training of our local stress test our algorithms, for a total of 8 additional test databases.
classifier, which is consistent with the standard practice of machine Metrics. We compare the predicted joins of different methods,
learning to avoid data leakage. against the ground truth (which in the case of Real, are human-
In comparison, Table 2 shows the characteristics of all BI models specified relationships that we programmatically extract from BI
harvested in the entire collection, which are substantially simpler models we crawled). We evaluate prediction quality of algorithms
(e.g., with a small number of tables). This is not entirely surprising, both at the edge-level and case-level, defined as below:
when these BI models are programmed by non-technical business Edge-level quality. For each BI test case 𝐶, we evaluate the frac-
users using GUI tools, which is why we perform stratified sampling
tion of predicted join relationships (edges on the graph) that is
described above to create a more balanced and challenging test set.
identical to ground-truth relationships. We use standard preci-
- Synthetic. In addition to evaluation on real BI models, we
sion/recall/F metrics for the edge-level evaluation, defined as pre-
perform tests on 4 TPC benchmarks, henceforth referred to as Syn- num-of-correctly-predicted-edges
thetic. Specifically, TPC-H and TPC-DS are two popular bench- cision 𝑃𝑒𝑑𝑔𝑒 (𝐶) = num-of-total-predicted-edges , recall 𝑅𝑒𝑑𝑔𝑒 (𝐶) =
num-of-correctly-predicted-edges
marks for BI/OLAP workloads. We evaluate predicted joins with num-of-total-ground-truth-edges . The F-1 score, denoted by 𝐹𝑒𝑑𝑔𝑒 (𝐶), is
the ground-truth relationships in TPC specifications. We further 2𝑃 (𝐶 )𝑅 (𝐶 )
then the harmonic mean of precision and recall, or 𝑃 𝑒𝑑𝑔𝑒(𝐶 )+𝑅𝑒𝑑𝑔𝑒 (𝐶 ) .
perform tests on TPC-C and TPC-E, two popular OLTP benchmarks. 𝑒𝑑𝑔𝑒 𝑒𝑑𝑔𝑒
While Auto-BI is not designed for general OLTP databases, we We report precision/recall/F-1 for the entire Real benchmark,
perform the tests nevertheless, in order to understand Auto-BI’s as the average across all 1000 test cases.
ability to detect foreign-keys in general (beyond the snowflake-like Case-level quality. Since it is difficult for non-technical users
schemas that Auto-BI was initially designed for). to identify incorrectly-predicted relationships, high precision is
crucial for Auto-BI (Section 4.3.2). We report case-level precision
for each test case 𝐶, denoted(by 𝑃𝑐𝑎𝑠𝑒 (𝐶), defined as: with both precision and recall modes (Section 4.3.3). (3): Further-
1, if P𝑒𝑑𝑔𝑒 (𝐶) = 1 more, we test a light-weight version of Auto-BI called Auto-
𝑃𝑐𝑎𝑠𝑒 (𝐶) = (20) BI-Schema-Only (Auto-BI-S), which uses only schema-level
0, otherwise
metadata (table/column names) for local-classifiers (without us-
In other words, 𝑃𝑐𝑎𝑠𝑒 (𝐶) is 1 only if no incorrect edge is predicted;
ing column-values). This method is thus efficient, with sub-second
and 0 even if a single false-positive edge is predicted (for it is un-
latency in most cases, while still being surprisingly accurate.
likely that non-technical users can identify and correct the incorrect
predictions, making us to “fail” in such a situation). The case-level
precision on the entire benchmark is also the average across all 5.3 Quality comparisons
1000 cases. 6
Overall comparison. Table 5 compares the overall quality of all
Latency. We report the latency of all methods using wall-clock
methods, on both the 1000-case Real benchmark and 4 Synthetic
time. All methods are implemented using Python 3.8 and tested on a
TPC benchmarks. As we can see, Auto-BI-based methods consis-
Windows 11 machine with 64-core Intel CPU and 128 GB memory.
5.2 Methods compared tently outperform alternatives, with Auto-BI-P always being the
We compare Auto-BI with the following methods in the literature. best in precision (very close to 1), while the full Auto-BI being the
Our code and data are available for reproducibility [1]. best overall in F-scores.
MC-FK [58]. MC-FK pioneered a method to accurately detect On the 1000-case Real benchmark, we note that the difference
multi-column FKs, using an EMD-based randomness metric that between Auto-BI and the best baselines is significant – the differ-
measures the distribution similarity of two columns. ence is 13 percentage points for edge-level precision (0.98 vs. 0.846),
Fast-FK [17]. Fast-FK makes FK predictions based on a scoring and over 10 percentage points for edge-level recall (0.879 vs. 0.77).
function that combines column-value and column-name similar- The difference is even more pronounced at the case-level, which
ity. This method employs fast pruning and focuses on ensuring is over 35 percentage points (0.92 vs. 0.557), showing the strong
interactive predictions. benefit on quality when using a principled global optimization in
HoPF [30]. HoPF is a recent method that detects FKs and PKs our 𝑘-MCA.
together, in order to ensure better prediction accuracy. It employs Even the light-weight Auto-BI-S (with the same 𝑘-MCA but
hand-tuned scoring functions as well as structural constraints (e.g., using schema-only features) is surprisingly accurate (losing only
no cycles) to improve accuracy. 2-3 percentage points in precision/recall). As we will see, it is much
ML-FK [48]. ML-FK is an ML-based approach to predict FK, more efficient, however, making it a strong contender for practical
which is similar in spirit to our local-classifiers, and we feed ML-FK use when latency becomes important.
with the same training data used by our local-classifiers. For the Synthetic TPC benchmarks, we observe similar trends
System-X. System-X is a commercial system from a leading consistent with the Real benchmark. It is interesting to note that,
BI vendor that has a feature to detect joins. We anonymize the not only is Auto-BI the best on OLAP benchmarks (TPC-H/TPC-
name of the system, in keeping with benchmarking traditions in DS), it turns out to be the best on OLTP benchmarks too (TPC-
the database literature [7, 23, 44, 49]. C/TPC-E). This is somewhat surprising, because OLTP databases
GPT-3.5. GPT is a family of language models [13] capable of do not usually follow snowflake-like schema, and are not the use
performing many downstream tasks such as NL-to-SQL [45, 54] cases that Auto-BI is originally designed for. We inspected TPC-C
and schema-matching [59]. We use GPT as a baseline with few-shot and TPC-E carefully to understand why Auto-BI is effective. It
learning [13], where we provide few-shot demonstrations of the turns out that while these OLTP databases do not have snowflake
join task in the prompt, and then ask the model to predict for new designs, they nevertheless have clusters of tables (e.g., a cluster of
test cases. We use the latest publicly available version of GPT (GPT- tables on “Customers”, and another cluster on “Market”, etc., for
3.5-turbo), 7 and in the prompt we provide both the table-schema TPC-E), where each cluster loosely follows a hub-and-spoke pattern
(column names) as well as sample data rows. Note that because with related tables joining through a few “central” tables (e.g., a clus-
synthetic benchmarks (e.g., TPC) are heavily represented on the ter of tables relating to customers, such as “Customers-Account”
web, it is likely that GPT has seen such cases in its pre-training (e.g., and “Customers-Tax-Rate”, all join through a central “Customers”
TPC queries on the web), causing data leakage and inflated results. table in TPC-E, like very nicely depicted in Figure 2 of prior work
We report these numbers nevertheless, for reference purposes. on schema summarization [57]). Similar results are also observed in
Auto-BI. This is our proposed method using global 𝑘-MCA additional synthetic benchmarks as shown in Table 6. Exploring the
optimization. We evaluate three variants of our method: (1): Auto- applicability of Auto-BI-like global optimization in FK-detection
BI-Precision (Auto-BI-P) is the precision-mode of Auto-BI (Sec- for general OLTP databases is an interesting area for future work.
tion 4.3.2) that focuses on high precision (by finding snowflake-like Among all baselines, the commercial System-X produces high
“backbones” from the schema). (2): Auto-BI is our full approach precision at the cost of a low recall. Among the four FK methods
from the literature, ML-FK produces the best quality (since we feed
6We note that it is possible that a predicted join graph may differ syntactically from it with the same training data harvested from real BI models), while
the ground truth when the two are semantically equivalent. For example, the ground
truth may have a fact-table 𝐹 join a dimension-table 𝐴 (N:1), which in turn 1:1 join the remaining three have comparable precision/recall.
with another dimension-table 𝐵 . If it is predicted that 𝐹 joins with 𝐵 (N:1), and then Detailed breakdown. Table 7 and Table 8 show a more detailed
𝐵 joins with 𝐴 (1:1), while the ground truth is F-(N:1)-A-(1:1)-B, then the are in fact comparison of edge-level and case-level quality, respectively, buck-
identical (except that the join order is different). We account for semantic equivalence
in our evaluation and mark both as correct, which is fair for all methods tested. etized by the number of input tables. The overall trend is consistent
7 https://platform.openai.com/docs/models/gpt-3-5 with Table 5 – Auto-BI has the best F-1, while Auto-BI-P has the
20 3.5
Auto-BI-P UCC

Avg Runtime Breakdown (seconds)


Auto-BI 3 IND
Local-Inference

Avg Runtime (seconds)


15 Auto-BI-S
2.5
MC-FK Global-Predict
Fast-FK 2
10 HoPF
1.5

1
5

0.5

0 0
50% 90% 95% Auto-BI-P Auto-BI Auto-BI-S MC-FK Fast-FK HoPF

(a) 50/90/95-th Percentiles (b) Breakdown by components

Figure 5: Comparison of end-to-end latency.


6
percentile latency are 0.02, 0.06, and 0.17 seconds, respectively. This
5 confirms that our Algorithm 3 is not only optimal but also real-time
and practical. There are a few cases where the latency is over 1
Run time (seconds)

4
second, which are mostly cases with more than 40 input tables,
3
which we argue is acceptable considering the size of the data. We
2
also report that the largest case we encounter among all 10K+ BI
models has 88 input tables, which k-MCA-CC still solves optimally
1
in about 11 seconds (not shown in the figure as it is not sampled in
0
0 5 10 15 20 25 30 35 40 45 50
the 1000-case Real). This confirms that our algorithm is practical
Number of Input Tables in solving large real-world BI cases both optimally and efficiently.
Figure 6: Latency distribution of k-MCA-CC on 1000-case Real. Figure 7 shows the benefit of the two key optimization techniques
The 50/90/95-th p-tiles are 0.02/0.06/0.17 seconds, respectively.
we used in solving k-MCA-CC: (1) artificial root in 𝑘-MCA (Algo-
best precision across the board. We note that Auto-BI works rea- rithm 2) and (2) branch-and-bound in 𝑘-MCA-CC (Algorithm 3).
sonably well on the most challenging slice of the test cases (with We compare the number of 1-MCA calls with and without the opti-
21+ tables), maintaining high precision (0.94) and reasonable recall. mizations. (Note that we use the number of 1-MCA calls instead
We report additional experiments and analysis (comparing with of wall-clock time, because the algorithm can timeout without our
baselines enhanced using components from Auto-BI to better un- optimizations). We can see that the two optimization techniques
derstand its component-wise benefits) in Appendix C. give 5 and 4 orders of magnitude improvement in efficiency, re-
5.4 Efficiency comparison spectively (the y-axis is in log-scale). Compared to a brute-force
Overall comparison. We show the 50/90/95-th percentile latency approach that solves k-MCA without optimization, the combined
of all methods in Figure 5(a), on the 1000-case Real benchmark. benefit is a staggering 10 orders of magnitude, again showing the
Overall, Auto-BI-S and Fast-FK are the fastest, taking 2-3 seconds importance of our algorithms.
even in the largest cases. Auto-BI is 2-3x slower but the latency is
still acceptable. HoPF takes the most time in comparison. 5.5 Ablation Studies
We also report the latency breakdown of all methods in Fig- We perform an ablation study to analyze the benefits of various
ure 5(b). Overall, there are 4 main components for all methods: ideas developed in Auto-BI, using the 1000-case Real benchmark.
(1) UCC: generate unique column combinations; (2) IND: gener- Our results are summarized in Figure 8.
ate inclusion dependencies; (3) Local-Inference: generate features The effect of FK-once constraint. The bar marked as “no-FK-
and perform local-classifier inference for joins in Auto-BI (while once-constraint” in Figure 8 shows the result quality that removes
baselines use predefined scoring functions here); (4) Global-Predict: the FK-once constraint (or using k-MCA instead of k-MCA-CC).
Auto-BI uses 𝑘-MCA, whereas baselines use various other algo- There is a noticeable drop in precision/recall, showing the impor-
rithms. For Auto-BI, the most expensive part is Local-Inference tance of this constraint in k-MCA-CC.
(specifically, feature generation for inference). Auto-BI-S is effec- The effect of precision-mode. The “no-precision-mode” bar shows
tive in reducing the latency of this part (by using metadata-only the result quality if we omit the precision-mode step (Section 4.3.2)
features), thus achieving significant speedup. IND-generation is and use the recall-mode directly (Section 4.3.3). We see a substantial
also expensive for all methods, but it is a standard step and the drop in precision (6 and 13 percentage points at edge-level and case-
latency is comparable across all methods. In Global-Predict, we see level, respectively), again showing the importance of k-MCA-CC.
that Auto-BI (using the 𝑘-MCA algorithm) is highly efficient here. The effect of N-1/1-1 separation in local classifier. The “no-
In comparison, baselines like MC-FK and HoPF are very expensive. N-1/1-1-seperation” bar shows the results when we use one local
Efficiency of k-MCA-CC. Our key innovation in Auto-BI is classifier, instead of splitting N-1 and 1-1 joins into two prediction
the 𝑘-MCA-CC formulation and its efficient solutions. Despite the tasks (Section 4.2). The drop in quality is again evident.
hardness of the problem, we solve it optimally (Algorithm 3), which The effect of label transitivity in local classifier. The bar for
is efficient as shown by the purple Global-Predict component in “no-label-transitivity” shows the drop in quality when we do not
Figure 5(b). It is interesting to drill down and study the latency of apply label transitivity in local classification (Section 4.2). The effect
this key step in detail. is noticeable but diluted across 1000 cases, as the benefit is more
Figure 6 shows the distribution of latency when we solve k-MCA- pronounced in large cases and less obvious in smaller ones.
CC on the 1000-case Real. We can see that the latency for the vast The effect of no data-value features in local classifier. The “no-
majority of cases is sub-second. In fact, the 50-th, 90-th and 95-th data-features” bar shows the effect of removing features relating
1
Auto-BI Auto-BI no-N-1/1-1-seperation LC-only
10 10 no-branch-and-bound 0.95 no-FK-once-constraint no-label-transitivity
no-precision-mode no-data-features
no-synthetic-root

Number of MCA calls


Brute Force 0.9
10 8
0.85
10 6
0.8
4
10 0.75

0.7
10 2
0.65
10 0
4 5 6 7 8 9 10 [11-15] [16-20] >=21 0.6
Number of Input Tables Edge Precision Edge Recall Edge F-1 Case Precision
Figure 7: Effect of our efficiency optimization techniques, as Figure 8: Ablation study on the effect of Auto-BI components
measured by the number of 1-MCA invocations. on result quality (average of 1000-case Real).

1 1

0.95
0.9
0.9

0.85 0.8
0.8

0.75 0.7
Auto-BI edge precision Auto-BI edge precision
0.7 Auto-BI edge recall Auto-BI edge recall
0.6
Auto-BI edge F-1 Auto-BI edge F-1
0.65
Auto-BI case precision Auto-BI case precision
0.6 0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
p

(a) Sensitivity to 𝑝 in 𝑘 -MCA-CC (b) Sensitivity to 𝜏 in EMS

Figure 9: Sensitivity study on Real benchmark.


to data-value in our local classifier (we use meta-data features only
instead). This directly corresponds to our lightweight Auto-BI-S
(Section 5.3). The drop in quality is apparent but not too significant,
which gives us a useful trade-off between quality and latency.
The effect of using local classifier only. Lastly, we perform an im-
portant ablation study, to understand the benefit of our graph-based
k-MCA algorithm on top of local classifier results. The “LC-only”
bar shows the performance if we employ the same local classifier at
the edge-level (keeping edges that have calibrated probability over
0.5), without using a holistic k-MCA optimization. The difference in
performance is significant (25 percentage points in case-precision),
underscoring the importance of our graph-based k-MCA.

5.6 Sensitivity Analysis


Figure 9 shows a sensitivity analysis of the two parameters in
Auto-BI, the penalty-term 𝑝 in 𝑘-MCA (Section 4.3.2), and the
edge-weight threshold 𝜏 in EMS (Section 4.3.3). Recall that because
we use calibrated true probabilities, 0.5 is the natural choice in both
cases. This is confirmed in our study. Figure 9(a) shows that when
varying 𝑝 from 0 to 1, the region around 0.5 indeed gives the best
result. Figure 9(b) shows the sensitivity to 𝜏, which is used to prune
unpromising edges, where a lower 𝜏 naturally leads to better recall
at the cost of precision. We see that 𝜏 = 0.5 strikes a reasonable
balance between precision/recall and produces high F-1, which is
also our default setting in Auto-BI.

6 CONCLUSIONS AND FUTURE WORK


In this work we develop an Auto-BI system that can accurately pre-
dict join relationships in BI models, leveraging a novel graph-based
formulation called k-MCA that exploits snowflake-like structures
common in BI models.
Future directions include improving the end-to-end efficiency of
Auto-BI, and extending the system to synthesize transformation
steps, in order to automate BI model building end-to-end.
REFERENCES [33] Sebastian Kruse, Thorsten Papenbrock, Christian Dullweber, Moritz Finke,
[1] [n.d.]. Benchmark data will be released after an internal review:. https://github. Manuel Hegner, Martin Zabel, Christian Zollner, and Felix Naumann. 2017.
com/yiminl18/Auto-BI. Fast approximate discovery of inclusion dependencies. Datenbanksysteme für
[2] 2022. AdventureWorks Database. https://github.com/Microsoft/sql-server- Business, Technologie und Web (BTW 2017) (2017).
samples/releases/tag/adventureworks. [34] Sebastian Kruse, Thorsten Papenbrock, and Felix Naumann. 2015. Scaling out the
[3] 2022. FoodMart Database. https://begincodingnow.com/microsofts-foodmart- discovery of inclusion dependencies. Datenbanksysteme für Business, Technologie
database/. und Web (BTW 2015) (2015).
[4] 2022. NorthWind Database. https://github.com/Microsoft/sql-server-samples/ [35] Ailsa H Land and Alison G Doig. 2010. An automatic method for solving discrete
tree/master/samples/databases/northwind-pubs. programming problems. In 50 Years of Integer Programming 1958-2008. Springer,
[5] 2022. SentenceTransformers. https://www.sbert.net/. 105–132.
[6] 2022. WorldWide-Importers Database. https://github.com/Microsoft/sql-server- [36] Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paul-
samples/releases/tag/wide-world-importers-v1.0. heim, and Christian Bizer. 2015. The mannheim search join engine. Journal of
[7] Retrieved in 2023-01. DeWitt Clause (retrieved 2022-09). https://en.wikipedia. Web Semantics 35 (2015), 159–166.
org/wiki/David_DeWitt#DeWitt_Clause. [37] Peng Li, Xiang Cheng, Xu Chu, Yeye He, and Surajit Chaudhuri. 2021. Auto-
[8] Retrieved in 2023-01. Power BI. https://powerbi.microsoft.com/en-us/. FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. In
[9] Retrieved in 2023-01. Power BI: Create and manage relationships. Proceedings of the 2021 International Conference on Management of Data. 1064–
https://learn.microsoft.com/en-us/power-bi/transform-model/desktop- 1076.
create-and-manage-relationships. [38] Jock Mackinlay, Pat Hanrahan, and Chris Stolte. 2007. Show me: Automatic
[10] Retrieved in 2023-01. Tableau. https://www.tableau.com/. presentation for visual analysis. IEEE transactions on visualization and computer
[11] Retrieved in 2023-01. Tableau: Relate your data by drag-and-drop. https://help. graphics 13, 6 (2007), 1137–1144.
tableau.com/current/pro/desktop/en-us/relate_tables.htm. [39] Fabien De Marchi, Stéphane Lopes, and Jean-Marc Petit. 2009. Unary and n-ary
[12] Jana Bauckmann, Ulf Leser, Felix Naumann, and Véronique Tietz. 2007. Efficiently inclusion dependency discovery in relational databases. Journal of Intelligent
detecting inclusion dependencies. In 2007 IEEE 23rd International Conference on Information Systems 32, 1 (2009), 53–73.
Data Engineering. IEEE, 1448–1450. [40] Solomon Negash and Paul Gray. 2008. Business intelligence. In Handbook on
[13] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, decision support systems 2. Springer, 175–193.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda [41] Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities
Askell, et al. 2020. Language models are few-shot learners. Advances in neural with supervised learning. In Proceedings of the 22nd international conference on
information processing systems 33 (2020), 1877–1901. Machine learning. 625–632.
[14] Marco A Casanova, Ronald Fagin, and Christos H Papadimitriou. 1982. Inclusion [42] Arash Dargahi Nobari and Davood Rafiei. 2022. Efficiently transforming tables
dependencies and their interaction with functional dependencies. In Proceedings for joinability. In 2022 IEEE 38th International Conference on Data Engineering
of the 1st ACM SIGACT-SIGMOD symposium on Principles of database systems. (ICDE). IEEE, 1649–1661.
171–176. [43] Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Felix
[15] Surajit Chaudhuri, Umeshwar Dayal, and Vivek Narasayya. 2011. An overview Naumann. 2015. Divide & conquer-based inclusion dependency discovery. Pro-
of business intelligence technology. Commun. ACM 54, 8 (2011), 88–98. ceedings of the VLDB Endowment 8, 7 (2015), 774–785.
[16] Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive [44] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J Abadi, David J DeWitt,
operator for similarity joins in data cleaning. In 22nd International Conference on Samuel Madden, and Michael Stonebraker. 2009. A comparison of approaches to
Data Engineering (ICDE’06). IEEE, 5–5. large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International
[17] Zhimin Chen, Vivek Narasayya, and Surajit Chaudhuri. 2014. Fast foreign-key Conference on Management of data. 165–178.
detection in Microsoft SQL server PowerPivot for Excel. Proceedings of the VLDB [45] Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the
Endowment 7, 13 (2014), 1417–1428. text-to-sql capabilities of large language models. arXiv preprint arXiv:2204.00498
[18] Yoeng-Jin Chu. 1965. On the shortest arborescence of a directed graph. Scientia (2022).
Sinica 14 (1965), 1396–1400. [46] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings
[19] Fabien De Marchi, Stéphane Lopes, and Jean-Marc Petit. 2002. Efficient al- using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
gorithms for mining inclusion dependencies. In International Conference on [47] Kenneth H Rosen. 2008. ITS APPLICATIONS. (2008).
Extending Database Technology. Springer, 464–476. [48] Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and Ulf
[20] Jack Edmonds et al. 1967. Optimum branchings. Journal of Research of the Leser. 2009. A machine learning approach to foreign key discovery.. In WebDB.
national Bureau of Standards B 71, 4 (1967), 233–240. [49] Albrecht Schmidt, Florian Waas, Martin Kersten, Michael J Carey, Ioana
[21] Bruno Escoffier and Vangelis Th Paschos. 2006. Completeness in approximation Manolescu, and Ralph Busse. 2002. XMark: A benchmark for XML data manage-
classes beyond apx. Theoretical computer science 359, 1-3 (2006), 369–377. ment. In VLDB’02: Proceedings of the 28th International Conference on Very Large
[22] Jean-Claude Fournier. 2013. Graphs theory and applications: with exercises and Databases. Elsevier, 974–985.
problems. John Wiley & Sons. [50] Nuhad Shaabani and Christoph Meinel. 2015. Scalable inclusion dependency dis-
[23] Florian Funke, Alfons Kemper, and Thomas Neumann. 2011. Benchmarking covery. In International Conference on Database Systems for Advanced Applications.
hybrid oltp&olap database systems. Datenbanksysteme für Business, Technologie Springer, 425–440.
und Web (BTW) (2011). [51] Nuhad Shaabani and Christoph Meinel. 2018. Improving the efficiency of in-
[24] Gartner. [n.d.]. The Gartner 2022 Analytics BI Platforms Magic Quadrant clusion dependency detection. In Proceedings of the 27th ACM International
Highlights. ([n. d.]). Conference on Information and Knowledge Management. 207–216.
[25] Gartner. [n.d.]. How to Enable Self-Service Analytics. ([n. d.]). [52] Yasin N Silva, Walid G Aref, and Mohamed H Ali. 2010. The similarity join
[26] Venkatesan Guruswami, Johan Håstad, Rajsekar Manokaran, Prasad Raghaven- database operator. In 2010 IEEE 26th International Conference on Data Engineering
dra, and Moses Charikar. 2011. Beating the random ordering is hard: Every (ICDE 2010). IEEE, 892–903.
ordering CSP is approximation resistant. SIAM J. Comput. 40, 3 (2011), 878–914. [53] Fabian Tschirschnitz, Thorsten Papenbrock, and Felix Naumann. 2017. Detecting
[27] Pat Hanrahan. 2006. Vizql: a language for query, analysis and visualization. In inclusion dependencies on very many tables. ACM Transactions on Database
Proceedings of the 2006 ACM SIGMOD international conference on Management of Systems (TODS) 42, 3 (2017), 1–29.
data. 721–721. [54] Lihan Wang, Bowen Qin, Binyuan Hui, Bowen Li, Min Yang, Bailin Wang, Binhua
[28] Juris Hartmanis. 1982. Computers and intractability: a guide to the theory of Li, Jian Sun, Fei Huang, Luo Si, et al. 2022. Proton: Probing schema linking infor-
np-completeness (michael r. garey and david s. johnson). Siam Review 24, 1 mation from pre-trained language models for text-to-sql parsing. In Proceedings
(1982), 90. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
[29] Yeye He, Kris Ganjam, and Xu Chu. 2015. Sema-join: joining semantically-related 1889–1898.
tables using big table corpora. Proceedings of the VLDB Endowment 8, 12 (2015), [55] Robert H Warren and Frank Wm Tompa. 2006. Multi-column substring matching
1358–1369. for database schema translation. In Proceedings of the 32nd international conference
[30] Lan Jiang and Felix Naumann. 2020. Holistic primary key and foreign key on Very large data bases. 331–342.
detection. Journal of Intelligent Information Systems 54, 3 (2020), 439–461. [56] Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data prepa-
[31] David R Karger, Philip N Klein, and Robert E Tarjan. 1995. A randomized linear- ration steps using data science notebooks. In Proceedings of the 2020 ACM SIG-
time algorithm to find minimum spanning trees. Journal of the ACM (JACM) 42, MOD International Conference on Management of Data. 1539–1554.
2 (1995), 321–328. [57] Xiaoyan Yang, Cecilia M Procopiuc, and Divesh Srivastava. 2009. Summarizing
[32] Ralph Kimball and Margy Ross. 2011. The data warehouse toolkit: the complete relational databases. Proceedings of the VLDB Endowment 2, 1 (2009), 634–645.
guide to dimensional modeling. John Wiley & Sons. [58] Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M Procopiuc,
and Divesh Srivastava. 2010. On multi-column foreign key discovery. Proceedings
of the VLDB Endowment 3, 1-2 (2010), 805–814.
[59] Yunjia Zhang, Avrilia Floratou, Joyce Cahoon, Subru Krishnan, Andreas Mueller,
Dalitso Banda, Fotis Psallidas, and Jignesh Patel. 2023. Schema Matching using
Pre-Trained Language Models. In ICDE 2023.
[60] Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-join: Joining tables by
leveraging transformations. Proceedings of the VLDB Endowment 10, 10 (2017),
1034–1045.
A LOCAL JOIN: TWO OPTIMIZATIONS N:1 classifier. We design the following features for the N:1
Separate N-1 and 1-1 classifiers. Unlike key/foreign-key that are classifier. We list the name of each feature, as well as its description.
typically N-1 joins, joins in real BI models can also be 1-1 joins, espe- Metadata-features. These are features relating to column-names
cially between related dimension tables describing the same logical and table-names. We pre-process all column/table names, by first
entity (e.g., “employee-salaries” and “employee-details”). standardizing all names into tokens, based on camel-casing and
Initially, we use one generic classifier to predict joinability be- delimiters (e.g., dash or underscores).
tween (𝐶𝑖 , 𝐶 𝑗 ) regardless of 1:1 or N:1. While performing an error • Jaccard_similarity, Jaccard_containment, Edit_distance,
analysis on incorrect predictions, we realized that while both 1:1 Jaro_winkler: These are standard string similarity between names.
or N:1 are joins, they represent conceptually different relation- We compute these similarities between two column names (𝐶𝑖 , 𝐶 𝑗 ).
ships with disparate characteristics. For example, 1:1 joins tend In addition, assuming 𝐶𝑖 is the “1” side, we also compute the
to be between the primary-keys of two dimension tables on the similarities between (𝑇𝑖 + 𝐶𝑖 , 𝐶 𝑗 ), where 𝑇𝑖 is the name of the
same logical entity (e.g. “employees”), often with the same number table from the “1” side, where the observation is that while
of rows from both tables, producing perfect 1-to-1 matches (N:1 column-names from fact-tables are often fully descriptive (e.g.,
joins are, on the other hand, very different). Furthermore, because “Employee-ID”), sometimes column-names from dimension ta-
1:1 joins tend to be between tables about the same logical entity bles are simplified (e.g., “ID” and ‘Name”), with the central entity
with complementary information, the column-headers for both (e.g., “Employees”) only mentioned in table-names, thus making
tables that do not join will also have high overlap (e.g., with the it necessary to piece together table-names and column-names to
prefix ”employees_” in the example of “employee-salaries” and recover complete metadata. We use 𝑚𝑎𝑥 (𝑠𝑖𝑚(𝐶𝑖 , 𝐶 𝑗 ), 𝑠𝑖𝑚(𝑇𝑖 +
“employee-details”). 𝐶𝑖 , 𝐶 𝑗 )) as features for metadata similarity (where 𝑠𝑖𝑚 can be
We thus treat N:1 and 1:1 as two separate classification tasks. different forms of similarity functions mentioned above, for a
This separation of is unique in Auto-BI and we will show the effect total of 4 features).
of this in our ablation studies. • Embedding_similarity: In addition to string similarity, we also use
Enhance labels with transitivity. Given that we deal with embedding similarity, specifically SentenceBERT [5, 46] trained
entire BI models that have multiple tables, using the pair-wise on top of all-mpnet-base-v2), to compute the column name simi-
join/not-join labels is no longer sufficient, because joinability can larity, using a setup similar to above.
be “transitive”. • Token_count, Char_count: the number of tokens and characters in
Consider, for instance, a table “Fact_Sales” has a FK column column names, for both 𝐶𝑖 , 𝐶 𝑗 . These serve as auxiliary features
called “sales_emp_id”, that N:1 joins with table “Dim_Employee” on top of similarity scores above (e.g., longer column-names
on its PK column “emp_id”, which in turn 1:1 joins with table with high similarity may be a more reliable indicator of match,
“Dim_Employee_Salaries” also on its PK “emp_id”. In such cases, compared to shorter ones with the same similarity scores).
even though “Fact_Sales.sales_emp_id” does not directly join • Col_frequency: the frequency of the column name 𝐶𝑖 (and 𝐶 𝑗 ) in
with “Dim_Employee_Salaries.emp_id”, and the pair would ordi- all BI-models we collect in the training set. Intuitively, matches
narily be considered a negative example (should not join), the two of common names (e.g., Code and Index) are less reliable, so this
actually refer to the same semantic concept (employee_ids) and are feature works similarly to Inverse-Document-Frequency (IDF)
logically joinable. Formally, given the joinability label 𝐿𝑖 𝑗 = 1 for in TF-IDF.
(𝐶𝑖 , 𝐶 𝑗 ) and 𝐿 𝑗𝑘 = 1 for (𝐶 𝑗 , 𝐶𝑘 ), we mark 𝐿𝑖𝑘 = 1 for (𝐶𝑖 , 𝐶𝑘 ) in • Col_position: This feature keeps the positional index of 𝐶𝑖 and
training, even if there is no explicit join between (𝐶𝑖 , 𝐶𝑘 ). Applying 𝐶 𝑗 , counting from left (e.g., the 2nd column from left). This is
label transitivity overcomes incomplete ground-truth and makes it based on the observation that columns on the left are more likely
easier for ML models to fit the ground-truth, which leads to better to join.
quality end-to-end. • Col_relative_position: This is the same feature as Col_position
We note that the transitivity-based optimization is unique when above, except that it is measured in relative terms, defined as:
Col_position
we deal with general types of joins on a large graph, which is not Num_of_total_columns .
needed when only pair-wise joins are considered, or when only • Unique_col_position: Similar to Col_position, except here we count
PK-FK (N:1) joins are used as they are typically not transitive. unique columns, where the observation is that the first few
unique column from the left of a table is more likely to be PK for
join.
Data-features. These features relate to the actual data content
B LOCAL JOIN CLASSIFIER: DETAILS
in columns, such as value overlap. Data-features provide comple-
We give details on the Local Classifier (LC) in Section 4.2, which mentary signals to metadata-features above, but are generally more
takes two columns (𝐶𝑖 , 𝐶 𝑗 ) and predicts the joinability label 𝐿𝑖 𝑗 . expensive to process.
As discussed in Section 4.2, we separate the prediction problem • Left_containment, Right_containment, Max_containment: value
for N:1 and 1:1 joins, since the two types of joins are different at a containment is an important signal for PK/FK joins [17, 30, 58].
conceptual level. For each classifier, we featurize each column pair we compute containment in both directions (left and right), and
(𝐶𝑖 , 𝐶 𝑗 ), which can be categorized as metadata-features (column- also the max of the two, for a total of three features.
names, table-names) as well as actual data-features (e.g., data value • Value_distinct_ratio: This feature calculates the distinctness of
overlap). In the following, we will describe the features of both the columns, or the fraction of values in 𝐶𝑖 and 𝐶 𝑗 that are distinct.
N:1 and 1:1 classifiers.
Methods Average 50%tile 90%tile 95%tile
• Range_overlap: computes the overlap of the min/max value ranges, Auto-BI-P 2.21 0.61 4.91 10.36
if both 𝐶𝑖 and 𝐶 𝑗 are of numeric types. Auto-BI Auto-BI 2.23 0.61 4.92 10.38
Auto-BI-S 0.45 0.09 2.02 2.65
• EMD_score: This is a feature based on distributional-similarity MC-FK 1.21 0.33 2.79 3.59
Original
and proposed in [58], we use this to complement the overlap- Baselines
Fast-FK 0.56 0.08 1.84 2.93
HoPF 4.31 0.25 6.74 18.72
based features above. MC-FK+LC 2.13 0.60 4.91 10.34
Enhanced
• Value_length: we cast all values in 𝐶𝑖 and 𝐶 𝑗 to string type, and Fast-FK+LC 2.15 0.61 4.92 10.36
Baselines HoPF+LC 6.41 1.13 11.88 27.23
compute the average value length (longer values tend to produce
Table 9: Comparison of end-to-end latency. Enhanced base-
more reliable matches).
lines (+LC) pay the cost of classifiers, which have latency
• Value_type: We one-hot encode column value types, such as
comparable to Auto-BI methods.
integer, float, string.
• Row_cnt: We use the number of rows in both tables as an auxiliary At a high level, the advantage of Auto-BI main comes from
feature, where the observation is that fact table tends to have two sources: (1) the local-classifier step (Section 4.2), that uses
more rows compared to dimension tables. data-driven ML scoring together with principled probability cali-
• Row_ratio, Col_ratio, Cell_ratio: These use similar intuition as bration, and (2) the graph-based k-MCA step for global optimization
Row_cnt above, except that we featurize the ratio of the number (Section 4). In order to understand the contributions of these two
of rows/cols/cells explicitly between two tables. sources, we “enhance” existing baselines, by injecting our local-
Feature importance. Our results suggest that, for the N:1 classi- classifier scores from step (1) into their algorithm as follows.
Enhanced baselines: MC-FK+LC, Fast-FK+LC, HoPF+LC.
fier, the following features are the most important (in that order,
Since MC-FK, Fast-FK and HoPF are not competitive on the Real
based on sklearn output): Max_containment, Jaccard_similarity,
benchmark, we inject our Local-Classifier (LC) scores for join-
Col_relative_position, Edit_distance, Jaro_winkler, Range_overlap,
likelihood into these baselines, replacing their heuristic scores with
EMD_score, Embedding_similarity, Col_frequency.
our calibrated classifier scores. This leads to stronger baselines,
1:1 classifier. Our 1:1 classifier share many of the same features
and allows us to see the benefit attributable to better local clas-
as the N:1 classifier (except the Row_ratio, Col_ratio, Cell_ratio
sifer scores, vs. the global k-MCA algorithm. To differentiate from
features, which are applicable to N:1 fact-dimension joins, but would
the original baselines, we use a "+LC" suffix for these enhanced
be as useful for 1:1 joins). We omit these identical features, and only
baselines, which become MC-FK+LC, Fast-FK+LC, and HoPF+LC,
describe features that are unique to the 1:1 classifier below.
respectively.
Metadata-features. These are features based only on column-
Quality Comparisons. Table 10 shows the additional compari-
names and table-names, like in the N:1 classifier.
son with MC-FK+LC, Fast-FK+LC, HoPF+LC, on top of the results
• Table_embedding: We measure the SentenceBERT embedding
reported in Table 5. As we can see, on the Real benchmark, the
similarity of table names where 𝑇𝑖 and 𝑇 𝑗 , based on the intuition
enhanced baselines improve substantially over the original base-
that two tables with 1-1 join should likely refer to the same
lines. However, all baselines still lag behind Auto-BI, especially
entity (e.g., Employees and Employee-Details, or Country and
in terms of precision: the edge-level error rate of Auto-BI is 5x
Country-Code, etc.), making high table-name similarity a useful
smaller than the best baseline (2% vs. 10%), while the case-level
signal.
error rate of Auto-BI is 4x smaller (8% vs. 34%). Because all the
• Header_jaccard: measures the jaccard similarity between all column-
enhanced baselines use the same LC classifiers as Auto-BI, the
names of 𝑇𝑖 and 𝑇 𝑗 . This is based on our observation that two
precision benefit can be attributable to the global optimization in
overlapping fact tables with highly similar column names do not
the k-MCA step, underscoring the importance of our graph-based
1:1 join in BI models, since such joins produce mostly redundant
formulation. On the Synthetic TPC benchmarks, we have similar
information. Higher Header_jaccard is thus inversely correlated
observations consistent with the Real benchmark. With the help
to joinability.
of the better scores from the local classifier, the enhanced baselines
Data-features. These are features relating to column-values.
improve over the original baselines on both OLAP benchmarks
• Min_containment: Instead of using Max_containment between
(TPC-H and TPC-DS) and OLTP benchmarks. (TPC-C and TPC-E)
the Left_containment and Right_containment as in the N:1 clas-
But overall, Auto-BI still gains the best performance.
sifer above, we use Min_containment for as the feature 1:1, based
Table 11 and Table 12 report more detailed numbers, bucketized
on the observation that 1:1 joins tend to join tuple-for-tuple
based on the number of input tables. Similar trends are observed
between two tables, unlikely PK/FK joins.
here, as we see Auto-BI is still the best method overall in terms
Feature importance. Our results suggest that for the 1:1 classifier,
of quality, across the spectrum of large and small test cases. The
the following feature are the most important (in that order, based on
advantage is the most significant in precision, especially on larger
sklearn output): Min_containment, Col_position, Jaccard_similarity,
test cases, which tend to be more difficult to predict correctly.
Col_relative_position, EMD_score, Header_jaccard, Col_frequency
Latency comparisons. As we can see from Table 9, introducing
and Embedding_similarity.
the LC classifier step into baselines increases their latency, though
not substantially. The enhanced baselines have latency comparable
C ADDITIONAL EXPERIMENTS to Auto-BI methods, with the exception of HoPF+LC, which is
We present additional experimental results in this section, mainly more expensive in terms of latency.
focusing on understanding the reason behind the quality advantage
of Auto-BI over baselines.
Real (OLAP) Synthetic (OLAP) Synthetic (OLTP)
Benchmark 1000-case Real benchmark TPC-H TPC-DS TPC-C TPC-E
Metric 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑐𝑎𝑠𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒
Auto-BI-P 0.98 0.664 0.752 0.92 1 0.88 0.93 0.99 0.28 0.43 1 0.6 0.75 1 0.4 0.58
Auto-BI Auto-BI 0.973 0.879 0.907 0.853 1 1 1 0.96 0.91 0.93 1 0.8 0.89 0.96 0.93 0.95
Auto-BI-S 0.951 0.848 0.861 0.779 1 1 1 0.92 0.89 0.91 1 0.7 0.82 0.93 0.94 0.94
Commercial System-X 0.916 0.584 0.66 0.754 0 0 0 0 0 0 0 0 0 0 0 0
MC-FK 0.604 0.616 0.503 0.289 1 1 1 0.73 0.65 0.68 0.46 0.8 0.63 0.57 0.79 0.48
Original
Fast-FK 0.647 0.585 0.594 0.259 0.71 0.88 0.79 0.62 0.35 0.44 0.62 0.57 0.6 0.73 0.84 0.78
Baselines
HoPF 0.684 0.714 0.67 0.301 0.86 0.75 0.8 0.87 0.51 0.65 0.75 0.7 0.72 0.71 0.91 0.81
MC-FK+LC 0.903 0.872 0.887 0.636 1 1 1 0.89 0.87 0.88 0.56 1 0.57 0.92 0.83 0.88
Enhanced Fast-FK+LC 0.898 0.879 0.883 0.631 1 0.88 0.93 0.94 0.5 0.6 1 0.7 0.82 0.94 0.87 0.91
Baselines HoPF+LC 0.738 0.765 0.726 0.524 1 0.88 0.93 0.93 0.53 0.68 1 0.7 0.82 0.91 0.88 0.9
LC 0.885 0.864 0.87 0.631 1 0.88 0.93 0.85 0.91 0.88 1 0.6 0.75 1 0.6 0.75
Table 10: Quality comparison on the 1000-case Real benchmark and 4 TPC benchmarks.

# of tables 4 5 6 7 8 9 10 [11,15] [16,20] 21+


Auto-BI 0.97 (0.99,0.95) 0.97 (0.98,0.96) 0.96 (0.99,0.94) 0.95 (0.98,0.91) 0.95 (0.99,0.92) 0.96 (1.00,0.93) 0.94 (0.98,0.90) 0.90 (0.97,0.85) 0.84 (0.95,0.75) 0.79 (0.94,0.69)
Auto-BI Auto-BI-P 0.91 (1.00,0.84) 0.91 (0.99,0.85) 0.88 (1.00,0.79) 0.81 (0.98,0.69) 0.83 (0.98,0.71) 0.83 (1.00,0.70) 0.81 (0.99,0.69) 0.71 (0.98,0.56) 0.60 (0.96,0.44) 0.55 (0.95,0.39)
Auto-BI-S 0.95 (0.98,0.93) 0.95 (0.96,0.94) 0.94 (0.97,0.92) 0.94 (0.97,0.90) 0.95 (0.98,0.91) 0.93 (0.97,0.89) 0.92 (0.96,0.88) 0.85 (0.92,0.79) 0.78 (0.90,0.70) 0.74 (0.89,0.63)
Commercial System-X 0.76 (0.94,0.66) 0.67 (0.91,0.55) 0.76 (0.94,0.66) 0.76 (0.93,0.66) 0.75 (0.91,0.65) 0.78 (0.91,0.70) 0.77 (0.92,0.67) 0.74 (0.90,0.65) 0.65 (0.80,0.56) 0.66 (0.88,0.54)
MC-FK 0.69 (0.88,0.57) 0.65 (0.93,0.49) 0.63 (0.70,0.58) 0.65 (0.67,0.63) 0.62 (0.70,0.56) 0.56 (0.46,0.72) 0.54 (0.48,0.63) 0.54 (0.49,0.61) 0.52 (0.42,0.69) 0.42 (0.30,0.68)
Baselines Fast-FK 0.76 (0.79,0.72) 0.76 (0.79,0.74) 0.68 (0.69,0.66) 0.53 (0.53,0.52) 0.65 (0.67,0.63) 0.47 (0.46,0.47) 0.45 (0.48,0.42) 0.49 (0.53,0.46) 0.47 (0.49,0.46) 0.42 (0.40,0.44)
HoPF 0.83 (0.86,0.81) 0.77 (0.77,0.76) 0.72 (0.73,0.71) 0.63 (0.58,0.70) 0.68 (0.67,0.70) 0.55 (0.49,0.64) 0.62 (0.55,0.70) 0.61 (0.58,0.64) 0.57 (0.55,0.60) 0.49 (0.44,0.56)
MC-FK+LC 0.86 (0.90,0.83) 0.85 (0.91,0.80) 0.87 (0.87,0.86) 0.89 (0.87,0.90) 0.88 (0.85,0.91) 0.90 (0.87,0.92) 0.87 (0.83,0.91) 0.84 (0.81,0.87) 0.78 (0.79,0.77) 0.74 (0.73,0.74)
Enhanced Fast-FK+LC 0.90 (0.89,0.90) 0.91 (0.91,0.92) 0.87 (0.87,0.88) 0.85 (0.85,0.85) 0.87 (0.88,0.86) 0.86 (0.87,0.84) 0.85 (0.85,0.85) 0.83 (0.82,0.84) 0.76 (0.73,0.78) 0.71 (0.68,0.75)
Baselines HoPF+LC 0.85 (0.87,0.83) 0.81 (0.82,0.81) 0.77 (0.78,0.75) 0.70 (0.65,0.75) 0.75 (0.74,0.76) 0.62 (0.57,0.67) 0.64 (0.58,0.72) 0.70 (0.67,0.72) 0.64 (0.60,0.69) 0.53 (0.46,0.62)
LC 0.89 (0.90,0.88) 0.89 (0.90,0.88) 0.87 (0.87,0.87) 0.88 (0.89,0.87) 0.86 (0.87,0.86) 0.89 (0.91,0.87) 0.86 (0.87,0.84) 0.83 (0.84,0.82) 0.77 (0.79,0.75) 0.72 (0.72,0.71)

Table 11: Edge-level quality reported as “F-1 (precision, recall)”, by number of input tables in Real benchmark.

# of tables 4 5 6 7 8 9 10 [11,15] [16,20] 21+


Auto-BI 1.00 0.96 0.95 0.89 0.95 0.97 0.85 0.78 0.64 0.55 We now show that by this construction, a solution 𝑃 to Hamilton-
Auto-BI Auto-BI-P
Auto-BI-S
1.00
0.99
0.98
0.95
0.99
0.93
0.94
0.93
0.96
0.95
0.99
0.80
0.95
0.76
0.89
0.69
0.83
0.49
0.67
0.31
path on 𝐺, will also be a solution to the 𝑘-MCA-CC we constructed.
Commercial System-X 0.91 0.87 0.81 0.85 0.77 0.78 0.74 0.75 0.52 0.54 This is the case because if 𝑃 is a Hamilton-path of 𝐺, it must be
MC-FK 0.76 0.68 0.41 0.34 0.19 0.14 0.1 0.13 0.09 0.05
Baselines Fast-FK 0.68 0.65 0.39 0.14 0.32 0.08 0.09 0.12 0.09 0.03 a feasible solution on 𝐺 ′ to 𝑘-MCA-CC, because first of all, 𝑃 is
HoPF 0.78 0.57 0.42 0.21 0.32 0.06 0.18 0.19 0.18 0.11
MC-FK+LC 0.89 0.92 0.75 0.62 0.67 0.70 0.63 0.53 0.35 0.31
a single path, which is a trivial 1-MCA, satisfying Equation (15).
Enhanced Fast-FK+LC 0.88 0.92 0.77 0.64 0.78 0.73 0.67 0.49 0.22 0.21 Furthermore, since 𝑃 passes through each vertex exactly once, it
Baselines HoPF+LC 0.87 0.74 0.66 0.62 0.57 0.49 0.48 0.34 0.28 0.19
LC 0.87 0.85 0.72 0.66 0.68 0.81 0.65 0.53 0.28 0.26 satisfies our FK-once constraint in Equation (16). Lastly, its cost in
Table 12: Case-level precision, by the number of input tables Equation (14) is exactly |𝑉 | − 1 (with |𝑉 | − 1 unit-weight edges in
in Real benchmark. the path 𝑃), which is guaranteed to be lower than any 𝑘-MCA with
D PROOF OF THEOREM 3 𝑘 > 1 (whose penalty cost alone is |𝑉 |). All of these above guarantee
that 𝑃 is an optimal solution to the 𝑘-MCA-CC we constructed on
Proof. We show the hardness and inapproximability of 𝑘-MCA-
𝐺 ′.
CC. We will first show the hardness of the problem using a reduction
Assume for a moment that we can solve 𝑘-MCA-CC efficiently
from Hamilton path [28]. We then show inapproximability using a
in polynomial time, then by this construction, we can solve any
reduction from non-metric min-TSP [21].
instance of Hamilton path using the reduction above, also in poly-
Recall that given a graph 𝐺 = (𝑉 , 𝐸), the Hamilton path problem
nomial time, which contradicts with existing complexity result of
looks for a path that visits each vertex exactly once. We reduce
Hamilton path [28]. We can thus conclude that 𝑘-MCA-CC is also
Hamilton path to 𝑘-MCA-CC. Given an instance of Hamilton path
NP-hard.
on 𝐺 = (𝑉 , 𝐸), we construct an instance of 𝑘-MCA-CC as follows.
We now show the inapproximability of 𝑘-MCA-CC using a re-
We construct a new graph 𝐺 ′ = (𝑉 , 𝐸 ′ ) on the same set of vertices 𝑉 ,
duction from non-metric min-TSP [21]. Recall that the non-metric
where for each 𝑣𝑖 ∈ 𝑉 we construct a table 𝑇𝑖 with a single column
min-TSP on a graph 𝐺 = (𝑉 , 𝐸) finds the min-cost cycle that visits
𝐶𝑖 in 𝑘-MCA-CC. For each directed edge 𝑒 (𝑣𝑖 , 𝑣 𝑗 ) ∈ 𝐸, we construct
each vertex exactly once, and returns to the origin, where cost
a possible N:1 FK/PK join between 𝐶𝑖 of 𝑇𝑖 and 𝐶 𝑗 of 𝑇 𝑗 in 𝑘-MCA-
is defined as the sum of edge-weights (which in the non-metric
CC, which would correspondingly create an edge 𝑒 (𝑣𝑖 , 𝑣 𝑗 ) ∈ 𝐸 ′
version of min-TSP, can be arbitrary numbers).
in the graph representation for 𝑘-MCA-CC. Note that because we
Recall that the Hamilton cycle problem (a special version of min-
construct each table 𝑇𝑖 to have a single column 𝐶𝑖 , and given the
TSP) can be reduced to Hamilon path, by adding an artificial source
FK-once constraint (Equation (16)), when there are multiple edges
and sink vertex 𝑠 and 𝑡 to the input graph, which is connected to
pointing away from a vertex 𝑣 in 𝐺 ′ , a solution to 𝑘-MCA-CC
a vertex 𝑣 and its cleaved copy 𝑣 ′ with the same neighborhood as
is forced to pick only one such edge, ensuring that the 𝑘-MCA
𝑣 [28]. We perform a similar construction from non-metric TSP
returned by the solution contains either single paths, or isolated
(cycle) to non-metric min-cost Hamilton path, using the same ar-
vertices (which are trivial forms of single paths). Furthermore, in our
tificial source and sink vertex 𝑠 and 𝑡, together with a cleaved 𝑣 ′ .
constructed 𝑘-MCA-CC, we set all edge-weight to be unit weight,
For any instance of the non-metric min-TSP problem, we can thus
and set the penalty weight 𝑝 to be a large constant |𝑉 | + 1. This
construct a min-cost Hamilton path problem. Because the edges
large penalty weight ensures that if a 1-MCA exists on 𝐺 ′ , it is
(𝑠, 𝑣), (𝑡, 𝑣 ′ ) are constructed to have 0 cost, the reduction is value
guaranteed to have a lower cost than any 𝑘-MCA with 𝑘 > 1,
preserving as an optimal solution to min-TSP is also the optimal
because the penalty term |𝑉 | is already larger than the cost of
min-cost Hamilton-path (modulo 𝑠 and 𝑡), and the two solutions
1-MCA (which is at most |𝑉 | − 1).
have the same objective function values. Furthermore, using the
same reduction from Hamilton path to 𝑘-MCA-CC above, we can
then reduce an instance of non-metric min-TSP first to min-cost 𝐶𝑠 = {𝑒𝑠 𝑗 , 𝑒𝑠𝑘 , . . .} ⊆ 𝐽 that causes violations, where all edges in
Hamilton path, and then to 𝑘-MCA-CC, all with the same objective 𝐶𝑠 point from the same vertex with the same column-index 𝑠. We
|𝐶 |
values for optimal solutions. partition 𝐶𝑠 into |𝐶𝑠 | number of subsets 𝐶𝑠1 , 𝐶𝑠2 , . . . , 𝐶𝑠 𝑠 , each with
Given that the non-metric min-TSP is EXP-APX-complete [21], exactly one edge from 𝐶𝑠 , and then construct |𝐶𝑠 | number of 𝑘-MCA-
we know 𝑘-MCA-CC is also EXP-APX-complete (for otherwise CC problem instances, each with a new graph 𝐺𝑖 = (𝑉 , 𝐸𝑖 ), where
given any instance of non-metric min-TSP, we can use the reduction 𝑉𝑖 = 𝑉 , 𝐸𝑖 = 𝐸\𝐶𝑠 ∪𝐶𝑠𝑖 , which we then recurse and solve the 𝑘-MCA-
above to solve the corresponding 𝑘-MCA-CC efficiently, and thus CC on each graph 𝐺𝑖 in Line 11. To show that 𝐽 ∗ = argmin 𝐽𝑖 𝑐 (𝐽𝑖 )
non-metric min-TSP, contradicting with the complexity result of is the optimal solution to the original 𝑘-MCA-cc problem on 𝐺, we
non-metric min-TSP). □ need to show that the optimal solution to the original 𝑘-MCA-cc
problem on 𝐺, 𝐽 + , is still not pruned when we split edges in 𝐶𝑠 and
E PROOF OF THEOREM 4 create |𝐶𝑠 | number of smaller problems with 𝐺𝑖 . In order to see this,
Proof. We show that Algorithm 3 solves 𝑘-MCA-CC optimally. notice that the optimal solution to the original 𝑘-MCA-cc problem
If the optimal solution to 𝑘-MCA, 𝐽 , does not violate FK-once con- on 𝐺, 𝐽 + , will have at most one edge in 𝐶𝑠 (otherwise it violates
straint (Equation (16), then in Line 2 we know 𝐽 must be a feasible FK-once constraint and would not have been a feasible solution). If
solution to 𝑘-MCA-CC on the same graph. Furthermore, because 𝐽 + has no edge in 𝐶𝑠 , then it has not been pruned away when we
the feasible region of 𝑘-MCA is strictly no smaller than that of partition 𝐶𝑠 and reduce the solution space, as it is still a feasible
𝑘-MCA-CC, we know if 𝐽 is an optimal solution to 𝑘-MCA and it solution in each 𝐺𝑖 . Alternatively, 𝐽 + has one edge in 𝐶𝑠 , and let it
is feasible for 𝑘-MCA-CC, it must also be an optimal solution to be the edge in 𝐶𝑠𝑙 , then all the edges in 𝐽 + is still in the graph 𝐺𝑙
𝑘-MCA-CC. Thus we return 𝐽 in Line 3 and we are done. (which only pruned edges in 𝐶𝑠 \ 𝐶𝑠𝑙 ). As such, 𝐽 + is still a feasible
If instead, the optimal solution to 𝑘-MCA, 𝐽 , does violate FK- solution in 𝐺𝑙 , and will be returned in step Line 12. □
once constraint in Line 2, because there are at least one edge set

You might also like