Auto BI
Auto BI
Figure 1: Types of common BI schemas. (a) Star schema, which can be seen as a simplified version of Snowflake. (b) Snowflake
schema, where dimension tables like “Customers” can connect to additional dimension tables. (c) Constellation schema, which
can be seen as multiple Snowflake schemas.
task, as they frequently produce incorrect results (with ~0.6 pre- evaluations show that our algorithms are both effective (with
cision for all methods we tested). This is because real BI data in over 0.9 F1 scores), and efficient (scales to the largest BI models
the wild tend to be considerably more challenging (e.g., column- we encounter and runs in a few seconds).
headers can often be cryptic with generic names like “name” or
“id”, and column-values can often overlap accidentally). Crucially, 2 RELATED WORKS
most existing approaches are “local” in nature, that consider two BI and dashboarding tools. There are a wide variety of BI and
tables at a time to make greedy local decisions, without a principled dashboarding tools that aim to help users perform ad-hoc data
global optimization, thus unable to produce accurate predictions analysis, with Tableau [10] and Power-BI [8] being the leading
on challenging real BI test cases (Section 2). vendors [24]. These tools use visual drag-and-drop interfaces [38]
Auto-BI: global optimization using schema graphs. A key (without requiring users to write SQL), and are particularly popular
insight we develop in Auto-BI, is that because we know users are among non-technical users.
finding join relationships specifically for building BI-models, and Foreign key detection. Foreign key (FK) detection is an impor-
we know the typical structure that schema-graphs in BI-models tant problem in database settings, with many influential methods
should generally follow (e.g., variants of snowflake), this gives us developed in the literature [17, 30, 48, 58].
a unique opportunity to leverage the global graph structure of the MC-FK [58] is a pioneering effort to detect FK using an EMD-
resulting schema-graph, to predict joins much more accurately. based randomness metric for distribution similarity between two
We thus formulate the Auto-BI problem as a global optimization columns, which is more reliable to predict true relationships when
problem that holistically considers both the local decision of pair- unrelated key-columns that can frequently have overlapping ranges.
wise joinability, as well as the global decision of graph structures. Fast-FK [17] develops an efficient method that selects FKs with
Specifically, we show that the snowflake schema popular in BI- the best pre-defined scores until all input tables are connected.
modeling corresponds to a graph-theoretical concept called Arbores- HoPF [30] improves upon prior work by considering not only
cence [47]. Leveraging this structure, we formulate a new graph FK-scores but also PK-scores, which are combined using a prede-
problem called k-MCA (k-Minimum-Cost Arborescence), which finds fined scoring function, making this also a global method in spirit.
the most probable 𝑘-snowflakes, by considering both local join- The algorithm enumerates all PK/FK combinations and returns the
ability and global graph structures, all using a precise probabilistic combination that has the highest total score.
interpretation based on calibrated probabilities. ML-FK [48] proposes an ML-based classifier to predict FK, trained
We prove that the new 𝑘-MCA problem and its variants are on known FKs. As we will show analytically and experimentally,
in general intractable (NP-hard) and inapproximable (EXP-APX- without principled global optimizations, ML classification alone is
complete). Nevertheless, we develop novel algorithms based on still local in nature and not sufficient to achieve high accuracy.
branch-and-bound principles, which are surprisingly efficient on While FK detection methods are clearly related to our problem
real test cases and can scale to the largest BI models we encounter, (we experimentally compare them with these methods), there are
while still being provably optimal. Our extensive evaluations using also a few key distinctions that we would like to highlight.
real BI models and synthetic TPC benchmarks suggest that the First, while FK detection targets general database settings, we
proposed Auto-BI is substantially more accurate when compared focus on predicting joins in BI models, which gives us a unique
to existing solutions in the literature. opportunity to exploit the likely graph structure (e.g., snowflake)
Contributions. We make the following contributions: to make more accurate predictions, which is a unique direction not
• We are the first to harvest and leverage over 100K real BI mod- considered in prior work.
els in the wild, for the problem of predicting BI models. This Second, unlike FK-detection methods that typically consider
enables a data-driven algorithm design and makes it possible to the canonical PK/FK (1:N) joins, in Auto-BI the types of joins
rigorously evaluate different algorithms on real BI data. we consider are more general, because real BI models in the wild
• We are the first to exploit the snowflake-like schema structure frequently employ 1:1 joins as well as joins that are not perfectly
in BI settings for our predictions, by mapping snowflakes to a 1:N, making the prediction problem more challenging.
less-known graph-theoretical concept called arboresence. We Lastly, with the exception of [48], most prior FK detection meth-
formulate a set of novel graph-based optimization problems that ods primarily rely on hand-crafted scoring functions to make local
we call 𝑘-MCA, that have precise probabilistic foundations. join predictions (pairwise between two tables). In comparison, we
• We study the theoretical hardness of 𝑘-MCA variants, and pro- leverage the BI models harvested to first predict local-joinability in
pose efficient algorithms that are provably optimal. Extensive a data-driven way, which is then combined into a principled global
optimization for optimal global decisions (at the graph-level for all We note that these three types of schemas are extensively studied
tables). This also makes our technique different from prior work. in the literature [15, 32, 40] and widely adopted in practice.
Detect inclusion dependency. Inclusion dependency (IND)
is closely related to FK, and there is an influential body of work 3.2 Harvest Real BI Models
focusing on efficiently enumerating inclusion dependency (IND) In order to understand real BI models created in the wild, and
in large databases [12, 33, 34, 39, 43, 50, 51, 53]. The focus on effi- systematically evaluate the effectiveness of Auto-BI, we harvest
ciency and scalability of this line of work makes them orthogonal over 100K real BI models from public sources that are created using
to typical FK-detection methods [17, 30, 58], where the focus is on a popular tool Power BI [8], whose model files have a suffix “.pbix”.
accurately predicting meaningful FKs (from a large collection of We crawl these “.pbix” model files from two sources. The first
IND candidates). And like prior FK-detection methods that employ is GitHub, where developers and data analysts upload their Power
efficient IND detection [30, 58], we also use IND-detection as a BI models for sharing and versioning. We crawl a total of 27K
pre-processing step to enumerate possible FK candidates efficiently. such model files from GitHub. As a second source, we go through
Complex non-equi joins. Beyond equi-joins, techniques to the URLs crawled by a commercial search engine that can lead to
automatically detect and handle complex join relationships have “.pbix” model files. Most of these URLs are not crawled by the
also been studied, e.g., transformation-based join [42, 55, 60], fuzzy- search engine so we crawl ourselves and obtain 86K model files.
join [16, 37, 52], search-join [36], semantic lookup-join [29], etc., Combining data from the two sources gives us a large collection
which is an interesting area of future work in the context of BI. of 100K+ real BI models, which cover diverse BI use cases, including
financial reporting, inventory management, among many others.
3 PROBLEM STATEMENT These real BI models become a valuable dataset for rigorous
In this section, we first describe preliminaries and the real BI models evaluation – specifically, from each “.pbix” model file, we pro-
we harvest, before introducing the Auto-BI problem. grammatically extract all tables used in the model, as well as the
ground-truth BI model (join relationships) manually specified by
users. This enables us to thoroughly evaluate different algorithms
3.1 Preliminary: BI models
using real BI models in the wild, which turn out to be more chal-
Business Intelligence is closely related to topics like data warehous- lenging than synthetic TPC benchmarks used in prior work that
ing and decision support system, and has been extensively studied. tend to be clean and simple.
We provide a brief preliminary here and refer readers to surveys
like [15, 32, 40] for more information.
3.3 Problem Statement: Auto-BI
Fact and dimension tables. In data warehousing and BI ter-
minologies, tables involved in BI modeling can be categorized as We now define the high-level Auto-BI problem as follows.
two types: fact tables and dimension tables [32]. A fact table con- Definition 1. [Auto-BI]. Given a set of input tables T = {𝑇1,𝑇2,
tains key metrics and measurements of business processes that . . . , 𝑇𝑛 } used for BI modeling, where each table 𝑇𝑖 consists of a
one intends to analyze (e.g., the revenue of sales transactions). In list of columns 𝑇𝑖 = (𝑐𝑖1, 𝑐𝑖2, . . . , 𝑐𝑖𝑚 ). Predict a desired BI model
addition, a fact table contains foreign keys that can reference multi- 𝑀 (T) that consists of a set of joins 𝑀 (T) = {𝐽𝑖 𝑗 , 𝐽𝑝𝑞 , . . .}, where
ple dimension tables, where each dimension table contains detailed each join 𝐽𝑖 𝑗 is a join between 𝑇𝑖 and 𝑇 𝑗 , in the form of 𝐽𝑖 𝑗 =
information associated with the measurements from a unique facet (𝑐𝑖𝑘 , 𝑐𝑖𝑙 , . . .), (𝑐 𝑗𝑟 , 𝑐 𝑗𝑠 , . . .) .
(e.g., a “Product” dimension table contains details of products sold, Note that the output 𝑀 (T) is restricted to equi-joins for now,
whereas a “Date” dimension table has detailed day/month/year which can be single-column joins, or multi-column joins in the form
info of transactions, etc.). Figure 1 shows a few examples of the BI of 𝐽𝑖 𝑗 = (𝑐𝑖𝑘 , 𝑐𝑖𝑙 , . . .), (𝑐 𝑗𝑟 , 𝑐 𝑗𝑠 , . . .) . Also note that we only state
schemas, with fact and dimension tables in different colors. the high-level problem so far, and will defer the exact graph-based
Such a separation of fact/dimension tables has many benefits, formulation to Section 4.3.
such as storage/query efficiency, ease of maintenance, etc. [15, 32, In addition to Definition 1, a more general version of the problem
40]. The fact/dimension design is a de-facto standard in BI modeling. goes beyond equi-joins to detect and handle more complex forms of
Star, snowflake, and constellation schemas. In BI modeling, joins, such as transformation-joins [60] and fuzzy-joins [37], which
fact/dimension tables are frequently organized into what is known we will leave as future work.
as star/snowflake/constellation schemas [32], like shown in Figure 1.
Star-schema refers to the cases where there is one fact table,
4 AUTO-BI
whose foreign-key columns refer to primary-keys from one or more
(non-hierarchical) dimension tables, as illustrated by Figure 1(a). We will start with an overview of our Auto-BI architecture, before
Snowflake-schema generalizes the star-schema, with dimension discussing how we predict joins holistically on graphs.
tables referring to each other in a hierarchical manner. For example,
in Figure 1(b), the “Customer” dimension refers to a coarser-grained 4.1 Architecture Overview
dimension “Cust-Segment”. Similarly an “Address” dimension can Figure 2 shows the overall architecture of our Auto-BI system.
refer to a coarser-grained “City”, which in turn refers to “Country”. It consists of an offline component and an online component, as
While there is only one fact table in star and snowflake schemas, depicted by the two boxes shown in the figure.
constellation-schema generalizes to the cases with multiple fact In the offline step, using real BI models we harvested and their
tables, as shown in Figure 1(c). ground-truth BI join models, we perform offline training to learn
Input: user tables
Figure 3: Solve 1-MCA on the graph representation of the Figure 4: Solve 𝑘-MCA on the graph representation of the
tables in Figure 1(b), with join candidates as edges. tables in Figure 1(c), with join candidates as edges.
known results new results
Problem 1-MCA k-MCA k-MCA-CC
Intuitively, arborescence is a directed rooted tree where all its
find the most most probable vertices are pointing away from the root. We use an example to
find the most
Description
probable 1-snowflake
probable k-snowflakes w/ show the relationship between snowflakes and arborescence.
k-snowflakes constraints
Poly-time Poly-time solvable EXP-APX-hard Example 2. Consider the sub-graph 𝐺 induced by all green edges
Hardness
solvable [18] (Theorem 2) (Theorem 3)
Chu-Liu/Edmond’s ours ours in Figure 3, where each green edge would correspond to a ground-
Algorithm (Algorithm 2) (Algorithm 3) truth join in the snowflake schema of Figure 1(b). This sub-graph
algorithm [18]
Table 1: Summary of results for MCA problem variants. 𝐺 is an arborescence, because if we take the vertex marked as
by finding additional joins beyond typical snowflakes. We will “Fact_Sales” as the root 𝑟 , then there is exactly one directed path
introduce both in turn below. from 𝑟 to every other vertex in 𝐺. Equivalently, we can check that
We first introduce our precision-mode formulation, referred this root 𝑟 has in-degree of 0, while all other vertices in 𝐺 have
to as 𝑘-MCA-CC. For ease of exposition, we will illustrate the in-degree of exactly 1, which also ensures that 𝐺 is an arborescence.
thought process of arriving at 𝑘-MCA-CC in 3 steps: (1) We will
start with a simplistic assumption that the schema graph has exactly Recall that we know there is exactly one snowflake in 1-MPS,
one snowflake, which we model with a graph-theoretical problem which is equivalent to saying that the sub-graph induced by 𝐽 ⊆ 𝐸
called 1-MCA. (2) We then generalize this to arbitrary numbers is an arborescence. We rewrite 1-MPS into 1-MPA (1-Most-Probable-
of snowflakes, using a formulation we call 𝑘-MCA. (3) Finally, we Arborescence), defined as follows.
Ö
generalize 𝑘-MCA to include additional graph-structure constraints (1-MPA) max 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) (3)
to arrive at our final formulation, which we call 𝑘-MCA-CC. We 𝐽 ⊆𝐸
𝑒𝑖 𝑗 ∈ 𝐽
summarize our key results in this section in Table 1 for reference. ′
s.t. 𝐺 = (𝑉 , 𝐽 ) is an arborescence (4)
(1) Exactly one snowflake: 1-MCA. To start with, we discuss
the simplistic case where we know there is exactly one snowflake Example 3. We revisit the graph in Figure 3. Using the formula-
in the underlying graph (e.g., Figure 1(b)). Note that this is only for tion of 1-MPA, the best arborescence with the highest joint proba-
ease of illustration, and not an actual assumption that we rely on bility (correspondingly, the most probable snowflake), is the set of
(we will show how this can be relaxed next). solid edges marked in green, denoted by 𝐽 ∗ = {𝑒 1, 𝑒 2, 𝑒 3, 𝑒 4, 𝑒 7, 𝑒 8 },
Given a graph 𝐺 = (𝑉 , 𝐸) where candidate joins are marked whose joint probability is 0.9 ∗ 0.7 ∗ 0.6 ∗ 0.7 ∗ 0.8 ∗ 0.9. It can be
as edges like in Figure 3. Since we know there is one snowflake verified that 𝐽 ∗ is the most probable arborescence among all ar-
schema, intuitively we want to select edges (joins) 𝐽 ⊆ 𝐸, such that: borescences (that span the entire vertex set), which corresponds to
(a) The graph induced by 𝐽 , 𝐺 ′ = (𝑉 , 𝐽 ), is a snowflake that the ground-truth snowflake schema shown in Figure 1(b).
connects all vertices in 𝑉 ; Consider an alternative arborescence, such as 𝐽 ′ = {𝑒 1, 𝑒 3, 𝑒 5, 𝑒 6,
(b) If more than one such snowflake-structure exists, find the 𝑒 7, 𝑒 8 }, which removes 𝑒 2, 𝑒 4 from 𝐽 ∗ and adds 𝑒 5, 𝑒 6 (marked in dot-
most probable snowflake, based on the joint-probability of all joins ted red lines). Observe that this new 𝐽 ′ also forms a 1-arborescence
based on Definition 2. However, 𝐽 ′ has a lower joint probability
Î
selected in 𝐽 , 𝑃 (𝐽 ) = 𝑒𝑖 𝑗 ∈ 𝐽 𝑃 (𝐶𝑖 , 𝐶 𝑗 ).
These two considerations can be written as the following opti- than 𝐽 ∗ (because 𝑒 5, 𝑒 6 in 𝐽 ′ has a joint probability of 0.8 ∗ 0.4=0.32,
mization problem, which we refer to as 1-Most-Probable-Snowflake while 𝑒 2, 𝑒 4 in 𝐽 ∗ has a joint probability of 0.7 ∗ 0.7=0.49), which
(1-MPS): Ö will thus not be selected based on the 1-MPA formulation above.
(1-MPS) max 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) (1) We note that in 1-MPA, 𝐽 ∗ is selected based on a global deci-
𝐽 ⊆𝐸 sion of optimality, which avoids locally-greedy decisions – e.g.,
𝑒𝑖 𝑗 ∈ 𝐽
s.t. 𝐺 ′ = (𝑉 , 𝐽 ) is a snowflake (2) the incorrect 𝑒 5 will not be selected despite its high score like we
Note that the constraint in Equation (2) of 1-MPS corresponds to mentioned in Example 1.
requirement (a), while the objective function in Equation (1) maps While we can now use arborescence to formalize 1-MPA, it still
to requirement (b). has a cross-product that is not amenable to optimization. We thus
Since snowflake is used informally in BI, to make it more formal perform the following transformation: instead of assigning join-
we map it to a structure in graph theory called arboresence [22]. probability 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) as the edge weight for each edge 𝑒𝑖 𝑗 , we per-
form a logarithmic transformation and set the edge weight of each
Definition 2. [Arborescence]. A directed graph 𝐺 = (𝑉 , 𝐸) is
𝑒𝑖 𝑗 as:
called an arborescence, if there is a unique vertex 𝑟 ∈ 𝑉 known
as the root, such that there is exactly one directed path from 𝑟 to 𝑤 (𝑒𝑖 𝑗 ) = − log(𝑃 (𝐶𝑖 , 𝐶 𝑗 )) (5)
every other 𝑣 ∈ 𝑉 , 𝑣 ≠ 𝑟 . Equivalently, a directed graph 𝐺 is an
arborescence if all its vertices have in-degree of 1, except a unique Using the new transformed 𝑤 (𝑒𝑖 𝑗 ), for each instance of the 1-MPA
root vertex 𝑟 ∈ 𝑉 that has in-degree of 0 [22]. problem, we construct a new minimization problem below that we
Algorithm 1: Construct graph with edge-weights efficiently solved using the Chu–Liu/Edmonds algorithm. Note that
input : all input tables T in a BI-model 𝐽 ∗ is the same optimal solution to 1-MPA, as we see in Example 3.
output : Graph 𝐺 = (𝑉 , 𝐸 ) that represents T Algorithm for 1-MCA. Given a set of input tables T (for which
1 𝑉 ← {𝑣𝑇 |𝑇 ∈ T}, with 𝑣𝑇 representing each 𝑇 ∈ T a BI-model needs to be built), the complete pseudo-code to con-
2 𝐸 ← {} struct a graph 𝐺 for 1-MCA is shown in Algorithm 1. Since we
3 foreach (𝐶𝑖 , 𝐶 𝑗 ) satisfying Inclusion-Dependency in T do are given tables T, in Line 3 we enumerate column-pairs (𝐶𝑖 , 𝐶 𝑗 )
4 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) ← Local-Classifier(𝐶𝑖 , 𝐶 𝑗 ) in T for which Inclusion-Dependencies (IND) [14] hold approxi-
5 𝑤 (𝑒𝑖 𝑗 ) ← − log(𝑃 (𝐶𝑖 , 𝐶 𝑗 ) ) mately, which are possible joins that we should consider (note that
6 𝐸 ← 𝐸 ∪ {𝑒𝑖 𝑗 }, with edge-weight 𝑤 (𝑒𝑖 𝑗 ) efficient IND-enumeration is a standard step [19], so we invoke
7 return 𝐺 (𝑉 , 𝐸 ) existing techniques here). In Line 4, we “score” each (𝐶𝑖 , 𝐶 𝑗 ) using
term 1-MCA, which uses summation our Local-classifier to obtain calibrated probabilities 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) (Sec-
∑︁ in the objective function:
(1-MCA) min 𝑤 (𝑒𝑖 𝑗 ) (6) tion 4.2), which are transformed in Line 5 to become edge-weights
𝐽 ⊆𝐸
𝑒𝑖 𝑗 ∈ 𝐽
𝑤 (𝑒𝑖 𝑗 ) (Equation (5)).
Using the resulting graph 𝐺 constructed from T, we can invoke
s.t. 𝐺 ′ = (𝑉 , 𝐽 ) is arboresence (7) the Chu–Liu/Edmonds’ algorithm, which yields the optimal solu-
It can be shown that solving 1-MPA is equivalent to solving the tion 𝐽 ∗ to 1-MCA (and thus also 1-MPA).
corresponding 1-MCA. We summarize our main result for 1-MCA in the first column of
Lemma 1. A solution 𝐽 ∗ ⊆ 𝐸 is an optimal solution to 1-MPA, if Table 1. We note that although we leverage known results from the
and only if 𝐽 ∗ is an optimal solution to the corresponding 1-MCA. graph theory here, to the best of our knowledge, we are the first to
connect the popular concept of snowflakes in BI, with arborescence
Proof. First, observe that 1-MPA and 1-MCA have identical in graph theory.
feasible regions, as they are subject to the same constraints. In the following, we will extend 1-MCA to general cases, and
Next, we show that an optimal solution 𝐽 ∗ that maximizes the develop new results not known in graph theory or databases.
(2) Arbitrary number of snowflakes: 𝑘-MCA. So far we use
Î
objective function 𝑒𝑖 𝑗 ∈ 𝐽 ∗ 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) in Equation (3) of 1-MPA, will
Í
simultaneously minimize the objective function 𝑒𝑖 𝑗 ∈ 𝐽 𝑤 (𝑒𝑖 𝑗 ) in 1-MCA to solve the simplistic case where we know there is exactly
Equation (6) of 1-MCA. This is the case because the objective 1 snowflake. In general, there can be multiple snowflakes, and we
Í Í
function of 1-MCA is: 𝑒𝑖 𝑗 ∈ 𝐽 ∗ 𝑤 (𝑒𝑖 𝑗 ) = 𝑒𝑖 𝑗 ∈ 𝐽 ∗ − log 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) do not know its exact number beforehand (e.g., the ground-truth
Î
= − log( 𝑒𝑖 𝑗 ∈ 𝐽 ∗ 𝑃 (𝐶𝑖 , 𝐶 𝑗 )), where the term inside − log() is ex- schema of Figure 1(c) has 2 snowflakes, which is unknown a priori).
actly the objective function of 1-MPA, thus ensuring that 1-MCA is In this section, we extend 1-MCA to the more general 𝑘-MCA,
minimized if and only if 1-MPA is maximized, and completes the where the proposed 𝑘-MCA formulation will not only find the
proof. □ most probable 𝑘 snowflakes like in 1-MCA, but also infer the right
number of snowflakes 𝑘 at the same time.
The reason we construct 1-MCA for each instance of 1-MPA, We first extend the concept of arborescence in graph-theory
is that 1-MCA relates to a known problem in graph theory called (Definition 2), to the more general 𝑘-arboresence below.
“Minimum-Cost-Arborescence” (abbreviated as MCA2 ) [20], which Definition 3. [𝑘-arboresence]. A directed graph 𝐺 = (𝑉 , 𝐸) is
finds a spanning arborescence (covering all vertices) in a directed an 𝑘-arborescence, if its underlying undirected graph has a total of
graph 𝐺 that has the smallest edge-weights. Note that in the sim- 𝑘 joint connected-components, written as {𝐺𝑖 = (𝑉𝑖 , 𝐸𝑖 )|𝑖 ∈ [𝑘]},
Ð Ð
plistic setting where we know there is exactly 1 snowflake (arbores- such that 𝑖 ∈ [𝑘 ] 𝑉𝑖 = 𝑉 , 𝑖 ∈ [𝑘 ] 𝐸𝑖 = 𝐸. Furthermore, each 𝐺𝑖 is
cence), our 1-MCA problem directly translates to the MCA prob- an arborescence, for all 𝑖 ∈ [𝑘].
lem [20]. Since MCA is known in graph theory with polynomial- Note that when 𝑘 = 1, the notion of 1-arborescence degenerates
time solutions called the Chu–Liu/Edmonds’ algorithm [18, 20], our into arborescence described in Definition 2. We will henceforth
construction allows us to solve 1-MPA efficiently by leveraging the use 1-arborescence and arborescence interchangeably when the
same algorithms. context is clear.
We use our running example to show the connection between Example 5. Consider the graph in Figure 4, which is the graph
1-MPA and 1-MCA. representation of the constellation (multi-snowflake) schema in Fig-
Example 4. Continue with Example 3. To solve the 1-MPA prob- ure 1(c). The sub-graph with green solid edges (ignoring the dotted
lem for the graph in Figure 3, we use the transformation in Equa- edges for now), is a 2-arboresence. It is a 2-arboresence because
tion (5) and construct an instance of 1-MCA on the same graph, both of its two connected-components are arboresences (Defini-
where all edge-weights are now 𝑤 (𝑒𝑖 𝑗 ) = − log(𝑃 (𝐶𝑖 , 𝐶 𝑗 )). It can tion 2), rooted at “Fact_Sales” and “Fact_Supplies”, respectively.
be verified that 𝐽 ∗ = {𝑒 1, 𝑒 2, 𝑒 3, 𝑒 4, 𝑒 7, 𝑒 8 } is the minimizer of Equa-
tion (6) in 1-MCA, with the smallest objective-value −(log(0.9) + Intuitively, we use the 𝑘-arboresence structure to force the 𝑘
log(0.7) + log(0.6) + log(0.7) + log(0.8) + log(0.9)), which can be underlying snowflakes to emerge, which reveals the most salient
“backbones” of the underlying schema. It should be noted that in the
2 Although MCA is not a well-known concept, we note that MCA is directly anal- precision-mode of Auto-BI (this section), because we focus exclu-
ogous to a better-known problem in graph theory called “Minimum-Spanning-Tree
(MST)” [31], with the only difference that MST is defined on undirected graphs whereas sively on finding snowflake/arborescence structures to ensure high
MCA is on directed graphs. precision, some desired joins may be missing in the 𝑘-arboresence
(e.g., the orange dotted edges in Figure 4), which is something that Algorithm 2: Solve 𝑘-MCA for constellation schema
we will get to in the recall-mode of Auto-BI next (Section 4.3.3). input : Graph 𝐺 = (𝑉 , 𝐸 )
Given that we want to use 𝑘-arboresence to let the 𝑘 snowflakes output : optimal k-MCA (k-snowflakes)
emerge, the next question is how to find the right 𝑘. Conceptually, 1 𝑉 ′ ← 𝑉 ∪ {𝑟 }
we can iterate over all possible 𝑘, and pick the “best” 𝑘, using a 2 𝐸 ′ ← 𝐸 ∪ {𝑒 (𝑟, 𝑣) |𝑣 ∈ 𝑉 }, with 𝑤 (𝑒 (𝑟, 𝑣) ) = 𝑝
suitable “goodness” function (e.g., revealing the right number of 3 𝐽1∗ ← solve 1-MCA on 𝐺 ′ = (𝑉 ′ , 𝐸 ′ ) with Chu-Liu/Edmonds’ algo
snowflakes). 4 𝐽𝑘∗ = 𝐽1∗ \ {𝑒 (𝑟, 𝑣) |𝑣 ∈ 𝑉 }
First, recall that a 𝑘-arborescence on graph 𝐺 = (𝑉 , 𝐸) has exactly 5 return 𝐽𝑘∗
𝑘 connected components, and is bound to have exactly (|𝑉 | − 𝑘)
edges (because every vertex except the root has in-degree of 1, per indeed the right choice (Section 5), thanks to the fact that we use
Definition 2). true calibrated probabilities (Section 4.2).
Given these, we can see that while the “sum of edge-weights” Observe that if we know 𝑘 = 1 a priori, our 𝑘-MCA degener-
objective function in Equation (6) for 1-MCA is a suitable “goodness” ates exactly into 1-MCA like before. When 𝑘 is unknown however,
function to compare two 1-arboresences (both with 𝑘 = 1), it is no the objective function in Equation (8) would help reveal the best
longer suitable to compare a 𝑘 1 -arboresence and a 𝑘 2 -arboresence 𝑘-snowflakes, as well as the right number of 𝑘. We show this intu-
(with 𝑘 1 ≠ 𝑘 2 ), because the two will have different number of itively using the example below.
edges, making the comparison using the sum of edge-weights in Example 6. We revisit Figure 4 where all edges are possible join
Equation (6) “unfair”. In fact, using Equation (6) and a flexible 𝑘 candidates, and solve 𝑘-MCA on this example graph.
is guaranteed to lead to |𝑉 | disconnected vertices (a trivial |𝑉 |- First, observe that there is no 1-arboresence in this graph. Let
arborescence) as the optimal, because it has no edges and thus no 𝐽 ∗ be all the green edges, then the subgraph induced by 𝐽 ∗ is a 2-
edge-weights. arboresence (rooted at the two fact-tables). Since 𝑘 = 2, the penalty
To make the comparisons “fair” between two 𝑘-arboresences term in this case is (2 − 1)(− log(0.5)) = 1. It can be verified that
with different values of 𝑘, and prevent disconnected components 𝐽 ∗ has the lowest cost in all 2-arboresences.
from having artificial advantages, we modify the objective function It is possible to have 3-arboresences here too – for example if we
as follows. Since a 𝑘-arboresences has exactly 𝑘 connected compo- remove 𝑒 1 from 𝐽 ∗ , then the remaining green-edges in 𝐽 ′ = 𝐽 ∗ \ {𝑒 1 }
nents and (|𝑉 | −𝑘) edges, we imagine that there are (𝑘 − 1) “virtual- induce a 3-arboresence. It can be verified that the cost of 𝐽 ′ is higher
edges”, each with a parameterized edge-weight 𝑝, that connect the 𝑘 than that of 𝐽 ∗ , because 𝐽 ′ removes 𝑒 1 from 𝐽 ∗ and thus lowers its
connected components into one, such that a 𝑘-arboresences always cost by 𝑤 (𝑒 1 ) = − log(0.9) = 0.13, but incurs a higher penalty-cost
has the same number of edges as 1-arboresences, regardless of 𝑘 of (3 − 1)(− log(0.5)) = 2, which makes 𝐽 ′ less desirable than 𝐽 ∗ .
(because (|𝑉 | − 𝑘) + (𝑘 − 1) = (|𝑉 | − 1)). Accounting for the (𝑘 − 1)
Algorithm for 𝑘-MCA. A naive way to solve 𝑘-MCA that builds
new virtual edges in Equation (6) leads to Equation (8) below, which
is new objective function we use on top of 1-MCA above, is to enumerate different values of 𝑘, and
∑︁for 𝑘-MCA: for each 𝑘, exhaustively enumerate all 𝑘-way partitions of 𝐺, then
(𝑘-MCA) min 𝑤 (𝑒𝑖 𝑗 ) + (𝑘 − 1) · 𝑝 (8)
𝐽 ⊆𝐸, 𝑘 ≤ |𝑉 | invoke 1-MCA on each resulting graph partition, to find the optimal
𝑒𝑖 𝑗 ∈ 𝐽
𝑘-MCA. This approach is straightforward but inefficient (because
s.t. 𝐺 ′ = (𝑉 , 𝐽 ) is a 𝑘-arboresence (9) 𝑘 can be as large as |𝑉 |, making it exponential in |𝑉 |).
The parameter 𝑝 we introduce effectively controls the number of We design an algorithm that solves 𝑘-MCA, using a graph-based
snowflakes (e.g., a larger 𝑝 would “penalize” having more discon- construction that reduces any instance of the new 𝑘-MCA problem
nected snowflakes). The question is how to set 𝑝 appropriately. into one instance of 1-MCA (which admits efficient solutions [18]).
Recall that we use calibrated join-probability 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) to produce Specifically, given a graph 𝐺 = (𝑉 , 𝐸) on which 𝑘-MCA needs to
edge weight 𝑤 (𝑒𝑖 𝑗 ) = − log(𝑃 (𝐶𝑖 , 𝐶 𝑗 )), then naturally the virtual be solved, we introduce a new vertex 𝑟 that is an “artificial root”,
edges should be imagined as edges with a join-probability of ex- and connects 𝑟 with all 𝑣 ∈ 𝑉 using edges 𝑒 (𝑟, 𝑣), with edge-weight
actly 0.5, which means a 50% chance of being joinable and thus a 𝑤 (𝑒 (𝑟, 𝑣)) = 𝑝. This leads to a new constructed graph 𝐺 ′ = (𝑉 ′, 𝐸 ′ )
coin-toss that is ok to either include or exclude (which makes them where 𝑉 ′ = 𝑉 ∪ {𝑟 }, 𝐸 ′ = 𝐸 ∪ {𝑒 (𝑟, 𝑣)|𝑣 ∈ 𝑉 }.
“virtual”). We solve 1-MCA on the new 𝐺 ′ , and let 𝐽1∗ be the optimal 1-MCA.
As another way to look at this, consider when we drop a regular Let 𝐽𝑘∗ = 𝐽1∗ \ {𝑒 (𝑟, 𝑣)|𝑣 ∈ 𝑉 } be edges in 𝐽1∗ not incident with the
edge 𝑒𝑖 𝑗 from a 𝑘-arboresence to create a (𝑘 + 1)-arboresence – artificial-root 𝑟 . We can show that 𝐽𝑘∗ is the optimal solution to
the objective function of the latter in Equation (8) will have the 𝑘-MCA on the original 𝐺. The complete steps of this algorithm are
edge-weight 𝑤 (𝑒𝑖 𝑗 ) removed, but incur a penalty cost of 𝑝 from shown in Algorithm 2.
the additional virtual-edge we introduce. When the penalty-weight
𝑝 on the virtual-edge corresponds to a join-probability of 0.5, it Theorem 2. Algorithm 2 solves 𝑘-MCA optimally, in time poly-
would discourage 𝑒𝑖 𝑗 from being dropped if its 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) is over 0.5 nomial to the input size.
(50% chance of being joinable), and encourage 𝑒𝑖 𝑗 to be dropped if
its 𝑃 (𝐶𝑖 , 𝐶 𝑗 ) is under 0.5, which is exactly the right thing to do.
Proof. Given a graph 𝐺 = (𝑉 , 𝐸), let 𝐽𝑘∗ be the optimal 𝑘-MCA
We, therefore, use 𝑝 = − log(0.5) as our natural choice of penalty
of 𝐺, with objective function value 𝑐𝑘 (𝐽𝑘∗ ), where 𝑐𝑘 is the objective
weight in 𝑘-MCA. Our experiments suggest that empirically 0.5 is
function defined in Equation (8). Because 𝐽𝑘∗ is a 𝑘-arboresence, by
Definition 3, 𝐽𝑘∗ has exactly 𝑘 disjoint components {𝐺 1, 𝐺 2, . . . 𝐺𝑘 },
all of which are 1-arboresence, with 𝑅 = {𝑟 1, 𝑟 2, . . . 𝑟𝑘 } as their improved by adding an additional constraint on the graph that we
roots. call “FK-once”.3
We show that on our constructed graph 𝐺 ′ = (𝑉 ′, 𝐸 ′ ), the solu- FK-once refers to the property that the same FK column in a
tion 𝐽1∗ = 𝐽𝑘∗ ∪ {(𝑟, 𝑟𝑖 )|𝑟𝑖 ∈ [𝑘]} must be an optimal solution to the fact-table should likely not refer to two different PK columns in two
1-MCA problem on 𝐺 ′ . dimension tables (a common property known in the database litera-
We show this by contradiction. Suppose 𝐽1∗ is not an optimal ture [17, 30]). On a graph, such a structure would correspond to two
solution to 𝐺 ′ , which means that there is another 𝐽1′ with a smaller edges (𝐶𝑖 , 𝐶 𝑗 ), (𝐶𝑖 , 𝐶𝑚 ), pointing from the same 𝐶𝑖 , to two separate
cost as defined in the 1-MCA objective function. Let the objective 𝐶 𝑗 and 𝐶𝑚 , which usually indicates that one of the joins is incor-
function in Equation (6) be 𝑐 1 , we can write this as rect. For example, the same FK column “Customer-ID” in “Sales”
𝑐 1 (𝐽1′ ) < 𝑐 1 (𝐽1∗ ) (10) table may appear joinable with both (1) the PK column “C-ID” of
In the solution 𝐽1′ , let 𝑅 ′ be the set of vertices incident with artificial the “Customers” table, and (2) the PK “Customer-Segment-ID” of
root 𝑟 . Removing from 𝐽1′ the set of edges incident with 𝑟 produces the “Customer-Segments” table (because both have high column-
𝐽𝑘′ = 𝐽1′ \ {𝑒 (𝑟, 𝑟 ′ )|𝑟 ′ ∈ 𝑅 ′ }. 𝐽𝑘′ is an |𝑅 ′ |-arboresence, and thus a header similarity and value-overlap). However, we know that an FK
feasible solution to 𝑘-MCA on 𝐺. Similarly, removing from 𝐽1∗ the should likely only join one PK (or otherwise we have two redundant
set of edges incident with 𝑟 produces 𝐽𝑘∗ = 𝐽1∗ \ {𝑒 (𝑟, 𝑟 ′ )|𝑟 ′ ∈ 𝑅}, dimension tables).
which is also a feasible solution to 𝑘-MCA on 𝐺. We add the FK-once constraint as a cardinality-constraint (CC)
Since 𝐽𝑘∗ is an |𝑅|-arboresence and 𝐽𝑘′ is an |𝑅 ′ |-arboresence, by to 𝑘-MCA, which leads to 𝑘-MCA-CC ∑︁ below:
the definition of 𝑐𝑘 in Equation ∑︁ (8), we know (k-MCA-CC) min 𝑤 (𝑒𝑖 𝑗 ) + (𝑘 − 1) · 𝑝 (14)
𝐽 ⊆𝐸, 𝑘 ≤ |𝑉 |
𝑐𝑘 (𝐽𝑘∗ ) = 𝑤 (𝑒) + (|𝑅| − 1)𝑝 (11) 𝑒𝑖 𝑗 ∈ 𝐽
′
𝑒 ∈ 𝐽𝑘∗ s.t. 𝐺 = (𝑉 , 𝐽 ) is 𝑘-arboresence (15)
𝑖 ≠ 𝑙, ∀𝑒𝑖 𝑗 ∈ 𝐽, 𝑒𝑙𝑚 ∈ 𝐽, 𝑗 ≠ 𝑚 (16)
∑︁
𝑐𝑘 (𝐽𝑘′ ) = ′
𝑤 (𝑒) + (|𝑅 | − 1)𝑝 (12)
𝑒 ∈ 𝐽𝑘′
Note that Equation (16) is the new FK-once constraint, which
states that no two edges 𝑒𝑖 𝑗 , 𝑒𝑙𝑚 in the selected edge-set 𝐽 should
Because we know 𝐽𝑘∗ is an optimal solution to 𝑘-MCA on 𝐺, we share the same starting column-index, or (𝑖 ≠ 𝑙).4
know 𝑐𝑘 (𝐽𝑘∗ ) ≤ 𝑐𝑘 (𝐽𝑘′ ). Combining with Equation (11) and (12), we Theorem 3. 𝑘-MCA-CC is NP-hard. Furthermore, it is Exp-APX-
get: complete, making it inapproximable in polynomial time.
∑︁ ∑︁
𝑤 (𝑒) + (|𝑅| − 1)𝑝 ≤ 𝑤 (𝑒) + (|𝑅 ′ | − 1)𝑝 (13) We prove Theorem 3 using a reduction from non-metric min-
𝑒 ∈ 𝐽𝑘∗ 𝑒 ∈ 𝐽𝑘′ TSP [21]. Proof of this can be found in Appendix D. It is interesting
Í to see that adding one constraint in 𝑘-MCA-CC makes it consid-
Adding 𝑝 to both sides of Equation (13), we get 𝑒 ∈ 𝐽 ∗ 𝑤 (𝑒) + erably more difficult than 𝑘-MCA (which however is important in
𝑘
(|𝑅|)𝑝 ≤ 𝑒 ∈ 𝐽 ′ 𝑤 (𝑒) + (|𝑅 ′ |)𝑝. Note that the left-hand-side is ex-
Í
𝑘
terms of result quality, as we will show in experiments).
actly the objective-value of 𝐽1∗ on 𝐺 ′ , or 𝑐 1 (𝐽1∗ ), while the right-hand- Algorithm for 𝑘-MCA-CC. Despite the hardness, we develop a
side is the objective-value of 𝐽1′ on 𝐺 ′ . This gives 𝑐 1 (𝐽1∗ ) ≤ 𝑐 1 (𝐽1′ ), novel algorithm that solves 𝑘-MCA-CC optimally, leveraging the
contradicting with our assumption in Equation (10), thus proving branch-and-bound principle [35], and the sparsity of join edges
that 𝐽1∗ must be an optimal solution to 1-MCA on 𝐺 ′ . pointing from the same columns.
Since 𝐽1∗ can be solved optimally in Line 3 of Algorithm 2 for Algorithm 3 shows the pseudo-code of this recursive procedure.
1-MCA in polynomial time (Line 3), and all other steps in Algo- We first invoke Algorithm 2 to solve the unconstrained version
rithm 2 also take time polynomial in the input size, we conclude that 𝑘-MCA and obtain 𝐽 (Line 1). We check 𝐽 for constraint violations
Algorithm 2 can solve 𝑘-MCA optimally, in polynomial time. □ (Line 2) – if there is no violation, we are done and 𝐽 is the optimal
solution to 𝑘-MCA-CC. Alternatively, if 𝐽 has constraint violations,
but the cost of the unconstrained version 𝑐 (𝐽 ) is already higher
Example 7. We continue with Example 6. Using Algorithm 2, we
than a best solution to 𝑘-MCA-CC found so far (Line 4), we know
would construct a new graph 𝐺 ′ by adding an artificial root 𝑟 (shown
adding the constraints will only degrade solution quality so we
on the right of Figure 4), which connects to all existing vertices
return null for this solution space. Otherwise, in Line 7, let 𝐶𝑠 =
𝑣𝑖 with an edge 𝑒 (𝑟, 𝑣𝑖 ) with the same penalty weight 𝑤 (𝑒) = 𝑝.
{𝑒𝑠 𝑗 , 𝑒𝑠𝑘 , . . .} ⊆ 𝐽 be one set of conflicting edges in 𝐽 from the same
Using Line 4 of Algorithm 2 to solve the 1-MCA on 𝐺 ′ produces
column (thus violating the FK-once constraint). We partition 𝐶𝑠
the optimal solution of 𝐽1∗ that consists of all solid green edges, plus |𝐶 |
two artificial-edges connecting 𝑟 with the two fact-table, which can into |𝐶𝑠 | number of subsets 𝐶𝑠1 , 𝐶𝑠2 , . . . , 𝐶𝑠 𝑠 , each with exactly one
be verified is the optimal 1-MCA. Line 4 of Algorithm 2 would then edge in 𝐶𝑠 (Line 8). We then construct |𝐶𝑠 | number of 𝑘-MCA-CC
produce 𝐽𝑘∗ corresponding to of all green edges (removing the two problem instances, each with a new graph 𝐺𝑖 = (𝑉𝑖 , 𝐸𝑖 ), where
artificial-edges connecting fact-tables to the artificial-root). It can 𝑉𝑖 = 𝑉 , 𝐸𝑖 = 𝐸 \ 𝐶𝑠 ∪ 𝐶𝑠𝑖 . We recursively solve 𝑘-MCA-CC on
be verified that 𝐽𝑘∗ in this example is the optimal 𝑘-MCA, which is each graph 𝐺𝑖 to get 𝐽𝑖 (Line 11). Let 𝑐 (𝐽 ) be the objective function
also the ground-truth schema in Figure 1(c). 3 Other possible constraints that can benefit the inference, such as “no-cycles” of join
(3) Arbitrary number of snowflakes with constraints: k- paths in inferred schema graph, is implicit from the definition of arboresence.
4 Note that the constraint is on column-index and at a column-level – one table can still
MCA-CC. The 𝑘-MCA formulation we considered so far can handle have multiple different FK columns pointing from the same table/vertex, to different
arbitrary numbers of snowflakes. This however, can be further PK columns of different tables/vertices.
Average 50-th p% 90-th p% 95-th p%
Algorithm 3: Solve 𝑘-MCA-CC # of rows per table 1053.4 50.1 1925.4 13981.2
input : Graph 𝐺 = (𝑉 , 𝐸 ) # of columns per table 8.1 4.2 11 17.1
output : optimal solution to k-MCA-CC (k-snowflakes) # of tables (nodes) per case 3.2 2 6.9 9.1
# of relationships (edges) per case 3.9 3 8.2 11.6
1 𝐽 ← Solve-k-MCA(𝐺) using Algorithm 2
2 if 𝐽 is a feasible solution to 𝑘-MCA-CC(G)) then Table 2: Characteristics of
.
all BI models harvested
3 return 𝐽 Average 50-th p% 90-th p% 95-th p%
# of rows per table 7730 80 10992 33125
4 else if cost 𝑐 (𝐽 ) is higher than best 𝑘-MCA-CC found so far then # of columns per table 9 5 20 27
5 return null # of tables (nodes) per case 12.9 9 25 32
6 else # of relationships (edges) per case 11.7 8.6 19 29
7 𝐶𝑠 ← edges in 𝐽 with the same source column index 𝑠 that Table 3: Characteristics of 1000-case
.
Real benchmark
violates FK-once constraint (Equation 16 ) TPC-C TPC-E TPC-H TPC-DS
|𝐶 | average # of rows per table 288590 7850832 1082655 814889
8 𝐶𝑠1 , 𝐶𝑠2 , . . .𝐶𝑠 𝑠 ← disjoint subsets of 𝐶𝑠 , each with exactly one
average # of columns per table 10.2 5.8 7.6 17.7
edge from 𝐶𝑠 # of tables (nodes) 9 32 8 24
9 𝐸𝑖 = 𝐸 \ 𝐶𝑠 ∪ 𝐶𝑠𝑖 , ∀𝑖 ∈ [ |𝐶𝑠 | ] # of relationships (edges) 10 45 8 107
10 𝐺𝑖 = (𝑉 , 𝐸𝑖 ) ∀𝑖 ∈ [ |𝐶𝑠 | ] Table 4: Characteristics. of 4 TPC benchmarks
11 𝐽𝑖 = Call Solve-k-MCA-CC(𝐺𝑖 ) recursively, ∀𝑖 ∈ [ |𝐶𝑠 | ]
marked as dotted orange edges (these are joins that reference the
12 𝐽 ∗ = argmin 𝐽𝑖 ,𝑖 ∈ [|𝐶𝑠 | ] 𝑐 (𝐽𝑖 )
same dimension-table, from the multiple fact tables).
13 return 𝐽 ∗
We, therefore, introduce the “recall-mode” of Auto-BI, where
in Equation (14), the 𝐽𝑖 that minimizes the cost function, 𝐽 ∗ = we “grow” additional edges on top of the “backbone” identified in
argmin 𝐽𝑖 𝑐 (𝐽𝑖 ), is the optimal solution to the original 𝑘-MCA-cc the precision-mode.
problem on 𝐺 (Line 12). Specifically, given the original graph 𝐺 = (𝑉 , 𝐸), let 𝐽 ∗ be the
Note that our recursive call (Line 11) partitions and reduces the optimal solution to 𝑘-MCA-CC. Let 𝑅 = {𝑒𝑖 𝑗 |𝑒𝑖 𝑗 ∈ (𝐸 \ 𝐽 ∗ ), 𝑃 (𝑒𝑖 𝑗 ) ≥
solution space into disjoint subsets without losing optimal solutions, 𝜏 } be the remaining edges that are promising (meeting a precision
while the pruning step (Line 4) quickly prunes away unpromising threshold 𝜏 5 ) but not yet selected by 𝐽 ∗ . For our recall-oriented
solution spaces leveraging efficient 𝑘-MCA algorithms. This two solution of Auto-BI, we want to select as many as edges 𝑆 ⊆ 𝑅,
steps can conceptually be seen as a form of branch-and-bound used subject to certain graph-structure constraints below. We write this
in the optimization literature [35], but is specifically tailored to as EMS (Edge-Maximizing Schema) below:
our graph problem (leveraging the mutually-exclusive nature of (EMS) argmax |𝑆 | (17)
𝑆 ⊆𝑅
conflicting edges based on the FK-once property).
s.t. 𝑖 ≠ 𝑙, ∀𝑒𝑖 𝑗 , 𝑒𝑙𝑚 ∈ 𝑆 ∪ 𝐽 ∗, 𝑗 ≠ 𝑚 (18)
Theorem 4. Algorithm 3 solves 𝑘-MCA-CC optimally.
∗
𝑆 ∪ 𝐽 is cycel-free (19)
A proof of the theorem can be found in Appendix E. Intuitively,
The new constraint in Equation (19) ensures that there should
this algorithm ensures optimality, because at most one edge in
be no cycles in the resulting graph induced by 𝑆 ∪ 𝐽 ∗ , as circular
𝐶𝑠 may appear in the optimal solution to 𝑘-MCA-CC (otherwise
joins between columns are uncommon at a schema level.
the FK-once constraint is violated), making edges in 𝐶𝑠 mutually
The EMS problem is NP-hard, and is 1/2-inapproximable using
exclusive. This allows us to partition the conflicting edges in 𝐶𝑠 ,
a reduction from Max-Sub-DAG [26]. However, because this step
and recursively construct instances of the problem with smaller
operates on top of the optimal solution 𝐽 ∗ to 𝑘-MCA-CC, there is
feasible regions, without losing the optimal solution in the process.
typically limited flexibility in selecting from 𝑅. We find different
Given the hardness of 𝑘-MCA-CC, we clearly cannot hope Algo-
solutions have very similar results (Section 5), and we thus solve
rithm 3 to solve 𝑘-MCA-CC in polynomial time. In practice, how-
EMS greedily by picking the most confident edges (based on 𝑃 (𝑒𝑖 𝑗 )),
ever, our branch-and-bound partitioning exploits the sparseness of
without using more expensive solutions.
edges (there are few edges pointing from the same columns), which
This now completes the overall Auto-BI. To recap, we first solve
turns out to be highly efficient. In our experiments (Section 5), we
𝑘-MCA-CC optimally using Algorithm 3 in the precision-mode, to
observe the mean and median latency of Algorithm 3 on real-world
get 𝐽 ∗ (the salient 𝑘-snowflakes). Then in the recall-mode, we solve
BI schemas is 0.11 and 0.02 second, respectively, with the max be-
EMS using 𝐽 ∗ and obtain 𝑆 ∗ . The sub-graph induced by 𝐽 ∗ ∪ 𝑆 ∗ is
ing 11 seconds on a case with 88 data-tables (the largest case we
the final solution of Auto-BI.
encounter in the 100K+ real BI models harvested). We note that
this is encouraging, and analogous to many classical combinatorial
problems (e.g., Euclidean TSP and Knapsack) that are intractable in
5 EXPERIMENTS
theory but reliably solvable in practice. We perform large-scale evaluations to test Auto-BI and related
4.3.3 Recall Mode (MaxEdge-CC). methods in the literature, in both quality and efficiency.
At this point, we did with the precision-mode of Auto-BI, which is
precision-oriented as we focus on finding the salient 𝑘-snowflakes 5.1 Evaluation Setup
to reveal the “backbone” of the underlying schema. However, not all Benchmarks. We use two benchmarks to evaluate Auto-BI.
desired joins are included in 𝑘-snowflakes/arboresence. For exam-
ple, the green edges in Figure 4 correspond to the optimal solution 5 Bydefault, we threshold with 𝜏 = 0.5 here, since our 𝑃 (𝑒𝑖 𝑗 ) are calibrated probability,
to 𝑘-MCA-CC, but we are still missing two joins in the ground truth, and 0.5 is a natural cutoff for joinability.
Real (OLAP) Synthetic (OLAP) Synthetic (OLTP)
Benchmark 1000-case Real benchmark TPC-H TPC-DS TPC-C TPC-E
Method Category Method 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑐𝑎𝑠𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒 𝑃𝑒𝑑𝑔𝑒 𝑅𝑒𝑑𝑔𝑒 𝐹𝑒𝑑𝑔𝑒
Auto-BI-P 0.98 0.664 0.752 0.92 1 0.88 0.93 0.99 0.28 0.43 1 0.6 0.75 1 0.4 0.58
Auto-BI Auto-BI 0.973 0.879 0.907 0.853 1 1 1 0.96 0.91 0.93 1 0.8 0.89 0.96 0.93 0.95
Auto-BI-S 0.951 0.848 0.861 0.779 1 1 1 0.92 0.89 0.91 1 0.7 0.82 0.93 0.94 0.94
Commercial System-X 0.916 0.584 0.66 0.754 0 0 0 0 0 0 0 0 0 0 0 0
MC-FK 0.604 0.616 0.503 0.289 1 1 1 0.73 0.65 0.68 0.46 0.8 0.63 0.57 0.79 0.48
Fast-FK 0.647 0.585 0.594 0.259 0.71 0.88 0.79 0.62 0.35 0.44 0.62 0.57 0.6 0.73 0.84 0.78
Baselines
HoPF 0.684 0.714 0.67 0.301 0.86 0.75 0.8 0.87 0.51 0.65 0.75 0.7 0.72 0.71 0.91 0.81
ML-FK 0.846 0.77 0.773 0.557 0.6 0.75 0.667 0.369 0.589 0.454 0.273 0.3 0.286 0.694 0.756 0.723
Language model GPT-3.5 0.73 0.64 0.67 0.43 0.75 0.75 0.75 0 0 0 0.16 0.2 0.15 0.78 0.56 0.65
Table 5: Quality comparison on the 1000-case Real benchmark and 4 TPC benchmarks.
Denormalized (OLAP-like) Normalized (OLTP-like)
Benchmark Foodmart Northwind AdventureWork WorldWideImporters Foodmart Northwind AdventureWork WorldWideImporters
Method Category Method 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒 𝑃𝑒 𝑅𝑒 𝐹𝑒
Auto-BI-P 1 0.5 0.67 1 1 1 1 0.34 0.51 1 0.31 0.47 1 0.63 0.77 1 1 1 1 0.44 0.62 1 0.25 0.4
Auto-BI Auto-BI 0.75 1 0.86 1 1 1 0.98 0.93 0.97 1 0.83 0.91 0.8 1 0.89 1 1 1 0.97 0.82 0.89 0.97 0.86 0.91
Auto-BI-S 0.68 1 0.81 0.9 1 0.95 0.94 0.91 0.92 1 0.79 0.88 0.8 1 0.88 1 1 1 0.93 0.8 0.86 0.92 0.84 0.88
Commercial System-X 0.75 0.6 0.67 0.85 0.79 0.82 0.97 0.66 0.78 1 0.59 0.74 0.75 0.6 0.67 0.9 1 0.95 0.9 0.34 0.5 1 0.09 0.17
MC-FK 0.42 0.9 0.57 0.28 1 0.44 0.33 0.71 0.45 0.21 0.86 0.34 0.54 0.88 0.67 0.43 0.8 0.56 0.2 0.87 0.33 0.2 0.52 0.3
Fast-FK 0.38 0.6 0.46 0.53 0.64 0.65 0.33 0.34 0.34 0.27 0.47 0.35 0.55 0.75 0.63 0.55 0.6 0.57 0.43 0.69 0.53 0.45 0.14 0.2
Baselines HoPF 0.44 0.4 0.42 0.53 0.5 0.52 0.78 0.73 0.75 0.89 0.72 0.8 0.43 0.38 0.4 0.8 0.8 0.8 0.55 0.71 0.62 0.75 0.51 0.61
ML-FK 0.43 0.8 0.56 0.35 0.86 0.5 0.84 0.8 0.82 0.89 0.75 0.81 0.57 0.88 0.69 0.5 0.9 0.64 0.95 0.83 0.89 0.9 0.5 0.64
Language model GPT-3.5 1 0.6 0.75 1 0.71 0.83 1 0.32 0.48 0.77 0.69 0.73 1 0.75 0.86 1 1 1 0.87 0.54 0.67 0.88 0.61 0.72
Table 6: Quality comparison on FoodMart, NorthWind, AdventureWork and WorldWideImporters . (𝑃𝑒 , 𝑅𝑒 , 𝐹𝑒 ) = (𝑃𝑒𝑑𝑔𝑒 , 𝑅𝑒𝑑𝑔𝑒 , 𝐹𝑒𝑑𝑔𝑒 ).
# of tables 4 5 6 7 8 9 10 [11,15] [16,20] 21+
Case-type statistics (ST,SN,C,O) (50,18,31,1) (49,12,37,2) (36,14,48,2) (19,23,49,9) (10,22,57,11) (7,25,50,18) (2,39,40,19) (7,14,60,19) (1,7,72,20) (9,5,62,24)
Auto-BI 0.97 (0.99,0.95) 0.97 (0.98,0.96) 0.96 (0.99,0.94) 0.95 (0.98,0.91) 0.95 (0.99,0.92) 0.96 (1.00,0.93) 0.94 (0.98,0.90) 0.90 (0.97,0.85) 0.84 (0.95,0.75) 0.79 (0.94,0.69)
Auto-BI Auto-BI-P 0.91 (1.00,0.84) 0.91 (0.99,0.85) 0.88 (1.00,0.79) 0.81 (0.98,0.69) 0.83 (0.98,0.71) 0.83 (1.00,0.70) 0.81 (0.99,0.69) 0.71 (0.98,0.56) 0.60 (0.96,0.44) 0.55 (0.95,0.39)
Auto-BI-S 0.95 (0.98,0.93) 0.95 (0.96,0.94) 0.94 (0.97,0.92) 0.94 (0.97,0.90) 0.95 (0.98,0.91) 0.93 (0.97,0.89) 0.92 (0.96,0.88) 0.85 (0.92,0.79) 0.78 (0.90,0.70) 0.74 (0.89,0.63)
Commercial System-X 0.76 (0.94,0.66) 0.67 (0.91,0.55) 0.76 (0.94,0.66) 0.76 (0.93,0.66) 0.75 (0.91,0.65) 0.78 (0.91,0.70) 0.77 (0.92,0.67) 0.74 (0.90,0.65) 0.65 (0.80,0.56) 0.66 (0.88,0.54)
MC-FK 0.69 (0.88,0.57) 0.65 (0.93,0.49) 0.63 (0.70,0.58) 0.65 (0.67,0.63) 0.62 (0.70,0.56) 0.56 (0.46,0.72) 0.54 (0.48,0.63) 0.54 (0.49,0.61) 0.52 (0.42,0.69) 0.42 (0.30,0.68)
Fast-FK 0.76 (0.79,0.72) 0.76 (0.79,0.74) 0.68 (0.69,0.66) 0.53 (0.53,0.52) 0.65 (0.67,0.63) 0.47 (0.46,0.47) 0.45 (0.48,0.42) 0.49 (0.53,0.46) 0.47 (0.49,0.46) 0.42 (0.40,0.44)
Baselines
HoPF 0.83 (0.86,0.81) 0.77 (0.77,0.76) 0.72 (0.73,0.71) 0.63 (0.58,0.70) 0.68 (0.67,0.70) 0.55 (0.49,0.64) 0.62 (0.55,0.70) 0.61 (0.58,0.64) 0.57 (0.55,0.60) 0.49 (0.44,0.56)
ML-FK 0.87(0.91,0.84) 0.86 (0.91,0.85) 0.8 (0.94,0.79) 0.83 (0.88,0.83) 0.84 (0.89,0.84) 0.8 (0.84,0.81) 0.8 (0.88,0.79) 0.72 (0.84,0.71) 0.64 (0.69,0.65) 0.56 (0.65,0.6)
Language model GPT-3.5 0.79 (0.86,0.76) 0.75 (0.83,0.7) 0.8 (0.83,0.78) 0.74 (0.78,0.74) 0.72 (0.77,0.69) 0.69 (0.76,0.66) 0.69 (0.76,0.66) 0.56 (0.62,0.54) 0.49 (0.56,0.46) 0.39 (0.46,0.36)
Table 7: Edge-level quality reported as “F-1 (precision, recall)”, by number of input tables in Real benchmark. In the first row, we report “case
type statistics”, denoted as (𝑆𝑇 , 𝑆𝑁 , 𝐶, 𝑂), which stand for the number of cases in (Star, Snowflake, Constellation, Others), respectively.
# of tables 4 5 6 7 8 9 10 [11,15] [16,20] 21+
- Real. We sample 1000 real BI models crawled in the wild Auto-BI 1.00 0.96 0.95 0.89 0.95 0.97 0.85 0.78 0.64 0.55
(Section 3.2) as our first benchmark, henceforth referred to as Real. Auto-BI Auto-BI-P
Auto-BI-S
1.00
0.99
0.98
0.95
0.99
0.93
0.94
0.93
0.96
0.95
0.99
0.80
0.95
0.76
0.89
0.69
0.83
0.49
0.67
0.31
In order to account for different levels of difficulty in predicting BI Commercial System-X 0.91 0.87 0.81 0.85 0.77 0.78 0.74 0.75 0.52 0.54
MC-FK 0.76 0.68 0.41 0.34 0.19 0.14 0.1 0.13 0.09 0.05
models, we perform stratified sampling as follows – we bucketize Fast-FK 0.68 0.65 0.39 0.14 0.32 0.08 0.09 0.12 0.09 0.03
models into 10 groups based on the number of input tables as {4, 5, Baselines HoPF 0.78 0.57 0.42 0.21 0.32 0.06 0.18 0.19 0.18 0.11
ML-FK 0.87 0.86 0.68 0.65 0.7 0.55 0.53 0.42 0.18 0.12
6, 7, 8, 9, 10, [11-15], [16-20], 21+}, and randomly select 100 cases Language model GPT-3.5 0.67 0.61 0.61 0.6 0.44 0.42 0.42 0.21 0.1 0.05
in each group, to create this 1000-case benchmark that covers the Table 8: Case-level precision, by number of tables.
entire spectrum in terms of levels of difficulty, from easy (with only We further perform tests on 4 commonly-used synthetic databases:
a few tables) to hard (with over 20 tables). Table 3 summarizes the FoodMart [3], Northwind [4], AdventureWorks [2] and WorldWide-
characteristics of the 1000-case Real benchmark. Note that these Importers [6]. We use both of their OLAP and OLTP versions to
test cases are held out and never used when training of our local stress test our algorithms, for a total of 8 additional test databases.
classifier, which is consistent with the standard practice of machine Metrics. We compare the predicted joins of different methods,
learning to avoid data leakage. against the ground truth (which in the case of Real, are human-
In comparison, Table 2 shows the characteristics of all BI models specified relationships that we programmatically extract from BI
harvested in the entire collection, which are substantially simpler models we crawled). We evaluate prediction quality of algorithms
(e.g., with a small number of tables). This is not entirely surprising, both at the edge-level and case-level, defined as below:
when these BI models are programmed by non-technical business Edge-level quality. For each BI test case 𝐶, we evaluate the frac-
users using GUI tools, which is why we perform stratified sampling
tion of predicted join relationships (edges on the graph) that is
described above to create a more balanced and challenging test set.
identical to ground-truth relationships. We use standard preci-
- Synthetic. In addition to evaluation on real BI models, we
sion/recall/F metrics for the edge-level evaluation, defined as pre-
perform tests on 4 TPC benchmarks, henceforth referred to as Syn- num-of-correctly-predicted-edges
thetic. Specifically, TPC-H and TPC-DS are two popular bench- cision 𝑃𝑒𝑑𝑔𝑒 (𝐶) = num-of-total-predicted-edges , recall 𝑅𝑒𝑑𝑔𝑒 (𝐶) =
num-of-correctly-predicted-edges
marks for BI/OLAP workloads. We evaluate predicted joins with num-of-total-ground-truth-edges . The F-1 score, denoted by 𝐹𝑒𝑑𝑔𝑒 (𝐶), is
the ground-truth relationships in TPC specifications. We further 2𝑃 (𝐶 )𝑅 (𝐶 )
then the harmonic mean of precision and recall, or 𝑃 𝑒𝑑𝑔𝑒(𝐶 )+𝑅𝑒𝑑𝑔𝑒 (𝐶 ) .
perform tests on TPC-C and TPC-E, two popular OLTP benchmarks. 𝑒𝑑𝑔𝑒 𝑒𝑑𝑔𝑒
While Auto-BI is not designed for general OLTP databases, we We report precision/recall/F-1 for the entire Real benchmark,
perform the tests nevertheless, in order to understand Auto-BI’s as the average across all 1000 test cases.
ability to detect foreign-keys in general (beyond the snowflake-like Case-level quality. Since it is difficult for non-technical users
schemas that Auto-BI was initially designed for). to identify incorrectly-predicted relationships, high precision is
crucial for Auto-BI (Section 4.3.2). We report case-level precision
for each test case 𝐶, denoted(by 𝑃𝑐𝑎𝑠𝑒 (𝐶), defined as: with both precision and recall modes (Section 4.3.3). (3): Further-
1, if P𝑒𝑑𝑔𝑒 (𝐶) = 1 more, we test a light-weight version of Auto-BI called Auto-
𝑃𝑐𝑎𝑠𝑒 (𝐶) = (20) BI-Schema-Only (Auto-BI-S), which uses only schema-level
0, otherwise
metadata (table/column names) for local-classifiers (without us-
In other words, 𝑃𝑐𝑎𝑠𝑒 (𝐶) is 1 only if no incorrect edge is predicted;
ing column-values). This method is thus efficient, with sub-second
and 0 even if a single false-positive edge is predicted (for it is un-
latency in most cases, while still being surprisingly accurate.
likely that non-technical users can identify and correct the incorrect
predictions, making us to “fail” in such a situation). The case-level
precision on the entire benchmark is also the average across all 5.3 Quality comparisons
1000 cases. 6
Overall comparison. Table 5 compares the overall quality of all
Latency. We report the latency of all methods using wall-clock
methods, on both the 1000-case Real benchmark and 4 Synthetic
time. All methods are implemented using Python 3.8 and tested on a
TPC benchmarks. As we can see, Auto-BI-based methods consis-
Windows 11 machine with 64-core Intel CPU and 128 GB memory.
5.2 Methods compared tently outperform alternatives, with Auto-BI-P always being the
We compare Auto-BI with the following methods in the literature. best in precision (very close to 1), while the full Auto-BI being the
Our code and data are available for reproducibility [1]. best overall in F-scores.
MC-FK [58]. MC-FK pioneered a method to accurately detect On the 1000-case Real benchmark, we note that the difference
multi-column FKs, using an EMD-based randomness metric that between Auto-BI and the best baselines is significant – the differ-
measures the distribution similarity of two columns. ence is 13 percentage points for edge-level precision (0.98 vs. 0.846),
Fast-FK [17]. Fast-FK makes FK predictions based on a scoring and over 10 percentage points for edge-level recall (0.879 vs. 0.77).
function that combines column-value and column-name similar- The difference is even more pronounced at the case-level, which
ity. This method employs fast pruning and focuses on ensuring is over 35 percentage points (0.92 vs. 0.557), showing the strong
interactive predictions. benefit on quality when using a principled global optimization in
HoPF [30]. HoPF is a recent method that detects FKs and PKs our 𝑘-MCA.
together, in order to ensure better prediction accuracy. It employs Even the light-weight Auto-BI-S (with the same 𝑘-MCA but
hand-tuned scoring functions as well as structural constraints (e.g., using schema-only features) is surprisingly accurate (losing only
no cycles) to improve accuracy. 2-3 percentage points in precision/recall). As we will see, it is much
ML-FK [48]. ML-FK is an ML-based approach to predict FK, more efficient, however, making it a strong contender for practical
which is similar in spirit to our local-classifiers, and we feed ML-FK use when latency becomes important.
with the same training data used by our local-classifiers. For the Synthetic TPC benchmarks, we observe similar trends
System-X. System-X is a commercial system from a leading consistent with the Real benchmark. It is interesting to note that,
BI vendor that has a feature to detect joins. We anonymize the not only is Auto-BI the best on OLAP benchmarks (TPC-H/TPC-
name of the system, in keeping with benchmarking traditions in DS), it turns out to be the best on OLTP benchmarks too (TPC-
the database literature [7, 23, 44, 49]. C/TPC-E). This is somewhat surprising, because OLTP databases
GPT-3.5. GPT is a family of language models [13] capable of do not usually follow snowflake-like schema, and are not the use
performing many downstream tasks such as NL-to-SQL [45, 54] cases that Auto-BI is originally designed for. We inspected TPC-C
and schema-matching [59]. We use GPT as a baseline with few-shot and TPC-E carefully to understand why Auto-BI is effective. It
learning [13], where we provide few-shot demonstrations of the turns out that while these OLTP databases do not have snowflake
join task in the prompt, and then ask the model to predict for new designs, they nevertheless have clusters of tables (e.g., a cluster of
test cases. We use the latest publicly available version of GPT (GPT- tables on “Customers”, and another cluster on “Market”, etc., for
3.5-turbo), 7 and in the prompt we provide both the table-schema TPC-E), where each cluster loosely follows a hub-and-spoke pattern
(column names) as well as sample data rows. Note that because with related tables joining through a few “central” tables (e.g., a clus-
synthetic benchmarks (e.g., TPC) are heavily represented on the ter of tables relating to customers, such as “Customers-Account”
web, it is likely that GPT has seen such cases in its pre-training (e.g., and “Customers-Tax-Rate”, all join through a central “Customers”
TPC queries on the web), causing data leakage and inflated results. table in TPC-E, like very nicely depicted in Figure 2 of prior work
We report these numbers nevertheless, for reference purposes. on schema summarization [57]). Similar results are also observed in
Auto-BI. This is our proposed method using global 𝑘-MCA additional synthetic benchmarks as shown in Table 6. Exploring the
optimization. We evaluate three variants of our method: (1): Auto- applicability of Auto-BI-like global optimization in FK-detection
BI-Precision (Auto-BI-P) is the precision-mode of Auto-BI (Sec- for general OLTP databases is an interesting area for future work.
tion 4.3.2) that focuses on high precision (by finding snowflake-like Among all baselines, the commercial System-X produces high
“backbones” from the schema). (2): Auto-BI is our full approach precision at the cost of a low recall. Among the four FK methods
from the literature, ML-FK produces the best quality (since we feed
6We note that it is possible that a predicted join graph may differ syntactically from it with the same training data harvested from real BI models), while
the ground truth when the two are semantically equivalent. For example, the ground
truth may have a fact-table 𝐹 join a dimension-table 𝐴 (N:1), which in turn 1:1 join the remaining three have comparable precision/recall.
with another dimension-table 𝐵 . If it is predicted that 𝐹 joins with 𝐵 (N:1), and then Detailed breakdown. Table 7 and Table 8 show a more detailed
𝐵 joins with 𝐴 (1:1), while the ground truth is F-(N:1)-A-(1:1)-B, then the are in fact comparison of edge-level and case-level quality, respectively, buck-
identical (except that the join order is different). We account for semantic equivalence
in our evaluation and mark both as correct, which is fair for all methods tested. etized by the number of input tables. The overall trend is consistent
7 https://platform.openai.com/docs/models/gpt-3-5 with Table 5 – Auto-BI has the best F-1, while Auto-BI-P has the
20 3.5
Auto-BI-P UCC
1
5
0.5
0 0
50% 90% 95% Auto-BI-P Auto-BI Auto-BI-S MC-FK Fast-FK HoPF
4
second, which are mostly cases with more than 40 input tables,
3
which we argue is acceptable considering the size of the data. We
2
also report that the largest case we encounter among all 10K+ BI
models has 88 input tables, which k-MCA-CC still solves optimally
1
in about 11 seconds (not shown in the figure as it is not sampled in
0
0 5 10 15 20 25 30 35 40 45 50
the 1000-case Real). This confirms that our algorithm is practical
Number of Input Tables in solving large real-world BI cases both optimally and efficiently.
Figure 6: Latency distribution of k-MCA-CC on 1000-case Real. Figure 7 shows the benefit of the two key optimization techniques
The 50/90/95-th p-tiles are 0.02/0.06/0.17 seconds, respectively.
we used in solving k-MCA-CC: (1) artificial root in 𝑘-MCA (Algo-
best precision across the board. We note that Auto-BI works rea- rithm 2) and (2) branch-and-bound in 𝑘-MCA-CC (Algorithm 3).
sonably well on the most challenging slice of the test cases (with We compare the number of 1-MCA calls with and without the opti-
21+ tables), maintaining high precision (0.94) and reasonable recall. mizations. (Note that we use the number of 1-MCA calls instead
We report additional experiments and analysis (comparing with of wall-clock time, because the algorithm can timeout without our
baselines enhanced using components from Auto-BI to better un- optimizations). We can see that the two optimization techniques
derstand its component-wise benefits) in Appendix C. give 5 and 4 orders of magnitude improvement in efficiency, re-
5.4 Efficiency comparison spectively (the y-axis is in log-scale). Compared to a brute-force
Overall comparison. We show the 50/90/95-th percentile latency approach that solves k-MCA without optimization, the combined
of all methods in Figure 5(a), on the 1000-case Real benchmark. benefit is a staggering 10 orders of magnitude, again showing the
Overall, Auto-BI-S and Fast-FK are the fastest, taking 2-3 seconds importance of our algorithms.
even in the largest cases. Auto-BI is 2-3x slower but the latency is
still acceptable. HoPF takes the most time in comparison. 5.5 Ablation Studies
We also report the latency breakdown of all methods in Fig- We perform an ablation study to analyze the benefits of various
ure 5(b). Overall, there are 4 main components for all methods: ideas developed in Auto-BI, using the 1000-case Real benchmark.
(1) UCC: generate unique column combinations; (2) IND: gener- Our results are summarized in Figure 8.
ate inclusion dependencies; (3) Local-Inference: generate features The effect of FK-once constraint. The bar marked as “no-FK-
and perform local-classifier inference for joins in Auto-BI (while once-constraint” in Figure 8 shows the result quality that removes
baselines use predefined scoring functions here); (4) Global-Predict: the FK-once constraint (or using k-MCA instead of k-MCA-CC).
Auto-BI uses 𝑘-MCA, whereas baselines use various other algo- There is a noticeable drop in precision/recall, showing the impor-
rithms. For Auto-BI, the most expensive part is Local-Inference tance of this constraint in k-MCA-CC.
(specifically, feature generation for inference). Auto-BI-S is effec- The effect of precision-mode. The “no-precision-mode” bar shows
tive in reducing the latency of this part (by using metadata-only the result quality if we omit the precision-mode step (Section 4.3.2)
features), thus achieving significant speedup. IND-generation is and use the recall-mode directly (Section 4.3.3). We see a substantial
also expensive for all methods, but it is a standard step and the drop in precision (6 and 13 percentage points at edge-level and case-
latency is comparable across all methods. In Global-Predict, we see level, respectively), again showing the importance of k-MCA-CC.
that Auto-BI (using the 𝑘-MCA algorithm) is highly efficient here. The effect of N-1/1-1 separation in local classifier. The “no-
In comparison, baselines like MC-FK and HoPF are very expensive. N-1/1-1-seperation” bar shows the results when we use one local
Efficiency of k-MCA-CC. Our key innovation in Auto-BI is classifier, instead of splitting N-1 and 1-1 joins into two prediction
the 𝑘-MCA-CC formulation and its efficient solutions. Despite the tasks (Section 4.2). The drop in quality is again evident.
hardness of the problem, we solve it optimally (Algorithm 3), which The effect of label transitivity in local classifier. The bar for
is efficient as shown by the purple Global-Predict component in “no-label-transitivity” shows the drop in quality when we do not
Figure 5(b). It is interesting to drill down and study the latency of apply label transitivity in local classification (Section 4.2). The effect
this key step in detail. is noticeable but diluted across 1000 cases, as the benefit is more
Figure 6 shows the distribution of latency when we solve k-MCA- pronounced in large cases and less obvious in smaller ones.
CC on the 1000-case Real. We can see that the latency for the vast The effect of no data-value features in local classifier. The “no-
majority of cases is sub-second. In fact, the 50-th, 90-th and 95-th data-features” bar shows the effect of removing features relating
1
Auto-BI Auto-BI no-N-1/1-1-seperation LC-only
10 10 no-branch-and-bound 0.95 no-FK-once-constraint no-label-transitivity
no-precision-mode no-data-features
no-synthetic-root
0.7
10 2
0.65
10 0
4 5 6 7 8 9 10 [11-15] [16-20] >=21 0.6
Number of Input Tables Edge Precision Edge Recall Edge F-1 Case Precision
Figure 7: Effect of our efficiency optimization techniques, as Figure 8: Ablation study on the effect of Auto-BI components
measured by the number of 1-MCA invocations. on result quality (average of 1000-case Real).
1 1
0.95
0.9
0.9
0.85 0.8
0.8
0.75 0.7
Auto-BI edge precision Auto-BI edge precision
0.7 Auto-BI edge recall Auto-BI edge recall
0.6
Auto-BI edge F-1 Auto-BI edge F-1
0.65
Auto-BI case precision Auto-BI case precision
0.6 0.5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
p
Table 11: Edge-level quality reported as “F-1 (precision, recall)”, by number of input tables in Real benchmark.