Big Data Group Assignment - Report
Big Data Group Assignment - Report
Table of Contents
INTRODUCTION....................................................................................2
PROGRAM DOCUMENTATION.................................................................6
PROGRAM ORGANIZATION......................................................................................... 6
Task 1 Nearest Neighbor Search.....................................................................6
Task 1 Output................................................................................................. 7
Task 2 Skyline Search.....................................................................................9
Task 2 Output............................................................................................... 10
FUNCTION DESCRIPTION.......................................................................................... 11
Task 1 Nearest Neighbor Search...................................................................11
Task 2 Skyline Search...................................................................................13
CONCLUSION......................................................................................35
1
Introduction
The increasing volume of spatial data across industries—ranging from urban planning and
navigation systems to real estate analytics—has created a growing demand for efficient
spatial query processing techniques. Traditional linear search methods often fail to scale when
dealing with high-dimensional or large-volume datasets. This report investigates the use
of spatial indexing through R-tree structures to solve two core spatial problems: Nearest
Neighbor (NN) search and Skyline computation.
In the first part of the project, the focus lies on identifying the closest facility (e.g.,
restaurant) for each user query point using three different algorithms: sequential scan, R-
tree-based Best-First (BF) search, and a divide-and-conquer variant. This task emphasizes
proximity detection in dense spatial datasets, leveraging the hierarchical structure of R-trees
to significantly reduce computation time.
In the second part, the objective is to determine skyline points from a large dataset of
property listings. A skyline point is one that is not dominated by any other in terms of both
cost and size—providing the best trade-offs. Here, the Branch-and-Bound Skyline (BBS)
algorithm and its divide-and-conquer extension are implemented and compared against a
baseline brute-force approach. R-trees again serve as the foundational structure for efficient
region-based pruning and prioritization.
Together, these tasks demonstrate how spatial data structures like R-trees and algorithms such
as BF and BBS can be adapted to support fast and scalable spatial analysis. The report
includes detailed documentation of the implementation, system requirements, algorithmic
flow, and experimental results—showcasing practical insights into handling spatial queries at
scale.
2
Group Member Information
Parth Jayesh
48356573 [email protected] 1
Doshi
3
Hardware Requirements
Python Requirements
4
Input Files and Parameters (directory settings and other parameters)
The project files are organized in the following directory structure:
“/Users/arjunkrishangimahawar/Desktop/MQ BA/Sem 3/Big Data/ Group Assignment/”
● Task1_Parth
restaurant_dataset.txt
query_points.txt
● Task2_Callanth
city2.txt
All files should be placed in their respective task folders for the programs to run correctly.
The working directory should be set to the task folder (e.g., Task1 or Task2) when running
the programs.
Additional Requirements
As discussed in 2.1.2.4, the additional requirements will be required for installing r-tree.
5
Program Documentation
Program Organization
6
7
Task 1 Output
8
9
10
Task 2 Skyline Search
11
Task 2 Output
12
13
Function Description
14
p is added to node. geometrically.
is_leaf() Checks if node is a None Returns True if node has no
leaf node. children.
is_root() Checks if node is None Returns True if node’s
the root of the tree. parent is None.
is_overflow() Checks if node has None Returns True if node holds
exceeded max more entries than B=4.
capacity.
perimeter() Computes the None Used to compare MBR
perimeter of a compactness during split.
node’s MBR.
bf_search_nn(query) Best-First search to query: point Traverses tree using priority
find nearest dictionary queue sorted by min_dist to
neighbor using R- MBR. Returns closest point.
tree.
main() Runs Task 1 None Loads files, builds R-tree,
algorithms and performs all three NN
writes output. methods and writes results
to task1_output.txt.
15
Task 2 Skyline Search
16
Analyzing BF Algorithm based NN Search
In this section, a Best First (BF) algorithm will be used to search for the closest point to the
query point of (4, 7) as given in the sample dataset. An R-tree will be constructed, and
examples of the BF algorithm will be shown, first by a standard BF search, then using divide
and conquer. The points used in the R-tree construction can be found in the sample data set is
shown in the table below:
Sample Facility Points:
ID X Y
1 9 2
2 5 4
3 6 7
4 8 6
5 1 9
6 4 7
7 2 5
8 1 4
9 4 7
10 6 2
Query Point:
ID X Y
1 4 7
17
The Process of R-Tree Construction
The R-tree is a height-balanced spatial indexing structure used to efficiently organize and
search spatial data such as geographical coordinates. It groups nearby objects and represents
them with minimum bounding rectangles (MBRs) at each level. Each internal node in the tree
stores entries that act as pointers to child nodes, with their corresponding MBRs. Leaf nodes
contain the actual data points. R-trees help reduce the number of distance computations by
enabling region-based pruning, which is crucial for nearest neighbor searches using the Best-
First (BF) algorithm.
In this task, the R-tree is constructed using a sample dataset of 10 facility points and 1 query
point. The R-tree parameter B is set to 4, meaning a node can hold a maximum of four entries
before triggering a split. The dataset and query are as follows:
The very first point in the dataset is (9, 2). Since the R-tree is empty, this point is inserted as
the root node of the tree. The Minimum Bounding Rectangle (MBR) of the root initially
covers only this single point.
New Root: Contains (9, 2)
MBR: (9, 9, 2, 2)
18
Step 2: Insert (5, 4)
The point (5, 4) is inserted into the current root node. Since the root can still hold more
entries, the MBR is expanded to include both (9, 2) and (5, 4), covering the smallest possible
rectangle that contains them.
Root Entries: [(9, 2), (5, 4)]
MBR: (5, 9, 2, 4)
Adding (6, 7) to the root node is still allowed, as the capacity hasn't been exceeded. The MBR
is updated to reflect the inclusion of this point.
Root Entries: [(9, 2), (5, 4), (6, 7)]
MBR: (5, 9, 2, 7)
This point is also added directly to the root node, which now contains four entries (still within
limit). The updated MBR is recalculated accordingly.
Root Entries: [(9, 2), (5, 4), (6, 7), (8, 6)]
MBR: (5, 9, 2, 7)
This point causes an overflow as the root already contains four entries. A split is triggered.
The existing points are sorted using X and Y coordinates to find the most optimal split. The
node is divided into two child nodes based on minimum perimeter expansion.
R2: [(8, 6), (9, 2)] → MBR = (8, 9, 2, 6)
R3: [(1, 9), (5, 4), (6, 7)] → MBR = (1, 6, 4, 9)
New Root: Contains R2 and R3
Root MBR: (1, 9, 2, 9)
19
Step 6: Insert (4, 7)
The point (4, 7) is closest to MBR of R3 and is inserted there. R3 still has capacity. MBR is
updated accordingly.
R3 Entries: [(1, 9), (5, 4), (6, 7), (4, 7)]
MBR R3: (1, 6, 4, 9)
Point (2, 5) is also routed to R3. Now R3 exceeds the maximum capacity (B = 4), triggering
another split.
R3 Split:
o R3: [(1, 9), (5, 4), (6, 7)] → MBR = (1, 6, 4, 9)
o R4: [(2, 5), (1, 4), (4, 7)] → MBR = (1, 4, 4, 7)
This point is routed to R4 as it lies within or near its MBR. R4 can still accept more points.
R4 Updated Entries: [(2, 5), (1, 4), (4, 7)]
MBR R4: (1, 4, 4, 7)
Point (4, 3) is inserted into R4. MBR is expanded to accommodate the new lower y-
coordinate.
R4 Entries: [(2, 5), (1, 4), (4, 7), (4, 3)]
MBR R4: (1, 4, 3, 7)
This point is closest to R2 and is inserted there. R2’s MBR is updated accordingly.
R2 Entries: [(8, 6), (9, 2), (6, 2)]
MBR R2: (6, 9, 2, 6)
20
The Process of BF Algorithm
Now that the R-tree has been constructed, we can begin querying points using the Best-First
(BF) algorithm. This approach efficiently finds the nearest neighbor by exploring the tree in
order of increasing minimum distance between the query point and the Minimum Bounding
Rectangles (MBRs) of the R-tree nodes.
R2:
o X distance = |4 - 6| = 2
o Y distance = |7 - 2| = 5
o Distance = √(2² + 5²) ≈ 5.39
R3:
o X distance = |4 - 6| = 2
o Y is within the MBR’s Y-range → Y distance = 0
o Distance = √(2² + 0²) = 2.00
R4:
o X is within the MBR → X distance = 0
o Y distance = |7 - 5| = 2
o Distance = √(0² + 2²) = 2.00
21
22
R5:
o X distance = |4 - 2| = 2
o Y is within MBR range → Y distance = 0
o Distance = √(2² + 0²) = 2.00
Sorted by distance:
1. R3 – 2.00
2. R4 – 2.00
3. R5 – 2.00
4. R2 – 5.39
Step 3: Traverse Nodes by Priority
Result
After processing all candidates, the closest point to query (4,7) is: → d3 (6,7) with
distance 2.00
Thus, the Best-First algorithm correctly identifies d3 (6,7) as the nearest neighbor.
23
The Process of Divide-and-Conquer
The divide-and-conquer strategy for Nearest Neighbor Search involves splitting the dataset
based on a vertical boundary to build two separate R-trees, and applying the Best-First
algorithm independently in each region. For our dataset of 10 points, we split based on X = 5,
dividing the dataset into left and right spatial regions:
Each region will construct its own R-tree and be queried independently.
o X Range: 1 to 4
o Y Range: 3 to 9
o X Range: 5 to 9
o Y Range: 2 to 7
24
d5 (1,9) → dist ≈ 3.61
25
2. Expand Right Region (d1, d2, d3, d4, d10):
Step 5: Result
After examining both regions, the closest point to the query (4,7) is:
26
Analyzing the BBS Algorithm based Skyline Search
Step 1:
As we can see in the image, after inserting the data points from D1 to D6,, we have organized
the data points into two Minimum Bounding Rectangles (R2 and R3), where R2 contains
27
points D1 and D4 in the upper region, while R3 encompasses points D2, D3, D5, and D6 in
the lower region, setting up the initial R-tree structure for the BBS Algorithm.
Step 2:
In the image 5.2.a , it is subdivided region R3 by adding a new region R4 around point D6,
and included a new point D7 in the lower part of R3, continuing to refine the R-tree structure
for more precise spatial partitioning. This is done to handle the overflow and now the new
MBRs are as follows
● R2 = 3,5,10,12
● R3 = 9,12,4,8
● R4 = 6,8,6,7
Step 3:
28
(Fig 5.1c: Showing creation of 4 MBRs and addition of data points)
In Figure 3, after inserting new data points D8, D9, and D10, the R-tree structure needed
reorganization to handle overflow. The MBRs were split and redefined with new coordinates
as follows:
● R2 = 2,5,10,4
● R3 = 9,12,4,8
● R4 = 6,8,6,7
● R5 = 13,15,2,3
Step 4:
30
Step 5:
1. A new point D14 has been added, which led to the creation of a new region R6 within
R2 to handle the overflow in the upper left area. R6 specifically contains D14 and D4.
2. A new point D15 has been added to region R5 in the lower right area, expanding the
existing R5 region to accommodate this new point alongside D9 and D10.
The R-tree now has six distinct regions (R2 through R6), with R2 being subdivided to include
the nested R6 region and R7 region.
31
The Process of BBS Algorithm
The Branch-and-Bound Skyline (BBS) algorithm is applied to identify the most optimal
facility points from a 2D dataset, where each point represents a property defined by cost (X)
and size (Y). A skyline point in this scenario is one that is not dominated by any other facility
—that is, no other point is simultaneously cheaper and larger. The algorithm operates on the
R-tree constructed in Section 5.1, efficiently traversing regions based on their proximity to the
origin (0,0), and dynamically eliminating dominated entries. The use of the BBS algorithm
allows for rapid skyline computation while minimizing unnecessary evaluations by pruning
entire subspaces using dominance logic.
Calculate the minimum distance from the origin (0,0) to each Minimum Bounding
Rectangle (MBR) at the root level of the R-tree.
Insert MBRs into a priority queue based on ascending distance. Closer regions like
R6 and R4 are prioritized first.
32
Step 2: Insert Closest MBRs to the Queue and Begin Traversal
The algorithm begins traversal with the most promising MBRs (e.g., R6, R4), as
shown via directional arrows in visual references.
This selection determines which spatial partitions will be expanded first to look for
skyline candidates.
33
Step 3: Identify Initial Skyline Points – D6 and D13
D6 is identified as the first skyline point; it is closest to the origin and not dominated
by any other region.
D13 is then selected as the next skyline point, having also passed the domination test.
Both are added to the skyline list.
34
(Fig 5.2d: Prune Dominated Regions and Assess Unresolved Entries)
35
(Fig 5.2e: Evaluation of new candidates and elimination)
36
Step 6: Assess R6 for Additional Skylines
Region R6 is processed. Points D14 and D4 are discovered and added to the skyline
list.
D8 is removed, as it is dominated by D14.
Status Update:
o Skyline points: D6, D13, D11, D14, D4
37
Step 7: Final Evaluation of Region R5
38
(Fig 5.2h: Final Output and Skyline Completion)
After evaluating all MBRs and their data points, the algorithm finalizes the skyline.
Final Skyline Points: D14, D4, D11, D6, D13, D15
These points are not dominated by any other in both X and Y dimensions.
The BBS algorithm can also be extended using a divide-and-conquer strategy to simplify and
accelerate skyline computation on larger datasets. This approach breaks the problem into
smaller sub-problems and then merges the results, taking advantage of spatial dominance
relationships. The process involves the following three key phases:
1. Dividing the dataset into two regions along the X-axis (typically using a vertical
split at the median X-value), forming left and right halves.
2. Applying the BBS algorithm independently on each sub-region to compute local
skylines.
3. Merging the local skylines by performing a dominance check along the Y-axis.
The core principle behind this technique is that points in the left half cannot be dominated
by points in the right half due to their strictly better (lower) X-values. Therefore:
All skyline points from the left region are automatically included in the final result.
Skyline points from the right region are only included if their Y-values are lower than
all Y-values from the left region’s skyline.
This reduces the complexity of skyline comparison and makes the process more efficient.
39
40
Step 1: Splitting the Dataset
The dataset is divided into two halves using a vertical partition line at approximately X = 8.
41
Within each half, skyline points are identified by checking whether any point is dominated by
another point along the X-axis. This creates a filtered list of non-dominated points for both
the left and right halves.
1. All left-half skyline points are automatically accepted into the global skyline set.
2. Each right-half point is then compared against the Y-values of the left-half skylines:
o If a right-half point has a higher Y-value, it is discarded.
o If it has a lower Y-value than all left-half points, it is included.
o In this case, only D15 satisfies this condition and is added to the final skyline.
Final Output
The final skyline consists of the union of all left-half skyline points and any qualifying right-
half points.
Thus, the result is: D14, D4, D11, D6, D13, D15
42
These points form the skyline path under the divide-and-conquer model, and the remaining
right-half entries are excluded as they are dominated on the Y-axis by existing left-half
skyline members.
Conclusion
This report explored and evaluated advanced spatial search techniques using R-tree structures,
with a focus on two critical tasks: Nearest Neighbor Search (Task 1) and Skyline
Computation (Task 2). Each task incorporated multiple algorithmic approaches and was
designed to demonstrate the efficiency of spatial indexing on large-scale, two-dimensional
datasets.
In Task 1, we implemented three different strategies for Nearest Neighbor (NN) search: a
naïve Sequential Scan, a spatially optimized Best-First (BF) search using R-trees, and
a Divide-and-Conquer variant of the BF method. The R-tree construction followed best
practices such as balanced insertion, MBR updating, and perimeter-based splitting. The BF
algorithm showcased a significant improvement in performance by prioritizing search regions
based on proximity to the query point. The divide-and-conquer extension allowed for
localized R-tree construction, reducing overall query complexity and making the method
scalable for larger datasets.
Together, both tasks confirmed the superiority of R-tree-enhanced algorithms over brute-force
methods in terms of computational speed and resource usage. Execution time comparisons
validated these findings, with BBS and BF methods performing significantly faster than their
sequential counterparts—without compromising accuracy.
In summary, this assignment highlighted how spatial data structures (R-trees) and algorithmic
optimizations (BF, BBS, Divide-and-Conquer) can effectively solve complex real-world
problems in location-based services, recommendation systems, and decision-making support.
These methods not only scale well with data size but also maintain precision, making them
essential tools in the big data and spatial analytics domain.
43
Video Presentation Link:
https://mqoutlook-my.sharepoint.com/:p:/r/personal/
parthjayesh_doshi_students_mq_edu_au/_layouts/15/Doc.aspx?sourcedoc=%7B62198463-
8DF3-428B-9C82-679DCA74D292%7D&file=Big%20Data%20Group
%20Assignment.pptx&action=edit&mobileredirect=true
44