0% found this document useful (0 votes)
81 views44 pages

Big Data Group Assignment - Report

This report details the implementation of spatial query processing techniques using R-tree structures to address Nearest Neighbor (NN) search and Skyline computation. It outlines the algorithms used, including Best-First (BF) and Branch-and-Bound Skyline (BBS), and provides documentation on program requirements, execution, and performance analysis. The findings demonstrate significant efficiency improvements in spatial data handling through optimized algorithms and indexing methods.

Uploaded by

mahawark513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views44 pages

Big Data Group Assignment - Report

This report details the implementation of spatial query processing techniques using R-tree structures to address Nearest Neighbor (NN) search and Skyline computation. It outlines the algorithms used, including Best-First (BF) and Branch-and-Bound Skyline (BBS), and provides documentation on program requirements, execution, and performance analysis. The findings demonstrate significant efficiency improvements in spatial data handling through optimized algorithms and indexing methods.

Uploaded by

mahawark513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

COMP3210/6210: Report for Assignment 2

Table of Contents
INTRODUCTION....................................................................................2

GROUP MEMBER INFORMATION.............................................................3

PROGRAM EXECUTION REQUIREMENTS..................................................3

PROGRAM ENVIRONMENT (E.G., OS, DATABASE, CPU, ETC.)...........................................3


Test Environment Used...................................................................................3
Recommended System Requirements............................................................3
INPUT FILES AND PARAMETERS (DIRECTORY SETTINGS AND OTHER PARAMETERS)..................5
ADDITIONAL REQUIREMENTS......................................................................................5

PROGRAM DOCUMENTATION.................................................................6

PROGRAM ORGANIZATION......................................................................................... 6
Task 1 Nearest Neighbor Search.....................................................................6
Task 1 Output................................................................................................. 7
Task 2 Skyline Search.....................................................................................9
Task 2 Output............................................................................................... 10
FUNCTION DESCRIPTION.......................................................................................... 11
Task 1 Nearest Neighbor Search...................................................................11
Task 2 Skyline Search...................................................................................13

ANALYZING BF ALGORITHM BASED NN SEARCH....................................14

THE PROCESS OF R-TREE CONSTRUCTION..................................................................15


THE PROCESS OF BF ALGORITHM.............................................................................18
THE PROCESS OF DIVIDE-AND-CONQUER....................................................................20

ANALYZING THE BBS ALGORITHM BASED SKYLINE SEARCH...................22

THE PROCESS OF R-TREE CONSTRUCTION..................................................................22


THE PROCESS OF BBS ALGORITHM...........................................................................26
THE PROCESS OF DIVIDE-AND-CONQUER....................................................................32

CONCLUSION......................................................................................35

VIDEO PRESENTATION LINK:...............................................................35

1
Introduction

The increasing volume of spatial data across industries—ranging from urban planning and
navigation systems to real estate analytics—has created a growing demand for efficient
spatial query processing techniques. Traditional linear search methods often fail to scale when
dealing with high-dimensional or large-volume datasets. This report investigates the use
of spatial indexing through R-tree structures to solve two core spatial problems: Nearest
Neighbor (NN) search and Skyline computation.

In the first part of the project, the focus lies on identifying the closest facility (e.g.,
restaurant) for each user query point using three different algorithms: sequential scan, R-
tree-based Best-First (BF) search, and a divide-and-conquer variant. This task emphasizes
proximity detection in dense spatial datasets, leveraging the hierarchical structure of R-trees
to significantly reduce computation time.

In the second part, the objective is to determine skyline points from a large dataset of
property listings. A skyline point is one that is not dominated by any other in terms of both
cost and size—providing the best trade-offs. Here, the Branch-and-Bound Skyline (BBS)
algorithm and its divide-and-conquer extension are implemented and compared against a
baseline brute-force approach. R-trees again serve as the foundational structure for efficient
region-based pruning and prioritization.

Together, these tasks demonstrate how spatial data structures like R-trees and algorithms such
as BF and BBS can be adapted to support fast and scalable spatial analysis. The report
includes detailed documentation of the implementation, system requirements, algorithmic
flow, and experimental results—showcasing practical insights into handling spatial queries at
scale.

2
Group Member Information

Name Student ID Email: Assigned Task


Coordinator:
Arjun Krishangi 43841924 [email protected] 3
Mahawar

Parth Jayesh
48356573 [email protected] 1
Doshi

Quynh Nhi Vu 46863370 [email protected] 2

Program Execution Requirements

Program Environment (e.g., OS, database, CPU, etc.)

Test Environment Used


Category Specification
Operating System macOS Sequoia (version 15)
CPU 8-Core
RAM 8GB
Storage 2TB
IDE Visual Studio Code with Python extensions
Python Version 3.12.4 (base)

Recommended System Requirements


Operating Systems

Operating System Details


MacOS Apple M2 chip
Linux Distributions Asahi Linux
Virtualization
Compatibility Any OS supporting Python 3.x environment

3
Hardware Requirements

Category Minimum Requirement Recommended Requirement


CPU Core Count: 8-core. Quad-core processor at 2.5 GHz
Performance Cores: 4 or higher
Efficiency Cores: 4
Architecture Apple M2 chip
Memory (RAM) 8GB Configurable up to 16GB 8GB or higher
to 24GB
Storage 256GB Configurable: Up to SSD recommended for
512GB, 1TB, or 2TB faster data processing

Python Requirements

Python Environment Details


Python Version Python 3.7 or higher (3.9+ recommended)
Package Manager pip
Virtual Environment venv/conda recommended

Required Python Libraries

Library Type Library Description


Built-in math Used for mathematical calculations
Libraries
time Used for performance measurements
sys Used for system-specific parameters
heapq Used for maintaining priority queues
External Library  Version 1.0.0 or higher
 Installation: pip install rtree
 Dependencies: libspatialindex C++ library
 Required for R-tree based algorithms and spatial indexing

4
Input Files and Parameters (directory settings and other parameters)
The project files are organized in the following directory structure:
“/Users/arjunkrishangimahawar/Desktop/MQ BA/Sem 3/Big Data/ Group Assignment/”

● Task1_Parth
 restaurant_dataset.txt
 query_points.txt

● Task2_Callanth
 city2.txt

All files should be placed in their respective task folders for the programs to run correctly.
The working directory should be set to the task folder (e.g., Task1 or Task2) when running
the programs.

Additional Requirements
As discussed in 2.1.2.4, the additional requirements will be required for installing r-tree.

5
Program Documentation

Program Organization

Task 1 Nearest Neighbor Search

Class/File Name Description


Text file containing 99,999 restaurant locations, formatted as ID, X-
coordinate (longitude), and Y-coordinate (latitude). Each line
represents a unique restaurant with a spatial position on a 2D map.
These restaurant points are treated as candidate facilities against
restaurant_dataset.txt which the 200 user query points are evaluated. This file is essential
for the R-tree index construction and forms the search space for all
three nearest neighbor algorithms. The dataset is large enough to
demonstrate the scalability and performance difference between
sequential scanning and spatial indexing using R-tree structures.
Text file containing 200 query points in the format ID, X-coordinate,
and Y-coordinate. Each line represents a user query point with a
unique ID and geographical coordinates (longitude and latitude).
query_points.txt
These points serve as inputs to the nearest neighbor search
algorithms, which identify the closest matching facility for each user
based on spatial proximity.
This Python script executes Task 1 for Nearest Neighbor Search
using three methods: Sequential Scan, Best-First (BF) with R-tree,
and BF with Divide-and-Conquer. It contains all the necessary
functions to load 99,999 restaurant data points and 200 query points,
task1.py construct the R-tree index, and apply each search method to
determine the nearest facility to every query. The results, including
nearest neighbor IDs and coordinates, are written to an output file
along with total and average execution times for performance
comparison.
This file contains the complete output of Task 1, recording the
results of all three nearest neighbor algorithms: Sequential Scan, BF
using R-tree, and BF with Divide-and-Conquer. For each of the 200
query points, it logs the ID and coordinates of the nearest restaurant
task1_output.txt identified by each method. Additionally, it reports the total execution
time and the average time per query for each algorithm. These
metrics are used to compare and evaluate the runtime performance
and efficiency gains of using spatial indexing and optimization
techniques in nearest neighbor search.

6
7
Task 1 Output

8
9
10
Task 2 Skyline Search

Class/File Name Description


This text file contains exactly 300,000 entries representing
home listings. Each line follows the format: ID, Cost (X), and
Size (Y). Cost represents the price of the home, while Size
represents the square meter area. The skyline algorithms aim to
city2.txt identify listings that offer a better balance of size and cost—
i.e., listings that are not dominated by others with both lower
price and larger area. The dataset is large enough to highlight
performance gains from optimized skyline computation
methods.
This Python script performs the Skyline Search task using
three methods: Sequential Scan, Branch-and-Bound Skyline
(BBS), and Divide-and-Conquer. It loads home listings from
the city2.txt dataset and determines which homes are part of
the skyline—i.e., not dominated in terms of cost (X) and size
task2.py
(Y). The script includes efficient skyline implementations
optimized for 2D datasets, including a sorted greedy BBS
approach and a recursive skyline merger for the divide-and-
conquer method. It computes execution time for each
algorithm and writes all skyline results to the output file.
This file contains the result of executing all three skyline
search methods on the dataset: Sequential Scan, BBS, and
Divide-and-Conquer. For each method, it lists the homes that
form the skyline, including their ID, cost, and size. At the end
skyline_output.txt
of each section, it records the total execution time, allowing
performance comparison. For example, BBS completed in
~0.19 seconds, significantly faster than the sequential scan
which took ~1.76 seconds on the same data.

11
Task 2 Output

12
13
Function Description

Task 1 Nearest Neighbor Search

Function Name Purpose Parameters Description


load_points(filename) Loads spatial point filename: File path Reads each line from the
data from a file as a string file, extracting ID, X, and
(restaurant or query Y. Returns a list of
points). dictionaries with the parsed
values.
euclidean_dist_sq(p1, Computes squared p1, p2: point Calculates (x1 - x2)^2 + (y1
p2) Euclidean distance dictionaries with 'x' - y2)^2. Avoids square root
between two points. and 'y' for efficiency during
comparison.
min_dist_sq(query, Computes minimum query: dict, mbr: Determines the squared
mbr) squared distance dict with x1, x2, y1, minimum distance from the
from query point to y2 point to the MBR edges.
a bounding Used for pruning in BF
rectangle. search.
insert(u, p) Inserts point p into u: Node object, p: Navigates down tree, adds
R-tree starting at point dictionary point to correct leaf,
node u. updates MBRs. Handles
node overflow via split if
needed.
split(u) Handles overflow u: Node object Evaluates all split
by splitting node u combinations (x/y sort),
into two valid chooses the one with least
nodes. total MBR perimeter.
add_child(node, Adds a child node node: parent Node, Appends child to parent’s
child) and updates parent's child: Node child_nodes and expands
MBR. the MBR to include the
child.
add_data_point(node, Adds a point to a node: Node, p: Adds point to data_points
p) node and updates its point dict and adjusts MBR to include
MBR. it.
update_mbr(node) Recalculates the node: Node object Updates the node's MBR
MBR of a node. based on contained child
MBRs or point coordinates.
choose_subtree(u, p) Selects best child u: Node, p: point Chooses child node with
node for inserting dict smallest perimeter increase
point p. from insertion.
peri_increase(node, Calculates node: Node, p: Used within choose_subtree
p) perimeter increase if point dict to compare insertion impact

14
p is added to node. geometrically.
is_leaf() Checks if node is a None Returns True if node has no
leaf node. children.
is_root() Checks if node is None Returns True if node’s
the root of the tree. parent is None.
is_overflow() Checks if node has None Returns True if node holds
exceeded max more entries than B=4.
capacity.
perimeter() Computes the None Used to compare MBR
perimeter of a compactness during split.
node’s MBR.
bf_search_nn(query) Best-First search to query: point Traverses tree using priority
find nearest dictionary queue sorted by min_dist to
neighbor using R- MBR. Returns closest point.
tree.
main() Runs Task 1 None Loads files, builds R-tree,
algorithms and performs all three NN
writes output. methods and writes results
to task1_output.txt.

15
Task 2 Skyline Search

Function Name Purpose Parameters Description


load_dataset(file_path) Loads home file_path: String Reads each line and
listing data (cost path to city2.txt extracts ID, X (cost), and
and size) from Y (size) into a list of
file. tuples. Used as the input
for skyline computation.
dominates(a, b) Checks whether a, b: Tuples (ID, X, Returns True if a has
point a Y) lower or equal cost (X)
dominates point and greater or equal size
b. (Y), and is strictly better
in at least one dimension.
skyline_sequential(points) Performs skyline points: List of (ID, Compares each point
search using X, Y) tuples with all others to
brute-force eliminate dominated
method. ones. Returns a sorted list
of skyline points.
bbs_skyline(points) Efficient skyline points: List of (ID, Sorts points by cost (X)
using sort-based X, Y) tuples ascending and size (Y)
2D BBS method. descending. Selects
skyline by checking if
size exceeds max seen so
far.
divide_and_conquer(points) Performs skyline points: List of (ID, Recursively splits dataset
search using X, Y) tuples into halves by cost (X),
divide-and- computes skyline for
conquer each, and merges them
algorithm. using 1D dominance
check.
merge_skylines_2d(left, Merges two left, right: Sorted Iterates over both lists,
right) skyline lists skyline lists keeping only points that
during divide- improve on max size seen
and-conquer. so far. Assumes lists are
sorted by X.
write_results(filename, Writes skyline filename: Output Writes the skyline points
results) results and file path (ID, X, Y) for each
execution times results: List of method to the file, along
to output file. tuples with runtime in seconds.
(method_name, Used to evaluate
skyline, time) performance.

16
Analyzing BF Algorithm based NN Search
In this section, a Best First (BF) algorithm will be used to search for the closest point to the
query point of (4, 7) as given in the sample dataset. An R-tree will be constructed, and
examples of the BF algorithm will be shown, first by a standard BF search, then using divide
and conquer. The points used in the R-tree construction can be found in the sample data set is
shown in the table below:
Sample Facility Points:
ID X Y
1 9 2
2 5 4
3 6 7
4 8 6
5 1 9
6 4 7
7 2 5
8 1 4
9 4 7
10 6 2
Query Point:
ID X Y
1 4 7

17
The Process of R-Tree Construction

The R-tree is a height-balanced spatial indexing structure used to efficiently organize and
search spatial data such as geographical coordinates. It groups nearby objects and represents
them with minimum bounding rectangles (MBRs) at each level. Each internal node in the tree
stores entries that act as pointers to child nodes, with their corresponding MBRs. Leaf nodes
contain the actual data points. R-trees help reduce the number of distance computations by
enabling region-based pruning, which is crucial for nearest neighbor searches using the Best-
First (BF) algorithm.
In this task, the R-tree is constructed using a sample dataset of 10 facility points and 1 query
point. The R-tree parameter B is set to 4, meaning a node can hold a maximum of four entries
before triggering a split. The dataset and query are as follows:

Principles Followed During R-Tree Construction


While constructing the R-tree for this task, the following principles were applied:
1. Points were inserted one by one in the order of their ID (insertion order).
2. The tree remains height-balanced, meaning all leaf nodes are at the same depth.
3. Each node can contain up to B = 4 entries. When a node exceeds this limit, it is split.
4. Splitting is performed using the minimum perimeter sum heuristic to ensure spatial
efficiency.
5. After a split, if the parent node exceeds its capacity, the overflow is propagated upward
recursively.
6. The Minimum Bounding Rectangle (MBR) of each node is updated to fully contain all its
child entries.
7. The final R-tree is organized such that the root node may have multiple children (R2, R3,
R4), each representing a subspace of the original dataset.

Step-by-Step R-Tree Insertion

Step 1: Insert (9, 2)

The very first point in the dataset is (9, 2). Since the R-tree is empty, this point is inserted as
the root node of the tree. The Minimum Bounding Rectangle (MBR) of the root initially
covers only this single point.
 New Root: Contains (9, 2)
 MBR: (9, 9, 2, 2)

18
Step 2: Insert (5, 4)

The point (5, 4) is inserted into the current root node. Since the root can still hold more
entries, the MBR is expanded to include both (9, 2) and (5, 4), covering the smallest possible
rectangle that contains them.
 Root Entries: [(9, 2), (5, 4)]
 MBR: (5, 9, 2, 4)

Step 3: Insert (6, 7)

Adding (6, 7) to the root node is still allowed, as the capacity hasn't been exceeded. The MBR
is updated to reflect the inclusion of this point.
 Root Entries: [(9, 2), (5, 4), (6, 7)]
 MBR: (5, 9, 2, 7)

Step 4: Insert (8, 6)

This point is also added directly to the root node, which now contains four entries (still within
limit). The updated MBR is recalculated accordingly.
 Root Entries: [(9, 2), (5, 4), (6, 7), (8, 6)]
 MBR: (5, 9, 2, 7)

Therefore, Suboptimal MBRS are as follows:


 R2: 16
 R3: 18
 Total Parameter: 16+ 18 = 34

Step 5: Insert (1, 9) → Causes Overflow

This point causes an overflow as the root already contains four entries. A split is triggered.
The existing points are sorted using X and Y coordinates to find the most optimal split. The
node is divided into two child nodes based on minimum perimeter expansion.
 R2: [(8, 6), (9, 2)] → MBR = (8, 9, 2, 6)
 R3: [(1, 9), (5, 4), (6, 7)] → MBR = (1, 6, 4, 9)
 New Root: Contains R2 and R3
 Root MBR: (1, 9, 2, 9)

Therefore, optimal MBRS are as follows:


 R2 :10
 R3:20
 Total Parameter: 20+10 = 30

19
Step 6: Insert (4, 7)

The point (4, 7) is closest to MBR of R3 and is inserted there. R3 still has capacity. MBR is
updated accordingly.
 R3 Entries: [(1, 9), (5, 4), (6, 7), (4, 7)]
 MBR R3: (1, 6, 4, 9)

Step 7: Insert (2, 5)

Point (2, 5) is also routed to R3. Now R3 exceeds the maximum capacity (B = 4), triggering
another split.
 R3 Split:
o R3: [(1, 9), (5, 4), (6, 7)] → MBR = (1, 6, 4, 9)
o R4: [(2, 5), (1, 4), (4, 7)] → MBR = (1, 4, 4, 7)

Step 8: Insert (1, 4)

This point is routed to R4 as it lies within or near its MBR. R4 can still accept more points.
 R4 Updated Entries: [(2, 5), (1, 4), (4, 7)]
 MBR R4: (1, 4, 4, 7)

Step 9: Insert (4, 3)

Point (4, 3) is inserted into R4. MBR is expanded to accommodate the new lower y-
coordinate.
 R4 Entries: [(2, 5), (1, 4), (4, 7), (4, 3)]
 MBR R4: (1, 4, 3, 7)

Step 10: Insert (6, 2)

This point is closest to R2 and is inserted there. R2’s MBR is updated accordingly.
 R2 Entries: [(8, 6), (9, 2), (6, 2)]
 MBR R2: (6, 9, 2, 6)

Final Tree Structure:


 Root R1 has three children: R2, R3, R4
 Each child contains up to 4 entries
 Leaf nodes hold the actual data points

20
The Process of BF Algorithm
Now that the R-tree has been constructed, we can begin querying points using the Best-First
(BF) algorithm. This approach efficiently finds the nearest neighbor by exploring the tree in
order of increasing minimum distance between the query point and the Minimum Bounding
Rectangles (MBRs) of the R-tree nodes.

The BF algorithm operates as follows:


1. Calculate the minimum distance (mindist) from the query point to each MBR at the
root level.
2. Insert the MBRs and their mindist values into a priority queue, sorted by ascending
distance.
3. Expand the node with the smallest mindist.
4. If the node is internal, insert its children (MBRs) into the queue.
5. If the node is a leaf, compute actual distances to the contained points.
6. The search terminates once the top of the queue has a distance greater than the current
best point found.

Query Point: (4, 7)


We begin by calculating the mindist from the query point to each child MBR of the root.

Based on the final R-tree, the root has three children:


 R2: Contains points d1 (9,2), d10 (6,2) → MBR = X: (6–9), Y: (2–2)
 R3: Contains points d3 (6,7), d4 (8,6) → MBR = X: (6–8), Y: (6–7)
 R4: Contains points d2 (5,4), d6 (3,5), d9 (4,3) → MBR = X: (3–5), Y: (3–5)
 R5: Contains points d5 (1,9), d7 (2,5), d8 (1,4) → MBR = X: (1–2), Y: (4–9)

Step 1: Calculate mindist from (4, 7) to each MBR

 R2:
o X distance = |4 - 6| = 2
o Y distance = |7 - 2| = 5
o Distance = √(2² + 5²) ≈ 5.39

 R3:
o X distance = |4 - 6| = 2
o Y is within the MBR’s Y-range → Y distance = 0
o Distance = √(2² + 0²) = 2.00

 R4:
o X is within the MBR → X distance = 0
o Y distance = |7 - 5| = 2
o Distance = √(0² + 2²) = 2.00

21
22
 R5:
o X distance = |4 - 2| = 2
o Y is within MBR range → Y distance = 0
o Distance = √(2² + 0²) = 2.00

Step 2: Insert all MBRs into Priority Queue

Sorted by distance:
1. R3 – 2.00
2. R4 – 2.00
3. R5 – 2.00
4. R2 – 5.39
Step 3: Traverse Nodes by Priority

1. Expand R3 (contains d3 and d4)


 d3 (6,7): distance = √((6–4)² + (7–7)²) = 2.00
 d4 (8,6): distance = √((8–4)² + (6–7)²) ≈ 4.12 → Best: d3 (6,7), dist = 2.00

2. Expand R4 (d2, d6, d9)


 d2 (5,4): ≈ 3.16
 d6 (3,5): ≈ 2.24
 d9 (4,3): = 4.00 → Closest in R4: d6 (3,5), dist ≈ 2.24

3. Expand R5 (d5, d7, d8)


 d5 (1,9): ≈ 3.61
 d7 (2,5): ≈ 2.83
 d8 (1,4): ≈ 4.24 → Closest in R5: d7 (2,5), dist ≈ 2.83

4. Expand R2 (d1, d10)


 d1 (9,2): ≈ 7.07
 d10 (6,2): ≈ 5.39 → Closest in R2: d10 (6,2), dist ≈ 5.39

Result
After processing all candidates, the closest point to query (4,7) is: → d3 (6,7) with
distance 2.00
Thus, the Best-First algorithm correctly identifies d3 (6,7) as the nearest neighbor.

23
The Process of Divide-and-Conquer
The divide-and-conquer strategy for Nearest Neighbor Search involves splitting the dataset
based on a vertical boundary to build two separate R-trees, and applying the Best-First
algorithm independently in each region. For our dataset of 10 points, we split based on X = 5,
dividing the dataset into left and right spatial regions:

Step 1: Data Division Based on X = 5


 Left Region (X < 5):

o Points: d5 (1,9), d8 (1,4), d7 (2,5), d6 (3,5), d9 (4,3)

 Right Region (X ≥ 5):

o Points: d2 (5,4), d3 (6,7), d10 (6,2), d4 (8,6), d1 (9,2)

Each region will construct its own R-tree and be queried independently.

Step 2: Minimum Distance from Query to Region MBRs


 Left Region MBR:

o X Range: 1 to 4

o Y Range: 3 to 9

o Since (4,7) is inside this MBR → mindist = 0

 Right Region MBR:

o X Range: 5 to 9

o Y Range: 2 to 7

o X Distance: |4 – 5| = 1, Y is inside → mindist = 1

Step 3: Insert Regions into Priority Queue


The queue is initialized as:

1. Left Region MBR – mindist = 0

2. Right Region MBR – mindist = 1

Step 4: Query Execution (Conquer Phase)

1. Expand Left Region (d5, d6, d7, d8, d9):

24
 d5 (1,9) → dist ≈ 3.61

 d6 (3,5) → dist ≈ 2.24

 d7 (2,5) → dist ≈ 2.83

 d8 (1,4) → dist ≈ 4.24

 d9 (4,3) → dist = 4.00 → Closest in Left: d6 (3,5), dist ≈ 2.24

25
2. Expand Right Region (d1, d2, d3, d4, d10):

 d1 (9,2) → dist ≈ 7.07

 d2 (5,4) → dist ≈ 3.16

 d3 (6,7) → dist = 2.00

 d4 (8,6) → dist ≈ 4.12

 d10 (6,2) → dist ≈ 5.39 → Closest in Right: d3 (6,7), dist = 2.00

Step 5: Result
After examining both regions, the closest point to the query (4,7) is:

→ d3 (6,7) with distance 2.00

The Divide-and-Conquer approach successfully identifies the correct nearest neighbor by


efficiently narrowing down the search within smaller subspaces.

26
Analyzing the BBS Algorithm based Skyline Search

The Process of R-Tree Construction


Sample Facility Points:
ID X Y
1 5 10
2 8 7
3 12 5
4 3 12
5 10 8
6 6 6
7 9 4
8 2 14
9 13 3
10 15 2
11 4 9
12 11 6
13 7 3
14 1 13
15 14 1

Step 1:

(Fig 5.1a: Showing creation of 2 MBRs)

As we can see in the image, after inserting the data points from D1 to D6,, we have organized
the data points into two Minimum Bounding Rectangles (R2 and R3), where R2 contains

27
points D1 and D4 in the upper region, while R3 encompasses points D2, D3, D5, and D6 in
the lower region, setting up the initial R-tree structure for the BBS Algorithm.

Step 2:

(Fig 5.1b: Showing creation of 3 MBRs and addition of data points)

In the image 5.2.a , it is subdivided region R3 by adding a new region R4 around point D6,
and included a new point D7 in the lower part of R3, continuing to refine the R-tree structure
for more precise spatial partitioning. This is done to handle the overflow and now the new
MBRs are as follows
● R2 = 3,5,10,12
● R3 = 9,12,4,8
● R4 = 6,8,6,7

Step 3:

28
(Fig 5.1c: Showing creation of 4 MBRs and addition of data points)

In Figure 3, after inserting new data points D8, D9, and D10, the R-tree structure needed
reorganization to handle overflow. The MBRs were split and redefined with new coordinates
as follows:
● R2 = 2,5,10,4
● R3 = 9,12,4,8
● R4 = 6,8,6,7
● R5 = 13,15,2,3

Step 4:

(Fig 5.1d: Adding data points)


29
In image 4, new data points D11, D12, and D13 have been added to the existing R-tree
structure. Here's the breakdown of how the regions are now organized:
● R2: Contains D1, D4, D8, and D11 (upper left region)
● R3: Contains D2, D3, D5, D7, and D12 (middle region)
● R4: Contains D6 and D13 (middle-left region)
● R5: Contains D9 and D10 (lower right region)

30
Step 5:

(Fig 5.1e: Showing Final MBRs)

In image 5, two significant changes have occurred in the R-tree structure:

1. A new point D14 has been added, which led to the creation of a new region R6 within
R2 to handle the overflow in the upper left area. R6 specifically contains D14 and D4.
2. A new point D15 has been added to region R5 in the lower right area, expanding the
existing R5 region to accommodate this new point alongside D9 and D10.

The R-tree now has six distinct regions (R2 through R6), with R2 being subdivided to include
the nested R6 region and R7 region.

The Final MBRs are as follows:


 R2 = 1,5,9,14
 R3 = 6,15,1,8
 R4 = 6,8,3,7
 R5 = 13,15,1,3
 R6 = 1,13,12,14
 R7 = 4,5,9,10

31
The Process of BBS Algorithm
The Branch-and-Bound Skyline (BBS) algorithm is applied to identify the most optimal
facility points from a 2D dataset, where each point represents a property defined by cost (X)
and size (Y). A skyline point in this scenario is one that is not dominated by any other facility
—that is, no other point is simultaneously cheaper and larger. The algorithm operates on the
R-tree constructed in Section 5.1, efficiently traversing regions based on their proximity to the
origin (0,0), and dynamically eliminating dominated entries. The use of the BBS algorithm
allows for rapid skyline computation while minimizing unnecessary evaluations by pruning
entire subspaces using dominance logic.

Step 1: Initialize the Priority Queue Based on MBR Distances

(Fig 5.2a: Priority Queue based on MBR Distances)

 Calculate the minimum distance from the origin (0,0) to each Minimum Bounding
Rectangle (MBR) at the root level of the R-tree.
 Insert MBRs into a priority queue based on ascending distance. Closer regions like
R6 and R4 are prioritized first.

32
Step 2: Insert Closest MBRs to the Queue and Begin Traversal

(Fig 5.2b: Closest MBR to the Queue)

 The algorithm begins traversal with the most promising MBRs (e.g., R6, R4), as
shown via directional arrows in visual references.
 This selection determines which spatial partitions will be expanded first to look for
skyline candidates.

33
Step 3: Identify Initial Skyline Points – D6 and D13

(Fig 5.2c: Initial Skyline Points)

 D6 is identified as the first skyline point; it is closest to the origin and not dominated
by any other region.
 D13 is then selected as the next skyline point, having also passed the domination test.
 Both are added to the skyline list.

Step 4: Prune Dominated Regions and Assess Unresolved Entries

34
(Fig 5.2d: Prune Dominated Regions and Assess Unresolved Entries)

 R8 and D2 are removed, as they are dominated by D6 and D13.


 D9 is tentatively retained for now; although dominated by D13, its region (R5) might
contain other non-dominated points.
 Status Update:
o Queue contains: {(R2, 9), (R5, 13)}
o Skyline points: D6, D13

Step 5: Evaluate New Candidates – D11 and Eliminate D1

35
(Fig 5.2e: Evaluation of new candidates and elimination)

 Region R2 is expanded. D11 is identified as undominated and added to the skyline.


 D1 is evaluated and found to be dominated by D11. It is removed.

36
Step 6: Assess R6 for Additional Skylines

(Fig 5.2f: Additional Skylines)

 Region R6 is processed. Points D14 and D4 are discovered and added to the skyline
list.
 D8 is removed, as it is dominated by D14.
 Status Update:
o Skyline points: D6, D13, D11, D14, D4

37
Step 7: Final Evaluation of Region R5

(Fig 5.2g: Final Evaluation)

 Region R5 is now processed.


 D9 is eliminated as it is dominated by D13.
 D15 is added to the skyline (undominated).
 D10 is removed, as it is dominated by D15.

Step 8: Final Output and Skyline Completion

38
(Fig 5.2h: Final Output and Skyline Completion)

 After evaluating all MBRs and their data points, the algorithm finalizes the skyline.
 Final Skyline Points: D14, D4, D11, D6, D13, D15
 These points are not dominated by any other in both X and Y dimensions.

The Process of Divide-and-Conquer

The BBS algorithm can also be extended using a divide-and-conquer strategy to simplify and
accelerate skyline computation on larger datasets. This approach breaks the problem into
smaller sub-problems and then merges the results, taking advantage of spatial dominance
relationships. The process involves the following three key phases:

1. Dividing the dataset into two regions along the X-axis (typically using a vertical
split at the median X-value), forming left and right halves.
2. Applying the BBS algorithm independently on each sub-region to compute local
skylines.
3. Merging the local skylines by performing a dominance check along the Y-axis.

The core principle behind this technique is that points in the left half cannot be dominated
by points in the right half due to their strictly better (lower) X-values. Therefore:

 All skyline points from the left region are automatically included in the final result.
 Skyline points from the right region are only included if their Y-values are lower than
all Y-values from the left region’s skyline.

This reduces the complexity of skyline comparison and makes the process more efficient.

39
40
Step 1: Splitting the Dataset

(Fig 5.3a: Splitting the Dataset)

The dataset is divided into two halves using a vertical partition line at approximately X = 8.

 Left half: D8, D14, D4, D1, D11, D6, D13


 Right half: D2, D5, D12, D7, D3, D9, D10, D15
These two subsets are then processed independently.

Step 2: X-Axis Domination Screening

(Fig 5.3b: Domination Screening)

41
Within each half, skyline points are identified by checking whether any point is dominated by
another point along the X-axis. This creates a filtered list of non-dominated points for both
the left and right halves.

Step 3: Y-Axis Domination Screening and Merge

In the final merge phase:

1. All left-half skyline points are automatically accepted into the global skyline set.
2. Each right-half point is then compared against the Y-values of the left-half skylines:
o If a right-half point has a higher Y-value, it is discarded.
o If it has a lower Y-value than all left-half points, it is included.
o In this case, only D15 satisfies this condition and is added to the final skyline.

Final Output

(Fig 5.3c: Final Output)

The final skyline consists of the union of all left-half skyline points and any qualifying right-
half points.

Thus, the result is: D14, D4, D11, D6, D13, D15

42
These points form the skyline path under the divide-and-conquer model, and the remaining
right-half entries are excluded as they are dominated on the Y-axis by existing left-half
skyline members.

Conclusion

This report explored and evaluated advanced spatial search techniques using R-tree structures,
with a focus on two critical tasks: Nearest Neighbor Search (Task 1) and Skyline
Computation (Task 2). Each task incorporated multiple algorithmic approaches and was
designed to demonstrate the efficiency of spatial indexing on large-scale, two-dimensional
datasets.

In Task 1, we implemented three different strategies for Nearest Neighbor (NN) search: a
naïve Sequential Scan, a spatially optimized Best-First (BF) search using R-trees, and
a Divide-and-Conquer variant of the BF method. The R-tree construction followed best
practices such as balanced insertion, MBR updating, and perimeter-based splitting. The BF
algorithm showcased a significant improvement in performance by prioritizing search regions
based on proximity to the query point. The divide-and-conquer extension allowed for
localized R-tree construction, reducing overall query complexity and making the method
scalable for larger datasets.

In Task 2, the Branch-and-Bound Skyline (BBS) algorithm was applied to a dataset of


300,000 home listings. Using an R-tree to spatially partition data, BBS efficiently identified
all non-dominated points—those offering optimal trade-offs between cost and size. Through
MBR-based pruning and distance-sorted traversal, the algorithm minimized unnecessary
comparisons. Furthermore, the Divide-and-Conquer skyline method demonstrated how
regional segmentation and intelligent merging could yield the same skyline output with
improved efficiency, particularly in high-volume data environments.

Together, both tasks confirmed the superiority of R-tree-enhanced algorithms over brute-force
methods in terms of computational speed and resource usage. Execution time comparisons
validated these findings, with BBS and BF methods performing significantly faster than their
sequential counterparts—without compromising accuracy.

In summary, this assignment highlighted how spatial data structures (R-trees) and algorithmic
optimizations (BF, BBS, Divide-and-Conquer) can effectively solve complex real-world
problems in location-based services, recommendation systems, and decision-making support.
These methods not only scale well with data size but also maintain precision, making them
essential tools in the big data and spatial analytics domain.

43
Video Presentation Link:

https://mqoutlook-my.sharepoint.com/:p:/r/personal/
parthjayesh_doshi_students_mq_edu_au/_layouts/15/Doc.aspx?sourcedoc=%7B62198463-
8DF3-428B-9C82-679DCA74D292%7D&file=Big%20Data%20Group
%20Assignment.pptx&action=edit&mobileredirect=true

44

You might also like