0% found this document useful (0 votes)

81 views44 pages

Big Data Group Assignment - Report

This report details the implementation of spatial query processing techniques using R-tree structures to address Nearest Neighbor (NN) search and Skyline computation. It outlines the algorithms used, including Best-First (BF) and Branch-and-Bound Skyline (BBS), and provides documentation on program requirements, execution, and performance analysis. The findings demonstrate significant efficiency improvements in spatial data handling through optimized algorithms and indexing methods.

Uploaded by

mahawark513

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views44 pages

Big Data Group Assignment - Report

Uploaded by

mahawark513

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

COMP3210/6210: Report for Assignment 2

Table of Contents
INTRODUCTION....................................................................................2

GROUP MEMBER INFORMATION.............................................................3

PROGRAM EXECUTION REQUIREMENTS..................................................3

PROGRAM ENVIRONMENT (E.G., OS, DATABASE, CPU, ETC.)...........................................3

Test Environment Used...................................................................................3
Recommended System Requirements............................................................3
INPUT FILES AND PARAMETERS (DIRECTORY SETTINGS AND OTHER PARAMETERS)..................5
ADDITIONAL REQUIREMENTS......................................................................................5

PROGRAM DOCUMENTATION.................................................................6

PROGRAM ORGANIZATION......................................................................................... 6
Task 1 Nearest Neighbor Search.....................................................................6
Task 1 Output................................................................................................. 7
Task 2 Skyline Search.....................................................................................9
Task 2 Output............................................................................................... 10
FUNCTION DESCRIPTION.......................................................................................... 11
Task 1 Nearest Neighbor Search...................................................................11
Task 2 Skyline Search...................................................................................13

ANALYZING BF ALGORITHM BASED NN SEARCH....................................14

THE PROCESS OF R-TREE CONSTRUCTION..................................................................15

THE PROCESS OF BF ALGORITHM.............................................................................18
THE PROCESS OF DIVIDE-AND-CONQUER....................................................................20

ANALYZING THE BBS ALGORITHM BASED SKYLINE SEARCH...................22

THE PROCESS OF R-TREE CONSTRUCTION..................................................................22

THE PROCESS OF BBS ALGORITHM...........................................................................26
THE PROCESS OF DIVIDE-AND-CONQUER....................................................................32

CONCLUSION......................................................................................35

VIDEO PRESENTATION LINK:...............................................................35

1
Introduction

The increasing volume of spatial data across industries—ranging from urban planning and
navigation systems to real estate analytics—has created a growing demand for efficient
spatial query processing techniques. Traditional linear search methods often fail to scale when
dealing with high-dimensional or large-volume datasets. This report investigates the use
of spatial indexing through R-tree structures to solve two core spatial problems: Nearest
Neighbor (NN) search and Skyline computation.

In the first part of the project, the focus lies on identifying the closest facility (e.g.,
restaurant) for each user query point using three different algorithms: sequential scan, R-
tree-based Best-First (BF) search, and a divide-and-conquer variant. This task emphasizes
proximity detection in dense spatial datasets, leveraging the hierarchical structure of R-trees
to significantly reduce computation time.

In the second part, the objective is to determine skyline points from a large dataset of
property listings. A skyline point is one that is not dominated by any other in terms of both
cost and size—providing the best trade-offs. Here, the Branch-and-Bound Skyline (BBS)
algorithm and its divide-and-conquer extension are implemented and compared against a
baseline brute-force approach. R-trees again serve as the foundational structure for efficient
region-based pruning and prioritization.

Together, these tasks demonstrate how spatial data structures like R-trees and algorithms such
as BF and BBS can be adapted to support fast and scalable spatial analysis. The report
includes detailed documentation of the implementation, system requirements, algorithmic
flow, and experimental results—showcasing practical insights into handling spatial queries at
scale.

2
Group Member Information

Name Student ID Email: Assigned Task

Coordinator:
Arjun Krishangi 43841924 [email protected] 3
Mahawar

Parth Jayesh
48356573 [email protected] 1
Doshi

Quynh Nhi Vu 46863370 [email protected] 2

Program Execution Requirements

Program Environment (e.g., OS, database, CPU, etc.)

Test Environment Used

Category Specification
Operating System macOS Sequoia (version 15)
CPU 8-Core
RAM 8GB
Storage 2TB
IDE Visual Studio Code with Python extensions
Python Version 3.12.4 (base)

Recommended System Requirements

Operating Systems

Operating System Details

MacOS Apple M2 chip
Linux Distributions Asahi Linux
Virtualization
Compatibility Any OS supporting Python 3.x environment

3
Hardware Requirements

Category Minimum Requirement Recommended Requirement

CPU Core Count: 8-core. Quad-core processor at 2.5 GHz
Performance Cores: 4 or higher
Efficiency Cores: 4
Architecture Apple M2 chip
Memory (RAM) 8GB Configurable up to 16GB 8GB or higher
to 24GB
Storage 256GB Configurable: Up to SSD recommended for
512GB, 1TB, or 2TB faster data processing

Python Requirements

Python Environment Details

Python Version Python 3.7 or higher (3.9+ recommended)
Package Manager pip
Virtual Environment venv/conda recommended

Required Python Libraries

Library Type Library Description

Built-in math Used for mathematical calculations
Libraries
time Used for performance measurements
sys Used for system-specific parameters
heapq Used for maintaining priority queues
External Library  Version 1.0.0 or higher
 Installation: pip install rtree
 Dependencies: libspatialindex C++ library
 Required for R-tree based algorithms and spatial indexing

4
Input Files and Parameters (directory settings and other parameters)
The project files are organized in the following directory structure:
“/Users/arjunkrishangimahawar/Desktop/MQ BA/Sem 3/Big Data/ Group Assignment/”

● Task1_Parth
 restaurant_dataset.txt
 query_points.txt

● Task2_Callanth
 city2.txt

All files should be placed in their respective task folders for the programs to run correctly.
The working directory should be set to the task folder (e.g., Task1 or Task2) when running
the programs.

Additional Requirements
As discussed in 2.1.2.4, the additional requirements will be required for installing r-tree.

5
Program Documentation

Program Organization

Task 1 Nearest Neighbor Search

Class/File Name Description

Text file containing 99,999 restaurant locations, formatted as ID, X-
coordinate (longitude), and Y-coordinate (latitude). Each line
represents a unique restaurant with a spatial position on a 2D map.
These restaurant points are treated as candidate facilities against
restaurant_dataset.txt which the 200 user query points are evaluated. This file is essential
for the R-tree index construction and forms the search space for all
three nearest neighbor algorithms. The dataset is large enough to
demonstrate the scalability and performance difference between
sequential scanning and spatial indexing using R-tree structures.
Text file containing 200 query points in the format ID, X-coordinate,
and Y-coordinate. Each line represents a user query point with a
unique ID and geographical coordinates (longitude and latitude).
query_points.txt
These points serve as inputs to the nearest neighbor search
algorithms, which identify the closest matching facility for each user
based on spatial proximity.
This Python script executes Task 1 for Nearest Neighbor Search
using three methods: Sequential Scan, Best-First (BF) with R-tree,
and BF with Divide-and-Conquer. It contains all the necessary
functions to load 99,999 restaurant data points and 200 query points,
task1.py construct the R-tree index, and apply each search method to
determine the nearest facility to every query. The results, including
nearest neighbor IDs and coordinates, are written to an output file
along with total and average execution times for performance
comparison.
This file contains the complete output of Task 1, recording the
results of all three nearest neighbor algorithms: Sequential Scan, BF
using R-tree, and BF with Divide-and-Conquer. For each of the 200
query points, it logs the ID and coordinates of the nearest restaurant
task1_output.txt identified by each method. Additionally, it reports the total execution
time and the average time per query for each algorithm. These
metrics are used to compare and evaluate the runtime performance
and efficiency gains of using spatial indexing and optimization
techniques in nearest neighbor search.

6
7
Task 1 Output

8
9
10
Task 2 Skyline Search

Class/File Name Description

This text file contains exactly 300,000 entries representing
home listings. Each line follows the format: ID, Cost (X), and
Size (Y). Cost represents the price of the home, while Size
represents the square meter area. The skyline algorithms aim to
city2.txt identify listings that offer a better balance of size and cost—
i.e., listings that are not dominated by others with both lower
price and larger area. The dataset is large enough to highlight
performance gains from optimized skyline computation
methods.
This Python script performs the Skyline Search task using
three methods: Sequential Scan, Branch-and-Bound Skyline
(BBS), and Divide-and-Conquer. It loads home listings from
the city2.txt dataset and determines which homes are part of
the skyline—i.e., not dominated in terms of cost (X) and size
task2.py
(Y). The script includes efficient skyline implementations
optimized for 2D datasets, including a sorted greedy BBS
approach and a recursive skyline merger for the divide-and-
conquer method. It computes execution time for each
algorithm and writes all skyline results to the output file.
This file contains the result of executing all three skyline
search methods on the dataset: Sequential Scan, BBS, and
Divide-and-Conquer. For each method, it lists the homes that
form the skyline, including their ID, cost, and size. At the end
skyline_output.txt
of each section, it records the total execution time, allowing
performance comparison. For example, BBS completed in
~0.19 seconds, significantly faster than the sequential scan
which took ~1.76 seconds on the same data.

11
Task 2 Output

12
13
Function Description

Task 1 Nearest Neighbor Search

Function Name Purpose Parameters Description

load_points(filename) Loads spatial point filename: File path Reads each line from the
data from a file as a string file, extracting ID, X, and
(restaurant or query Y. Returns a list of
points). dictionaries with the parsed
values.
euclidean_dist_sq(p1, Computes squared p1, p2: point Calculates (x1 - x2)^2 + (y1
p2) Euclidean distance dictionaries with 'x' - y2)^2. Avoids square root
between two points. and 'y' for efficiency during
comparison.
min_dist_sq(query, Computes minimum query: dict, mbr: Determines the squared
mbr) squared distance dict with x1, x2, y1, minimum distance from the
from query point to y2 point to the MBR edges.
a bounding Used for pruning in BF
rectangle. search.
insert(u, p) Inserts point p into u: Node object, p: Navigates down tree, adds
R-tree starting at point dictionary point to correct leaf,
node u. updates MBRs. Handles
node overflow via split if
needed.
split(u) Handles overflow u: Node object Evaluates all split
by splitting node u combinations (x/y sort),
into two valid chooses the one with least
nodes. total MBR perimeter.
add_child(node, Adds a child node node: parent Node, Appends child to parent’s
child) and updates parent's child: Node child_nodes and expands
MBR. the MBR to include the
child.
add_data_point(node, Adds a point to a node: Node, p: Adds point to data_points
p) node and updates its point dict and adjusts MBR to include
MBR. it.
update_mbr(node) Recalculates the node: Node object Updates the node's MBR
MBR of a node. based on contained child
MBRs or point coordinates.
choose_subtree(u, p) Selects best child u: Node, p: point Chooses child node with
node for inserting dict smallest perimeter increase
point p. from insertion.
peri_increase(node, Calculates node: Node, p: Used within choose_subtree
p) perimeter increase if point dict to compare insertion impact

14
p is added to node. geometrically.
is_leaf() Checks if node is a None Returns True if node has no
leaf node. children.
is_root() Checks if node is None Returns True if node’s
the root of the tree. parent is None.
is_overflow() Checks if node has None Returns True if node holds
exceeded max more entries than B=4.
capacity.
perimeter() Computes the None Used to compare MBR
perimeter of a compactness during split.
node’s MBR.
bf_search_nn(query) Best-First search to query: point Traverses tree using priority
find nearest dictionary queue sorted by min_dist to
neighbor using R- MBR. Returns closest point.
tree.
main() Runs Task 1 None Loads files, builds R-tree,
algorithms and performs all three NN
writes output. methods and writes results
to task1_output.txt.

15
Task 2 Skyline Search

Function Name Purpose Parameters Description

load_dataset(file_path) Loads home file_path: String Reads each line and
listing data (cost path to city2.txt extracts ID, X (cost), and
and size) from Y (size) into a list of
file. tuples. Used as the input
for skyline computation.
dominates(a, b) Checks whether a, b: Tuples (ID, X, Returns True if a has
point a Y) lower or equal cost (X)
dominates point and greater or equal size
b. (Y), and is strictly better
in at least one dimension.
skyline_sequential(points) Performs skyline points: List of (ID, Compares each point
search using X, Y) tuples with all others to
brute-force eliminate dominated
method. ones. Returns a sorted list
of skyline points.
bbs_skyline(points) Efficient skyline points: List of (ID, Sorts points by cost (X)
using sort-based X, Y) tuples ascending and size (Y)
2D BBS method. descending. Selects
skyline by checking if
size exceeds max seen so
far.
divide_and_conquer(points) Performs skyline points: List of (ID, Recursively splits dataset
search using X, Y) tuples into halves by cost (X),
divide-and- computes skyline for
conquer each, and merges them
algorithm. using 1D dominance
check.
merge_skylines_2d(left, Merges two left, right: Sorted Iterates over both lists,
right) skyline lists skyline lists keeping only points that
during divide- improve on max size seen
and-conquer. so far. Assumes lists are
sorted by X.
write_results(filename, Writes skyline filename: Output Writes the skyline points
results) results and file path (ID, X, Y) for each
execution times results: List of method to the file, along
to output file. tuples with runtime in seconds.
(method_name, Used to evaluate
skyline, time) performance.

16
Analyzing BF Algorithm based NN Search
In this section, a Best First (BF) algorithm will be used to search for the closest point to the
query point of (4, 7) as given in the sample dataset. An R-tree will be constructed, and
examples of the BF algorithm will be shown, first by a standard BF search, then using divide
and conquer. The points used in the R-tree construction can be found in the sample data set is
shown in the table below:
Sample Facility Points:
ID X Y
1 9 2
2 5 4
3 6 7
4 8 6
5 1 9
6 4 7
7 2 5
8 1 4
9 4 7
10 6 2
Query Point:
ID X Y
1 4 7

17
The Process of R-Tree Construction

The R-tree is a height-balanced spatial indexing structure used to efficiently organize and
search spatial data such as geographical coordinates. It groups nearby objects and represents
them with minimum bounding rectangles (MBRs) at each level. Each internal node in the tree
stores entries that act as pointers to child nodes, with their corresponding MBRs. Leaf nodes
contain the actual data points. R-trees help reduce the number of distance computations by
enabling region-based pruning, which is crucial for nearest neighbor searches using the Best-
First (BF) algorithm.
In this task, the R-tree is constructed using a sample dataset of 10 facility points and 1 query
point. The R-tree parameter B is set to 4, meaning a node can hold a maximum of four entries
before triggering a split. The dataset and query are as follows:

Principles Followed During R-Tree Construction

While constructing the R-tree for this task, the following principles were applied:
1. Points were inserted one by one in the order of their ID (insertion order).
2. The tree remains height-balanced, meaning all leaf nodes are at the same depth.
3. Each node can contain up to B = 4 entries. When a node exceeds this limit, it is split.
4. Splitting is performed using the minimum perimeter sum heuristic to ensure spatial
efficiency.
5. After a split, if the parent node exceeds its capacity, the overflow is propagated upward
recursively.
6. The Minimum Bounding Rectangle (MBR) of each node is updated to fully contain all its
child entries.
7. The final R-tree is organized such that the root node may have multiple children (R2, R3,
R4), each representing a subspace of the original dataset.

Step-by-Step R-Tree Insertion

Step 1: Insert (9, 2)

The very first point in the dataset is (9, 2). Since the R-tree is empty, this point is inserted as
the root node of the tree. The Minimum Bounding Rectangle (MBR) of the root initially
covers only this single point.
 New Root: Contains (9, 2)
 MBR: (9, 9, 2, 2)

18
Step 2: Insert (5, 4)

The point (5, 4) is inserted into the current root node. Since the root can still hold more
entries, the MBR is expanded to include both (9, 2) and (5, 4), covering the smallest possible
rectangle that contains them.
 Root Entries: [(9, 2), (5, 4)]
 MBR: (5, 9, 2, 4)

Step 3: Insert (6, 7)

Adding (6, 7) to the root node is still allowed, as the capacity hasn't been exceeded. The MBR
is updated to reflect the inclusion of this point.
 Root Entries: [(9, 2), (5, 4), (6, 7)]
 MBR: (5, 9, 2, 7)

Step 4: Insert (8, 6)

This point is also added directly to the root node, which now contains four entries (still within
limit). The updated MBR is recalculated accordingly.
 Root Entries: [(9, 2), (5, 4), (6, 7), (8, 6)]
 MBR: (5, 9, 2, 7)

Therefore, Suboptimal MBRS are as follows:

 R2: 16
 R3: 18
 Total Parameter: 16+ 18 = 34

Step 5: Insert (1, 9) → Causes Overflow

This point causes an overflow as the root already contains four entries. A split is triggered.
The existing points are sorted using X and Y coordinates to find the most optimal split. The
node is divided into two child nodes based on minimum perimeter expansion.
 R2: [(8, 6), (9, 2)] → MBR = (8, 9, 2, 6)
 R3: [(1, 9), (5, 4), (6, 7)] → MBR = (1, 6, 4, 9)
 New Root: Contains R2 and R3
 Root MBR: (1, 9, 2, 9)

Therefore, optimal MBRS are as follows:

 R2 :10
 R3:20
 Total Parameter: 20+10 = 30

19
Step 6: Insert (4, 7)

The point (4, 7) is closest to MBR of R3 and is inserted there. R3 still has capacity. MBR is
updated accordingly.
 R3 Entries: [(1, 9), (5, 4), (6, 7), (4, 7)]
 MBR R3: (1, 6, 4, 9)

Step 7: Insert (2, 5)

Point (2, 5) is also routed to R3. Now R3 exceeds the maximum capacity (B = 4), triggering
another split.
 R3 Split:
o R3: [(1, 9), (5, 4), (6, 7)] → MBR = (1, 6, 4, 9)
o R4: [(2, 5), (1, 4), (4, 7)] → MBR = (1, 4, 4, 7)

Step 8: Insert (1, 4)

This point is routed to R4 as it lies within or near its MBR. R4 can still accept more points.
 R4 Updated Entries: [(2, 5), (1, 4), (4, 7)]
 MBR R4: (1, 4, 4, 7)

Step 9: Insert (4, 3)

Point (4, 3) is inserted into R4. MBR is expanded to accommodate the new lower y-
coordinate.
 R4 Entries: [(2, 5), (1, 4), (4, 7), (4, 3)]
 MBR R4: (1, 4, 3, 7)

Step 10: Insert (6, 2)

This point is closest to R2 and is inserted there. R2’s MBR is updated accordingly.
 R2 Entries: [(8, 6), (9, 2), (6, 2)]
 MBR R2: (6, 9, 2, 6)

Final Tree Structure:

 Root R1 has three children: R2, R3, R4
 Each child contains up to 4 entries
 Leaf nodes hold the actual data points

20
The Process of BF Algorithm
Now that the R-tree has been constructed, we can begin querying points using the Best-First
(BF) algorithm. This approach efficiently finds the nearest neighbor by exploring the tree in
order of increasing minimum distance between the query point and the Minimum Bounding
Rectangles (MBRs) of the R-tree nodes.

The BF algorithm operates as follows:

1. Calculate the minimum distance (mindist) from the query point to each MBR at the
root level.
2. Insert the MBRs and their mindist values into a priority queue, sorted by ascending
distance.
3. Expand the node with the smallest mindist.
4. If the node is internal, insert its children (MBRs) into the queue.
5. If the node is a leaf, compute actual distances to the contained points.
6. The search terminates once the top of the queue has a distance greater than the current
best point found.

Query Point: (4, 7)

We begin by calculating the mindist from the query point to each child MBR of the root.

Based on the final R-tree, the root has three children:

 R2: Contains points d1 (9,2), d10 (6,2) → MBR = X: (6–9), Y: (2–2)
 R3: Contains points d3 (6,7), d4 (8,6) → MBR = X: (6–8), Y: (6–7)
 R4: Contains points d2 (5,4), d6 (3,5), d9 (4,3) → MBR = X: (3–5), Y: (3–5)
 R5: Contains points d5 (1,9), d7 (2,5), d8 (1,4) → MBR = X: (1–2), Y: (4–9)

Step 1: Calculate mindist from (4, 7) to each MBR

 R2:
o X distance = |4 - 6| = 2
o Y distance = |7 - 2| = 5
o Distance = √(2² + 5²) ≈ 5.39

 R3:
o X distance = |4 - 6| = 2
o Y is within the MBR’s Y-range → Y distance = 0
o Distance = √(2² + 0²) = 2.00

 R4:
o X is within the MBR → X distance = 0
o Y distance = |7 - 5| = 2
o Distance = √(0² + 2²) = 2.00

21
22
 R5:
o X distance = |4 - 2| = 2
o Y is within MBR range → Y distance = 0
o Distance = √(2² + 0²) = 2.00

Step 2: Insert all MBRs into Priority Queue

Sorted by distance:
1. R3 – 2.00
2. R4 – 2.00
3. R5 – 2.00
4. R2 – 5.39
Step 3: Traverse Nodes by Priority

1. Expand R3 (contains d3 and d4)

 d3 (6,7): distance = √((6–4)² + (7–7)²) = 2.00
 d4 (8,6): distance = √((8–4)² + (6–7)²) ≈ 4.12 → Best: d3 (6,7), dist = 2.00

2. Expand R4 (d2, d6, d9)

 d2 (5,4): ≈ 3.16
 d6 (3,5): ≈ 2.24
 d9 (4,3): = 4.00 → Closest in R4: d6 (3,5), dist ≈ 2.24

3. Expand R5 (d5, d7, d8)

 d5 (1,9): ≈ 3.61
 d7 (2,5): ≈ 2.83
 d8 (1,4): ≈ 4.24 → Closest in R5: d7 (2,5), dist ≈ 2.83

4. Expand R2 (d1, d10)

 d1 (9,2): ≈ 7.07
 d10 (6,2): ≈ 5.39 → Closest in R2: d10 (6,2), dist ≈ 5.39

Result
After processing all candidates, the closest point to query (4,7) is: → d3 (6,7) with
distance 2.00
Thus, the Best-First algorithm correctly identifies d3 (6,7) as the nearest neighbor.

23
The Process of Divide-and-Conquer
The divide-and-conquer strategy for Nearest Neighbor Search involves splitting the dataset
based on a vertical boundary to build two separate R-trees, and applying the Best-First
algorithm independently in each region. For our dataset of 10 points, we split based on X = 5,
dividing the dataset into left and right spatial regions:

Step 1: Data Division Based on X = 5

 Left Region (X < 5):

o Points: d5 (1,9), d8 (1,4), d7 (2,5), d6 (3,5), d9 (4,3)

 Right Region (X ≥ 5):

o Points: d2 (5,4), d3 (6,7), d10 (6,2), d4 (8,6), d1 (9,2)

Each region will construct its own R-tree and be queried independently.

Step 2: Minimum Distance from Query to Region MBRs

 Left Region MBR:

o X Range: 1 to 4

o Y Range: 3 to 9

o Since (4,7) is inside this MBR → mindist = 0

 Right Region MBR:

o X Range: 5 to 9

o Y Range: 2 to 7

o X Distance: |4 – 5| = 1, Y is inside → mindist = 1

Step 3: Insert Regions into Priority Queue

The queue is initialized as:

1. Left Region MBR – mindist = 0

2. Right Region MBR – mindist = 1

Step 4: Query Execution (Conquer Phase)

1. Expand Left Region (d5, d6, d7, d8, d9):

24
 d5 (1,9) → dist ≈ 3.61

 d6 (3,5) → dist ≈ 2.24

 d7 (2,5) → dist ≈ 2.83

 d8 (1,4) → dist ≈ 4.24

 d9 (4,3) → dist = 4.00 → Closest in Left: d6 (3,5), dist ≈ 2.24

25
2. Expand Right Region (d1, d2, d3, d4, d10):

 d1 (9,2) → dist ≈ 7.07

 d2 (5,4) → dist ≈ 3.16

 d3 (6,7) → dist = 2.00

 d4 (8,6) → dist ≈ 4.12

 d10 (6,2) → dist ≈ 5.39 → Closest in Right: d3 (6,7), dist = 2.00

Step 5: Result
After examining both regions, the closest point to the query (4,7) is:

→ d3 (6,7) with distance 2.00

The Divide-and-Conquer approach successfully identifies the correct nearest neighbor by

efficiently narrowing down the search within smaller subspaces.

26
Analyzing the BBS Algorithm based Skyline Search

The Process of R-Tree Construction

Sample Facility Points:
ID X Y
1 5 10
2 8 7
3 12 5
4 3 12
5 10 8
6 6 6
7 9 4
8 2 14
9 13 3
10 15 2
11 4 9
12 11 6
13 7 3
14 1 13
15 14 1

Step 1:

(Fig 5.1a: Showing creation of 2 MBRs)

As we can see in the image, after inserting the data points from D1 to D6,, we have organized
the data points into two Minimum Bounding Rectangles (R2 and R3), where R2 contains

27
points D1 and D4 in the upper region, while R3 encompasses points D2, D3, D5, and D6 in
the lower region, setting up the initial R-tree structure for the BBS Algorithm.

Step 2:

(Fig 5.1b: Showing creation of 3 MBRs and addition of data points)

In the image 5.2.a , it is subdivided region R3 by adding a new region R4 around point D6,
and included a new point D7 in the lower part of R3, continuing to refine the R-tree structure
for more precise spatial partitioning. This is done to handle the overflow and now the new
MBRs are as follows
● R2 = 3,5,10,12
● R3 = 9,12,4,8
● R4 = 6,8,6,7

Step 3:

28
(Fig 5.1c: Showing creation of 4 MBRs and addition of data points)

In Figure 3, after inserting new data points D8, D9, and D10, the R-tree structure needed
reorganization to handle overflow. The MBRs were split and redefined with new coordinates
as follows:
● R2 = 2,5,10,4
● R3 = 9,12,4,8
● R4 = 6,8,6,7
● R5 = 13,15,2,3

Step 4:

(Fig 5.1d: Adding data points)

29
In image 4, new data points D11, D12, and D13 have been added to the existing R-tree
structure. Here's the breakdown of how the regions are now organized:
● R2: Contains D1, D4, D8, and D11 (upper left region)
● R3: Contains D2, D3, D5, D7, and D12 (middle region)
● R4: Contains D6 and D13 (middle-left region)
● R5: Contains D9 and D10 (lower right region)

30
Step 5:

(Fig 5.1e: Showing Final MBRs)

In image 5, two significant changes have occurred in the R-tree structure:

1. A new point D14 has been added, which led to the creation of a new region R6 within
R2 to handle the overflow in the upper left area. R6 specifically contains D14 and D4.
2. A new point D15 has been added to region R5 in the lower right area, expanding the
existing R5 region to accommodate this new point alongside D9 and D10.

The R-tree now has six distinct regions (R2 through R6), with R2 being subdivided to include
the nested R6 region and R7 region.

The Final MBRs are as follows:

 R2 = 1,5,9,14
 R3 = 6,15,1,8
 R4 = 6,8,3,7
 R5 = 13,15,1,3
 R6 = 1,13,12,14
 R7 = 4,5,9,10

31
The Process of BBS Algorithm
The Branch-and-Bound Skyline (BBS) algorithm is applied to identify the most optimal
facility points from a 2D dataset, where each point represents a property defined by cost (X)
and size (Y). A skyline point in this scenario is one that is not dominated by any other facility
—that is, no other point is simultaneously cheaper and larger. The algorithm operates on the
R-tree constructed in Section 5.1, efficiently traversing regions based on their proximity to the
origin (0,0), and dynamically eliminating dominated entries. The use of the BBS algorithm
allows for rapid skyline computation while minimizing unnecessary evaluations by pruning
entire subspaces using dominance logic.

Step 1: Initialize the Priority Queue Based on MBR Distances

(Fig 5.2a: Priority Queue based on MBR Distances)

 Calculate the minimum distance from the origin (0,0) to each Minimum Bounding
Rectangle (MBR) at the root level of the R-tree.
 Insert MBRs into a priority queue based on ascending distance. Closer regions like
R6 and R4 are prioritized first.

32
Step 2: Insert Closest MBRs to the Queue and Begin Traversal

(Fig 5.2b: Closest MBR to the Queue)

 The algorithm begins traversal with the most promising MBRs (e.g., R6, R4), as
shown via directional arrows in visual references.
 This selection determines which spatial partitions will be expanded first to look for
skyline candidates.

33
Step 3: Identify Initial Skyline Points – D6 and D13

(Fig 5.2c: Initial Skyline Points)

 D6 is identified as the first skyline point; it is closest to the origin and not dominated
by any other region.
 D13 is then selected as the next skyline point, having also passed the domination test.
 Both are added to the skyline list.

Step 4: Prune Dominated Regions and Assess Unresolved Entries

34
(Fig 5.2d: Prune Dominated Regions and Assess Unresolved Entries)

 R8 and D2 are removed, as they are dominated by D6 and D13.

 D9 is tentatively retained for now; although dominated by D13, its region (R5) might
contain other non-dominated points.
 Status Update:
o Queue contains: {(R2, 9), (R5, 13)}
o Skyline points: D6, D13

Step 5: Evaluate New Candidates – D11 and Eliminate D1

35
(Fig 5.2e: Evaluation of new candidates and elimination)

 Region R2 is expanded. D11 is identified as undominated and added to the skyline.

 D1 is evaluated and found to be dominated by D11. It is removed.

36
Step 6: Assess R6 for Additional Skylines

(Fig 5.2f: Additional Skylines)

 Region R6 is processed. Points D14 and D4 are discovered and added to the skyline
list.
 D8 is removed, as it is dominated by D14.
 Status Update:
o Skyline points: D6, D13, D11, D14, D4

37
Step 7: Final Evaluation of Region R5

(Fig 5.2g: Final Evaluation)

 Region R5 is now processed.

 D9 is eliminated as it is dominated by D13.
 D15 is added to the skyline (undominated).
 D10 is removed, as it is dominated by D15.

Step 8: Final Output and Skyline Completion

38
(Fig 5.2h: Final Output and Skyline Completion)

 After evaluating all MBRs and their data points, the algorithm finalizes the skyline.
 Final Skyline Points: D14, D4, D11, D6, D13, D15
 These points are not dominated by any other in both X and Y dimensions.

The Process of Divide-and-Conquer

The BBS algorithm can also be extended using a divide-and-conquer strategy to simplify and
accelerate skyline computation on larger datasets. This approach breaks the problem into
smaller sub-problems and then merges the results, taking advantage of spatial dominance
relationships. The process involves the following three key phases:

1. Dividing the dataset into two regions along the X-axis (typically using a vertical
split at the median X-value), forming left and right halves.
2. Applying the BBS algorithm independently on each sub-region to compute local
skylines.
3. Merging the local skylines by performing a dominance check along the Y-axis.

The core principle behind this technique is that points in the left half cannot be dominated
by points in the right half due to their strictly better (lower) X-values. Therefore:

 All skyline points from the left region are automatically included in the final result.
 Skyline points from the right region are only included if their Y-values are lower than
all Y-values from the left region’s skyline.

This reduces the complexity of skyline comparison and makes the process more efficient.

39
40
Step 1: Splitting the Dataset

(Fig 5.3a: Splitting the Dataset)

The dataset is divided into two halves using a vertical partition line at approximately X = 8.

 Left half: D8, D14, D4, D1, D11, D6, D13

 Right half: D2, D5, D12, D7, D3, D9, D10, D15
These two subsets are then processed independently.

Step 2: X-Axis Domination Screening

(Fig 5.3b: Domination Screening)

41
Within each half, skyline points are identified by checking whether any point is dominated by
another point along the X-axis. This creates a filtered list of non-dominated points for both
the left and right halves.

Step 3: Y-Axis Domination Screening and Merge

In the final merge phase:

1. All left-half skyline points are automatically accepted into the global skyline set.
2. Each right-half point is then compared against the Y-values of the left-half skylines:
o If a right-half point has a higher Y-value, it is discarded.
o If it has a lower Y-value than all left-half points, it is included.
o In this case, only D15 satisfies this condition and is added to the final skyline.

Final Output

(Fig 5.3c: Final Output)

The final skyline consists of the union of all left-half skyline points and any qualifying right-
half points.

Thus, the result is: D14, D4, D11, D6, D13, D15

42
These points form the skyline path under the divide-and-conquer model, and the remaining
right-half entries are excluded as they are dominated on the Y-axis by existing left-half
skyline members.

Conclusion

This report explored and evaluated advanced spatial search techniques using R-tree structures,
with a focus on two critical tasks: Nearest Neighbor Search (Task 1) and Skyline
Computation (Task 2). Each task incorporated multiple algorithmic approaches and was
designed to demonstrate the efficiency of spatial indexing on large-scale, two-dimensional
datasets.

In Task 1, we implemented three different strategies for Nearest Neighbor (NN) search: a
naïve Sequential Scan, a spatially optimized Best-First (BF) search using R-trees, and
a Divide-and-Conquer variant of the BF method. The R-tree construction followed best
practices such as balanced insertion, MBR updating, and perimeter-based splitting. The BF
algorithm showcased a significant improvement in performance by prioritizing search regions
based on proximity to the query point. The divide-and-conquer extension allowed for
localized R-tree construction, reducing overall query complexity and making the method
scalable for larger datasets.

In Task 2, the Branch-and-Bound Skyline (BBS) algorithm was applied to a dataset of

300,000 home listings. Using an R-tree to spatially partition data, BBS efficiently identified
all non-dominated points—those offering optimal trade-offs between cost and size. Through
MBR-based pruning and distance-sorted traversal, the algorithm minimized unnecessary
comparisons. Furthermore, the Divide-and-Conquer skyline method demonstrated how
regional segmentation and intelligent merging could yield the same skyline output with
improved efficiency, particularly in high-volume data environments.

Together, both tasks confirmed the superiority of R-tree-enhanced algorithms over brute-force
methods in terms of computational speed and resource usage. Execution time comparisons
validated these findings, with BBS and BF methods performing significantly faster than their
sequential counterparts—without compromising accuracy.

In summary, this assignment highlighted how spatial data structures (R-trees) and algorithmic
optimizations (BF, BBS, Divide-and-Conquer) can effectively solve complex real-world
problems in location-based services, recommendation systems, and decision-making support.
These methods not only scale well with data size but also maintain precision, making them
essential tools in the big data and spatial analytics domain.

43
Video Presentation Link:

https://mqoutlook-my.sharepoint.com/:p:/r/personal/
parthjayesh_doshi_students_mq_edu_au/_layouts/15/Doc.aspx?sourcedoc=%7B62198463-
8DF3-428B-9C82-679DCA74D292%7D&file=Big%20Data%20Group
%20Assignment.pptx&action=edit&mobileredirect=true

Spatial Data Indexing Techniques Explained
No ratings yet
Spatial Data Indexing Techniques Explained
56 pages
Project in DSA Java
No ratings yet
Project in DSA Java
5 pages
Spatial Data Management
No ratings yet
Spatial Data Management
7 pages
Performance of Nearest Neighbor Queries in R-Trees
No ratings yet
Performance of Nearest Neighbor Queries in R-Trees
32 pages
Ranking Spatial Data by Quality Preferences: Abstract
0% (1)
Ranking Spatial Data by Quality Preferences: Abstract
16 pages
Manual of Applied Spatial Ecology
No ratings yet
Manual of Applied Spatial Ecology
190 pages
Discover Qgis3 Toc
No ratings yet
Discover Qgis3 Toc
7 pages
Kurt Menke - Discover QGIS 3.x - Second Edition A Workbook For Classroom or Independent Study-Locate Press (2022)
83% (6)
Kurt Menke - Discover QGIS 3.x - Second Edition A Workbook For Classroom or Independent Study-Locate Press (2022)
430 pages
Vor-Tree: R-Trees With Voronoi Diagrams For Efficient Processing of Spatial Nearest Neighbor Queries
No ratings yet
Vor-Tree: R-Trees With Voronoi Diagrams For Efficient Processing of Spatial Nearest Neighbor Queries
12 pages
Arnab Bhattacharya - Fundamentals of Database Indexing and Searching
No ratings yet
Arnab Bhattacharya - Fundamentals of Database Indexing and Searching
280 pages
Calculating The Longest Dominated Locations: Abstract
No ratings yet
Calculating The Longest Dominated Locations: Abstract
3 pages
1.1 About Spatial Mining
No ratings yet
1.1 About Spatial Mining
53 pages
NoSQL Databases Course Guide
No ratings yet
NoSQL Databases Course Guide
10 pages
Spatial R
No ratings yet
Spatial R
75 pages
Optimizing Pandas for Performance
No ratings yet
Optimizing Pandas for Performance
48 pages
An Effective Framework For Skyline Queries Using PCA
No ratings yet
An Effective Framework For Skyline Queries Using PCA
5 pages
Efficient Ranking On Road Network by Quality Preferences
No ratings yet
Efficient Ranking On Road Network by Quality Preferences
10 pages
Unit 2
No ratings yet
Unit 2
25 pages
Postgis-1 4 0
No ratings yet
Postgis-1 4 0
274 pages
Spatial Data Science With Applications in R 1st Edition All Chapters Included
100% (13)
Spatial Data Science With Applications in R 1st Edition All Chapters Included
17 pages
ML Lab - BCSL606
No ratings yet
ML Lab - BCSL606
67 pages
Analysis PDF
No ratings yet
Analysis PDF
135 pages
Raste Rio
No ratings yet
Raste Rio
269 pages
Spatial Analysis
100% (3)
Spatial Analysis
133 pages
A Dive Into Spatial Search Algorithms - by Vladimir Agafonkin - Maps For Developers
No ratings yet
A Dive Into Spatial Search Algorithms - by Vladimir Agafonkin - Maps For Developers
19 pages
Spatial Data
No ratings yet
Spatial Data
75 pages
Postgis-3 1 1
No ratings yet
Postgis-3 1 1
874 pages
Rasterio Readthedocs Io en Latest
No ratings yet
Rasterio Readthedocs Io en Latest
273 pages
Hijmans 2021 - Spatial Data in R
No ratings yet
Hijmans 2021 - Spatial Data in R
85 pages
Business Space Recommendation System
No ratings yet
Business Space Recommendation System
80 pages
Irange
No ratings yet
Irange
15 pages
ANNmanual
No ratings yet
ANNmanual
36 pages
Installing Faiss
No ratings yet
Installing Faiss
51 pages
Python 4 Data Science
No ratings yet
Python 4 Data Science
561 pages
CT 1
No ratings yet
CT 1
7 pages
Query Processing Optimization
No ratings yet
Query Processing Optimization
20 pages
Postgis-1 5 3 PDF
No ratings yet
Postgis-1 5 3 PDF
334 pages
Radian Measure of One Revolution
No ratings yet
Radian Measure of One Revolution
334 pages
(The R Series) Edzer Pebesma, Roger Bivand - Spatial Data Science - With Applications in R-CRC Press (2023)
100% (1)
(The R Series) Edzer Pebesma, Roger Bivand - Spatial Data Science - With Applications in R-CRC Press (2023)
315 pages
Multidimensional Search Trees
No ratings yet
Multidimensional Search Trees
100 pages
Postgis-1 5 PDF
No ratings yet
Postgis-1 5 PDF
326 pages
Postgis-1 5 PDF
No ratings yet
Postgis-1 5 PDF
326 pages
EliteQuant High-Frequency Trading System
No ratings yet
EliteQuant High-Frequency Trading System
3 pages
C++ City Class and Menu System Design
No ratings yet
C++ City Class and Menu System Design
4 pages
Manual Biblioteca SF Do R
No ratings yet
Manual Biblioteca SF Do R
68 pages
Final Project
No ratings yet
Final Project
14 pages
G3 - R-Tree, R+-Tree
No ratings yet
G3 - R-Tree, R+-Tree
47 pages
PostGIS 1.5.1 User Manual
No ratings yet
PostGIS 1.5.1 User Manual
3 pages
QGIS 2.2 QGISTrainingManual en
No ratings yet
QGIS 2.2 QGISTrainingManual en
625 pages
QGIS 2.14 QGISTrainingManual en
100% (1)
QGIS 2.14 QGISTrainingManual en
639 pages
Fiona
No ratings yet
Fiona
83 pages
Level 2 MIT Exercises
No ratings yet
Level 2 MIT Exercises
5 pages
Arduino and Raspberry Pi Lab Guide
No ratings yet
Arduino and Raspberry Pi Lab Guide
1 page
QGIS 2.8 Training Manual Guide
No ratings yet
QGIS 2.8 Training Manual Guide
627 pages
Swiggy Recommendation System
No ratings yet
Swiggy Recommendation System
6 pages
ApacheCon 2018 Geo 3 Calcite Spatial Hyde
No ratings yet
ApacheCon 2018 Geo 3 Calcite Spatial Hyde
48 pages
DG8245V 10
No ratings yet
DG8245V 10
1 page
PySide Introduction
No ratings yet
PySide Introduction
16 pages
CV - Luu Huy Hoang
No ratings yet
CV - Luu Huy Hoang
2 pages
Data Engineering Learning Path Guide
No ratings yet
Data Engineering Learning Path Guide
3 pages
Network Video Decoder User Manual
No ratings yet
Network Video Decoder User Manual
109 pages
Laptop Basics for Beginners
No ratings yet
Laptop Basics for Beginners
4 pages
Microsoft Dynamics Nav:: 23 Tips and Tricks For The New Dynamics Nav User
No ratings yet
Microsoft Dynamics Nav:: 23 Tips and Tricks For The New Dynamics Nav User
9 pages
Dhaanish Ahmed College of Engineering
100% (1)
Dhaanish Ahmed College of Engineering
2 pages
Computer Knowledge Quiz Questions
No ratings yet
Computer Knowledge Quiz Questions
6 pages
Computer Programming 1 Module Final 2024-2025 - First Year
No ratings yet
Computer Programming 1 Module Final 2024-2025 - First Year
129 pages
Be Electronics and Telecommunication Engineering Semester 5 2023 November Microcontrollers Pattern 2019
No ratings yet
Be Electronics and Telecommunication Engineering Semester 5 2023 November Microcontrollers Pattern 2019
2 pages
Summer 22 Model Ans
0% (1)
Summer 22 Model Ans
29 pages
Grade 6 Answer Key - Model QP
No ratings yet
Grade 6 Answer Key - Model QP
3 pages
Aruba Central Personal Wireless Network Guide
No ratings yet
Aruba Central Personal Wireless Network Guide
13 pages
TVL CSS11 Q4 M14
No ratings yet
TVL CSS11 Q4 M14
11 pages
Chapter 4 Fundamental of Classes
No ratings yet
Chapter 4 Fundamental of Classes
26 pages
Audit in CIS Environment Tasks
No ratings yet
Audit in CIS Environment Tasks
4 pages
SQL Injection Lab
No ratings yet
SQL Injection Lab
11 pages
Bugreport CPH2387T2 TP1A.220905.001 2023 11 22 23 48 25 Dumpstate - Log 28114
No ratings yet
Bugreport CPH2387T2 TP1A.220905.001 2023 11 22 23 48 25 Dumpstate - Log 28114
33 pages
Install Open Sans & DesignBall Fonts
No ratings yet
Install Open Sans & DesignBall Fonts
530 pages
Digital Logic Design Syllabus
No ratings yet
Digital Logic Design Syllabus
4 pages
COMP100 2017 Main
No ratings yet
COMP100 2017 Main
9 pages
CN Lab Viva Questions
No ratings yet
CN Lab Viva Questions
4 pages
Comp 3000 - Exam Notes Before Midterm 1
No ratings yet
Comp 3000 - Exam Notes Before Midterm 1
42 pages
Installation: V500 Cordless Notebook Mouse V500 Souris Notebook Sans Fil
No ratings yet
Installation: V500 Cordless Notebook Mouse V500 Souris Notebook Sans Fil
2 pages
2.5 Programming Languages and IDE + Answers
No ratings yet
2.5 Programming Languages and IDE + Answers
23 pages
CS201 Mcqs FinalTerm by Vu Topper RM
No ratings yet
CS201 Mcqs FinalTerm by Vu Topper RM
52 pages
32-Authentication Requirements, Authentication Functions-01-03-2025
No ratings yet
32-Authentication Requirements, Authentication Functions-01-03-2025
39 pages
XJTAG Boundary Scan Tools Overview
No ratings yet
XJTAG Boundary Scan Tools Overview
2 pages
Sem1 Pgdca Syllabus
No ratings yet
Sem1 Pgdca Syllabus
13 pages