Data Structures and Algorithms
Data Structures and Algorithms
Algorithms
Compiled by:
ELIAS A. AUSTRIA
LYDINAR D. DASTAS
MICHAEL B. DELA FUENTE
RIA A. SAGUM
Overview:
Data structure is an essential area of study for computer scientists and for anyone who will ever
undertake any serious programming task. This course deals with the fundamentals of organizing
and manipulating data efficiently using clean conceptual models. In particular, the emphasis is
on the organization of information, the implementation of common data structures and techniques
of data abstraction. Implementations in this course are carried out in any programming
languages, but the principles are more generally applicable to most modern programming
environments.
Data Structures are containers that contain objects of data types. There are several common
data structures, and each has its own behavior and applications. Common data structures are
arrays of one or more dimensions, stacks, linked lists (singly and doubly linked), queues, trees
(balanced, binary, and so on), graph and hashing. Understanding data structures helps the
student understand how they behave, and when to use any of them.
Course Information:
Course Code: COMP 20063
Course Title: Data Structures and Algorithms
No. of Units: 3
No. of Hours: 3
Course Outcomes:
After successful completion of this instructional material, you should be able to:
• Demonstrate the ability to analyze, design, apply, and use data structures and
algorithms to solve engineering problems and evaluate their solutions.
ii
Course Contents:
Unit 1 – Overview of Data Structures. In this course we will study data structures and learn
how to write efficient programs. But this has nothing to do with programming tricks but
rather with good organization of information and good algorithms that saves computer
memory and running time.
Unit 3 – Arrays. Most programming languages in one way or another have its own
implementation of an array. In this unit we present an array which is used by most of the
data structures to implement their algorithms.
Unit 4 – Lists. List provide a convenient way for keeping track of things in our everyday life:
shopping lists, laundry lists, list of thigs to be done, and so on. Similarly, lists provide a
convenient way for representing arbitrary nonnumeric information in a computer. The
remainder of this unit will deal with the list a formal data structure and will explain how it
is represented and manipulated in a computer.
Unit 5 – Stacks. Stacks are dynamic data structures that follow the Last In First Out (LIFO)
principle. The last item to be inserted into a stack is the first one to be deleted from it. In
this unit we will explore stacks and its applications and learn how to implement it.
Unit 6 – Queues. Queues are data structures that follow the First In First Out (FIFO) principle
where in the first element that is added to the queue is the first one to be removed. In
this unit we will explore queues and its applications and learn how to implement it.
Unit 7 – Trees. We have already considered two data structures, the stack and the queue. Both
are linear in nature and are represented in the memory either by a sequentially allocated
vector (array) or by a linked linear list. In this unit we now focus on two important non-
linear structures- the tree and the binary tree.
Unit 8 – Balanced Trees. The search time in a binary search tree depends on the form of the
tree, that is on the order in which its nodes were inserted. In practice, one can however
not always guarantee that binary search trees are built randomly. Binary search trees
are thus only interesting if they are “relatively complete.” So we must look for
specializations of binary search trees whose worst-case performance on the basic tree
operations can be guaranteed to be good, that is O(log2n) time. In this unit we will talk
about balanced trees as a solution to the problem of search time. Particularly we will look
into the balanced properties of binary search trees and m-way search trees.
Unit 9 – Graphs. In computer science, a graph is an abstract data type that is meant to
implement the undirected graph and directed graph concepts from the field of graph
theory within mathematics. In this unit we will focus on the basic concepts of graphs, its
representations and searching in graphs.
iii
Unit 10 – Hashing and Collision. As data structures deals with how data is represented in
memory (primary), file structures on the other hand deals more on how data is
represented on secondary storage with emphasis on file access. In this unit we will
discuss the different types of file structures, the different techniques used in file access
and how to resolve some of the issues in relation to file access.
References:
Weiss, Mark Allen (2014). Data Structures And Algorithm Analysis In C++ 4th Edition
Grading System:
• Assignments
• Unit Activities
• Programming Problems
Examination 30%
TOTAL 100%
iv
Table of Contents
Data Structures and Algorithms .................................................................................... i
Overview:............................................................................................................................................ ii
Course Information: .......................................................................................................................... ii
Course Code: COMP 20063 ........................................................................................................... ii
Course Title: Data Structures and Algorithms ............................................................................. ii
No. of Units: 3 ................................................................................................................................. ii
No. of Hours: 3 ................................................................................................................................. ii
Course Outcomes:............................................................................................................................. ii
Unit 1 – Overview of Data Structures ........................................................................... 1
Overview:............................................................................................................................................ 1
Module Objectives: ............................................................................................................................ 1
Course Materials:............................................................................................................................... 1
Unit 2 – Algorithm Analysis .......................................................................................... 8
Overview:............................................................................................................................................ 8
Module Objectives: ............................................................................................................................ 8
Course Materials:............................................................................................................................... 8
Activities/Assessments: ................................................................................................................. 19
Unit 3 – Arrays .............................................................................................................. 20
Overview:.......................................................................................................................................... 20
Module Objectives: .......................................................................................................................... 20
Course Materials:............................................................................................................................. 20
Activities/Assessments: ................................................................................................................. 28
Unit 4 – Linked Lists .................................................................................................... 30
Overview:.......................................................................................................................................... 30
Module Objectives: .......................................................................................................................... 30
Course Materials:............................................................................................................................. 30
Activities/Assessments: ................................................................................................................. 36
Unit 5 – Stacks .............................................................................................................. 40
Overview:.......................................................................................................................................... 40
Module Objectives: .......................................................................................................................... 40
Course Materials:............................................................................................................................. 40
Activities/Assessments: ................................................................................................................. 51
Unit 6 - Queues ............................................................................................................. 53
Overview:.......................................................................................................................................... 53
Module Objectives: .......................................................................................................................... 53
Course Materials:............................................................................................................................. 53
Activities/Assessments: ................................................................................................................. 61
Unit 7 - Trees ................................................................................................................ 62
Overview:.......................................................................................................................................... 62
Module Objectives: .......................................................................................................................... 62
Course Materials:............................................................................................................................. 62
Activities/Assessments: ................................................................................................................. 83
Unit 8 – Balanced Trees ............................................................................................... 86
Overview:.......................................................................................................................................... 86
Module Objectives: .......................................................................................................................... 86
Course Materials:............................................................................................................................. 86
Activities/Assessments: ............................................................................................................... 102
v
Unit 9 - Graphs ........................................................................................................... 103
Overview:........................................................................................................................................ 103
Module Objectives: ........................................................................................................................ 103
Course Materials:........................................................................................................................... 103
Activities/Assessments: ............................................................................................................... 109
Unit 10 – Hashing and Collision ............................................................................... 111
Overview:........................................................................................................................................ 111
Module Objectives: ........................................................................................................................ 111
Course Materials:........................................................................................................................... 111
Activities/Assessments: ............................................................................................................... 128
vi
Unit 1 – Overview of Data Structures
Overview:
In this course we will study data structures and learn how to write efficient programs. But this has
nothing to do with programming tricks but rather with good organization of information and good
algorithms that saves computer memory and running time.
Module Objectives:
After successful Completion of this module, you should be able to:
• Define and differentiate data structures, data type and abstract data type
• Discuss how one will select a data structure
• Describe the different data structures
• Knowledgeably converse about the concepts of abstract data types
Course Materials:
Learning about computers does not stop in learning how to do programs to run in them. It
is also important to know how data is represented in memory (data structure) and how it is
represented on the disk (file structure). Data structure is a way of collecting and organizing data
in such a way that we can perform operations on these data in an effective and sometimes efficient
way.
Doing efficient programs requires efficient data structures. By efficient we mean a problem
has to be solved within the given time and space constraints. Each problem puts constraints on
time and space. For example, in banking, opening or starting an account will require a few
minutes. Doing transactions, say withdrawing cash or depositing it, requires a few seconds. But
closing and account normally takes overnight. Here we see constraints on time and space. A
solution is efficient if it solves the problem within its space and time constraints. The cost of a
solution equates to the amount of resources consumed.
1. Analyze the problem to determine the resource constraints a solution must meet.
2. Determine the operations that must be supported (e.g., record search, insertion, deletion,
etc.)
3. Quantify the constraints for each operation (e.g., search operations must be very fast)
4. Select data structure that best meet these requirements.
1
Cost and Benefits
Each data structures requires spaces for each data item it stores, time to perform each
operation, and some programming effort to implement it. Every data structure has costs and
benefits. Rarely one data structure is better than another in all situations. One may permit faster
search (or insertion or deletion) operations than the other but that does not mean that it is better
than the rest. Note also if all operations are of the same importance.
Earlier we defined data structure as a way of collecting and organizing data in such a way
that we can perform operations on these data in an effective and sometimes efficient way. This
notion of data structures involves:
From this notion, data structures are structures programmed to store ordered data, so that
various operations can be performed on it easily. It represents the knowledge of data to be
organized in memory, both primary and secondary. It should be designed and implemented in
such a way that it reduces the complexity and increases the efficiency.
Data type is another term usually used synonymously with data structures but in itself is
different from data structures. In programming, data type is a set of values from which a variable,
constant, function, or other expression may take its value. For example, integer takes on whole
numbers, real numbers represent fixed-point and floating-point values, character takes on
alphanumeric values, Boolean represents a true or false value, while date/time represents a range
of values depicting date and time. Data type can be thought of as a set of the “same kind” of data
2
processed by the computer. Data has to be examined carefully so that the most appropriate data
type can be used.
Basically, anything that can store data can be called as a data structure, hence Integer,
Float, Boolean, Char, etc., all are data structures. They are known as Primitive Data Structures
or Basic Type. Figure 1 and Table 1 represent two ways of looking at data structures. Though
differently labeled with varying categories or classifications, you can see that they show almost
the same thing.
Basic Data Type or Primitive Data Structures represent a set of individual data and is
frequently used to create a program. Sometimes called atomic data structure as they represent a
form where data can no longer be divided or have no parts. This group can be further divided into
Simple Type and Pointer Type.
Simple Type
Simple type is the most basic data type which is usually declared according to the
syntax rule of a programming language. Most common data types fall under this group.
• Logical type – sometimes referred to as Boolean type where the values are used
in performing logical operations such as AND, OR, and NOT.
3
• Enumeration type – a data type that enumerates all possible values of variables.
Pointer Type
Pointer type are addresses that are allocated in a main memory unit. Pointer types
are used to refer to variables, file records, or functions.
(Address of
Data
variable “b”)
Structure Type or Simple Data Structure is a data structure that contains a basic data type
or any of the defined data types as its elements. Arrays, strings, and records are examples of
Structure Type. They represent a collection of data elements. Array type or array is simply a finite
set of elements having the same type referenced under a common name. If this type is a character
type, we refer to it as a string or a collection of character elements. Record type on the other hand
is also a set of elements but this time of different data types referenced under a common name.
This is useful if we would like to organized a collection of data but do not want to manage them
separately (individually).
Abstract Data Type is a part of Basic Data Structure but represents those under Problem-
oriented Data Structure. Abstract Data Type or ADT is almost always synonymous to Data
Structures but represents more of a logical description (abstract) rather than actual
implementation. ADT is basically a mathematical model where a set of data values and its
associated operations are precisely specified independent of any particular implementation.
4
ADT is a kind of data abstraction where a type’s internal form is hidden (information hiding)
behind a set of access functions. By data abstraction, we mean any implementation of data in
which the implementation details are hidden (abstracted) from the user. Hiding data on the level
of data type is often called data encapsulation.
Think of ADT as a black box where inputs and outputs are known but how things are
programmed are not (ADTs hide implementation details). This idea keep internal calculations
private providing better modularity and allows for localize changes and better division of labor. A
data structure is the implementation of an ADT where the operations associated with the ADT are
implemented by one or more functions.
1. Logical form: definition of the data item within an ADT (e.g. integers in mathematical
sense: +,-, *, / (operations)
2. Physical form: implementation of the data item (e.g. 16 or 32 bit integers)
DATA TYPE
Defining an ADT depends on the application one is making and there are different
definitions for the same application. An ADT hides implementation details providing different
implementations for the same ADT. When the ADT is given, its data type can be used by the
programmer (e.g., string and math libraries in C). When the implementation changes, the
programs need not changed.
The different data structures can also be classified on the basis of their characteristics.
The following table presents the different characteristics and how they are described.
5
Table 2 Classification of Data Structures based on Characteristics
Characteristic Description
Linear In linear data structures, the data items are arranged in a linear
sequence. Example: Array
Non-Linear For non-linear data structures, the data items are not in sequence.
Example: Tree, Graph
Homogeneous Homogeneous data structures represent a structure whose elements
are of the same type. Example: Array
Non-Homogeneous In Non-homogeneous data structure, the elements may or may not be
of the same type. Example: Structures
Static Static data structures are those whose sizes and structures associated
memory locations are fixed, at compile time. Example: Array
Dynamic Dynamic structures are those which expands or shrinks depending
upon the program need and its execution. Also, their associated
memory locations changes. Example: Linked List created using
pointers.
Data structure types are determined by what types of operations are required or what
kinds of algorithms are going to be applied. These types include:
• Arrays - An array stores a collection of items at adjoining memory locations. Items that
are the same type get stored together so that the position of each element can be
calculated or retrieved easily. Arrays can be fixed or flexible in length.
• Stacks - A stack stores a collection of items in the linear order that operations are applied.
This order could be last in first out (LIFO) or first in first out (FIFO).
• Queues - A queue stores a collection of items similar to a stack; however, the operation
order can only be first in first out.
• Linked lists - A linked list stores a collection of items in a linear order. Each element, or
node, in a linked list contains a data item as well as a reference, or link, to the next item
in the list.
• Trees - A tree stores a collection of items in an abstract, hierarchical way. Each node is
linked to other nodes and can have multiple sub-values, also known as children.
• Graphs - A graph stores a collection of items in a non-linear fashion. Graphs are made
up of a finite set of nodes, also known as vertices, and lines that connect them, also known
as edges. These are useful for representing real-life systems such as computer networks.
• Tries - A trie, or keyword tree, is a data structure that stores strings as data items that can
be organized in a visual graph.
• Hash tables - A hash table, or a hash map, stores a collection of items in an associative
array that plots keys to values. A hash table uses a hash function to convert an index into
an array of buckets that contain the desired data item.
6
These are considered complex data structures as they can store large amounts of
interconnected data. Examples of primitive, or basic, data structures are integers, floats, Booleans
and characters.
In general, data structures are used to implement the physical forms of abstract data types.
This can be translated into a variety of applications, such as displaying a relational database as
a binary tree.
In programming languages, data structures are used to organize code and information in
a digital space. For example, Python lists and dictionaries or JavaScript array and objects are
common coding structures used for storing and retrieving information. Data structures are also a
crucial part of designing efficient software.
Data structures are essential for managing large amounts of data, such as information
kept in databases or indexing services, efficiently. Proper maintenance of data systems requires
the identification of memory allocation, data interrelationships and data processes, all of which
data structures help with.
Additionally, it is not only important to use data structures but it is important to choose the
proper data structure for each task. Choosing an ill-suited data structure could result in slow
runtimes or unresponsive code. A few factors to consider when picking a data structure include
what kind of information will be stored, where should existing data be placed, how should data be
sorted and how much memory should be reserved for the data.
Watch:
Review:
7
Unit 2 – Algorithm Analysis
Overview:
Algorithm Analysis or Analysis of Algorithm is the theoretical study of computer program’s
performance and resource usage. In this unit, we present the idea of algorithms and how
performance and resource usage affect the effectiveness of an algorithm.
Module Objectives:
At the end of this module, you should be able to:
Course Materials:
1. Input – An algorithm has zero or more input quantities which are externally supplied taken
from a set of objects called the domain of the algorithm.
2. Output - It has one or more output quantities which generate a set called the range of the
algorithm.
3. Definiteness - Each instruction must be clear and unambiguous, meaning each step of
an algorithm must be precisely defined.
4. Finiteness - If we trace out the instructions of an algorithm, then for all cases the algorithm
will terminate after a finite number of steps.
8
To give a concrete example on the concepts of algorithms, consider Euclid’s algorithm for
finding the greatest common divisor of two positive integers m and n. Euclid’s algorithm is:
Let us try to gain more insight into it, by simply trying it out. So, let m = 608, n = 133. Then:
!"# '!
Loop 1: Step 1. = 4 , so r ß 76
%&& %&&
Step 3. m ß 133; n ß 76
%&& ('
Loop 2: Step 1. '!
= 1 '!
, so r ß 57
Step 3. m ß 76; n ß 57
'! %)
Loop 3: Step 1. ('
= 1 ('
, so r ß 19
Step 3. m ß 57; n ß 19
(' "
Loop 4: Step 1. %)
= 3 %)
, so r ß 0
Let us now put to test Euclid’s method in terms of the five criteria of an algorithm.
1. Input. The input consists of m and n, which are taken from the set of positive integers.
2. Output. The output is n, which contains the greatest common divisor upon termination of
the algorithm.
3. Definiteness. The stipulation that m and n be positive integers precisely defines step 1,
since there is no ambiguity as to what the remainder is upon dividing a positive integer by
another. If m and n were any real number, then step 1 may not satisfy the criterion of
definiteness.
4. Finiteness. Will the algorithm terminate for any input pair of positive integers m and n?
The answer is yes, it will. In step 1, r is always less than n. This is the value assigned to
n in step 3, unless r is already zero, in which case the algorithm terminates in step 2.
9
Therefore, each time step 1 is subsequently executed, the value of n is smaller than the
previous. A decreasing sequence of positive integers eventually terminates (at worst, at
n=1), so the algorithm must also terminate.
5. Correctness or Effectiveness. We have been able to carry out the steps of the algorithm
in working out the example. If we are patient enough, we should be able to do the same
for any pair (m, n), even if it takes a thousand loops.
Effectiveness of Programs
Effectiveness is defined as being able to perform an action or make a result (output) based
on specifications or what is meant to be achieved. For a program (or an algorithm for that matter)
to be considered effective it must answer two factors.
1. Does it run correctly? Provides no error and runs in accordance with the specifications
given.
2. Does it run efficiently? The program runs effectively within the shortest time possible.
Analysis of Algorithms
An algorithm is said to be efficient and fast, if it takes less time to execute and consumes
less memory space. Algorithms that are equally correct can vary in their utilization of
computational resources. The performance of an algorithm or algorithm efficiency is usually
measured on the basis of the following properties:
1. Space complexity. Space complexity refers to the amount of memory space required
by the algorithm, during the course of its execution. Space complexity must be taken
seriously for multi-user systems and in situations where limited memory is available.
a. Instruction Space: The space required to store the executable version of the
program. This space is fixed but varies depending upon the number of lines of
code in the program.
b. Data Space: Is the space required to store all the constants and variables
(including temporary variables) value.
10
For example, if a function A() calls function B() inside it, then all the variables
of the function A() will get stored on the system stack temporarily, while the
function B() is called and executed inside the function A().
Space complexity is still important in the field of embedded computing especially for
handheld computer-based equipment like cell phones, palm devices, etc.
2. Time complexity. Time Complexity is a way to represent the amount of time required
by the program to run till its completion. It's generally a good practice to try to keep the
time required minimum, so that our algorithm completes its execution in the minimum
time possible.
Note however that time complexity is affected by the complexity of instruction which
depends on the following:
a. machine execution time
b. time to execute the instruction
c. the instruction set
d. translation (the use of compilers)
All of which pertains to the physical hardware that runs the program and the software
tools that is used to translate the program into machine readable form.
When a solution to a problem is written some memory is required to complete. For any
algorithm memory may be used for the following:
As defined earlier, space complexity is the amount of memory used by the algorithm
(including input values to the algorithm) to execute and produce result. Sometime Auxiliary Space
is confused with Space Complexity. But Auxiliary Space is the extra space or the temporary space
used by the algorithm during its execution. Thus, we can say:
Also earlier, while executing, an algorithm uses or require memory space for three
reasons: 1) instruction space; 2) environment space or environmental stack; and 3) data space.
But in calculating the Space Complexity of any algorithm, we usually consider only Data Space
and neglect Instruction Space and Environmental Stack. The two being affected by the type of
hardware used to run the algorithm.
11
For calculating the space complexity, we need to know the value of memory used by
different type of datatype variables, which generally varies for different operating systems, but the
method for calculating the space complexity remains the same.
Type Size
bool, char, unsigned char, signed char, __int8 1 byte
__int16, short, unsigned short, wchar_t, __wchar_t 2 bytes
float, __int32, int, unsigned int, long, unsigned long 4 bytes
double, __int64, long double, long long 8 bytes
Now let's learn how to compute space complexity by taking a few examples:
{
int z = a + b + c;
return(z);
}
In the above code, variables a, b, c and z are all integer types, hence they will take up 4
bytes each, so total memory requirement will be (4(4) + 4) = 20 bytes, the additional 4 bytes is for
return value. And because this space requirement is fixed for the above example, hence it is
called Constant Space Complexity.
In the above code, 4*n bytes of space is required for the array a[] elements. 4 bytes each
for x, n, i and the return value.
Hence the total memory requirement will be (4n + 16), which is increasing linearly with the
increase in the input value n, hence it is called as Linear Space Complexity. Similarly, we can
have quadratic and other complex space complexity as well, as the complexity of an algorithm
increases. But we should always focus on writing algorithm code in such a way that we keep the
space complexity minimum.
12
Time Complexity of Algorithms
The best option is to perform asymptotic analysis where we evaluate the performance of
an algorithm in terms of the input size (problem size) in relation with the processing time (or space)
required to solve the problem. In here we look for three cases to examine:
1. Best Case – if the algorithm is executed, the fewest number of instructions are
executed.
2. Average Case – executing the algorithm produces path lengths that will on average
be the same.
3. Worst Case – executing the algorithm produces path lengths that are always a
maximum.
Of the three cases, the only useful case from the standpoint of program design is that of
the worst case. Worst case helps us answer the software life cycle question of: If it is good enough
today, will it be good enough tomorrow?
For any defined problem, there can be N number of solutions. This is true in general. If I
have a problem and I discuss about the problem with all of my friends, they will all suggest me
different solutions. And I am the one who has to decide which solution is the best based on the
circumstances.
Similarly, for any problem which must be solved using a program, there can be infinite
number of solutions. Let's take a simple example to understand this. Below we have two different
algorithms to find square of a number (for some time, forget that square of any number n is n*n):
One solution to this problem can be running a loop for n times, starting with the number n
and adding n to it every time.
/*
we have to calculate the square of n
*/
for i=1 to n
do n = n + n
// when the loop ends n will hold its square
return n
Or, we can simply use a mathematical operator * (multiplication) to find the square.
/*
we have to calculate the square of n
*/
return n*n
13
In the above two simple algorithms, we saw how a single problem can have many
solutions. While the first solution required a loop, which will execute for n number of times, the
second solution used a mathematical operator * to return the result in one line. So which one is
the better approach, of course the second one.
Time complexity of an algorithm signifies the total time required by the program to run till
its completion. The time complexity of algorithms is most commonly expressed using the big O
notation which is an asymptotic notation to represent the time complexity.
Time Complexity is most commonly estimated by counting the number of elementary steps
(we call this frequency count) performed by any algorithm to finish execution. Like in the example
above, for the first code the loop will run n number of times, so the time complexity will be n at
least and as the value of n will increase the time taken will also increase. While for the second
code, time complexity is constant, because it will never be dependent on the value of n, it will
always give the result in 1 step. And since the algorithm's performance may vary with different
types of input data, hence for an algorithm we usually use the worst-case Time complexity of an
algorithm because that is the maximum time taken for any input size.
Let us now tap onto the next big topic related to Time complexity, which is How to
Calculate Time Complexity. It becomes very confusing sometimes, but we will try to explain it in
the simplest way.
Now the most common metric for calculating time complexity is Big O notation. This
removes all constant factors so that the running time can be estimated in relation to N, as N
approaches infinity. In general you can think of it like this:
statement;
Above we have a single statement. Its Time Complexity will be constant. The running time
of the statement will not change in relation to N.
The time complexity for the above algorithm will be Linear. The running time of the loop is
directly proportional to N. When N doubles, so does the running time.
This time, the time complexity for the code below will be Quadratic. The running time of
the two loops is proportional to the square of N. When N doubles, the running time increases by
N * N.
14
The algorithm below breaks a set of numbers into halves to search a particular field. Now,
this algorithm will have a Logarithmic Time Complexity. The running time of the algorithm is
proportional to the number of times N can be divided by 2 (N is high-low here). This is because
the algorithm divides the working area in half with each iteration.
Taking the previous algorithm forward, below we have a small logic of Quick Sort. In Quick
Sort, we divide the list into halves every time, but we repeat the iteration N times (where N is the
size of list). Hence time complexity will be N*log(N). The running time consists of N loops (iterative
or recursive) that are logarithmic, thus the algorithm is a combination of linear and logarithmic.
In general, doing something with every item in one dimension is linear, doing something
with every item in two dimensions is quadratic, and dividing the working area in half is logarithmic.
15
Types of Notations for Time Complexity
O(expression) is the set of functions that grow slower than or at the same rate as
expression. It indicates the maximum required by an algorithm for all input values. It represents
the worst case of an algorithm's time complexity.
Omega(expression) is the set of functions that grow faster than or at the same rate as
expression. It indicates the minimum time required by an algorithm for all input values. It
represents the best case of an algorithm's time complexity.
Theta(expression) consist of all the functions that lie in both O(expression) and
Omega(expression). It indicates the average bound of an algorithm and it represents the average
case of an algorithm's time complexity.
Generally, there are several algorithmic solutions to every programming problem and we
would like to eliminate the bad ones early on so an analysis is usually required. The ability to do
an analysis usually provides insight into designing efficient algorithms. The analysis also generally
pinpoints the bottlenecks, which are worth coding carefully.
To simplify the analysis, we adopt the convention that there are no particular units of time
thus throwing away leading constants. We will also throw away low-order terms, so what we are
essentially doing is computing a Big-Oh running time. Since Big-Oh is an upper bound, we must
be careful never to underestimate the running time of the program. In effect, the answer provided
is a guarantee that the program will terminate within a certain time period. The program may stop
earlier than this, but never later.
Example.
1 partialSum = 0;
2 for( int i = 1; i <= n; i++ )
3 partialSum += i * i * i;
4 return partialSum;
}
16
The analysis of this fragment is simple. The declarations count for no time. Lines 1 and 4
count for one unit each. Line 3 counts for four units per time executed (two multiplications, one
addition, and one assignment) and is executed N times, for a total of 4N units. Line 2 has the
hidden costs of initializing i, testing i ≤ N, and incrementing i. The total cost of all these is 1 to
initialize, N+1 for all the tests, and N for all the increments, which is 2N + 2. We ignore the costs
of calling the method and returning, for a total of 6N + 4. Thus, we say that this method is O(N).
If we had to perform all this work every time we needed to analyze a program, the task would
quickly become infeasible. Fortunately, since we are giving the answer in terms of Big-Oh, there
are lots of shortcuts that can be taken without affecting the final answer. For instance, line 3 is
obviously an O(1) statement (per execution), so it is silly to count precisely whether it is two, three,
or four units; it does not matter. Line 1 is obviously insignificant compared with the for loop, so it
is silly to waste time here. This leads to several general rules.
The running time of a for loop is at most the running time of the statements inside the for
loop (including tests) times the number of iterations.
Analyze these inside out. The total running time of a statement inside a group of nested
loops is the running time of the statement multiplied by the product of the sizes of all the
loops.
These just add (which means that the maximum is the one that counts). Example is the
following program fragment, which has O(N) work followed by O(N2) work, is also O(N2):
17
Rule 4—If/Else
if (condition)
S1
else
S2
the running time of an if/else statement is never more than the running time of the test
plus the larger of the running times of S1 and S2.
Other rules are obvious, but a basic strategy of analyzing from the inside (or deepest part)
out works. If there are function calls, these must be analyzed first. If there are recursive functions,
there are several options. If the recursion is really just a thinly veiled for loop, the analysis is
usually trivial.
Watch:
Read:
Review:
1. What is an algorithm?
2. In formal computer science, how do you distinguish between algorithm and a program?
3. Why is there a need to analyze algorithms?
18
Activities/Assessments:
1. For each of the following three program fragments:
a. Give an analysis of the running time (Big-Oh will do).
b. Implement the code in the language of your choice, and give the running time for
several values of N.
c. Compare your analysis with the actual running times.
(1) sum = 0;
for( i = 0; i < n; ++i )
++sum;
(2) sum = 0;
for( i = 0; i < n; ++i )
for( j = 0; j < n * n; ++j )
++sum;
(3) sum = 0;
for( i = 0; i < n; ++i )
for( j = 0; j < i * i; ++j )
for( k = 0; k < j; ++k )
++sum;
19
Unit 3 – Arrays
Overview:
Most programming languages in one way or another have its own implementation of an array. In
this unit we present an array which is used by most of the data structures to implement their
algorithms.
Module Objectives:
After successful Completion of this module, you should be able to:
Course Materials:
Array type or array is defined as a finite set of elements having the same type referenced
under a common name. Each individual data is called an array element and must be of the same
type and size. Figure 6 shows a one-dimensional array where data is arrayed in a line. Each data
element is accessed using an index.
Arrays can have more than one dimension, these are called multidimensional arrays.
Though they may have certain limitations as to the number of definable dimensions, depending
on the type of programming language or compiler used. Figure 7 shows a two-dimensional array
sometimes called a table as it represents a series of rows and columns. The figure shows data
lined up in both vertical and horizontal directions. Data elements are identified by a row index and
a column index. Two-dimensional arrays are also called array of arrays.
20
Figure 7 Two-dimensional array
The figure below represents a three-dimensional array. Notice that aside from the rows
and columns, a third element representing depth can also be seen.
21
The above image can be looked as a top-level view of a staircase where you are at the
base of the staircase (index 0). Each element can be uniquely identified by their index in the array
(in a similar way as you could identify your friends by the step on which they were on in the above
example).
Where:
B = base address
W = storage size of one element stored in the array (in bytes)
X = subscript of element whose address is to be found
LB = Lower limit/Lower bound of subscript, if not specified assume 0 (zero)
Example:
Given the base address of an array B[1300…..1900] as 1020 and size of each element is
2 bytes in the memory. Find the address of B[1700].
Address of A [ X ] = B + W * ( X – LB )
22
Address Calculation in Two Dimension Array
Row Major System: The address of a location in Row Major System is calculated using
the following formula:
Address of A [ I ][ J ] = B + W * [ N * ( I – Lr ) + ( J – Lc ) ]
23
Column Major System: The address of a location in Column Major System is calculated
using the following formula:
Where:
B = Base address
I = Row subscript of element whose address is to be found
J = Column subscript of element whose address is to be found
W = Storage Size of one element stored in the array (in byte)
Lr = Lower limit of row/start row index of matrix, if not given assume 0 (zero)
Lc = Lower limit of column/start column index of matrix, if not given assume 0
M = Number of rows of the given matrix
N = Number of columns of the given matrix
Note however that number of rows and columns of a matrix are usually given (like
A[20][30] or A[40][60] ) but if it is given as A[Lr- – – – – Ur, Lc- – – – – Uc]. In this case
number of rows and columns are calculated using the following methods:
And rest of the process will remain same as per requirement (Row Major Wise or
Column Major Wise).
Example:
Solution:
As you see here the number of rows and columns are not given in the question. So they
are calculated as:
The given values are: B = 1500, W = 1 byte, I = 15, J = 20, Lr = -15, Lc = 15, M = 26
Address of A [ I ][ J ] = B + W * [ ( I – Lr ) + M * ( J – Lc ) ]
24
(ii) Row Major Wise Calculation of above equation
The given values are: B = 1500, W = 1 byte, I = 15, J = 20, Lr = -15, Lc = 15, N = 26
Address of A [ I ][ J ] = B + W * [ N * ( I – Lr ) + ( J – Lc ) ]
Array Representation
25
As seen in the illustration (Figure 14), following are the important points to be considered.
In C, when an array is initialized with size, then it assigns default values to its elements:
There are several basic operations supported by an array though the implementation
differs from one programming language to another. In Python, there is an array module that has
separate functions for performing array operations. In C language, the operations are explicitly
defined using loops and assignment statements. Following are the basic operations supported by
an array.
1. Traverse. This operation traverses through the elements of an array meaning we print
all the array elements one by one as we visit them.
2. Insertion. Insert operation is to insert one or more data elements into an array. Based
on the requirement, a new element can be added at the beginning, end, or any given
index of the array.
3. Deletion. Deletion refers to removing an existing element from the array and re-
organizing all elements of an array.
4. Search. With this operation, you can search for an item in an array based on a given
value (or through an index).
5. Update. Update operation refers to updating an existing element from the array at a
given index. This operation is quite similar to the insert method, except that it will
replace the existing value at the given index.
26
Table 5 Time Complexity of an Array
Advantages
1. Reading an array element is simple and efficient. As shown in the above table, the read
time of array is O(1) in both best and worst cases. This is because any element can be
instantly read using indexes (base address calculation behind the scene) without
traversing the whole array.
2. Array is a foundation of other data structures. For example, other data structures such as
LinkedList, Stack, Queue etc. are implemented using array.
3. All the elements of an array can be accessed using a single name (array name) along with
the index, which is readable, user-friendly and efficient rather than storing those elements
in different variables.
Disadvantages
1. While using array, we must need to make the decision of the size of the array in the
beginning, so if we are not aware how many elements we are going to store in array, it
would make the task difficult.
2. The size of the array is fixed so if at later point, if we need to store more elements in it
then it can’t be done. On the other hand, if we store a smaller number of elements than
the declared size, the remaining allocated memory is wasted.
Watch:
Read:
27
• Arrays in C (https://www.w3resource.com/c-programming/c-array.php)
• Arrays (https://en.wikibooks.org/wiki/Data_Structures/Arrays)
Review:
1. What is an array?
2. What are sequential lists and how are they limited?
3. A one-dimensional array is a collection of what? A two-dimensional array is a collection of
what?
4. If an array holds integers, each of which is four bytes, how many bytes from the base
location of the array is the location of the fifth element?
5. Describe a row major ordering and a column major ordering.
6. On average, how many items must be moved to insert a new item into an unsorted array
with N items?
7. On average, how many items must be moved to delete an item from an unsorted array
with N items?
8. On average, how many items must be examined to find a particular item in an unsorted
array with N items?
Activities/Assessments:
1. How many values can be held by an array with dimensions?
a) X [0..8]
b) a [0..n]
c) b [-1..n, 1..m]
d) c [-n..0, 1..2]
2. When storing a two-dimensional array A with ten rows and ten columns in continuous
memory space in the direction of rows, what is the address where A[5,6] is stored? In this
question, the address is represented in decimal numbers.
Address
100 A[1,1]
101
102 A[1,2]
103
104
3. Assume that the following lists of numbers represents the physical view of a two-
dimensional array in memory. If the array has 3 columns and 3 rows, show the row major
ordering and the column major ordering for each list.
a) 4, 0, 1, 3, 5, 1, 7, 4, 4
b) 7, 7, 7, 3, 3, 3, 1, 1, 1
c) 1, 2, 3, 4, 5, 6, 7, 8, 9
28
4. Answer the following problems given the following array variable/s assuming that an int is
equal to 2 bytes. int x [4][3] = {5, 2, 9, 0, 3, 1, 4, 12, 6, 8, 10, 11};
a) what will be the total size (in bytes) of the array? ___________
b) what value is found at x [0][0]? _____________
c) what value is found at x [3][2]? _____________
d) what value is found at x [2][1]? _____________
e) what value is found at x [1][2]? _____________
Programming:
29
Unit 4 – Linked Lists
Overview:
List provide a convenient way for keeping track of things in our everyday life: shopping lists,
laundry lists, list of thigs to be done, and so on. Similarly, lists provide a convenient way for
representing arbitrary nonnumeric information in a computer. The remainder of this unit will deal
with the list a formal data structure and will explain how it is represented and manipulated in a
computer.
Module Objectives:
After successful Completion of this module, you should be able to:
Course Materials:
List is a finite, ordered collection of items (elements) of some element type E. Note that
this doesn't mean that the objects are in sorted order, it just means that each object has a position
in the List, starting with position zero (first element in the list), second element, and so on. In
programming list is describe as a collection of elements that have a linear relationship with each
other. A linear relationship means that, except for the first one, each element on the list has a
unique successor. Also lists have a property intuitively called size, which is simply the number of
elements on the list. Know that every list has a size.
In computer programs, lists are extremely versatile ADTs that provide storage for
information without imposing any limitations on how that information is added, accessed, or
removed. Recall that when we think about an ADT, we think about both the external and internal
views. The external view includes the "conceptual picture" and the set of "conceptual operations".
The conceptual picture of a List is something like this:
item 0
item 1
item 2
...
item n
30
and one reasonable set of operations is:
Operation Description
void add(E item) add item to the end of the List
void add(int pos, E item) add item at position pos in the List, moving the items originally
in positions pos through size()-1 one place to the right to make
room (error if pos is less than 0 or greater than size())
boolean contains(E item) return true iff item is in the List (i.e., there is an item x in the List
such that x.equals(item))
int size() return the number of items in the List
boolean isEmpty() return true iff the List is empty
E get(int pos) return the item at position pos in the List (error if pos is less than
0 or greater than or equal to size())
E remove(int pos) remove and return the item at position pos in the List, moving
the items originally in positions pos+1 through size()-1 one
place to the left to fill in the gap (error if pos is less than 0 or
greater than or equal to size())
Many other operations are possible; when designing an ADT, you should try to provide
enough operations to make the ADT useful in many contexts, but not so many that it gets
confusing. It is not always easy to achieve this goal; it will sometimes be necessary to add
operations in order for a new application to use an existing ADT.
List Implementations
In some ways, a List is similar to an array: both Lists and arrays are ordered collections of
data elements and in both cases you can add or access items at a particular position (and in both
cases we consider the first position to be position zero). You can also find out how many items
are in a List and how large an array is.
You can actually implement a list using an array which is one of the most common
implementation of a list. Unfortunately, one disadvantage of using arrays to store data is that
arrays are static structures and therefore cannot be easily extended or reduced to fit the data set.
Arrays are also expensive to maintain new insertions and deletions.
A solution is to use another implementation, that of using pointers or memory cells forming
what we call a Linked list. A linked list is a linear data structure where each element is a separate
object connected together via links.
The implementation of a linked list in C is done using pointers. Linked list can be visualized
as a chain of nodes, where every node points to the next node. Each element (node) of a list is
comprised of two items - the data and a reference to the next node. The last node has a reference
to null and often identified as tail. The entry point into a linked list is called the head of the list. It
31
should be noted that head is not a separate node, but the reference to the first node. If the list is
empty then the head is a null reference.
Implementing a list in C language requires the use of structures and pointers. Each
structures represents a node having some data and also a pointer to another structure of the
same kind. This pointer holds the address of the next node and creates a link between two nodes.
The following code shows how to define a node in C (this is for a singly linked list).
struct node
{
int data;
struct node *next;
};
The first data member of the structure (named node) is an integer to hold an integer and
the second data member is the pointer to a node (same structure). This means that the second
data member holds the address of the next node and in this way, every node is connected as
represented in the picture above.
A linked list is a dynamic data structure. Unlike an array, a linked list does not store its
nodes in consecutive memory locations. The number of nodes in a list is not fixed and can grow
and shrink on demand. Any application which has to deal with an unknown number of objects will
need to use a linked list.
Another point of difference between an array and a linked list is that a linked list does not
allow random access of data or direct access to the individual elements. If you want to access a
particular item then you have to start at the head and follow the references until you get to that
item. Nodes in a linked list can be accessed only in a sequential manner. But like an array,
insertions and deletions can be done at any point in the list in a constant time. A disadvantage to
a linked list is that it uses more memory compare with an array - we need an extra 4 bytes (on
32-bit CPU) just to store a reference to the next node.
1. Simple Linked List or Singly Linked List – navigation in the list is forward only. In this
type of linked list, every node stores address or reference of next node in list and the
last node has next address or reference as NULL.
32
Figure 16 Singly Linked List
2. Doubly Linked List – navigation through the list can be both ways – forward and
backward. In this type of Linked list, there are two references associated with each
node, One of the reference points to the next node and one to the previous node.
Advantage of this data structure is that we can traverse in both the directions and for
deletion we don’t need to have explicit access to previous node.
3. Circular Linked List – The last item (TAIL) in the list contains a link to the first element
(HEAD) as next and the first element has a link to the last element as previous. Circular
linked list is a linked list where all nodes are connected to form a circle. There is no
NULL at the end. A circular linked list can be a singly circular linked list or doubly
circular linked list. Advantage of this data structure is that any node can be made as
starting node. This is useful in implementation of circular queue in linked list.
Basic Operations
Though there exist various types of linked list, some of the basic operations supported by
a linked list are almost the same with slight modification. The operations to be presented below
are that of a singly linked list.
33
• Addition/insertion – adding an element in a list. Note that the process of adding or
inserting an item varies depending on where the new element will be placed (at the
beginning, at the end of the list, or in between nodes)
• Search – finding an element from the list. Searching for an item in the list always begins
at the head.
Addition/Insertion Operation
The process of adding or inserting a new node to a linked list (singly linked list)
depends on the position where the new item is to be placed. Following are the processes
for adding an item to a list assuming the list already has data on it.
34
Step 3 – Point the ‘next’ of ‘prev’ to the new node.
Deletion Operation
We delete any node of a linked list by connecting the predecessor node of the
node to be deleted by the successor node of the same node. For example, if we have a
linked list a → b → c, then to delete the node ‘b’, we will connect ‘a’ to ‘c’ i.e., a → c. But
this will make the node ‘b’ inaccessible and this type of inaccessible nodes are called
garbage and we need to clean this garbage by using the ‘free’ function in C.
Search Operation
Searching for an item in the list start with the head and accessing each node
comparing the value of the item being searched until we reach null (the end of the list). Do
not change the head reference. The same process (without the comparison) is done when
traversing all entries in the list.
35
Watch:
Read:
Review:
1. What is a list?
2. What is the different between a list and linked list?
3. How is an array different from a linked list?
4. What are the different types of linked list?
Activities/Assessments:
1. Fill in the values indicated by the pointer variable expressions below, based on the
following linked list:
a) A->next->next->info ___________________
b) B->next->info ___________________
c) B->next->back->back->info ___________________
d) A->back->next->info ________________
36
2. The figure below is a uni-directional list. Manila is the head of the list and the pointer
contains addresses of data shown below. Batanes is the tail of the list and the pointer
contains 0. Choose one correct way to insert Cebu placed at address 150 between
Boracay and Davao.
a) The pointer for Cebu to be set to 50 and that for Davao to be set to 150
b) The pointer for Cebu to be set to 70 and that for Boracay to be set to 150
c) The pointer for Cebu to be set to 90 and that for Davao to be set to 150
d) The pointer for Cebu to be set to 150 and that for Boracay to be set to 90
b) P = (node) malloc(sizeof(NodeEntry));
P->info = 3;
Q = (node) malloc(sizeof(NodeEntry));
Q->info = 2;
P = (node) malloc(sizeof(NodeEntry));
P->info = Q->info;
Q->info = 0;
printf(“%d %d”, P->info, Q->info);
4. Swap two adjacent elements by adjusting only the links (and not the data) using:
a) singly linked lists
b) doubly linked lists
Programming:
You were tasked to do programs that will illustrate the use of lists implemented using an
array. One uses a sorted list, the other an unsorted list. Run the programs separately
and answer the following questions.
37
1. Write a program that will initially create an UNSORTED list containing the following and
then run the program:
Mitch
Diane
Jack
Robbie
Katherine
With the same list above (with Morrie added), delete/remove Jack. Answer and explain
the following questions below:
a) What is the new list? Identify the elements of the list and its index.
b) What happened to the former location occupied by Jack?
2. Write a program that will initially create a SORTED list containing the following and then
run the program:
Diane
Jack
Katherine
Mitch
Robbie
With the same list above (with Morrie added), delete/remove Jack. Answer and explain
the following questions below:
a) What is the new list? Identify the elements of the list and its index.
b) What happened to the former location occupied by Jack?
3. In most personal computers, the largest integer is 32,767 and the largest long integer is
2,147,483,647. Some applications, such as the national debt, may require an unbounded
integer. One way to store and manipulate integers of unlimited size is by using a linked
list. Each digit is stored in a node of the list. For example, the figure below shows how
we could store a five-digit number in a linked list.
38
While the linked list in the figure is represented as moving from right to left, remember that
there is no physical direction in a linked list. We represent it this way to assist us in
understanding the problem.
To add two numbers, we simply add the corresponding digit in the same location in their
respective linked lists with the carry from the previous addition. With each addition, if the
sum is greater than 10, then we need to subtract ten and set the carry to one. Otherwise,
the carry is set to zero.
Write a program to add two integer linked lists. Design your solution so that the same
logic adds the first numbers (units’ position) as well as the rest of the number. In other
words, do not have special one-time logic for adding the units’ position.
39
Unit 5 – Stacks
Overview:
Stacks are dynamic data structures that follow the Last In First Out (LIFO) principle. The last item
to be inserted into a stack is the first one to be deleted from it. In this unit we will explore stacks
and its applications and learn how to implement it.
Module Objectives:
After successful Completion of this module, you should be able to:
Course Materials:
A stack is an Abstract Data Type (ADT), commonly used in most programming languages.
It is named stack as it behaves like a real-world stack, for example – a deck of cards or a pile of
plates, etc. A real-world stack allows operations at one end only. For example, we can place or
remove a card or plate from the top of the stack only.
Figure 25 Stack
40
A stack is a container of objects that are inserted and removed according to the last-in
first-out (LIFO) principle. In the pushdown stacks only two operations are allowed: push the item
into the stack and pop the item out of the stack. A stack is a limited access data structure as
elements can be added and removed from the stack only at the top. Push adds an item to the top
of the stack, pop removes the item from the top. A helpful analogy is to think of a stack of books;
you can remove only the top book, also you can add a new book on the top.
Applications of Stacks
1. The simplest application of a stack is to reverse a word. You push a given word to a
stack one character at a time and then pop letters from the stack.
3. Problems involving Backtracking also uses stack. Backtracking is a process when you
need to access the most recent data element in a series of elements. Think of a
labyrinth or maze - how do you find a way from an entrance to an exit?
Once you reach a dead end, you must backtrack. But backtrack to where? to the
previous choice point. Therefore, at each choice point you store on a stack all possible
choices. Then backtracking simply means popping a next choice from the stack.
a. space for parameters and local variables is created internally using a stack.
b. compiler's syntax check for matching braces is implemented by using stack.
c. support for recursion
41
Stack Representation and Implementation
A Stack Data Structure can be implemented in two ways: array implementation and linked
list implementation.
Array-based Implementation
The variable top changes from - 1 to capacity - 1. We say that a stack is empty
when top = -1, and the stack is full when top = capacity - 1.
In a fixed-size stack, the capacity stays unchanged, therefore when top reaches
capacity, the stack throws an overflow error. Whereas, in a dynamic stack when top
reaches capacity, we double up the stack size.
An advantage of an array-based implementation is that it is easy to implement,
and that memory is saved as pointers are not involved. The drawback is that it is not
dynamic. It doesn’t grow and shrink depending on needs at runtime.
Linked List-based implementation provides the best (from the efficiency point of
view) dynamic stack implementation. The linked list implementation of a stack can grow
and shrink according to the needs at runtime. This advantage also closes into its
disadvantage that is it requires extra memory due to involvement of pointers.
Basic Operations
Stack operations may involve initializing the stack, using it and then de-initializing it. Apart
from these basic stuffs, a stack is used for the following two primary operations:
42
• push() − Pushing (storing) an element on the stack.
• pop() − Removing (accessing) an element from the stack.
To use a stack efficiently, we need to check the status of stack as well. For the same
purpose, the following functionality is added to stacks −
• peek() − get the top data element of the stack, without removing it.
• isFull() − check if stack is full.
• isEmpty() − check if stack is empty.
At all times, we maintain a pointer to the last PUSHed data on the stack. As this pointer
always represents the top of the stack, hence named top. The top pointer provides top value of
the stack without actually removing it.
peek()
int peek() {
return stack[top];
}
isfull()
end procedure
43
Implementation of isfull() function in C programming language −
bool isfull() {
if(top == MAXSIZE)
return true;
else
return false;
}
isempty()
end procedure
bool isempty() {
if(top == -1)
return true;
else
return false;
}
Push Operation
The process of putting a new data element onto stack is known as a Push Operation. Push
operation involves a series of steps −
44
Figure 29 Stack Push Operation
If the linked list is used to implement the stack, then in step 3, we need to allocate space
dynamically.
if stack is full
return null
endif
top ← top + 1
stack[top] ← data
end procedure
45
Pop Operation
Accessing the content while removing it from the stack, is known as a Pop Operation. In
an array implementation of pop() operation, the data element is not actually removed, instead top
is decremented to a lower position in the stack to point to the next value. But in a linked-list
implementation, pop() actually removes data element and deallocates memory space. A Pop
operation may involve the following steps −
data ← stack[top]
top ← top - 1
return data
end procedure
46
Arithmetic Expression Evaluation
• Infix Notation
• Prefix (Polish) Notation
• Postfix (Reverse-Polish) Notation
Infix Notation
Each operator is written between two operands (i.e., A + B). It is easy for us humans to
read, write, and speak in infix notation but the same does not go well with computing devices. An
algorithm to process infix notation could be difficult and costly in terms of time and space
consumption. With this notation, we must distinguish between ( A + B ) * C and A + ( B * C ) by
using either parentheses or some operator-precedence convention. Thus, the order of operators
and operands in an arithmetic expression does not uniquely determine the order in which the
operations are to be performed.
Prefix Notation
In this notation, operator is placed before its two operands. For example, +ab, this is
equivalent to its infix notation a + b. Prefix notation is also known as Polish Notation and no
parentheses are required.
Postfix Notation
This notation style is known as Reversed Polish Notation. It refers to the analogous
notation in which the operator is placed after its two operands. For example, ab+, this is equivalent
to its infix notation a + b. Again, no parentheses is required in Reverse Polish notation
Expression Parsing
Stack organized computers are better suited for post-fix notation than the traditional infix
notation. Thus the infix notation must be converted to the post-fix notation. An important
application of stacks is in parsing. For example, a compiler must parse arithmetic expressions
written using infix notation:
1 + ((2 + 3) * 4 + 5) * 6
We break the problem of parsing infix expressions into two stages. First, we convert from
infix to a different representation called postfix. Then we parse the postfix expression, which is a
47
somewhat easier problem than directly parsing infix. The conversion from infix notation to post-
fix notation must take into consideration the operational hierarchy.
For example –
Infix notation: ( A – B ) * [ C / ( D + E ) + F ]
Post-fix notation: A B – C D E + / F + *
Here, we first perform the arithmetic inside the parentheses (A-B) and (D+E). The division
of C/(D+E) must done prior to the addition with F. After that multiply the two terms inside the
parentheses and bracket.
2 + 5
where the operators (e.g. +, *) are written between the operands (e.q, 2 and 5). Writing the
operators after the operands gives a postfix expression 2 and 5 are called operands, and the '+'
is operator. The above arithmetic expression is called infix, since the operator is in between
operands. The expression below is postfix since the operator is found after the operands.
2 5 +
+ 2 5
Suppose you want to compute the cost of your shopping trip. To do so, you add a list of
numbers and multiply them by the local sales tax (7.25%):
70 + 150 * 1.0725
48
Depending on the calculator, the answer would be either 235.95 or 230.875. To avoid this
confusion we shall use a postfix notation
70 150 + 1.0725 *
3. If a token is an operator, push it to the stack, if the stack is empty. If the stack is not empty,
you pop entries with higher or equal priority and only then you push that token to the stack.
5. If a token is a right parentheses ')', you pop entries until you meet '('.
6. When you finish reading the string, you pop up all tokens which are left there.
3. If it is a binary operator, pop the top two elements from the stack, apply the operator, and
push the result back on the stack.
49
Consider the following postfix expression
5 9 3 + 4 2 * * 7 + *
Note, that division is not a commutative operation, so 2/3 is not the same as 3/2.
Watch:
Read:
Review:
50
3. How is a stack similar to a list? How is it different?
4. Give two stack operations, and what do these operations do?
5. What is one advantage of using a linked list to implement a stack rather than an array?
What is one disadvantage?
Activities/Assessments:
1. Many real-world situations correspond to a stack. For example, a pile of trays in a cafeteria
is a stack since the last tray put on the pile is the first tray used. Think of other real-world
stacks and describe the push and pop operations on these.
1 4 * B + C – 5 4 B * C + 5 - - + * 4 B C 5
2 A + B * C - D A B C * + D – - + A * B C D
3 A + B * C ^ D A B C D ^ * + + A * B ^ C D
4 A * B + C – (D – E) A B * C + D E - - - + * A B C – D E
5 3 * 4 + 5 / 2 - 3 3 4 * 5 2 / + 3 - – + * 3 4 / 5 2 3
3. A letter means push and an asterisk means pop in the following sequence. Give the
sequence of values returned by the pop operations when this sequence of operations is
performed on an initially empty LIFO stack.
E A S * Y * Q U E * * * S T * * * I O * N * * *
4. A letter means push and an asterisk means pop in the following sequence. Give the
contents of s[0], ..., s[4] after the execution of this sequence of operations performed on
an initially empty LIFO stack.
L A * S T I * N * F I R * S T * * O U * T * * * * * *
5. Suppose that a client performs an intermixed sequence of (stack) push and pop
operations. The push operations push the integers 0 through 9 in order on to the stack;
the pop operations print out the return value. Circle the following sequence(s) that could
not occur.
a. 4 3 2 1 0 9 8 7 6 5
b. 2 1 4 3 6 5 8 7 9 0
c. 0 4 6 5 3 8 1 7 2 9
d. 4 6 8 7 5 3 2 9 0 1
51
Programming:
1. Write a program that will continuously accept a character and store in a stack until letter
‘Z’ is entered.
2. Write a program that will continuously accept a character and store in a stack until a vowel
is entered, then display the letters in reverse order of their arrival.
3. Write a program that will continuously accept a character and store in a stack until an
asterisk (*) is entered, and then display the third to the last character entered.
4. Write a program that takes from standard input an expression without left parentheses
and prints the equivalent infix expression with the parentheses inserted. For example,
given the input:
1 + 2 ) * 3 - 4 ) * 5 - 6 ) ) )
your program should print
( ( 1 + 2 ) * ( ( 3 - 4 ) * ( 5 - 6 ) )
5. Write a program that takes an infix expression from standard input, evaluates it, and prints
the value. (Piping the output of your program from the previous exercise to this program
gives equivalent behavior to Evaluate.
6. Write a program InfixToPostfix that converts an arithmetic expression from infix to postfix.
7. Write a program EvaluatePostfix that takes a postfix expression from standard input,
evaluates it, and prints the value.
52
Unit 6 - Queues
Overview:
Queues are data structures that follow the First In First Out (FIFO) principle where in the first
element that is added to the queue is the first one to be removed. In this unit we will explore
queues and its applications and learn how to implement it.
Module Objectives:
After successful Completion of this module, you should be able to:
Course Materials:
Queue is an abstract data structure, somewhat similar to Stacks. Unlike stacks, a queue
is open at both ends. One end (REAR; also called tail) is always used to insert data (enqueue)
and the other (FRONT; also called head) is used to remove data (dequeue). Queue follows First-
In-First-Out methodology, i.e., the data item stored first will be accessed first. A real-world
example of queue can be a single-lane one-way road, where the vehicle enters first, exits first.
More real-world examples can be seen as queues at the ticket windows and bus-stops.
In the queue only two operations are allowed enqueue and dequeue. Enqueue means to
insert an item into the back of the queue, dequeue means removing the front item. The picture
demonstrates the FIFO access.
The difference between stacks and queues is in removing. In a stack we remove the item
the most recently added; in a queue, we remove the item the least recently added.
53
Application of Queues
Queue, as the name suggests is used whenever we need to manage any group of objects
in an order in which the first one coming in, also gets out first while the others wait for their turn,
like in the following scenarios:
1. Serving requests on a single shared resource, like a printer, CPU task scheduling etc.
2. In real life scenario, Call Center phone systems uses Queues to hold people calling
them in an order, until a service representative is free.
3. Handling of interrupts in real-time systems. The interrupts are handled in the same
order as they arrive i.e First come first served.
Queue can be implemented using an Array, Stack or Linked List. The easiest way of
implementing a queue is by using an Array.
Initially the head (FRONT) and the tail (REAR) of the queue points at the first index of the
array (starting the index of array from 0). As we add elements to the queue, the tail keeps on
moving ahead, always pointing to the position where the next element will be inserted, while the
head remains at the first index.
54
When we remove an element from Queue, we can follow two possible approaches
(mentioned [A] and [B] in above diagram). In [A] approach, we remove the element at head
position, and then one by one shift all the other elements in forward position.
In approach [B] we remove the element from head position and then move head to the
next position.
In approach [A] there is an overhead of shifting the elements one position forward every
time we remove the first element.
In approach [B] there is no such overhead, but whenever we move head one position
ahead, after removal of first element, the size on Queue is reduced by one space each time.
Circular Queue
In order to resolve the problem with [B] we can implement what we call a circular
queue. Given an array A of a default size (≥ 1) with two references back and front,
originally set to -1 and 0 respectively. Each time we insert (enqueue) a new item, we
increase the back index; when we remove (dequeue) an item - we increase the front index.
Here is a picture that illustrates the model after a few steps:
As you see from the picture, the queue logically moves in the array from left to
right. After several moves back reaches the end, leaving no space for adding new
elements
However, there is a free space before the front index. We shall use that space for
enqueueing new items, i.e. the next entry will be stored at index 0, then 1, until front. Such
a model is called a wrap-around queue or a circular queue.
55
Figure 35 Circular queue (wrap around)
Finally, when back reaches front, the queue is full. There are two choices to handle
a full queue: a) throw an overflow error ; b) double the array size (increase storage space).
Basic Operations
Queue operations may involve initializing or defining the queue, utilizing it, and then
completely erasing it from the memory. Here we shall try to understand the basic operations
associated with queues −
Few more functions are required to make the above-mentioned queue operation efficient.
These are −
• peek() − Gets the element at the front of the queue without removing it.
• isfull() − Checks if the queue is full.
• isempty() − Checks if the queue is empty.
In queue, we always dequeue (or access) data, pointed by front pointer and while
enqueueing (or storing) data in the queue we take help of rear pointer.
peek()
This function helps to see the data at the front of the queue. The algorithm of peek()
function is as follows −
56
Implementation of peek() function in C programming language −
int peek() {
return queue[front];
}
isfull()
As we are using single dimension array to implement queue, we just check for the rear
pointer to reach at MAXSIZE to determine that the queue is full. In case we maintain the
queue in a circular linked-list, the algorithm will differ. Algorithm of isfull() function −
bool isfull() {
if(rear == MAXSIZE - 1)
return true;
else
return false;
}
isempty()
end procedure
57
If the value of front is less than MIN or 0, it tells that the queue is not yet initialized, hence
empty. Here's the C programming code −
bool isempty() {
if(front < 0 || front > rear)
return true;
else
return false;
}
Enqueue Operation
Queues maintain two data pointers, front and rear. Therefore, its operations are
comparatively difficult to implement than that of stacks. The following steps should be taken to
enqueue (insert) data into a queue −
Sometimes, we also check to see if a queue is initialized or not, to handle any unforeseen
situations.
procedure enqueue(data)
if queue is full
return overflow
endif
rear ← rear + 1
queue[rear] ← data
return true
end procedure
58
Implementation of enqueue() in C programming language −
rear = rear + 1;
queue[rear] = data;
return 1;
end procedure
Dequeue Operation
Accessing data from the queue is a process of two tasks − access the data where front is
pointing and remove the data after access. The following steps are taken to perform dequeue
operation −
procedure dequeue
if queue is empty
return underflow
end if
data = queue[front]
front ← front + 1
return true
end procedure
59
Implementation of dequeue() in C programming language −
int dequeue() {
if(isempty())
return 0;
return data;
}
Watch:
Read:
Review:
60
Activities/Assessments:
1. Suppose that Q is an initially empty array-based queue of size 5. Show the values of the
data members front and back after each statement has been executed. Indicate any errors
that might occur.
2. A letter means enqueue and an asterisk means dequeue in the following sequence. Give
the sequence of values returned by the dequeue operations when this sequence of
operations is performed on an initially empty FIFO queue.
E A S * Y * Q U E * * * S T * * * I O * N * * *
3. Suppose that a client performs an intermixed sequence of (queue) enqueue and dequeue
operations. The enqueue operations put the integers 0 through 9 in order onto the queue;
the dequeue operations print out the return value. Which of the following sequence(s)
could not occur?
a. 0 1 2 3 4 5 6 7 8 9
b. 4 6 8 7 5 3 2 9 0 1
c. 2 5 6 7 4 8 9 3 1 0
d. 4 3 2 1 0 5 6 7 8 9
Programming:
1. The Josephus problem is the following game: N people, numbered 1 to N, are sitting in a
circle. Starting at person 1, a hot potato is passed. After M passes, the person holding the
hot potato is eliminated, the circle closes ranks, and the game continues with the person
who was sitting after the eliminated person picking up the hot potato. The last remaining
person wins. Thus, if M = 0 and N = 5, players are eliminated in order, and player 5 wins.
If M = 1 and N = 5, the order of elimination is 2, 4, 1, 5.
a) Write a program to solve the Josephus problem for general values of M and N. Try
to make your program as efficient as possible. Make sure you dispose of cells.
c) If M = 1, what is the running time of your program? How is the actual speed affected
by the delete routine for large values of N (N > 100,000)?
61
Unit 7 - Trees
Overview:
We have already considered two data structures, the stack and the queue. Both are linear in
nature and are represented in the memory either by a sequentially allocated vector (array) or by
a linked linear list. In this unit we now focus on two important non-linear structures- the tree and
the binary tree.
Module Objectives:
After successful completion of this unit, you should be able to:
Course Materials:
A tree is a nonlinear data structure and is a special form of directed graph. It consist of a
finite set of elements called nodes (or vertices) and a finite set of directed arcs called branches
(or edge) that connect pairs of nodes with the following properties:
Vertices of a tree which have successors are called ‘nonterminal vertices,’ while vertices
that have no successors are called ‘terminal vertices’ or ‘leaves.’ Figure 38 Illustrates a tree with
root vertex a; two nonterminal vertices a, c; and four terminal vertices b, d, e, f.
Trees in which each nonterminal vertex has at most n successors are called ‘n-ary’ trees.
Trees in which nonterminal vertex has at most two successors are called ‘binary’ trees. The tree
in Figure 38 is an example of an n-ary tree or a General Tree.
62
a
b c
d e f
Tree Representation
A
B
C
(A(B)(C(D)(E)(F))) D
E
F
Parentheses
Indentation
B D E F
b c
C
A
d e f
Nesting Tree
63
Tree Vocabulary
Many algorithms pertaining to tree structures usually involve a process in which each node
of the tree is “visited”, or processed, exactly once. Such a process is called a traversal. A complete
traversal of a tree yields a linear arrangement of the nodes of the tree. Such a linear ordering can
be very useful in processing the information stored in the nodes of the tree.
64
There are various ways to visit (traverse or search) a general tree but the two most
common ones will be discussed here. The simplest two search techniques are known as Depth-
First Search (DFS) and Breadth-First Search (BFS). These two searches are described by looking
at how the search tree (representing all the possible paths from the start) will be traversed.
Create a stack
Create a new choice point
Push the choice point onto the stack
while (not found and stack is not empty)
Pop the stack
Find all possible choices after the last one tried
Push these choices onto the stack
Return
In breadth-first search we explore all the nearest possibilities by finding all possible
successors and enqueue them to a queue.
Create a queue
Create a new choice point
Enqueue the choice point onto the queue
while (not found and queue is not empty)
Dequeue the queue
Find all possible choices after the last one tried
Enqueue these choices onto the queue
Return
Binary Tree
The binary tree is by far the most important non-linear data structure in computer
algorithms. A binary tree is a finite set of nodes which is either empty or consists of a root node
and maximum of two disjoint binary trees called left and right subtrees of the root.
Note that while a tree must have at least one node, a binary tree may be empty. Likewise,
a tree may have any number of subtrees; a binary tree can have at most two subtrees. Following
are examples of binary trees.
65
Tree Vocabulary
Figure 41 Tree1, Tree2, Tree3 are all similar; Tree1, Tree2 are equivalent
One issue with n-ary traversal is that it is quite difficult to search or traverse as there can
be uneven number of subtrees in the tree. A way to resolve this is to convert the n-ary tree into a
binary tree. With a binary tree we can be assured of the search or traversal as there is only two
possible paths to take. Following are the steps in converting a n-ary tree to a binary tree.
Step 1 − The left child of a node in the n-ary tree becomes the left child of the node in the
binary tree.
Step 2 − The next sibling of each node in the n-ary tree becomes the right child of the
node in the binary tree.
A
A
B
B C D C
E D
E F G
F G
n-ary tree
binary tree
Figure 42 Convert an N-ary Tree to a Binary Tree
66
Types of Binary (based on Structure)
• Rooted binary tree: It has a root node and every node has at most two children.
• Full binary tree: It is a tree in which every node in the tree has either 0 or 2 children.
• Perfect binary tree: It is a binary tree in which all interior nodes have two children and
all leaves have the same depth or same level.
• Complete binary tree: It is a binary tree in which every level, except possibly the last,
is completely filled, and all nodes are as far left as possible.
67
• Balanced binary tree: A binary tree is height balanced if it satisfies the following
constraints:
a) The left and right subtrees' heights differ by at most one, AND
b) The left subtree is balanced, AND
c) The right subtree is balanced
The height of a balanced binary tree is O(Log n) where n is number of nodes. Further
discussion about balanced trees is presented in the next unit.
• Degenarate tree: It is a tree where each parent node has only one child node. It
behaves like a linked list. This type of tree is considered a skewed tree. A skewed tree
is one with more levels of nodes in one subtree of a node than in the other subtree.
Recall that tree traversal is a process to visit all the nodes of a tree and may print their
values too. Because, all nodes are connected via edges (links) we always start from the root
(head) node. That is, we cannot randomly access a node in a tree. Generally, we traverse a tree
to search or locate a given item or key in the tree or to print all the values it contains.
68
In a binary tree, there are three different entities involved: a root, its left subtree, and its
right subtree. The sequence in which we process these entities defines a particular traversal
method.
We will consider three such traversal methods, namely: preorder traversal, inorder traversal, and
postorder traversal. These three methods are defined recursively. When a binary tree is empty, it
is traversed by doing nothing.
1. Preorder traversal
In this traversal method, the root node is visited first, then the left subtree and
finally the right subtree.
We start from A, and following pre-order traversal, we first visit A itself and then
move to its left subtree B. B is also traversed pre-order. The process goes on until all the
nodes are visited. The output of pre-order traversal of this tree will be −
A→B→D→E→C→F→G
Algorithm
In this traversal method, the left subtree is visited first, then the root and later the
right sub-tree. We should always remember that every node may represent a subtree
itself.
69
If a binary tree is traversed in-order, the output will produce sorted key values in
an ascending order.
We start from A, and following in-order traversal, we move to its left subtree B. B
is also traversed in-order. The process goes on until all the nodes are visited. The output
of inorder traversal of this tree will be −
D→B→E→A→F→C→G
Algorithm
3. Postorder traversal
In this traversal method, the root node is visited last, hence the name. First, we
traverse the left subtree, then the right subtree and finally the root node.
70
We start from A, and following Post-order traversal, we first visit the left subtree B.
B is also traversed post-order. The process goes on until all the nodes are visited. The
output of post-order traversal of this tree will be −
D→E→B→F→G→C→A
Algorithm
We have seen in previous discussions how a tree can be presented in various forms. We
also have learned of the different traversal methods that one can use in passing through the
different nodes of the tree.
By using the traversal information, we can re-construct a tree to its original form. To do
this, one will need two pieces of information: the in-order traversal, and the pre-order or post-
order traversal.
With the pre-order or post-order traversal information, one can easily identify the root. The
root is the first value or node in a pre-order traversal, and the last value or node for a post-order
traversal. On the other hand, with the in-order traversal, we can easily identify which nodes falls
on the left-side or on the right-side of a tree, given of course the root node (from the pre-order or
post-order traversal).
To present clearly, let us consider the binary tree below with its corresponding pre-order,
post-order, and in-order traversal information. The tree will be our basis if what we have
reconstructed is correct or not.
C Traversal Information
Pre-order:
B G CBAGFDEH
Post-order:
ABEDFHGC
In-order:
A F H ABCDEFGH
E
71
Sample 1: Using the pre-order and in-order traversal information in Figure 51, reconstruct the
binary tree.
Pre-order: CBAGFDEH
In-order: ABCDEFGH
First, we begin by analyzing the pre-order traversal information and identify the root among
the values. Since in a pre-order traversal, it is the root node that is first visited, we identify the
value C to be the root node. We then draw the tree with the information we have as seen below.
Using this information, we proceed with the second process, that is we look into the in-
order traversal information and identify the nodes on the left of node C and the nodes on the right
of node C. From here, we see that nodes A and B are found on the left of node C and thus form
the left-subtree, while nodes D, E, F, G, and H are found on the right of node C forming the right-
subtree.
Let’s take the left subtree of root C first. Here we see that nodes A and B are part of the
left subtree. We need to identify which node is the root of the left subtree. For this, we need to go
back to the pre-order traversal and identify among nodes A and B the root node. Again, in the
pre-order traversal, the first node evaluated is always the root. Thus, we can say that node B is
the root of the left subtree. From, the in-order traversal, we see that node A is at the left of node
B, so we draw the tree as seen below.
We check now the right subtree. From amongst nodes D, E, F, G, and H, we need to see
using the pre-order traversal information the node that will serve as the root. Among said nodes,
we find that G is at the front of the list, thus G serves as the root of the right subtree. We now
have a tree with information on the roots of both its left and right subtree as seen in Figure 54.
72
C
B G
Going back to the in-order traversal, we check the remaining nodes with respect to node
G. Here we can ascertain that nodes D, E, and F are found on the left side of G, whereas node H
is found at the right of G. We then update our tree, adding node H to the picture.
B G
A H
We are now left with nodes D, E, and F. Going back to the pre-order traversal information,
we identified node F to be the root of the left subtree of node G. With that, we check back on the
in-order traversal and find nodes D and E at the left of node F and thus forming the nodes of the
left-subtree of node F. The updated tree with the node F added can be seen below.
B G
A F H
Looking back at the pre-order traversal, we notice that node D comes first before node E
and thus mean that node D is the root of the subtree of node F. At the in-order traversal, we can
discern that node E is at the right of node D. The final tree is shown in Figure 57.
73
C
B G
A F H
Expression Trees
Expression tree represents an expression as a binary tree. Operators are parent nodes to
operands (leaf nodes) for example expression tree for 3 + ((5+9)*2) would be:
The specified order of tree traversal gives method for evaluating expression. Inorder
traversal of expression tree produces infix version of given postfix expression (same with preorder
traversal it gives prefix expression)
74
Evaluating the expression represented by expression tree:
Let t be the expression tree
If t is not null then
If t.value is operand then
Return t.value
A = solve(t.left)
B = solve(t.right)
In constructing an expression tree we use a stack. We loop through the input expression
and do the following for every character.
A "binary search tree" (BST) or "ordered binary tree" is a type of binary tree where the
nodes are arranged in order: for each node, all elements in its left subtree are less-or-equal to the
node (<=), and all the elements in its right subtree are greater than the node (>). Recursively,
each of the subtrees must also obey the binary search tree constraint: in the (1, 3, 4) subtree (see
Figure 59), the 3 is the root, the 1 <= 3 and 4 > 3.
75
The above properties of Binary Search Tree provide an ordering among keys so that the
operations like search, minimum and maximum can be done fast. If there is no ordering, then we
may have to compare every key to search a given key.
Tree Vocabulary
The largest value in the left subtree. The immediate predecessor is located
Immediate
by choosing the left pointer and chasing right pointers until a node is
predecessor
reached with a nil right pointer.
The smallest value in the right subtree. Located by choosing the right
Immediate
pointer of a node, then chase left pointers until a node is reached with a nil
successor
left pointer.
Searching in BST
Search operation in binary search tree will be very similar to that of binary search. Let’s
say we want to search for number, what we’ll do is we’ll start at root and then we will compare the
value to be searched with value of root if it’s equal we are done with the search if it’s lesser we
know that we need to go to the left subtree because in a binary search tree all the elements in the
left subtree are lesser and all the elements in right subtree are greater.
Searching an element in the binary search tree is basically this traversal in which at each
step we will go either towards left or right and hence in at each step we discard one of the sub-
trees. If the tree is balanced, we call a tree balanced if for all nodes the difference between the
heights of left and right subtrees is not greater than one, we will start with a search space of ‘n’
nodes and when we will discard one of the sub-trees we will discard ‘n/2’ nodes so our search
space will be reduced to ‘n/2’ and then in the next step we will reduce the search space to ‘n/4’
and we will go on reducing like this till we find the element or till our search space is reduced to
only one node. The search here is also a binary search and that’s why the name binary search
tree.
Algorithm
A new key is always inserted at leaf. We start searching a key from root till we hit a leaf
node. Once a leaf node is found, the new node is added as a child of the leaf node.
76
The worst-case time complexity of search and insert operations is O(h) where h is height
of Binary Search Tree. In worst case, we may have to travel from root to the deepest leaf node.
The height of a skewed tree may become n and the time complexity of search and insert operation
may become O(n).
Algorithm
2. Node to be deleted has only one child: Copy the child to the node and delete the child.
3. Node to be deleted has two children: Find inorder successor of the node. Copy
contents of the inorder successor to the node and delete the inorder successor. Note
that inorder predecessor can also be used.
77
The important thing to note is, inorder successor is needed only when right child is not
empty. In this particular case, inorder successor can be obtained by finding the
minimum value in right child of the node.
Heap Tree
Heaps are based on the notion of a complete tree, for which we gave an informal definition
earlier. A heap is a binary tree with two special properties:
1. Ordering property
The value in a node is smaller than all the values in the node's subtrees (MinHeap
property). This is very different than a binary search tree! Note that this property makes
no distinction between ‘left’ and ‘right’. All the following trees have this property but
only one is a heap:
The smallest value must always be in the root, but the largest value could be in any
leaf. Similar values are not collected together in any convenient location. Incidentally,
heaps can also be defined with the opposite ordering property: the value in each node
is larger than all the values in its subtrees (MaxHeap property).
2. Structural property
All the levels are full, except perhaps the bottom level, and all the nodes in the bottom
level are as far to the ‘left’ as possible. A level is full if there are no ‘missing nodes’ at
the level - there are 2**L nodes in level L.
This structural property could be rephrased: the only nodes that can have an empty
subtree are ones in the bottom level (by definition these have 2 empty subtrees) and
the rightmost nodes in the next-to-bottom level.
A tree that has this property is called a complete tree. There are two motives for
restricting ourselves to complete trees:
a. In a complete tree Height = O(logN). As we'll see, our Heap operations are
O(Height), so by restricting ourselves to complete trees, we get O(logN)
operations.
b. complete trees have certain advantages over general binary trees in some
implementations.
78
Although all 3 of the Figure 63 trees have the ordering property, only one of them
satisfies the structural requirement – that is the middle one.
Operations on Heaps
• Extract Min-Max → Returning and deleting the maximum or minimum element in max-
heap and min-heap respectively.
Heapify
It is a process to rearrange the elements of the heap in order to maintain the heap
property. It is done when a certain node causes an imbalance in the heap due to some
operation on that node. The heapify can be done in two methodologies:
79
void up_heapify(int heap[], int child)
{
parent = (child-1)/2;
if(heap[parent] < heap[child]) {
swap(heap[parent], heap[child])
up_heapify(heap,parent)
}
}
Insertion
Performing Step 1: Insert the new element at the end will result in (b). Second step
is to heapify the new element following bottom-up approach. Final heap is (c).
80
Deletion
The standard deletion on Heap is to delete the element present at the root node of
the heap.
For example, consider the heap tree (a) in Figure 65 following a max-heap
property. Element to be deleted is 12. Replacing the root with the last element ‘4’ and
deleting it will result to (b). Since the resulting tree does not follow the property of the heap
we perform down_heapify() to the new root resulting to the final heap tree (c).
81
Find-max (or Find-min)
The maximum element and the minimum element in the max-heap and min-heap
is found at the root node of the heap.
Extract Min-Max
This operation returns and deletes the maximum or minimum element in max-heap
and min-heap respectively. The maximum element is found at the root node.
Watch:
Read:
• Chapter 8 Trees
Goodrich, Michael T. (2014). Data Structures and Algorithms in Java 6th Edition
82
• Chapter 4 – Trees (Sec. 4.1 – 4.3, pp. 121-144)
Weiss, Mark Allen (2014). Data Structures And Algorithm Analysis In C++ 4th Edition
Review:
1. What is a tree?
2. What is a binary tree? How is it different from a tree?
3. What is an advantage of a binary search tree over a binary tree?
4. How to check if a given Binary Tree is BST or not?
5. What is a heap?
6. Describe the difference between a Max Heap over a Min Heap.
Activities/Assessments:
1. The following questions refer to the tree below.
a. Which node is the root?
b. What are the internal nodes?
c. How many descendants does node cs016/ have?
d. How many ancestors does node cs016/ have?
e. What are the siblings of node homeworks/?
f. Which nodes are in the subtree rooted at node projects/?
g. What is the depth of node papers/?
h. What is the height of the tree?
83
2. Draw a sketch of the logical representation for the following tree.
6. Given the pre-order traversal and in-order traversal of a tree, reconstruct the tree.
Pre: Y O C P H I R G T
In: COPYRIGHT
7. Given the post-order traversal and in-order traversal of a tree, reconstruct the tree.
Post: C O P Y R I G H T
In: COYPTRGIH
84
9. Draw the structure of a binary search tree
a. after these values have been inserted: 19, 34, 23, 16, 54, 89, 24, 29, 15, 61, 27.
b. after two delete operations (deleting the root)
10. Draw the structure of a binary heap (MinHeap):
a. after these priorities have been inserted: 19, 34, 23, 16, 54, 89, 24, 29, 15, 61, 27.
b. after two deleteMin operations (using the tree in (a)).
Programming:
1. Write a program that implements BINARY TREE TRAVERSALS using preorder traversal,
inorder traversal, and postorder traversal to produce a linear arrangement of the nodes.
Program input consists of the root nodes and its left and right subtrees.
Example
Input:
(A,B,E) (this means A is the root, B is the left subtree, and E is the right subtree)
(B,C,null) (this means B is the root, C is the left subtree, and there is no right subtree)
(C,null,D)
(E,F,H)
(F,null,G)
(H,I,J)
Output:
Node L-Subtree R-Subtree
A B E
B C null
C null D
E F H
F null G
H I J
Preorder Traversal: A B C D E F G H I J
Inorder Traversal: C D B A F G E I H J
Postorder Traversal: D C B G F I J H E A
85
Unit 8 – Balanced Trees
Overview:
The search time in a binary search tree depends on the form of the tree, that is on the order in
which its nodes were inserted. In practice, one can however not always guarantee that binary
search trees are built randomly. Binary search trees are thus only interesting if they are “relatively
complete.” So we must look for specializations of binary search trees whose worst-case
performance on the basic tree operations can be guaranteed to be good, that is O(log2n) time. In
this unit we will talk about balanced trees as a solution to the problem of search time. Particularly
we will look into the balanced properties of binary search trees and m-way search trees.
Module Objectives:
After successful completion of this unit, you should be able to:
Course Materials:
The Binary Search Tree has a serious deficiency for practical use as a search structure.
That is the fact that it can easily become unbalanced, so that some nodes are deep in the tree. In
fact, it is possible for a BST with n nodes to have a depth of n, making it no faster to search in the
worst case than a linked list. If we could keep the tree balanced in some way, then search cost
would only be O(log2n), a huge improvement.
Balanced Tree is a tree where no leaf is much farther away from the root than any other
leaf. Different balancing schemes allow different definitions of "much farther" and different
amounts of work to keep them balanced:
1. A binary tree is height-balanced if for each node the heights of its subtrees differ by
at most 1. (Remember, the height of an empty subtree is -1).
2. A binary tree is weight-balanced if for each node the numbers of inner nodes in its
left subtree and right subtree differ by at most 1.
The tree below is height-balanced. It is not weight-balanced because the left subtree of
the root has 3 inner nodes but the right subtree of the root has only 1. It has been shown that a
weight-balanced tree is also height-balanced.
86
Figure 66 Height-balanced Tree
The key term in both definitions is “for each node”. The property being described must
hold for all nodes, not just the root. To see the need for “for each node”, consider the tree whose
root has a copy of the tree to the right as its left subtree and another copy as its right subtree. It
is not weight-balanced, even though the left and right subtrees of the root have the same number
of inner nodes.
The important data structure called a Binary Search Tree, or BST, was invented in order
to make searching for a value more efficient, say, than searching in an unordered array. However,
if the BST tree is unbalanced then search is not efficient. It is difficult and inefficient to keep a BST
balanced as values are added and deleted.
Therefore, computer scientists looked for ways to enhance the BST data structure in order
to make balancing more efficient as values are added and deleted. This required modifying the
BST —by adding more information to each node or by allowing a node to have more than two
children. Invented were the AVL tree, the 2-3 tree, the B- tree, the red-black tree, and many more.
With each such new tree data structure, the inventor had to say what property they meant
by balanced. The goal was the same, of course: if the tree was balanced according to the
definition, then the inventors could prove that the height of the tree was O(log size-of-tree).
Height balance tree (or self-balancing tree) is a binary tree which automatically maintain
height of tree, and its sub tree on each insertion and deletion of node. And tree is always a
complete tree. AVL tree, red-black tree are example of height balanced tree. AVL tree uses
balance factor to balance the height of tree. Balance factor is nothing but difference between
height of left subtree and right subtree.
AVL Tree
AVL trees is named after its two inventors, G.M. Adelson-Velskii and E.M. Landis,
who published it in their 1962 paper "An algorithm for the organization of information." As
the name suggests AVL trees are used for organizing information. They are not perfectly
balanced, but maintain O(log2n) search, insertion, and deletion times, when n is the
number of nodes in the tree.
87
An AVL tree is a binary search tree where the sub-trees of every node differ in
height by at most 1.
A tree is perfectly height-balanced if the left and right subtrees of any node are the
same height. e.g.
It is clear that at every level there are twice as many nodes as at the previous level,
so we do indeed get H = O(logN). However, perfect height balance is very rare: it is only
possible if there are exactly 2^H-1 nodes.
As a practical alternative, we use trees that are `almost' perfectly height balanced.
We will say that a tree is height-balanced if the heights of the left and right subtree's of
each node are within 1. The following tree fits this definition:
88
How can we tell if a tree is height-balanced? We have to check every node. The
fastest way is to start at the leaves and work your way up. When you reach a node, you
will know the heights of its two subtrees; from that you can tell whether it is height-balanced
and also compute the node's height (for use by its parent). For example:
a. C and D are leaves. Their subtrees are all height 0 so C and D are both
perfectly balanced. Having finished D we can compute the heights of B's
subtrees.
b. B is not perfectly balanced, but the heights of its subtrees differ only by 1, so B
is regarded as height balanced. Now we can compute the balance of A.
c. Like B, A's two subtrees differ by 1 in height. We have now looked at every
node; every node is height-balanced, so the tree as a whole is considered to
be height-balanced.
Insertion in an AVL tree is the same as in a binary search tree which is always
done by expanding an external node. To make sure that the given tree remains AVL after
every insertion, we must augment the standard BST insert operation to perform some re-
balancing. Following are two basic operations that can be performed to re-balance a BST
without violating the BST property (keys(left) < key(root) < keys(right)).
1) Left Rotation
2) Right Rotation
C
Figure 71 Before Left Rotation (Single)
89
To fix this, we must perform a left rotation, rooted at A. This is done in the
following steps:
B
The tree now looks like this:
A C
Figure 72 After performing a Left Rotation (Single)
B
The resulting tree:
A C
Figure 74 After performing a Right Rotation (Single)
A C
A
C A
C
B B
(a) (b) (c)
Figure 75 Unbalanced tree after a Left Rotation (Single)
90
Initially with (a) we have a balanced tree. From this tree we insert node ‘B’,
the resulting tree is shown in (b) which is an unbalanced tree. An initial reaction
here is to do a single left rotation but the resulting tree (c) is still unbalanced. If we
were to do a single right rotation in this situation, we would be right back where we
started. What's causing this?
The answer is that this is a result of the right subtree having a negative
balance. In other words, because the right subtree was left heavy, our rotation was
not sufficient. What can we do? The answer is to perform a right rotation on the
right subtree. Read that again. We will perform a right rotation on the right subtree.
We are not rotating on our current root. We are rotating on our right child. Think of
our right subtree, isolated from our main tree, and perform a right rotation on it:
A A
B
C B
A C
B C
(a) (b) (c)
Image (a) is the resulting unbalanced tree after adding ‘B’ in a balanced
tree. Performing a right rotation on the right subtree will result to (b). After
performing a rotation on our right subtree, we have prepared our root to be rotated
left. The final tree is represented by (c).
C A C
B B
A C
B B A A C
(a) (b) (c) (d)
91
In this situation, we have a tree (a) that is unbalanced. The left subtree has
a height of 2, and the right subtree has a height of 0. This makes the balance factor
of our root node, c, equal to -2. What do we do? Some kind of right rotation is
clearly necessary, but a single right rotation will not solve our problem. Performing
a right rotation resulted to a tree (b) that has a balance of 2. It would appear that
we did not accomplish much. Let’s go back to the original tree, before we did our
pointless right rotation.
The reason our right rotation did not work, is because the left subtree, or
'a', has a positive balance factor, and is thus right heavy. Performing a right rotation
on a tree that has a left subtree that is right heavy will result in the problem we just
witnessed. What do we do? The answer is to make our left subtree left-heavy. We
do this by performing a left rotation our left subtree. Doing so leaves us with (c).
This is a tree which can now be balanced using a single right rotation. We
can now perform our right rotation rooted at C resulting to (d).
How to decide when you need a tree rotation is usually easy, but determining which
type of rotation you need requires a little thought.
A tree rotation is necessary when you have inserted or deleted a node which
leaves the tree in an unbalanced state. An unbalanced state is defined as a state in which
any subtree has a balance factor of greater than 1, or less than -1. That is, any tree with
a difference between the heights of its two subtrees greater than 1, is considered
unbalanced.
Deleting a node from an AVL tree is similar to that in a binary search tree. Deletion
may disturb the balance factor of an AVL tree and therefore the tree needs to be
rebalanced in order to maintain the AVL property. The same rotation rules is followed as
that of inserting a node in an AVL tree. Note however that unlike insertion, fixing the node
won’t fix the complete AVL tree. After fixing the unbalanced subtree, we may have to fix
ancestors as well.
B-Trees
A binary search tree has one value in each node and two subtrees. This notion
easily generalizes to an M-way search tree, which has (M-1) values per node and M
subtrees. M is called the degree of the tree. A binary search tree, therefore, has degree
2.
In an M-way subtree a node can have anywhere from 1 to (M-1) values, and the
number of (non-empty) subtrees can range from 0 (for a leaf) to 1+(the number of values).
M is thus a fixed upper limit on how much data can be stored in a node.
92
The values in a node are stored in ascending order, V1 < V2 < ... Vk (k <= M-1)
and the subtrees are placed between adjacent values, with one additional subtree at each
end. We can thus associate with each value a `left' and `right' subtree, with the right
subtree of Vi being the same as the left subtree of V(i+1). All the values in V1's left subtree
are less than V1 ; all the values in Vk's subtree are greater than Vk; and all the values in
the subtree between V(i) and V(i+1) are greater than V(i) and less than V(i+1).
The algorithm for searching for a value in an M-way search tree is the obvious
generalization of the algorithm for searching in a binary search tree. If we are searching
for value X and currently at node consisting of values V1...Vk, there are four possible
cases that can arise:
For example, suppose we were searching for 68 in the tree in Figure 78. At the
root, case (2) would apply, so we would continue the search in V2's right subtree. At the
root of this subtree, case (4) applies, 68 is between V1=55 and V2=70, so we would
continue to search in the subtree between them. Now case (3) applies, 68=V2, so we are
done. If we had been searching for 69, exactly the same processing would have occurred
down to the last node. At that point, case (2) would apply, but the subtree we want to
search in is empty. Therefore we conclude that 69 is not in the tree.
The other algorithms for binary search trees - insertion and deletion - generalize in
a similar way. As with binary search trees, inserting values in ascending order will result
in a degenerate M-way search tree; i.e. a tree whose height is O(N) instead of O(logN).
This is a problem because all the important operations are O(height), and it is our aim to
make them O(logN). One solution to this problem is to force the tree to be height-balanced;
we sketched this last lecture for binary search trees and now we will examine it in detail
for M-way search trees.
93
B-Trees as a balanced M-way Search Tree
(a) (b)
1. Using the SEARCH procedure for M-way trees find the leaf node to which X
should be added.
94
2. Add X to this node in the appropriate place among the values already there.
Being a leaf node there are no subtrees to worry about.
3. If there are M-1 or fewer values in the node after adding X, then we are finished.
If there are M nodes after adding X, we say the node has overflowed. To repair
this, we split the node into three parts:
Notice that Left and Right have just enough values to be made into individual
nodes. That's what we do... they become the left and right children of Middle, which we
add in the appropriate place in this node's parent.
But what if there is no room in the parent? If it overflows we do the same thing
again: split it into Left-Middle-Right, make Left and Right into new nodes and add Middle
(with Left and Right as its children) to the node above. We continue doing this until no
overflow occurs, or until the root itself overflows. If the root overflows, we split it, as usual,
and create a new root node with Middle as its only value and Left and Right as its children
(as usual).
For example, let's do a sequence of insertions into this B-tree (M=5, so each node other
than the root must contain between 2 and 4 values):
• Left = [ 2 3 ]
• Middle = 5
• Right = [ 6 7 ]
95
Left and Right become nodes; Middle is added to the node above with Left and
Right as its children.
The node above (the root in this small example) does not overflow, so we are done.
Insert 21: Add it to the middle leaf. That overflows, so we split it:
• left = [ 17 21 ]
• Middle = 22
• Right = [ 44 45 ]
Left and Right become nodes; Middle is added to the node above with Left and
Right as its children.
The node above (the root in this small example) does not overflow, so we are done.
Insert 67: Add it to the rightmost leaf. That overflows, so we split it:
• Left = [ 55 66 ]
• Middle = 67
• Right = [ 68 70 ]
Left and Right become nodes; Middle is added to the node above with Left and
Right as its children.
96
Figure 85 B-tree with M=5 after inserting 67
But now the node above does overflow. So it is split in exactly the same manner:
Left and Right become nodes, the children of Middle. If this were not the root,
Middle would be added to the node above and the process repeated. If there is no node
above, as in this example, a new root is created with Middle as its only value.
The tree-insertion algorithms we're previously seen add new nodes at the bottom
of the tree, and then have to worry about whether doing so creates an imbalance. The B-
tree insertion algorithm is just the opposite: it adds new nodes at the top of the tree (a new
node is allocated only when the root splits). B-trees grow at the root, not at the leaves.
Because of this, there is never any doubt that the tree is always perfectly height balanced:
when a new node is added, all existing nodes become one level deeper in the tree.
97
Deleting Value from a B-tree
Recall our deletion algorithm for binary search trees: if the value to be deleted is
in a node having two subtrees, we would replace the value with the largest value in its left
subtree and then delete the node in the left subtree that had contained the largest value
(we are guaranteed that this node will be easy to delete).
We will use a similar strategy to delete a value from a B-tree. If the value to be
deleted does not occur in a leaf, we replace it with the largest value in its left subtree and
then proceed to delete that value from the node that originally contained it. For example,
if we wished to delete 67 from the tree in Figure 86, we would find the largest value in
67's left subtree, 66, replace 67 with 66, and then delete the occurrence of 66 in the left
subtree. In a B-tree, the largest value in any value's left subtree is guaranteed to be in
leaf. Therefore wherever the value to be deleted initially resides, the following deletion
algorithm always begins at a leaf.
To delete value X from a B-tree, starting at a leaf node, there are 2 steps:
1. Remove X from the current node. Being a leaf node there are no subtrees to
worry about.
2. Removing X might cause the node containing it to have too few values.
Recall that we require the root to have at least 1 value in it and all other nodes to
have at least (M-1)/2 values in them. If the node has too few values, we say it has
underflowed.
If underflow does not occur, then we are finished the deletion process. If it does
occur, it must be fixed. The process for fixing a root is slightly different than the process
for fixing the other nodes.
How do we fix a non-root node that has underflowed? Let us take as a specific
example, from this B-tree (of degree 5):
98
In this example, we join node [7] with its more populous neighbor [17 22 44 45]
and put ‘10’ in between them, to create
[7 10 17 22 44 45]
Case 1: Suppose that the neighboring node contains more than (M-1)/2 values. In
this case, the total number of values in the combined node is strictly greater than 1 + ((M-
1)/2 - 1) + ((M-1)/2), i.e. it is strictly greater than (M-1). So it must contain M values or
more.
We split the combined node into three pieces: Left, Middle, and Right, where
Middle is a single value in the very middle of the combined node. Because the combined
node has M values or more, Left and Right are guaranteed to have (M-1)/2 values each,
and therefore are legitimate nodes. We replace the value we borrowed from the parent
with Middle and we use Left and Right as its two children. In this case the parent's size
does not change, so we are completely finished.
This is what happens in our example of deleting 6 from the tree in Figure 87. The
combined node [7 10 17 22 44 45] contains more than 5 values, so we split it into:
• Left = [ 7 10 ]
• Middle = 17
• Right = [ 22 44 45 ]
Then put Middle into the parent node (in the position where the ‘10’ had been) with
Left and Right as its.
99
Case 2: Suppose, on the other hand, that the neighboring node contains exactly
(M-1)/2 values. Then the total number of values in the combined node is 1 + ((M-1)/2 - 1)
+ ((M-1)/2) = (M-1)
In this case the combined node contains the right number of values to be treated
as a node. So we make it into a node and remove from the parent node the value that has
been incorporated into the new, combined node. As a concrete example of this case,
suppose that, in the tree Figure 87, we had deleted 3 instead of 6. The node [2 3]
underflows when 3 is removed. It would be combined with its more populous neighbor [6
7] and the intervening value from the parent (5) to create the combined node [2 5 6 7].
This contains 4 values, so it can be used without further processing. The result would be:
It is very important to note that the parent node now has one fewer value. This
might cause it to underflow - imagine that 5 had been the only value in the parent node. If
the parent node underflows, it would be treated in exactly the same way - combined with
its more populous neighbor etc. The underflow processing repeats at successive levels
until no underflow occurs or until the root underflows (the processing for root-underflow is
described next).
Now let us consider the root. For the root to underflow, it must have originally
contained just one value, which now has been removed. If the root was also a leaf, then
there is no problem: in this case the tree has become completely empty.
If the root is not a leaf, it must originally have had two subtrees (because it originally
contained one value). How could it possibly underflow?
The deletion process always starts at a leaf and therefore the only way the root
could have its value removed is through the Case 2 processing just described: the root's
two children have been combined, along with the root's only value to form a single node.
But if the root's two children are now a single node, then that node can be used as the
new root, and the current root (which has underflowed) can simply be deleted.
100
The node [3 7] would underflow, and the combined node [3 10 18 20] would be
created. This has 4 values, which is acceptable when M=5. So it would be kept as a node,
and ‘10’ would be removed from the parent node - the root. This is the only circumstance
in which underflow can occur in a root that is not a leaf. The situation is this:
Clearly, the current root node, now empty, can be deleted and its child used as the
new root.
Watch:
• AVL Trees & Rotations (Self-Balancing Binary Search Trees) By Back To Back SWE
(https://youtu.be/vRwi_UcZGjU)
• Introduction to B-Trees By David Taylog (https://youtu.be/I22wEC1tTGo)
• File Indexes with B-Trees By Jacob Schrum (https://youtu.be/2AD13EPQAms)
Read:
Review:
101
Activities/Assessments:
1. Show the result of inserting 3, 1, 4, 6, 9, 2, 5, 7 into an initially empty binary search tree.
2. Given the tree in (1), show the result of deleting the root.
3. Draw the AVL Tree of the following. NOTE: Show the tree before and after rotations.
COMPUTING
5. Draw the AVL Tree of the following. NOTE: Show the tree before and after rotations.
GOLDSTAR
7. What will be the resulting B- Trees (2-3 trees) for the following items?
a. Insert 112
47
15 86 96
b. Insert 1
59
23 27 86
c. Delete 73
62
20 73
d. Delete 31
67
31 79 95
102
Unit 9 - Graphs
Overview:
In computer science, a graph is an abstract data type that is meant to implement the undirected
graph and directed graph concepts from the field of graph theory within mathematics. In this unit
we will focus on the basic concepts of graphs, its representations and searching in graphs.
Module Objectives:
After successful Completion of this module, you should be able to:
Course Materials:
A graph is a flow structure that represents the relationship between various objects. It can
be visualized by using the following two basic components:
• Nodes: These are the most important components in any graph. Nodes are entities
whose relationships are expressed using edges. If a graph comprises 2 nodes A and
B and an undirected edge between them, then it expresses a bi-directional relationship
between the nodes and edge.
103
• Edges: Edges are the components that are used to represent the relationships
between various nodes in a graph. An edge between two nodes expresses a one-way
or two-way relationship between the nodes.
Types of graphs
• Undirected: An undirected graph is a graph in which all the edges are bi-directional
i.e. the edges do not point in any specific direction. The graph in Figure 92 is an
example of an undirected graph.
• Directed: A directed graph is a graph in which all the edges are uni-directional i.e. the
edges point in a single direction.
• Cyclic: A graph is cyclic if the graph comprises a path that starts from a vertex and
ends at the same vertex. That path is called a cycle. An acyclic graph is a graph that
has no cycle.
o (v0, e1, v1, e2, v2,..,vn-1, en, vn) -- alternating vertices and edges
o (v0, v1, v2,..,vn-1, vn) -- vertices only
A graph is connected if, for any vertices v and w, there is a path from v to w.
104
A tree is an undirected graph in which any two vertices are connected by only one path.
A tree is an acyclic graph and has N - 1 edges where N is the number of vertices. Each node in
a graph may have one or multiple parent nodes. However, in a tree, each node (except the root
node) comprises exactly one parent node (a root node has no parent). Also, A tree cannot contain
any cycles or self-loops, however, the same does not apply to graphs.
Graph Representations
You can represent a graph in many ways. The two most common ways of representing a
graph is as follows:
• Adjacency Matrix
An adjacency matrix is a VxV binary matrix A, where V is the number of vertices. Element
A(i,j) is 1 if there is an edge from vertex i to vertex j else A(i,j) is 0. A binary matrix is a
matrix in which the cells can have only one of two possible values - either a 0 or 1.
Adjacency matrix provides constant time access (O(1) ) to determine if there is an edge
between two nodes. Space complexity of the adjacency matrix is O(V2).
• Adjacency List
The other way to represent a graph is by using an adjacency list. An adjacency list is an
array A of separate lists. Each element of the array A(i) is a list, which contains all the
vertices that are adjacent to vertex i.
For a weighted graph, the weight or cost of the edge is stored along with the vertex in the
list using pairs. In an undirected graph, if vertex j is in list A(i) then vertex i will be in list
A(j).
105
The space complexity of adjacency list is O(V + E) because in an adjacency list information
is stored only for those edges that actually exist in the graph. In a lot of cases, where a
matrix is sparse using an adjacency matrix may not be very useful. This is because using
an adjacency matrix will take up a lot of space where most of the elements will be 0,
anyway. In such cases, using an adjacency list is better.
Note: A sparse matrix is a matrix in which most of the elements are zero, whereas a dense
matrix is a matrix in which most of the elements are non-zero.
Figure 96 Adjacency List for unweighted (above) and weighted (below) trees
Graph Traversal
Graph traversal means visiting every vertex and edge exactly once in a well-defined order.
While using certain graph algorithms, you must ensure that each vertex of the graph is visited
exactly once. The order in which the vertices are visited are important and may depend upon the
algorithm or question that you are solving.
During a traversal, it is important that you track which vertices have been visited. The most
common way of tracking vertices is to mark them.
There are many ways to traverse graphs. Breadth First Search (BFS) is the most
commonly used approach. BFS is a traversing algorithm where you should start traversing
from a selected node (source or starting node) and traverse the graph layer by layer thus
exploring the neighbor nodes (nodes which are directly connected to source node). You
must then move towards the next-level neighbor nodes.
As the name BFS suggests, you are required to traverse the graph breadthwise as
follows:
• First move horizontally and visit all the nodes of the current layer
• Move to the next layer
106
Figure 97 Moving process in BFS
The distance between the nodes in layer 1 is comparatively lesser than the
distance between the nodes in layer 2. Therefore, in BFS, you must traverse all the nodes
in layer 1 before you move to the nodes in layer 2.
A graph can contain cycles, which may bring you to the same node again while
traversing the graph. To avoid processing of same node again, use a boolean array which
marks the node after it is processed. While visiting the nodes in the layer of a graph, store
them in a manner such that you can traverse the corresponding child nodes in a similar
order.
In the earlier diagram (Figure 97), start traversing from 0 and visit its child nodes
1, 2, and 3. Store them in the order in which they are visited. This will allow you to visit the
child nodes of 1 first (i.e. 4 and 5), then of 2 (i.e. 6 and 7), and then of 3 (i.e. 7) etc.
To make this process easy, use a queue to store the node and mark it as 'visited'
until all its neighbors (vertices that are directly connected to it) are marked. The queue
follows the First In First Out (FIFO) queuing method, and therefore, the neighbors of the
node will be visited in the order in which they were inserted in the node i.e. the node that
was inserted first will be visited first, and so on.
mark s as visited.
while ( Q is not empty)
//Removing that vertex from queue (neighbor will be visited now)
v = Q.dequeue( )
107
Depth First Search
The DFS algorithm is a recursive algorithm that uses the idea of backtracking. It
involves exhaustive searches of all the nodes by going ahead, if possible, else by
backtracking.
Here, the word backtrack means that when you are moving forward and there are
no more nodes along the current path, you move backwards on the same path to find
nodes to traverse. All the nodes will be visited on the current path till all the unvisited nodes
have been traversed after which the next path will be selected.
This recursive nature of DFS can be implemented using stacks. The basic idea is
as follows:
• Pick a starting node and push all its adjacent nodes into a stack.
• Pop a node from stack to select the next node to visit and push all its adjacent
nodes into a stack.
Repeat this process until the stack is empty. However, ensure that the nodes that
are visited are marked. This will prevent you from visiting the same node more than once.
If you do not mark the nodes that are visited and you visit the same node more than once,
you may end up in an infinite loop.
DFS-recursive(G, s):
mark s as visited
for all neighbors w of s in Graph G:
if w is not visited:
DFS-recursive(G, w)
Watch:
108
• Graph Data Structure (Representations) By freeCodeCamp.org
(https://youtu.be/DBRW8nwZV-g)
• Graph Search (Depth First Search and Breadth First Search) By HackerRank
(https://youtu.be/zaBhtODEL0w)
Read:
Review:
1. What is a graph?
2. How are graphs similar to trees? How are they different?
3. If two graph nodes are connected by an edge, what are they called?
4. How can you tell if two graph nodes are adjacent?
Activities/Assessments:
1. Explain how an adjacency matrix is used to represent a graph.
2. Use the following description of an undirected graph and draw the graph:
V(Graph1) = { A, B, C, D}
E(Graph1) = { (A,B), (A,D), (B,C), (B,D) }
3. Described the graph pictured below using the formal graph notation.
V(CoAuthorship) =
E(CoAuthorship) =
109
4. Draw an adjacency matrix representation of the undirected graph shown in (3).
7. Let G be an undirected graph whose vertices are the integers 1 through 8, and let the
adjacent vertices of each vertex be given by the table below:
Assume that, in a traversal of G, the adjacent vertices of a given vertex are returned in the
same order as they are listed in the table above.
a. Draw G.
b. Give the sequence of vertices of G visited using a DFS traversal starting at vertex
1.
c. Give the sequence of vertices visited using a BFS traversal starting at vertex 1.
110
Unit 10 – Hashing and Collision
Overview:
As data structures deals with how data is represented in memory (primary), file structures on the
other hand deals more on how data is represented on secondary storage with emphasis on file
access. In this unit we will discuss the different types of file structures, the different techniques
used in file access and how to resolve some of the issues in relation to file access.
Module Objectives:
After successful Completion of this module, you should be able to:
Course Materials:
What is File?
File is a collection of records related to each other. The file size is limited by the size of
memory and storage medium. There are two important features of file:
1. File Activity - specifies percent of actual records which proceed in a single run.
2. File Volatility - addresses the properties of record changes. It helps to increase the
efficiency of disk design than tape.
File Organization
File Structures is the organization of data in secondary storage device in such a way that
minimize the access time and the storage space. A File Structure is a combination of
representations for data in files and of operations for accessing the data. It also allows applications
to read, write and modify data. It might also support finding the data that matches some search
criteria or reading through the data in some particular order.
111
File organization ensures that records are available for processing. It is used to determine
an efficient file organization for each base relation.
Storing and sorting in contiguous block within files on tape or disk is called as
sequential access file organization. In sequential access file organization, all records are
stored in a sequential order. The records are arranged in the ascending or descending
order of a key field.
Sequential file search starts from the beginning of the file and the records can be
added at the end of the file. In sequential file, it is not possible to add a record in the middle
of the file without rewriting the file.
Direct access file is also known as random access or relative file organization. In
direct access file, all records are stored in direct access storage device (DASD), such as
hard disk. The records are randomly placed throughout the file.
112
The records does not need to be in sequence because they are updated directly
and rewritten back in the same location. This file organization is useful for immediate
access to large amount of information. It is used in accessing large databases. It is also
called as hashing.
1. Direct access file helps in online transaction processing system (OLTP) like online
railway reservation system.
2. In direct access file, sorting of the records are not required.
3. It accesses the desired records immediately.
4. It updates several files quickly.
5. It has better control over record allocation.
Indexed sequential access file combines both sequential file and direct access file
organization. In indexed sequential access file, records are stored randomly on a direct
access device such as magnetic disk by a primary key.
This file have multiple keys. These keys can be alphanumeric in which the records
are ordered is called primary key. The data can be access either sequentially or randomly
using the index. The index is stored in a file and read into memory when the file is opened.
1. In indexed sequential access file, sequential file and random file access is
possible.
2. It accesses the records very fast if the index table is properly organized.
3. The records can be inserted in the middle of the file.
4. It provides quick access for sequential and direct processing.
5. It reduces the degree of the sequential search.
1. Indexed sequential access file requires unique keys and periodic reorganization.
2. Indexed sequential access file takes longer time to search the index for the data
access or retrieval.
3. It requires more storage space.
4. It is expensive because it requires special software.
5. It is less efficient in the use of storage space as compared to other file
organizations.
113
One important development with regard to file access is relative file organization. Relative
file organization allows a user to locate a record in a file with as few accesses as possible (ideally
one). A relative file organization implies a predictable relationship between the key used to
identify an individual record and that individual record’s location in an external file. It is to be
noted however, that the logical ordering of the records need not have any bearings to their
physical sequence. That is, the records do not necessarily appear in order by their key values
physically.
The question now is how would the Nth record be found? When a relative file is
established, the relationship (R) that will be used to translate between key values and physical
addresses is designated. This relationship is a mapping function from key values to addresses
in the file.
When a record is to be written into a relative file, the mapping function R is used to
translate the record’s key value to an address, which indicates where the record is to be stored.
When it is necessary to retrieve the record with a particular key value, the mapping function R is
applied to that key value, translating it to the address where the record can be found.
114
Addressing Techniques
Let us now delve into the problems of implementing relative files. There are three
fundamental techniques used by mapping functions R, where R (key value) à address:
1. Direct Mapping
a. absolute addressing
b. relative addressing
2. Directory Look up Techniques
3. Address Calculation Techniques
Direct mapping is the simplest technique for translating a record key to a storage
address though it is not widely used. The purpose of presenting it here is because an
understanding of the disadvantages of the direct mapping techniques helps us to
understand more on the benefits of the other techniques.
Absolute Addressing
Note however, that relocating of the direct file to another part of the disk requires
changing the absolute addresses.
Figure 100 Key position relationship with direct organization on sector-addressable devices
115
There are two advantages to this absolute addressing scheme:
1. Logical and physical considerations are not independent with absolute addressing.
This means that the user must know how the records are stored physically.
2. Users generally do not supply the appropriate types of key values.
3. Absolute addresses are device dependent. As the device upon which the file
resides is updated or change, more likely that key values would also need to be
changed.
4. Absolute addresses are address-space-dependent. As the relative file is
reorganize for purposes of enlarging the address space, it is likely that key values
would need to be changed.
Relative Addressing
Once a key-position relationship is established, the position of the record in the file
is specified as a record number relative to the beginning of the file, where the first record
is record number 0.
A relative file with space for N records contains positions with relative record
numbers 0, 1, 2, …, N – 1 where the i-th record has the relative record number of i – 1.
One of the simplest mapping functions is one of which the key value for the record
is the same as the record's relative address or position in the external file. The primary
116
advantage to this approach is that there is no processing time needed to determine the
record's relative address in the file when the record is to be accessed.
Example.
For a file of N records, the key values are in the range of 1 to N. So for a relative
file containing 200 key values in the range 1 to 200, a part number 156 is stored in the
position in the file with a relative record number 156.
Those values of consecutive keys that differ by only one are called dense keys.
For a range of key values that do not start with 1 or 0, the address could be the part
number less a constant.
For data with a large range of key values that are not dense, such as social security
numbers as key values, the programmer may have to allocate a large relative file in order
to use the key as the relative record number. But with this kind of process, a high
percentage of the file space may be wasted, or empty.
Relative addressing presents two advantages: (1) the mapping function R is very
simple; and (2) given a record’s key value, no processing time is required to determine the
record’s location on secondary storage.
After direct mapping, the next simplest approach to implementing R (key value) à
address is directory lookup. This approach takes advantage of the benefits of direct
mapping while eliminating its disadvantages, though it introduces a couple of new costs.
The basic idea behind the concept is to keep a table or directory of key
value:address pairs (or key value:relative address pairs). In order to find a record on a
117
relative file, one locates its key value in the directory and then the indicated address is
used to find the record on storage.
Thus the advantages of the directory lookup scheme to implement R (key value)
à address includes:
1. A record’s location can be determined with essentially no processing, once the key
value is found in the directory.
2. Keys can be represented in intuitively understandable forms, since they are
translated “internally” to addresses.
3. Logical-physical independence is achieved, because the key values are address-
space-independent. If the relative file is reorganized, address entries in the
directory must be changed, but the key values will not be affected.
118
Figure 104 A small phonebook as a hash table
Many times, a key value contains characters that make it difficult to manipulate the
key to an address. A nonnumeric key value can be converted to a numeric value by using
the numerically coded internal representation of each character in the key.
Example.
The numeric ASCII or EBCDIC representation of each letter can be used in place
of each character to provide a numeric key.
The primary problem with most hashing functions is that the function does not
always produce unique relative addresses. This situation, where R (K1) = R (K2) but K1
<> K2 is called a collision; when two different keys are hashed and the results are in the
same relative address.
Most hashing functions can be improved by allocating more file space than the
number needed to store the keys.
119
The relationship between file space and the number of keys needed is called the
load factor. The load factor, also called packing factor or packing density, is the ratio of
the number of key values to be stored versus the number of file positions.
All hash functions begin to work quite poorly when the target file becomes almost
full. As a general guideline, a load factor of 70 or 80 % is about as great as can be
tolerated with reasonable performance.
Example.
Twenty percent of a file with a load factor of 80 percent is empty. This kind of
arrangement reduces the number of collisions over a file with a load factor of 100 percent.
The number of file locations needed for a load factor of 80 percent is
or
The smaller the load factor, the less dense the file is, and the less chance of
collisions. But the disadvantage of having a low load factor (less than 50 percent) is by
having a great amount of file space that is wasted being empty.
Hashing Schemes
The basic idea of this approach is to divide a key by an appropriate number, then
to use the remainder as the relative address for the record. If the divisor (N) is the number
of relative positions in the file, this method can be used to map the keys into N record
locations
If a large range of key values is being mapped into a much smaller range, collisions
may occur between keys that yield the same remainder. Researchers generally find that
choosing a prime number that is close to the number of record locations in the file results
in relative addresses that are less prone to collision. Also, choosing a prime number as
divisor is better than an even divisor.
120
Figure 106 Example using divisor 4001
• Digit Extraction
The distribution of this hashing function is dependent on the distribution of the key
values. In digit extraction, the key values are analyzed to determine which digit positions
of the key are more evenly distributed. The more evenly distributed digit positions are
assigned from right to left.
The digit values in the chosen digit positions are then extracted, and the resulting
integer is used as the relative address. If none of the digit positions has a uniform
distribution, more collisions occur. Note however, that digit extraction requires that the
key values be known in order to determine the digit positions for extraction.
Figure 107 Example using 2nd, 4th, 7th and 8th positions
• Folding
Using the folding scheme, the key value is split into two or more parts and then
summed, or subjected to AND or XOR, as if each part were an integer. If the resulting
address contains more digits than the highest address in the file, the excess high order
digits are truncated.
121
Folding is useful for converting keys with a large number of digits to a smaller
number of digits so the address fits into a word of memory. Folding is easier to compute
than some of the other hashing schemes, but it can produce erratic results.
• Radix Conversion
The key value is interpreted as having a different base or radix and is converted to
a decimal number. Excess high order digits are truncated to yield the required address
size of the relative file.
Example.
Consider a five-digit key and a relative file with an address range from 0 to 9999.
• Mid-Square
The mid-square method involves treating the key as a single large number,
squaring the number, and extracting whatever number of digits is needed from the middle
of the result.
122
Other Hashing Schemes
A hashing function that provides one to one mapping from a key into a position is
termed a perfect hashing function since no collisions occur. A perfect hashing function
requires the knowledge of the set of key values in order to generate a perfectly uniform
distribution of addresses. Most perfect hashing functions require extensive manipulation
of the key set and applied to static keys sets only.
1. Quotient Reduction
2. Remainder Reduction
3. Associated Value Hashing
4. Reciprocal Hashing
Because hashing functions maps a relatively large key-value space to a relatively small
address space, there are great possibilities that collisions would occur. Several techniques are
available in handling collisions. These techniques can be classified into two groups, namely: open
addressing and chaining.
1. Open addressing
a) Linear probing
b) Two pass file creation with hashing
c) Separate overflow area
d) Double hashing
2. Chaining
a) Synonym chaining
b) Bucket addressing
For all of the techniques examined, n positions are allocated in the file (numbered 0, 1,.. ,
n-1) where each positions contains a flag, which is initialized as empty. The flag indicates that
nothing has yet been stored in that position. Once information has been stored in a particular
position, the flag is occupied.
Linear Probing
In linear probing, the file is searched as a circular file to use as many empty
positions as possible. In a search of a noncircular file, a key hashes to position i, which is
occupied, and positions i + 1, … , n – 1 are searched for an empty position. The back
portion of the file tends to fill up, leaving more empty positions at the front portion of the
file.
123
Storing the synonym in the nearest available position is simple but accessing the synonym
later may require a sequential search of the entire file.
The displacement problem of linear probing usually occurs when the relative file is
initially created with one-pass algorithm. A one-pass algorithm hashes the key of each
record to be loaded to the relative file. It stores a record in the hashed address, if empty,
and stores synonyms in the first available position nearest the hashed address. A two-
pass algorithm reduces the displacement problem by loading hashed positions in the first
pass and loading synonyms in the available positions in the second pass.
The first pass of the two-pass algorithm hashes the key of each record to be loaded
to the relative file and stores the record in the hashed address, if empty. The synonyms
are stored in an output file that will be used during the second pass. The second pass
inputs the synonyms separated by the first pass, hashes the key of each record, and stores
the synonym in the first available position nearest the hashed address.
The two-pass algorithm allows more records to be stored in the positions to which
they were hashed, thereby reducing the amount of linear probing that must be done later
to access any record without synonyms. The two-pass algorithm works well if the set of
key values is known before the file is created. Any additions to the file after the initial
loading may produce displacement problems similar to those produced by the one-pass
algorithm.
Another way for loading the relative file initially is storing synonyms sequentially in
a separate overflow area. Using a separate overflow area eliminates the displacement
problem altogether and allows use of the one-pass loading algorithm. Accessing
synonyms entails sequentially searching all the records in the overflow area, which is
probably not as large as the hashed positions of the relative file.
Depending on the size of the overflow area, the searching time for locating
synonyms could be lengthy. Accessing a record that is not in the hashed address involves
124
sequentially searching the overflow area for the record. The average number of probes
for a file with synonyms stored in a separate overflow area depends on the number of
synonyms.
Double Hashing
When a separate overflow area is established for synonyms, the synonyms are
stored in the next available overflow position. Rather than searching sequentially for the
next available overflow position or for the synonym, the synonym is often hashed to an
overflow record position. This method of hashing synonyms to an overflow area using a
second hash function is called double hashing.
Under this method, the record is initially hashed to a relative file position. If the
hashed address is not available, a second hash function is applied and added to the first
hashed value, and the synonym is hashed to the overflow area. If the hashed overflow
area is not available, a secondary collision occurs; then linear probing is used. Access to
synonyms are more direct than searching the overflow area sequentially. The synonyms
involved in secondary collisions are the only records that have to be accessed
sequentially.
Synonym Chaining
Synonym chaining is a technique used with and without a separate overflow area
to reduce the number of records examined when searching for a synonym. Initial creation
of a relative file that uses synonym chaining involves hashing the key of the record to be
loaded and retrieving the record in the hashed address. If the hashed address is occupied,
then the synonym is stored using one of the techniques discussed previously.
The record in the initial hashed address contains a link to the position containing
the synonym. If several synonyms hash to one address, all the synonyms are linked
together even though the synonyms are not physically stored contiguously. Access to any
synonym involves searching only the linked synonyms rather than searching the whole file
(if linear probing is used) or the whole overflow area (if a separate overflow area is used).
Each record is enlarged to include a link field, but the access of a synonym record is more
direct than what have been discussed before.
125
Figure 112 Synonym chaining
Bucket Addressing
All synonyms for a particular hash address are stored sequentially in the bucket
for the particular hash address. Accessing a record in a relative file that uses bucket
addressing involves hashing the key, then sequentially searching the bucket of records at
the hashed address. The number of records that must be examined to find a record is
limited by the bucket size; the algorithm does not have to search the whole file or the
whole overflow area.
126
The major problem with bucket addressing is the amount of space wasted when
the number of synonyms for any hashed address varies greatly. Since the bucket size is
determined by the maximum number of synonyms for any hashed address (each hashed
address has the same size bucket), those hashed addresses having a lower number of
synonyms also have buckets with unused space – space that is wasted. A secondary
problem with bucket addressing is how to determine the appropriate bucket size. Suppose
data in a relative file were unavailable for analysis prior to the creation of the file, and the
bucket size is smaller than the maximum number of synonyms as a result. All the
synonyms that hash to a bucket address cannot be stored in the bucket, so hashing
collisions must be resolved by using one of the techniques discussed. One solution is to
set a very large bucket size, but this solution is at the expense of wasted space.
Watch:
Read:
• Chapter 5 – Hashing
Weiss, Mark Allen (2014). Data Structures And Algorithm Analysis In C++ 4th Edition
Review:
127
Activities/Assessments:
1. Complete the table below
Prime Number
Radix
Division Digit Extraction Mid Square
Folding Conversion
Key Value Remainder (3rd and 2nd (2nd and 3rd
(123+45) (Base 9 to
Method (PN = digits) digits squared)
Base 10)
97)
12345 26 32 68 83 29
14344 85 34 87 40 49
24680 42 64 26 56 16
42577 91 52 2 57 25
54321 1 34 64 3 49
3. Given a hash table with m=11 entries and the following hash function h1 and step function
h2:
h1(key) = key mod m
h2(key) = {key mod (m-1)} + 1
Insert the keys {22, 1, 13, 11, 24, 33, 18, 42, 31} in the given order (from left to right) to
the hash table using each of the following hash methods:
a. Chaining with h1 ⟹ h(k) = h1(k)
b. Linear-Probing with h1 ⟹ h(k,i) = (h1(k)+i) mod m
c. Double-Hashing with h1 as the hash function and h2 as the step function
⟹ h(k,i) = (h1(k) + ih2(k)) mod m
Hash Table
128
129