Data Structures & algorithms
Data Structures intro
- different way of storing data on computer
- each point has a direction, either one-way or two-way (there-n-back) each point has a latitude and longitude (y,x axis, but in integer form)
so let's say we have 5 points:
latitude (y) = 49.3: "Store A", "Store B", "School"
latitude (y) = 49.2: "Home", "intersection"
and the ones that are connected are:
Home <--> Store A, Store A --> Store B
Home --> Store B, Store B <--> School
Home --> intersection, intersection <--> School
So you have:
Home (49.2 , -123.4)
Store A (49.3 , -123.4)
Store B (49.3, -123.3)
School (49.3 , -123.2)
Intersection (49.2 , -123.2)
Storing method 1 "array list"
So now you can make a list of possible paths:
(home, store A)
(store A, Home)
(home, store B)
(home, intersection)
(store A, Store B)
(store B, school)
(school, store B)
(school, intersection)
(intersection, school)
- in that list, there is no path back home on this list.
this method above was the one method of storing locations.
method 2 "hash map"
now let's say we want another method
- the method will be to list where we can go from the current place:
listed location locations can go to
Home (Store A, Store B, Intersection)
Store A (Home, Store B)
Store B (school)
School (store B, Intersection)
Intersection (school)
If you notice (later in the course), first method was a array/list data structure, second method corresponds to the hashtable/hashmap datastructure.
Algorithms intro
Algorithms = (operations on different data structures) + (sets of instructions for executing them)
let's say you want to find the shortest path from home to school.
there are 3 possible paths.
Home -> store a -> store b -> school
home -> store b -> school
home -> intersection -> school
now compute the distance using longtitude and lattitude and find the shortest distance in kilometers
we want to make it something computer can understand
Steps of sets of instructions.
1. find places you can go from home
2. from each of those places, find all paths
3. keep track of the distance traveled so fat
4. repeat this process until you get to school
5. compare the distance you've traveled
6. find the shortest path among all found distances
depending on what your data structure you are using, you may have different algorithm, in some cases completely different.
if we use array list, we will have to go through the entire list just in case.
in our hash map we find a home row, and then we don't have to go through thr entire table, meaning for out program hash map is efficient to use
method 3 array
- when we store elements in a "single-axis" array with unlimited partitions.
- Like let's say your friends come to your party and you wanna keep track who came, you write only first letter of their name, and store it like {D, K, A, L , F, G, S, R ... O, W, P, I}
method 4 linked list
now same as array, but each box (partition) is separated, and only connected with strings
which of method 3 or 4 should i use and when?
- let's say 100 ppl showed up to your party, and 98's person's name is misspelled and you want to fix that. If you used an Array, it's pretty easy, just find 98'th partition, because you know, ;et's say, that each box (partition) = 10 cm, so you
multiply 10x98-1, and you find 98ths person token.
- in linked list it would be hard, because those strings connecting each box, are different length. First box is in iving room, other maybe in the kitchen, and to find 98th boxx, you will need to count them one-by-one.
For our taks if we care about fixing names, we would choose method 3
However let's say 5 ppl showed up to party that you were not expecting to come. In the linked list you can add more boxes and more link strings easily, but with the array, it's tricky. You may get another box with 100 partitions, and use 2
boxes together. Or you will destroy the first box, transfer all of them to the new array box, while adding 5 more to it during the set.
- so let's say we have no idea how many people may come to party, it may be 5-1000, in this case linked list will be more convenient, because they're easier to resize than the arrays.
an Array
[5, -2, 300, 20] - array of integers
["yeahhh", "ouch", "haha"] - strings
["chhh", 3, 2, "ohh] - very unusual to have both strings and numbers
Code snippets in C
int sampleArray[5] = {2, 4, 6, 8, 100};
so we will have:
sampleArray = 2 4 6 8 100
and let's say we type:
sampleArray[0] = 20;
sampleArray[1] = -5;
we will then get:
sampleArray = 20 -5 6 8 100
let's say you want to add 2 more numbs, you would add 2 more partitions, but you can't do that, we will describe why later. (we allocate space before using)
2 mechanisms of storing data
storage - permanent
memory (RAM) - temporary
when you launch applications, they will run in RAM, and if you have too many applications running, you may run out of memory.
When you quit the program, it will not take memory on your task manager.
So how that helps us with arrays?
when you create a variable int a = 1; that data, when run, is stored in memory, and terminated if closed. The variable integer 1 in memory is exxprressed as 00000...01, 2 -> 0000...10, - meaning each of the integers can be expressed
as combination of 32 ones&zeros or bits, - 32 bits.
memory - a long tape of bytes
byte = a small unit of data = 8 bits {10010101, 00100010}
usually you will be given an address line with an index of where your bits are stored.
how many bytes do you need to store each integer?
you have 8 bits in each bytes
4x8 = 32 bits
you can take 4 bytes to represent an integer.
now the logic of creating variables:
When you create a variable:
int a = 1; first of all, 1 is converted into 32 bits
the bits of "1" will be split into 4 bytes on the memory tape indexes, so, our 00000001 will be stored in four indexes, each onteining pair pieces of 00000001
if you have another variable int b = 2, (2 => 00000010) then it will be stored the same way, just after the indexes of previous variable, so it may look like {00, 00, 00, 01, 00, 00, 00, 10}
it works the same for integers, decimals, characters.
What if you want to store an array of integers?
let's say we have int a =1; then that integer will be stored in 4 bytes from 120 to 123 (4 total bytes) bytes indexes
and let's say we type int sampleArray[3] = {5, 3, 20}, then for each integer we need 4 bytes, 4x3 = 12 bytes, so our array will be stored in the memory indexes 124-135 (12 total bytes), - 5 in first 4 bytes, then 3 in nexxt 4, and 20 in
the last 4 bytes.
So why can't we add more numbers by allocating more memory?
simplified argument: if you have another variable after the array, it will be hard for the OS to add new data in the array, as it will have to shift everything, which mostly causes wrong results.
So you may need to delete your array, and create a new one somewhere in the late bytes of memory.
there is a strategy for resizable arrays, where you have small array of 10, but then if you want 11th elements, you copy old items to new array of 20 elemtns, +10, +10... That's kind of how resizable arrays implemented in
Python Lists and Java ArrayList source codes.
Classes and objects
- we have set of information: name, color, weight, and also method introduceSelf(), usually they shown like:
- name: "Tom"
- color: "red"
- weight: 30
+introduceSelf()
there is a convenient way to organize that data for one item/person, which is an object which we will store in the variable.
- But each time we make a new object, we have to assign the same values, so we make constructors.
- to create a constructor, we the same name as our class:
OurClass(String n, String c, int w){
this.name = n; //"thise" relates to the object we created.
this.color = c;
this.weight = w;
}
- when you use a constructor, the default one no longer works.
let's say we want 5 bots interact with each other
- just add a method "Interact" in class robot. Then for each object object.Interact = object2 this can be used for, let's say twitter, when one person follows the other
Introduction to Linked lists (basics)
understanding how linked lists work from zero
- linked list stores values not as array.
- focusing on how can be implemented
each element in the linked list can be represented as a class Box (can change the name)
has 2 attributes:
- data of the list int data, first box of the list is called head data
- Box net, - attribute indicating a box that is connected to this object head.next will refer to the net box. head.next.data will relate to data of next box.
- Implementing
class Box {
int data; //we choose int to make it easy
Box next;
Box(int givenData) {
this.data = givenData;
}
}
so let's say when you create a new object with constructor box, it will create a box and also a box net box that will have a null value.
Also, it's a good practice to use the same name in argument of constructor as value we want to paste, so above code change givenData to data
also instead of calling the class "Box," it's usually called 'Node'
class Node {
int data; //we choose int to make it easy
Box next;
Box(int data) {
this.data = data;
}
}
Node nodeA = new Node(6);
Node nodeB = new Node(3);
...
Node nodeE = new Node(1);
- however, that code creates just boxes, and they are not connected. so we need to NodeA.next = nodeB and do that to every other node, we basically connecting boxes. This method works fine, but is awkward, because you have to
create nodes manually. To avoid all that work, create a class class LinkedList, and have that list contain all of those nodes inside. There you create a functionality that creates and connects the nodes
- practice making a function that will count the amount of nodes in the linked list. (nodes)
- here is the efficient solution
int countNodes(Node head) {
//assuming that head != null
int count = 1;
Node current = head; //our starting point
while (current.next != null) { //current.next simply goes to the next node
current = current.next; //select the next node
count += 1;
}
}
check how nodes behave in different ways separately of the function.
- One thing to notice: our linked list so far goes in only one direction, so to make it go in opposite direction we simply add a new attribute Node prev;, so now when we select current node and attribute with previous like middle.prev, we
get the node before the middle.
- Vocab: static function - function defined within class, can be called directly by using class name (?) (test/proof)
Introduction to Recursion
- recursion - a way of solving a problem by having a function call itself.
- think of factorials first (it' a good way of learning recursion)
factorials
factorials
- n! = n(n-1)(n-2)...3(2)(1)
- 4! = 4(3)(2)(1) = 24
- 2! = 2(1) = 2
- 1! = 1
- 0! = 1
- ! you may notice that (n-1)(n-2)...3(2)(1) = (n-1)! because we are not including n, which is by n+1 factor above, meaning, our current n! turns into (n-1)! if logically speaking. That also means we can revise n! as n! = n(n-1)! yet that's
not a complete defitionition of factorials.
- if you plug in any positive number, it will work just fine until the last 'n-1' is 1. However, plug in 0 and it breaks, bc 0! = 0(-1)!, so our definition n! = n(n-1)! is valid if we restrict the numbers as n>0 (in integers). But we can separate out
factorial for 2 cases:
n! = n(n − 1)! if n ≥ 1
n = {
1 if n = 0
if n = 0 look math notation tutorial to rewrite
- so, every time we get 1(0!), we know that 0! = 1, so we get (1)(1)
write a function factorial(n), and try to translate the statement above to code before looking at solution.
int fact(int n){
//assuming that n is a positive integer or 0
if (n>=1) { return n * fact(n-1);}
else { return 1;}
}
// this function has a logic, but doesn't quite work as we need, try to write the logic to get the result of the factorial
- that func works like f(4) -> 4 * f(3), f(3) -> 3*f(2), and that's until last factor is ... * f(0) = ...* 1
fibonacci sequence
- 1 + 1 = 2, 2+1 = 3, 3+2 = 5, 5+ 3 = 8
- fib(n), which means, that if we pass in 4, we should find the 4th fibonacci number, which is 3. Try to solve the problem yourself using recursion.
- Formula + examples:
Fn = Fn − 1 + Fn − 2
F5 = F4 + F3
F2 = F1 + F0
F1 = F0 + Fu is undef ined
- so we can conclude that the last fibonacci number is the sum of the previous 2 fibonacci numbers. If we try to find a fib value of either the first or second fibonacci number, we get the sum of F0 + F-1 = F_1 or F_2 = F_1 + F_0, which give
valuse not in fibonacci sequence, so we have ot restrict it to:
F_n = F_n-1 + F_n-2 , if n>=3
and 1, if n = 1, 2
- now write a function:
int fib(int n){
/assuming that n is a positive number
if (n >= 3){ return fib(n-1) + fib(n-2);}
else {return 1;}
}
// this is not the efficient way, look at dynamic programming tutorial (gonna happen later in Data Structs tutorial)
we get fib(3) -> fib(2) + fib(1)
NOTE: The condition where function calls itself is called "recursive case", and where it doesn't and returns a value "base case", our else statement, for example, is a base case usually.
- some problems to solve with recursion
frog, looking over the river 11ft wide, and there are 10 stones 1ft apart, (has initial position and last position before and after 10 rocks = 12 positions 1ft apart), frog is able to jump either 1 or 2ft. frog can never go back. (try solving a
simpler version, where river is 2ft, and 1 stone, there is only 2 ways (2 small jumps, one big jump))
Big O Notation
- helps to find the time it takes to run your function
- or rather how time to run the function grows as size of the input grows.
- let's say we have an array of a thousand or 100 thousand elemetns, and we want to write a function that adds up all those elements and returns a sum.
out Pseudocode would be:
def find_sum(given_array):
total = 0
for each i in given_array:
total += i
return total
how much times it will take to run this function? There is a lot of factors, most of them in your computer, some in programming language you are using. Instead of asking that question directly, you may ask a different question, which is:
"How does the runtime of this function grow?" (runtime = time to execute a piece of code)
- to answer question like that, you get 2 tools:
Big O Notation
Time Complexity
we may have randomly generated arrays of different sizes, and let's say we want to find average time to run the function for each n, where n = size of given array or number of elemetns of given array. if we test some functions, and we
notice that as n (input size) increases, the t (time) also increases linearly, we call it Linear time complexity, So:
Time Complexity = a way of showing how the runtime of a function increases
Linear time type is the one we just did.
constant time - time stays the same as we add elements
quadratic time
- is there a way to express time complexity more in the systematic way? that's when we use Big O Notation.
Big O Notation
Linear time --> O(n)
constant time --> O(1)
quadratic time --> O(n^2)
How to define it?
take the function we got earlier, compare the timing and amount of elements it processed.
express it as a mathematical expression based on the graph.
T = an + b (where a & b are constants)
steps to do that:
1. find the fastest growing term: based on the graph
2. take out the coefficient
T = cn^2 + dn + e
O(n^2)
usually squared complexity has 2 loops running one outer and one inner, both n-times which makes the quadratics equation.
On older or newer computers, the expression of O(n) is the same no matter how fast or slow, you will get the same time complexity, but the only thing will be changing is time.
O notations help you determine which functions are fast, even if they are in the same code.
if you ever have 2 n's in your function that aren't inside of another one, and only add up, that won't be O(2n^2) or O(n^3), but it will still be O(n^2) or any other with non-overlapping n's.
Intro to Trees
- in a list one node can only connect to the other node
- in a tree, a node can connect to multiple nodes
Class Node{ Node child1 Node child2 Node child3} \- when a tree has at most 2 nodes, it's a binary tree \- linked list is also a tree, or it could have other nodes to one, but can't \- trees can't
have 2 references linking to the same node, or rather parent nodes can't connect to one build, no matter the direction of pointing \- binary tree in code will look like: Class Node{ int data Node
left Node right } \- create function *find_sum(root) -> 20*, which will sum all the data of tree, and do that recursively. Find the base case: *if root == null:* sample code: def find_sum(root): if root ==
null: return 0 return root.data + find_sum(root.left) + find_sum(root.right)`
binary search
suppose u given an array, and given a target nu,ber
find the number in the array using a function
- search(arr, target)
- choose: linear search O(n/2)
- let's say we have array from -50 to 50, and terget = 20. Target must be between -50 and 50. express the idea by defining 2 pointers "left" and "right" as L & R. where L will be -50, R is 50.
- first step in binary search is to find the number in the middle.
- time complexity? 40->20->10->...->1, or n->n/2->n/4...->1. Or basically n/(2^x) which is x~~ log(2n) time complexity, --> O(logn).
log2(10,000,000) = 23.253
Example:
arr = [-2, 3, 4, 7, 8, 9, 11, 13]
target = 11
search(arr, target)
the insides of our function will look like: (Pseudocode)
left = 0
right = arr.length -1
while left <= right:
mid = (left+right)/2
if arr[mid] == target:
return mid
else if target < arr[mid]:
right = mid-1
else:
left = mid + 1
return -1 (if target doesn't exist in the array)
which will take O(log n) as expected.
QuickSort - overview
- you're given an array, and you should sort in ascending order, so from arr = [3, -2, -1, 0, 2, 4, 1] to [-2, -1, 0, 1, 2, 3, 4]. one approach is Quick sort.
quick sort implements a recursive function, that takes 3 arguments: given array, l, & r. Where l and r are 2 integers, representing indexes that show the section of array to sort, so the data between l, r will be sorted in that interval. To
implement that logic, first define base case. if l>= r: if l=r , then the section has only one element, meaning it's sorted, if l>r, then there is nothing to do. so we just return from the function. Recursive case, however, we use the function
partition(arr, l, r) which is the same as QuickSort. First step for partitioning is to choose pivot, - choosing the last element as a pivot, and divide the array into 2 groups, where one group is less than the pivot, and they will go to the left from
the pivot, which sorts a half of the array (useful if negative numbers involved with positive in array), now we can care about the order of the numbers. We going to store partition function in variable p (partition implementation later), now we
call QS function on different sections of array using qs(arr, l, p - 1) and qs(arr, p+1, r)
Partitioning:
we want to partition the array [2, 1, 4, 13, 14, 12, 3, 16, 5, 2, 10]
our pivot is the last element 10, we will use 2 indexes, j & i
j = current number examining, and part of for-loop which moves j to the beginning of the array up to number before pivot, and if j>pivot, they sorted in group
i = always points to the last number that are less than the pivot, does the same as j but for numbers less than the pivot.
then i is compared to j to clarify which numbers belong to what group. and i & j swap elements if those differ in belonging.
let's try to partition [-2, 3, -1, 5, 4, -3, 0]
def partition (arr, l, r):
pivot = arr [r]
i = l - 1
for j from l upto r - 1:
if arr[j] < pivot:
i += 1
swap arr[i] and arr[j] //if j & l happen to be the same numbers, nothing happens, but the loop will continue properly
swap arr[i + 1] and arr[r]
return i + 1
- and that's the partitioning pseudocode.
Time complexity of QuickSort?
Worst Case: "array is already sorted" or "a lot of duplicates in the array"
example: [1, 2, 3, 4, 5, 6, 7] for already sorted, we would try a qs(arr, 0, 6), where 0 & 6 is the index of the first & last elements respectfully which goes like
qs(arr, 0, 6) -> qs(arr, 0, 5) -> qs(arr, 0, 4)-> ... -> qs(arr, 0, 1)
n -> n-1 -> n-2 -> ... -> 1
n(n+1)/2 = O(n^2) in the worst case
Best Case:
example [-2, 3, -1, 5, 4, -3, 0] where number of integers less than pivot = nums grater than the pivot.
meaning we imagine a tree n(n/2(n/4, n4), n/2(n/4, n/4)), or in english, - we separate groups perfectly until we get the full sort.
On each level will take O(n)(O(n)(O(n)(O(n)))), or O(n logn) (find a better proof)
Average case????
implementation Details:
choosing random pivot
random element
median of three, - choose 3 elements randomly and pick median of 3, decreasing probability of picking a bad pivot.
Dealing with duplicates, where quicksort doesn't perform well.
3-way quicksort, - instead of dividing into 2 groups, divide into 3
group < pivot
group = pivot
group > pivot
Stacks & Queues
Stack
- stack - "bunch of pancakes on top of each other", you can put one data on top of another.
to get data on the very bottom, you will have to take of the top ones above it - whatever piece of data you put last, has to get out first
to implement it, use an array, and an integer, representing the last element of an array, which will = -1 if stack is empty. pointer to the index can be incremented by 1 each time you add an element to the array. -> basically stack is when
a pointer of the array always = to the index of the last element of the array
pop() - removes the number from stack
the operation of adding each element is O(1), if elements exceed the length of the array, either throw an error or create new array.
Queue
- a bunch of ppl lining up in a line/queue.
first element gets out first
implement with an array with 2 pointers, 1 pointing the first element, 2nd will point at the space after the last element (at null probably; to add new element).
both pointers will be at 0 if array is empty.
let's say you want to add another person, but the pointer after last elements moves to out of bounds space, to fix that, move the pointer to the beginning element of the array. So that's pretty much a circular array, if visualizing how
queue works.
Deque
deque - double-ended queue
you can remove elements from either sides of the queue.
Hash Tables & Dictionaries
Dictionary
data structure consists of key : vale pairs
operations
search
insert
delete
when implementing a dictionary, always try to make all 3 of those operatiosn to run at O(1), using hash table
hash table with dictionary logic
you have an array/hash table to represent a dictionary.
decide which key-value pair goes into first element of an array, and there is several ways, but we choose:
define the first and last key names of the data, let's say we have:
"Paul" 29
"Jane" 35
"Chloe" 88
"Alex" 18
and we want to define index for each in the array of 8 elements (0-7).
- we can try to take first letter of first element and compare its space with first letter of last => 'p' - 'a' = 15, but 15 is out of range of an array, so we do 15 mod 8 = 7, now "Paul" key of the dictionary will be put as 7th index of array.
- that entire process can be represented as function h1("Paul") -> 7, but there is an issue, we may have colissions when same letter names have the same index key.
- avoid colissions, but dealing with problem like that is to treat most of the letters with a function djb2 (research on it)
when choosing a function for hash table:
fast to compute
avoid collisions
uniformly distributed (not necesary for practical purpose)
Collisions handling
chaining
- this method uses a linked list inside the hashmap (kinda like a 2D array but 2 different ones)
method "chaining"
instead of storing the key-value directly in array, store them in linked-list. From each element of the array, we will have a pointer to that linked list.
so we will put the new values to the linked lies of one index of array to first index of linked list.
time complexity of method:
insertion = O(1)
search:
to explain that we define a=n/m
n = number of elements we put so far
m = length of the ARRAY
a(alpha) = shows how full the hash table is
search takes O(1+alpha)
if you keep alpha below 1, search will only take a constant amount of time
linear probing
- without data structure inside hashmap
we take the key-value data, if it is supposed to go in one element of hash table that is busy, then we check if it can go to the right element from it.
basically right & right... until we find an empty
is an Okay approach, but is inafficient when dealing with large elements because it will go through elements til finding a free spot.
To solve that, use double-hashing
double-hashing
- same as linear probing kinda
define a value to check onto the next (gap)
basically check for 3 (or any other number) elements ahead, it will go every 3 elements to check the free spot, if you choose 1, it will be the same as linear probing method.
summarizing (! I don't understand anything, research later)
i = h_1 (key) mod 8
(i+c) mod 8
(i +2c) mod 8
to pick the c use gcd(c,m) = 1
c <- {h_2 (key) mod (m-1)} + 1
approaches:
h_2 (key) = h_1 (key + 'd')
h_2 (key) = h_1 (key)
double-hashing implementation depends on your machine you are using.
self-notes:
research hash maps & hash tables separately
Hash Tables by BroCode
collection of Key-Value pairs, each pair is called "entry"
let's pretend we're a teacher creating student IDs.
take each key, insert into hashCode method: key.hashCode()
each hashCode method takes key as input, plug into formula and pit out an integer called "a hash". But if we use an integer number, that hash number is the same as key integer, but if they're too large, they should be divided by the
capacity of the hash table, and we will use the remainder using modulus operator, which will give the closest integer to store the value on. But we may get a collision
now let's say each key is a string, we will calculate its hash code using the formula s[0]31^(n-1) + s[1]31(n02) + ... + s[n-1], then take each hash, divide it by capacity of hash, and store it.
we will get 2 collisions again. We call it a bucket when 2 entries go into the same table element.
When we have a bucket, we make that hash table element (JUST ONE ELEMENT) into a linked list.
Usually if you want to avoid collisions, you make the function to store the entry next to the bucket, and making the hash table bigger, but then hash table will take more space, so you may usually balance between lesser collisions and
lesser has table size.
EXAMPLE OF THE HASH TABLE.... watch the rest of the BroCOde ___ timecode 6:00
collisions