School of Computer Science and Electronics Engineering, University of Essex
ILecture 3: Data Exploration: Summarising, presenting and
compressing data
CE880: An Approachable Introduction to Data Science
Haider Raza
Tuesday, 31 Jan 2023
1
About Myself
I Name: Haider Raza
I Position: Senior Lecturer in Artificial Intelligence
I Research interest: AI, Machine Learning, Data Science
I Contact:
[email protected] I Academic Support Hours: 1-2 PM on Friday via zoom. Zoom link is available
on Moodle
I Website: www.sagihaider.com
2
Common file formats in Data Science
0
https://www.weirdgeek.com/
3
Reading Zipped file
Zip files are a gift from the coding gods. It is like they have fallen from heaven to save
our storage space and time. Old school programmers and computer users will certainly
relate to how we used to copy gigantic installation files in Zip format.
0
analyticsvidhya
4
Reading Text Files
Text files are one of the most common file formats to store data. Python makes it
very easy to read data from text files. Python provides the ‘open()‘ function to read
files that take in the file path and the file access mode as its parameters. For reading
a text file, the file access mode is ‘r‘. I have mentioned the other access modes below:
I ‘w‘ – writing to a file
I ‘r+‘ or ‘w+‘ – read and write to a file
I ‘a‘ – appending to an already existing file
I ‘a+‘ – append to a file after reading
0
analyticsvidhya
5
Reading CSV Files
A CSV (or Comma Separated Value) file is the most common type of file that a data
scientist will ever work with. These files use a “,“ as a delimiter to separate the values
and each row in a CSV file is a data record.
0
analyticsvidhya
6
Reading CSV Files . . .
But CSV can run into problems if the values contain commas. This can be overcome
by using different delimiters to separate information in the file, like ‘‘ or ‘;‘, etc. These
can also be imported with the ‘read_csv()‘ function by specifying the delimiter in the
parameter value as shown below while reading a TSV (Tab Separated Values) file:
0
analyticsvidhya
7
Reading Excel Files
Pandas has a very handy function called ‘read_excel()‘ to read Excel files
We can easily read data from any sheet we wish by providing its name in the
sheet_name parameter in the ‘read_excel()‘ function
0
analyticsvidhya
8
Importing Data from a Database
Data in databases is stored in the form of tables and these systems are known as
Relational database management systems (RDBMS). However, connecting to
RDBMS and retrieving the data from it can prove to be quite a challenging task. You
will need to import the sqlite3 module to use SQLite.
0
analyticsvidhya
9
Reading JSON Files
JSON (JavaScript Object Notation) files are lightweight and human-readable to store
and exchange data. It is easy for machines to parse and generate these files and are
based on the JavaScript programming language. JSON files store data within { }
similar to how a dictionary stores it in Python
0
analyticsvidhya
10
Reading Data from Pickle
Pickle files are used to store the serialized form of Python objects. This means objects
such as list, set, tuple, dict, etc. are converted to a character stream before being
stored on the disk. This allows you to continue working with the objects later on.
These are particularly useful when you have trained your machine learning model and
want to save them to make predictions later on.
0
analyticsvidhya
11
Reading HTML using Python
12
Uploading Data to Colab
There are three different ways of uploading data to Colab:
I Manually locate the file
I Mounting Google Drive
I Using Git or API
13
Manually locate the file
14
Mounting Google Drive
15
Using Git or API
16
Types of Analytics in Data Science
17
Types of Analytics in Data Science
I Descriptive Analytics tells us what happened in the past and helps a business
understand how it is performing by providing context to help stakeholders
interpret information. Example: year-over-year pricing changes,
month-over-month sales growth, or the total revenue per subscriber
I Diagnostic Analytics takes descriptive data a step further and helps you
understand why something happened in the past. Example: Examining Market
Demand, Explaining Customer Behavior, Identifying Technology Issues
I Predictive Analytics predicts what is most likely to happen in the future and
provides companies with actionable insights based on the information. Example:
Forecasting future cash flow, Early detection of disease
I Prescriptive Analytics provides recommendations regarding actions that will take
advantage of the predictions and guide the possible actions toward a solution.
Example: Investment Decisions, Fraud Detection, Algorithmic Recommendations
(Instagram, tiktok)
18
Central Tendency
I Mean: The average of the dataset.
I Median: The middle value of an ordered dataset.
I Mode: The most frequent value in the dataset. If the data have multiple values
that occurred the most frequently, we have a multimodal distribution.
19
Central Tendency
I Skewness: A measure of symmetry.
I Kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative
to a normal distribution
20
Variability
21
Variability
Range: The difference between the highest and lowest value in the dataset.
Percentiles, Quartiles and Interquartile Range (IQR)
I Percentiles — A measure that indicates the value below which a given
percentage of observations in a group of observations falls.
I Quantiles— Values that divide the number of data points into four more or less
equal parts, or quarters.
I Interquartile Range (IQR)— A measure of statistical dispersion and variability
based on dividing a data set into quartiles. IQR = Q3 − Q1 distribution
22
Variability
Variance: The average squared difference of the values from the mean to measure
how spread out a set of data is relative to mean.
Standard Deviation: The standard difference between each data point and the mean
and the square root of variance.
23
Variability
24
Relationship Between Variables
Causality: Relationship between two events where one event is affected by the other.
Covariance: A quantitative measure of the joint variability between two or more
variables.
Correlation: Measure the relationship between two variables and ranges from -1 to 1,
the normalized version of covariance.
25
Hypothesis Testing and Statistical Significance
Null Hypothesis: A general statement that there is no relationship between two
measured phenomena or no association among groups. Alternative Hypothesis: Be
contrary to the null hypothesis.
In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis,
while a type II error is the non-rejection of a false null hypothesis.
“ Students who eat breakfast will perform better on a math exam than students who
do not eat breakfast. ”
26
Clustering
We would like to group our data into different groups.
27
Code: Generate the half-moon data
28
Half-Moon Data with 1500 points
29
Code: Generate the Cicle data
30
Circle Data with 1500 points
31
K-means Algorithm
Possibly the most popular algorithm for clustering
k-Means clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the
cluster.
I Initialise with "n_clusters" random "centroids"
I Iterates over two steps
I Assign each point to one of the centroids it is closer to using euclidean
distance
I Create new centroids by defining each centroid as the average of each
dimension
I Repeat
I Algorithms is unstable, different starting positions will result in different clusters
32
Metrics for clustering
Completeness: clustering must assign all of those datapoints that are members of a
single class to a single cluster
33
Metrics for clustering
Silhouette Coefficient:
I +1 indicate that the sample is far away from the neighboring clusters
I 0 indicates that the sample is on or very close to the decision boundary between
two neighboring clusters
I negative values indicate that those samples might have been assigned to the
wrong cluster
34
Let‘s run it for clustering moon data
35
Two clusters on moon data
36
Let‘s run it for clustering Cicle data
37
Two clusters on Circle data
38
Disadvantage of k-means clustering
I Difficult to predict k-value
I Use elbow plot to select best value of k
I Different initial partitions can result in different final clusters
39
Density-based spatial clustering of applications with noise (DBSCAN)
The DBSCAN algorithm should be used to find associations and structures in data
that are hard to find manually but that can be relevant and useful to find patterns and
predict trends.
Depends on two parameters
I eps: the minimum distance between two points. It means that if the distance
between two points is lower or equal to this value (eps), these points are
considered neighbors.
I minPoints: the minimum number of points to form a dense region. For example,
if we set the minPoints parameter as 5, then we need at least 5 points to form a
dense region.
40
DBSCAN: Advantages and disadvantages
Advantages:
I Can discover arbitrarily shaped clusters
I Find cluster completely surrounded by different clusters region.
Disadvantages:
I Datasets with altering densities are tricky
I Sensitive on two parameters
41
Let’s run it for clustering moon data with DBSCAN
42
Two clusters on moon data using DBSCAN
43
Two clusters on circle data using DBSCAN
44