0% found this document useful (0 votes)

17 views44 pages

CE880 Lecture3 Slides

Uploaded by

Anand A J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views44 pages

CE880 Lecture3 Slides

Uploaded by

Anand A J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

School of Computer Science and Electronics Engineering, University of Essex

ILecture 3: Data Exploration: Summarising, presenting and

compressing data
CE880: An Approachable Introduction to Data Science

Haider Raza
Tuesday, 31 Jan 2023

1
About Myself

I Name: Haider Raza

I Position: Senior Lecturer in Artificial Intelligence
I Research interest: AI, Machine Learning, Data Science
I Contact: [email protected]
I Academic Support Hours: 1-2 PM on Friday via zoom. Zoom link is available
on Moodle
I Website: www.sagihaider.com

2
Common file formats in Data Science

0
https://www.weirdgeek.com/

3
Reading Zipped file

Zip files are a gift from the coding gods. It is like they have fallen from heaven to save
our storage space and time. Old school programmers and computer users will certainly
relate to how we used to copy gigantic installation files in Zip format.

0
analyticsvidhya

4
Reading Text Files

Text files are one of the most common file formats to store data. Python makes it
very easy to read data from text files. Python provides the ‘open()‘ function to read
files that take in the file path and the file access mode as its parameters. For reading
a text file, the file access mode is ‘r‘. I have mentioned the other access modes below:

I ‘w‘ – writing to a file

I ‘r+‘ or ‘w+‘ – read and write to a file
I ‘a‘ – appending to an already existing file
I ‘a+‘ – append to a file after reading

0
analyticsvidhya

5
Reading CSV Files

A CSV (or Comma Separated Value) file is the most common type of file that a data
scientist will ever work with. These files use a “,“ as a delimiter to separate the values
and each row in a CSV file is a data record.

0
analyticsvidhya

6
Reading CSV Files . . .

But CSV can run into problems if the values contain commas. This can be overcome
by using different delimiters to separate information in the file, like ‘‘ or ‘;‘, etc. These
can also be imported with the ‘read_csv()‘ function by specifying the delimiter in the
parameter value as shown below while reading a TSV (Tab Separated Values) file:

0
analyticsvidhya

7
Reading Excel Files

Pandas has a very handy function called ‘read_excel()‘ to read Excel files

We can easily read data from any sheet we wish by providing its name in the
sheet_name parameter in the ‘read_excel()‘ function

0
analyticsvidhya

8
Importing Data from a Database

Data in databases is stored in the form of tables and these systems are known as
Relational database management systems (RDBMS). However, connecting to
RDBMS and retrieving the data from it can prove to be quite a challenging task. You
will need to import the sqlite3 module to use SQLite.

0
analyticsvidhya
9
Reading JSON Files

JSON (JavaScript Object Notation) files are lightweight and human-readable to store
and exchange data. It is easy for machines to parse and generate these files and are
based on the JavaScript programming language. JSON files store data within { }
similar to how a dictionary stores it in Python

0
analyticsvidhya

10
Reading Data from Pickle

Pickle files are used to store the serialized form of Python objects. This means objects
such as list, set, tuple, dict, etc. are converted to a character stream before being
stored on the disk. This allows you to continue working with the objects later on.
These are particularly useful when you have trained your machine learning model and
want to save them to make predictions later on.

0
analyticsvidhya

11
Reading HTML using Python

12
Uploading Data to Colab

There are three different ways of uploading data to Colab:

I Manually locate the file

I Mounting Google Drive
I Using Git or API

13
Manually locate the file

14
Mounting Google Drive

15
Using Git or API

16
Types of Analytics in Data Science

17
Types of Analytics in Data Science

I Descriptive Analytics tells us what happened in the past and helps a business
understand how it is performing by providing context to help stakeholders
interpret information. Example: year-over-year pricing changes,
month-over-month sales growth, or the total revenue per subscriber
I Diagnostic Analytics takes descriptive data a step further and helps you
understand why something happened in the past. Example: Examining Market
Demand, Explaining Customer Behavior, Identifying Technology Issues
I Predictive Analytics predicts what is most likely to happen in the future and
provides companies with actionable insights based on the information. Example:
Forecasting future cash flow, Early detection of disease
I Prescriptive Analytics provides recommendations regarding actions that will take
advantage of the predictions and guide the possible actions toward a solution.
Example: Investment Decisions, Fraud Detection, Algorithmic Recommendations
(Instagram, tiktok)

18
Central Tendency

I Mean: The average of the dataset.

I Median: The middle value of an ordered dataset.
I Mode: The most frequent value in the dataset. If the data have multiple values
that occurred the most frequently, we have a multimodal distribution.

19
Central Tendency

I Skewness: A measure of symmetry.

I Kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative
to a normal distribution

20
Variability

21
Variability

Range: The difference between the highest and lowest value in the dataset.
Percentiles, Quartiles and Interquartile Range (IQR)

I Percentiles — A measure that indicates the value below which a given

percentage of observations in a group of observations falls.
I Quantiles— Values that divide the number of data points into four more or less
equal parts, or quarters.
I Interquartile Range (IQR)— A measure of statistical dispersion and variability
based on dividing a data set into quartiles. IQR = Q3 − Q1 distribution

22
Variability

Variance: The average squared difference of the values from the mean to measure
how spread out a set of data is relative to mean.
Standard Deviation: The standard difference between each data point and the mean
and the square root of variance.

23
Variability

24
Relationship Between Variables

Causality: Relationship between two events where one event is affected by the other.
Covariance: A quantitative measure of the joint variability between two or more
variables.
Correlation: Measure the relationship between two variables and ranges from -1 to 1,
the normalized version of covariance.

25
Hypothesis Testing and Statistical Significance

Null Hypothesis: A general statement that there is no relationship between two

measured phenomena or no association among groups. Alternative Hypothesis: Be
contrary to the null hypothesis.
In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis,
while a type II error is the non-rejection of a false null hypothesis.

“ Students who eat breakfast will perform better on a math exam than students who
do not eat breakfast. ”

26
Clustering

We would like to group our data into different groups.

27
Code: Generate the half-moon data

28
Half-Moon Data with 1500 points

29
Code: Generate the Cicle data

30
Circle Data with 1500 points

31
K-means Algorithm

Possibly the most popular algorithm for clustering

k-Means clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the
cluster.

I Initialise with "n_clusters" random "centroids"

I Iterates over two steps
I Assign each point to one of the centroids it is closer to using euclidean
distance
I Create new centroids by defining each centroid as the average of each
dimension
I Repeat
I Algorithms is unstable, different starting positions will result in different clusters

32
Metrics for clustering

Completeness: clustering must assign all of those datapoints that are members of a
single class to a single cluster

33
Metrics for clustering

Silhouette Coefficient:
I +1 indicate that the sample is far away from the neighboring clusters
I 0 indicates that the sample is on or very close to the decision boundary between
two neighboring clusters
I negative values indicate that those samples might have been assigned to the
wrong cluster

34
Let‘s run it for clustering moon data

35
Two clusters on moon data

36
Let‘s run it for clustering Cicle data

37
Two clusters on Circle data

38
Disadvantage of k-means clustering

I Difficult to predict k-value

I Use elbow plot to select best value of k
I Different initial partitions can result in different final clusters

39
Density-based spatial clustering of applications with noise (DBSCAN)

The DBSCAN algorithm should be used to find associations and structures in data
that are hard to find manually but that can be relevant and useful to find patterns and
predict trends.
Depends on two parameters

I eps: the minimum distance between two points. It means that if the distance
between two points is lower or equal to this value (eps), these points are
considered neighbors.
I minPoints: the minimum number of points to form a dense region. For example,
if we set the minPoints parameter as 5, then we need at least 5 points to form a
dense region.

40
DBSCAN: Advantages and disadvantages

Advantages:

I Can discover arbitrarily shaped clusters

I Find cluster completely surrounded by different clusters region.
Disadvantages:

I Datasets with altering densities are tricky

I Sensitive on two parameters

41
Let’s run it for clustering moon data with DBSCAN

42
Two clusters on moon data using DBSCAN

43
Two clusters on circle data using DBSCAN

Unit 1
No ratings yet
Unit 1
84 pages
AI-Data Science
No ratings yet
AI-Data Science
21 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Intro to Exploratory Data Analysis
No ratings yet
Intro to Exploratory Data Analysis
67 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Data Analytics and Interactive Dashboards Using Python
No ratings yet
Data Analytics and Interactive Dashboards Using Python
96 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Data Analytics Syllabus
No ratings yet
Data Analytics Syllabus
15 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Foundations of Data Science: Data Analysis
No ratings yet
Foundations of Data Science: Data Analysis
147 pages
Industrialreport
No ratings yet
Industrialreport
26 pages
Project Report
No ratings yet
Project Report
37 pages
CS3352 FDS QP Solved (Anna University)
100% (1)
CS3352 FDS QP Solved (Anna University)
98 pages
4HG21CS007
No ratings yet
4HG21CS007
13 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Slides Concepts
No ratings yet
Slides Concepts
55 pages
Cec 218 - 042006
No ratings yet
Cec 218 - 042006
83 pages
Data Science: Career, Tools, and Trends
No ratings yet
Data Science: Career, Tools, and Trends
40 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
DAV - Technical Book
No ratings yet
DAV - Technical Book
137 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Dsa QB 2023-24
No ratings yet
Dsa QB 2023-24
3 pages
Data Science Course for Programmers
No ratings yet
Data Science Course for Programmers
18 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
Module I
No ratings yet
Module I
74 pages
Free Data Science Course Material 2018
No ratings yet
Free Data Science Course Material 2018
32 pages
Data Science & Python Essentials
No ratings yet
Data Science & Python Essentials
59 pages
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
No ratings yet
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
43 pages
Unit 2
No ratings yet
Unit 2
20 pages
Data Analytics With Python Lecture 1
No ratings yet
Data Analytics With Python Lecture 1
23 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
Data Science
No ratings yet
Data Science
24 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Unit - I: Topic - 1
No ratings yet
Unit - I: Topic - 1
13 pages
Data Science Training Report
100% (1)
Data Science Training Report
26 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Exploratory Data Analysis in Python
No ratings yet
Exploratory Data Analysis in Python
32 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Data Science
No ratings yet
Data Science
6 pages
4 (John Stredwick) Introduction To Human Resource Ma
No ratings yet
4 (John Stredwick) Introduction To Human Resource Ma
61 pages
Data Exploration and Preprocessing Guide
No ratings yet
Data Exploration and Preprocessing Guide
81 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
PDS Qba
No ratings yet
PDS Qba
12 pages
CS3352 Foundations of Data Science Nov Dec 2022
No ratings yet
CS3352 Foundations of Data Science Nov Dec 2022
36 pages
Data Analysis for Business Insights
No ratings yet
Data Analysis for Business Insights
99 pages
Project Report Writing Guidelines
No ratings yet
Project Report Writing Guidelines
31 pages
ITS62604 Tutorial 6 (Answer)
No ratings yet
ITS62604 Tutorial 6 (Answer)
2 pages
CS3352 QB
No ratings yet
CS3352 QB
35 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
CRC Data Science
No ratings yet
CRC Data Science
443 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Module 1
No ratings yet
Module 1
91 pages
Understanding Simpson's Paradox in Data Science
No ratings yet
Understanding Simpson's Paradox in Data Science
61 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
CE802 Lec FuncOptim Handouts
No ratings yet
CE802 Lec FuncOptim Handouts
11 pages
CE802 Lec Eval Handouts
No ratings yet
CE802 Lec Eval Handouts
33 pages
Bayesian Learning1
No ratings yet
Bayesian Learning1
21 pages
CE880 Lecture5 Slides
No ratings yet
CE880 Lecture5 Slides
32 pages
CE880 Lecture9 Slides
No ratings yet
CE880 Lecture9 Slides
43 pages
An Open Letter To All Engineering Grads Trying To Pursue Physics
No ratings yet
An Open Letter To All Engineering Grads Trying To Pursue Physics
2 pages
English Books by Mahatma Gandhi
No ratings yet
English Books by Mahatma Gandhi
41 pages
Dga - Ariel-P-Ip67 - Datasheet (Option 2 Spike)
No ratings yet
Dga - Ariel-P-Ip67 - Datasheet (Option 2 Spike)
3 pages
Application Note Abstract: Implementation of MPPT Solar Charge Controller With Powerpsoc
No ratings yet
Application Note Abstract: Implementation of MPPT Solar Charge Controller With Powerpsoc
14 pages
Primary Homework Help Day and Night
100% (1)
Primary Homework Help Day and Night
5 pages
Engineering Landmark: Belle Isle Turbine
100% (6)
Engineering Landmark: Belle Isle Turbine
15 pages
Ge3361 PD Lab
No ratings yet
Ge3361 PD Lab
4 pages
Initial Imperfection Analysis and Design of Concrete-Filled Stiffened Thin-Walled Steel Tubular
No ratings yet
Initial Imperfection Analysis and Design of Concrete-Filled Stiffened Thin-Walled Steel Tubular
13 pages
Alkanes Chemistry
No ratings yet
Alkanes Chemistry
13 pages
SAEJ743 V 001
No ratings yet
SAEJ743 V 001
19 pages
Rice, Richardson, Clark 2012 - Perfeccionismo, Procrastinación y Trastornos Psicologicos PDF
No ratings yet
Rice, Richardson, Clark 2012 - Perfeccionismo, Procrastinación y Trastornos Psicologicos PDF
15 pages
The Luxury Strategy Break The Rules of Marketing To Build Luxury Brands 2nd Edition by JeanNoeumll KapfererVincent Bastien
No ratings yet
The Luxury Strategy Break The Rules of Marketing To Build Luxury Brands 2nd Edition by JeanNoeumll KapfererVincent Bastien
344 pages
Hospital Pharmacy Terms & Definitions
No ratings yet
Hospital Pharmacy Terms & Definitions
7 pages
Oral Pathology Charts II
No ratings yet
Oral Pathology Charts II
7 pages
Cooperative Education Advocacy
No ratings yet
Cooperative Education Advocacy
11 pages
Personal Development Quarter 2 Module
No ratings yet
Personal Development Quarter 2 Module
9 pages
Mental Health For Parents 12-06-24
No ratings yet
Mental Health For Parents 12-06-24
20 pages
Theories of Accounting Regulation
No ratings yet
Theories of Accounting Regulation
39 pages
Classifying Real and Rational Numbers
No ratings yet
Classifying Real and Rational Numbers
48 pages
Plastic Moment of Resistance
No ratings yet
Plastic Moment of Resistance
5 pages
Figurative Language 2024
No ratings yet
Figurative Language 2024
15 pages
Parts of Speech
No ratings yet
Parts of Speech
25 pages
Planer Quick Return Mechanisms
No ratings yet
Planer Quick Return Mechanisms
6 pages
A Full Detailed Summary of The Femme Noire by Leop..
100% (1)
A Full Detailed Summary of The Femme Noire by Leop..
2 pages
HSE Aviation-Helideck-Operations-Inspection-Guide
No ratings yet
HSE Aviation-Helideck-Operations-Inspection-Guide
22 pages
A Study of Consumer Perception Towards Mwallets
No ratings yet
A Study of Consumer Perception Towards Mwallets
5 pages
001 PDF
No ratings yet
001 PDF
2 pages
Ethical Analysis of Yahoo-Yahoo Fraud
No ratings yet
Ethical Analysis of Yahoo-Yahoo Fraud
22 pages
Gulf Super Duty VLE 15W40
No ratings yet
Gulf Super Duty VLE 15W40
1 page
A Test For Turkey's Foreign Policy: The Syria Crisis: Doğan Ertuğrul
No ratings yet
A Test For Turkey's Foreign Policy: The Syria Crisis: Doğan Ertuğrul
8 pages
All Pros
No ratings yet
All Pros
5 pages
Cake Decorating Techniques and Tips
No ratings yet
Cake Decorating Techniques and Tips
5 pages

CE880 Lecture3 Slides

Uploaded by

CE880 Lecture3 Slides

Uploaded by

School of Computer Science and Electronics Engineering, University of Essex

ILecture 3: Data Exploration: Summarising, presenting and

I Name: Haider Raza

I ‘w‘ – writing to a file

There are three different ways of uploading data to Colab:

I Manually locate the file

I Mean: The average of the dataset.

I Skewness: A measure of symmetry.

I Percentiles — A measure that indicates the value below which a given

Null Hypothesis: A general statement that there is no relationship between two

We would like to group our data into different groups.

Possibly the most popular algorithm for clustering

I Initialise with "n_clusters" random "centroids"

I Difficult to predict k-value

I Can discover arbitrarily shaped clusters

I Datasets with altering densities are tricky

You might also like