0% found this document useful (0 votes)

66 views25 pages

Data Manipulation in DS 2024

The document discusses data preprocessing which involves collecting and cleaning raw data. It describes various tasks in data preprocessing including data collection, cleaning, integration, transformation, reduction and discretization. Exploratory data analysis techniques for understanding data are also covered.

Uploaded by

abebeyonas88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views25 pages

Data Manipulation in DS 2024

Uploaded by

abebeyonas88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Chapter 2

Data Manipulation and

Analysis

1
Data Manipulation and Analysis
• Data collection
• Data Preprocessing
 Why Data Preprocessing?
 Preprocessing
 Reading data
 Selecting filtering data
 Data cleaning
 Filtering missing values
 Dropping/replacing missing values
 Data integration
 Data transformation
 Manipulating data
 Data reduction
 Data Discretization and concept hierarchy generation
• Exploratory Data Analysis (EDA)
• Introduction to Pandas and Numpy

2 2
Data Manipulation and Analysis

 Data collection:
 is a systematic process of gathering observations
or measurements.
 Whether you are conducting research for
business, governmental, or academic purposes,
data collection allows you to gain first-hand
knowledge and original insights into your research
problem.

3
Data Manipulation and Analysis

 Steps in Data collection:

1. Define the Aim of Your Research
2. Choose Your Data Collection Method
3. Plan Your Data Collection Procedures
4. Collect the Data

4
Data Manipulation and Analysis

 Steps in Data collection:

 1. Define the Aim of Your Research
 Clarify your research objectives.

 Write problem statement and Formulate

research questions.
 Decide data type: Quantitative (Numeric) or
qualitative (expressed in words) or mixed
approach.

5
Data Manipulation and Analysis
Steps in Data Collection
2. Choose Your Data Collection Method:
o Based on the data you want to collect, decide on the most appropriate method:
 Surveys and Questionnaires: Gather information through structured questions.
 Observations: Observe and record behaviors, events, or phenomena.
 Interviews: Conduct one-on-one or group interviews to gather in-depth insights.
 Existing Data: Use data that already exists (e.g., historical records, databases).
 Experiments: Manipulate variables to observe their effects.
 Case Studies: Investigate a specific individual, group, or situation.
 Sampling: Collect data from a subset of the population.
 Sensor Data: Use sensors or devices to collect real-time data.
 Social Media Data: Analyze content from social platforms.
 Field Notes: Record observations during fieldwork.
 Diaries or Journals: Collect self-reported data over time.

6
Data Manipulation and Analysis
 Steps in Data collection:
 3. Plan Your Data Collection Procedures:
 Develop a detailed plan for data collection:

 Sampling Strategy: Decide how to select participants or

cases.
 Data Collection Tools: Prepare surveys, interview guides, or
observation protocols.
 Data Recording: Specify how you’ll record data (e.g., paper
forms, digital tools).
 Ethical Considerations: Ensure informed consent and protect
participants’ privacy.
 Pilot Testing: Test your data collection procedures before full
implementation.

7
Data Manipulation and Analysis
 Steps in Data collection:
 4. Collect the Data:
 Execute your plan, following the established
procedures.
 Be consistent, accurate, and thorough in recording
observations or measurements.
 Address any unexpected challenges during data
collection.

8
Data Manipulation and Analysis
Why Data Preprocessing?
 Quality decisions must be based on quality data

 Data in the real world is full of dirty

 incomplete:

 lacking attribute values that is vital for decision making so they have
to be added,
 lacking certain attributes of interest in certain dimension and should
be again added with the required value,
 containing only aggregate data so that the primary source of the
aggregation should be included
 noisy: containing errors or outliers that deviate from the expected

 inconsistent: containing discrepancies in codes or names of the

organization or domain
 etc

9
Data Manipulation and Analysis
Why Data Preprocessing?
 Incomplete, noisy and inconsistent data are commonplace
properties of large real world databases and data sources
 Data cleaning routine work to clean such problems so that
results can be accepted
 Before starting data preprocessing, it will be advisable to have
overall picture of the data we have so that it tell us high level
summary such as
 General property of the data
 Which data values should be considered as noise or outliers
 This can be done with the help of exploratory data analysis

10
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Descriptive summary about data can be
generated with the help of
 measure of central tendency of the data,
 measure of dispersion of the data and
 their graphic display

11
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Descriptive summary about data can be
generated with the help of
 measure of central tendency of the data,
 measure of dispersion of the data and
 their graphic display
 Measure of central tendency includes
 Mean, Median, and Mode

12
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
 Measure of dispersion includes
 Range, Quartiles, Interquartile and range (IQR)
 The five number summary (based on Quartiles)
 minimum, Q1, median (Q2), Q3, IQR, and maximum

 Variance and Standard deviation

13
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 Graphical Methods
 The BoxPlots
 Can be plotted based on the five number summary
 It is useful tool for identifying outliers
 It is also one of the popular way of visualizing a distribution

14
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 The BoxPlots
 The end of the box are the quartiles Q1 and Q3 so that the length of
the box is the IQR
 The median is marked by a line within the box

 Two lines (called whiskers) outside the box extends to the smallest
(Minimum) and largest (Maximum) observation
 The whiskers should extended to the extreme low and high value
only if these values are less than 1.5IQR beyond the quartiles.
Otherwise the whiskers terminates at the most extreme observation
occurring within 1.5IQR of the quartiles
 The remaining observations are plotted individually to show outliers

15
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 The BoxPlots

Boxplot for the unit price data for items sold at four branches

16
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
 Other graphical Methods
 Pie charts
 Bar charts
 Histograms
 Quantile plots
 q-q plots
 Scatter plots
 etc.

17
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
 Data pre-processing in data analytics activity refers to the
processing of the various data elements to prepare for the
analytics operation.
 Any activity performed prior to mining the data to get
knowledge out of it is called data pre-processing
 This involves:
 Data cleaning
 Data integration
 Data transformation
 Data reduction
 Data Discretization and concept hierarchy generation

18
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Cleaning: Refers to the process of
 filling in missing values,
 smooth noisy data,
 identify or remove outliers, and resolve inconsistencies

19
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data integration:
 Combines data from multiple sources (databases, data
cubes, or files) into a coherent store

 There are a number of issues to consider during data

integration. Some of these are:
 Schema integration issue

 Entity identification issue

 Data value conflict issue

 Avoiding redundancy issue

20
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Transformation
• Data transformation is the process of transforming or
consolidating data into a form appropriate for mining which is
more appropriate for measurement of similarity and distance
 This involves:
 Smoothing
 Aggregation
 Generalization
 Normalization
 Attribute/feature construction

21
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data reduction
 Data sources may store terabytes of data

 Complex data analysis/mining may take a very long time to

run on the complete dataset

 Data reduction tries to obtain a reduced representation of the

data set that is much smaller in volume but yet produces the
same (or almost the same or better) analytical results

22
Data Manipulation and Analysis
 Major Tasks in Data Preprocessing
 Data Discretization and concept hierarchy generation
 Data discritization refers to transforming the data set which

is usually continous into discrete interval values

 Concept hierarchy refers to generating the concept levels so
that data mining function can be applied at specific concept
level
 Can be used to reduce the number of values for a given
continuous attribute by dividing the range of attribute into
intervals
 Interval labels can be used to replace actual data values

 This leads to concise, easy to use, knowledge level

representation of mining result
23
Introduction to Numpy and Pandas
 NumPy
 Is a fundamental package for scientific computing with Python.
 It provides support for large, multi-dimensional arrays and
matrices, along with a collection of high-level mathematical
functions to operate on these arrays.

 Some key features of NumPy include:

 Multi-dimensional array objects (ndarray)
 Mathematical functions for fast operations on arrays
 Tools for reading and writing array data to disk
 Linear algebra and random number generation capabilities

24
Introduction to Numpy and Pandas
 Pandas
 Is a powerful and easy-to-use open-source data analysis and
manipulation tool built on top of the Python programming
language.
 It offers data structures and data analysis tools that are ideal for
working with structured data.

 Key features of pandas include:

 DataFrame object for data manipulation with integrated indexing
 Tools for reading and writing data between in-memory data
structures and different file formats
 Data alignment and handling of missing data
 Reshaping and pivoting of data sets
 Time series functionality

Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and
No ratings yet
Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and
12 pages
What Is Exploratory Data Analysis
No ratings yet
What Is Exploratory Data Analysis
28 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Yellow and Blue Data Visualization Basics Illustrated Presentation
No ratings yet
Yellow and Blue Data Visualization Basics Illustrated Presentation
8 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
EDA Techniques in SAS for Data Science
No ratings yet
EDA Techniques in SAS for Data Science
25 pages
Dev Core
No ratings yet
Dev Core
7 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
38 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Updated Notes of APR - 084732
No ratings yet
Updated Notes of APR - 084732
6 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
Data Analytics Interview Questions
No ratings yet
Data Analytics Interview Questions
3 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Data Acquisition and EDA Techniques
No ratings yet
Data Acquisition and EDA Techniques
58 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Techniques of Data Analysis
No ratings yet
Techniques of Data Analysis
9 pages
Week 3
No ratings yet
Week 3
23 pages
Advanced Data Analytics Assignment
No ratings yet
Advanced Data Analytics Assignment
6 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Data Analytics Course for Beginners
No ratings yet
Data Analytics Course for Beginners
34 pages
Comprehensive Guide To Modern Data Analysis Techniques
No ratings yet
Comprehensive Guide To Modern Data Analysis Techniques
4 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
Data Analytics: Collection & Pre-processing
No ratings yet
Data Analytics: Collection & Pre-processing
16 pages
Processing Data
No ratings yet
Processing Data
4 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
68 pages
General Data Analyst Interview Questions
No ratings yet
General Data Analyst Interview Questions
7 pages
(AD3301-DeV) Unit-Wise (Important Question)
No ratings yet
(AD3301-DeV) Unit-Wise (Important Question)
7 pages
Approaches in Data Analysis (Slides)
No ratings yet
Approaches in Data Analysis (Slides)
13 pages
Data Analysis - Wikipedia
No ratings yet
Data Analysis - Wikipedia
40 pages
Exploratory Data Analysis Essentials
No ratings yet
Exploratory Data Analysis Essentials
47 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Approaches in Data Analysis (Slides) (Re-Brand)
No ratings yet
Approaches in Data Analysis (Slides) (Re-Brand)
13 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Unit 2, 3
No ratings yet
Unit 2, 3
9 pages
Unit 1
No ratings yet
Unit 1
19 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Data Mining - Preprocessing
No ratings yet
Data Mining - Preprocessing
77 pages
Exploratory Data Analysis EDA Part of Data PreProcessing
No ratings yet
Exploratory Data Analysis EDA Part of Data PreProcessing
11 pages
Data Analysis
No ratings yet
Data Analysis
22 pages
Math211101020
No ratings yet
Math211101020
12 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Devish All Unit
No ratings yet
Devish All Unit
42 pages
Unit 1,2
No ratings yet
Unit 1,2
17 pages
Data Mining
No ratings yet
Data Mining
55 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
ML 5
No ratings yet
ML 5
26 pages
Group Lab2 Assignmen1
No ratings yet
Group Lab2 Assignmen1
2 pages
50191-Artificial Intelligence PPT Presentation
No ratings yet
50191-Artificial Intelligence PPT Presentation
1 page
Basic, Function and Turtle Graphics
No ratings yet
Basic, Function and Turtle Graphics
35 pages
Y0 Today
No ratings yet
Y0 Today
72 pages
Engineering and Technology Class Schedule
No ratings yet
Engineering and Technology Class Schedule
4 pages
Key Concepts of International Political Economy
No ratings yet
Key Concepts of International Political Economy
17 pages
Informacion PLM III
No ratings yet
Informacion PLM III
6 pages
Cybersecurity Incident Simulation
No ratings yet
Cybersecurity Incident Simulation
8 pages
5kp30a 26787697
No ratings yet
5kp30a 26787697
5 pages
Mgt212 Final Assignment - E2100233 - Pham Vu Hoang Lam
No ratings yet
Mgt212 Final Assignment - E2100233 - Pham Vu Hoang Lam
25 pages
Culture Health Intro CH 1 R
No ratings yet
Culture Health Intro CH 1 R
69 pages
Term Paper Table of Contents
No ratings yet
Term Paper Table of Contents
5 pages
Abstraction vs Encapsulation in C++
No ratings yet
Abstraction vs Encapsulation in C++
3 pages
Scrum Artifacts Overview
No ratings yet
Scrum Artifacts Overview
1 page
Understanding Artificial Intelligence
No ratings yet
Understanding Artificial Intelligence
2 pages
Robot: Safety Guide
No ratings yet
Robot: Safety Guide
22 pages
C# Basic Fundamentals
No ratings yet
C# Basic Fundamentals
3 pages
Lecture04 - MRI Pulse Sequence
No ratings yet
Lecture04 - MRI Pulse Sequence
14 pages
BEAMS - Unit 5 Indices
No ratings yet
BEAMS - Unit 5 Indices
28 pages
New Establishment Registration Form
No ratings yet
New Establishment Registration Form
2 pages
BCP 10B Documentation Reference Manual
No ratings yet
BCP 10B Documentation Reference Manual
166 pages
MFJ 4712
No ratings yet
MFJ 4712
3 pages
PES 2019 Pro Evolution Soccer 2019CPY by Heroskeep License Key PDF
40% (5)
PES 2019 Pro Evolution Soccer 2019CPY by Heroskeep License Key PDF
3 pages
Y10 03 CT15 Activities Solutions
No ratings yet
Y10 03 CT15 Activities Solutions
4 pages
Matrix Algebra PDF
No ratings yet
Matrix Algebra PDF
56 pages
मजदुर २०७७-९-६ बर्ष २२ अंक ३३४
No ratings yet
मजदुर २०७७-९-६ बर्ष २२ अंक ३३४
8 pages
Assignment College R
No ratings yet
Assignment College R
6 pages
ABB CoriolisMaster FCM2000
No ratings yet
ABB CoriolisMaster FCM2000
68 pages
Chapter 1
No ratings yet
Chapter 1
20 pages
Pengenalan Sistem Audio & Video
No ratings yet
Pengenalan Sistem Audio & Video
41 pages
Chess-Magadi Road
No ratings yet
Chess-Magadi Road
2 pages
Albert Bernard Cheryl Birthday Puzzle
No ratings yet
Albert Bernard Cheryl Birthday Puzzle
3 pages
Bioinspo Capstone Pujitha
No ratings yet
Bioinspo Capstone Pujitha
25 pages
Computers Buyers & Importers in India
No ratings yet
Computers Buyers & Importers in India
7 pages
Asco LV Ats & PCS
No ratings yet
Asco LV Ats & PCS
59 pages
LG 55UH8500 CNET Review Calibration Report
No ratings yet
LG 55UH8500 CNET Review Calibration Report
3 pages

Data Manipulation in DS 2024

Uploaded by

Data Manipulation in DS 2024

Uploaded by

Chapter 2

Data Manipulation and

 Steps in Data collection:

 Steps in Data collection:

 Write problem statement and Formulate

 Sampling Strategy: Decide how to select participants or

 Data in the real world is full of dirty

 inconsistent: containing discrepancies in codes or names of the

 Variance and Standard deviation

 There are a number of issues to consider during data

 Entity identification issue

 Data value conflict issue

 Avoiding redundancy issue

 Complex data analysis/mining may take a very long time to

 Data reduction tries to obtain a reduced representation of the

is usually continous into discrete interval values

 This leads to concise, easy to use, knowledge level

 Some key features of NumPy include:

 Key features of pandas include:

You might also like