Chapter 2
Data Manipulation and
Analysis
1
Data Manipulation and Analysis
• Data collection
• Data Preprocessing
Why Data Preprocessing?
Preprocessing
Reading data
Selecting filtering data
Data cleaning
Filtering missing values
Dropping/replacing missing values
Data integration
Data transformation
Manipulating data
Data reduction
Data Discretization and concept hierarchy generation
• Exploratory Data Analysis (EDA)
• Introduction to Pandas and Numpy
2 2
Data Manipulation and Analysis
Data collection:
is a systematic process of gathering observations
or measurements.
Whether you are conducting research for
business, governmental, or academic purposes,
data collection allows you to gain first-hand
knowledge and original insights into your research
problem.
3
Data Manipulation and Analysis
Steps in Data collection:
1. Define the Aim of Your Research
2. Choose Your Data Collection Method
3. Plan Your Data Collection Procedures
4. Collect the Data
4
Data Manipulation and Analysis
Steps in Data collection:
1. Define the Aim of Your Research
Clarify your research objectives.
Write problem statement and Formulate
research questions.
Decide data type: Quantitative (Numeric) or
qualitative (expressed in words) or mixed
approach.
5
Data Manipulation and Analysis
Steps in Data Collection
2. Choose Your Data Collection Method:
o Based on the data you want to collect, decide on the most appropriate method:
Surveys and Questionnaires: Gather information through structured questions.
Observations: Observe and record behaviors, events, or phenomena.
Interviews: Conduct one-on-one or group interviews to gather in-depth insights.
Existing Data: Use data that already exists (e.g., historical records, databases).
Experiments: Manipulate variables to observe their effects.
Case Studies: Investigate a specific individual, group, or situation.
Sampling: Collect data from a subset of the population.
Sensor Data: Use sensors or devices to collect real-time data.
Social Media Data: Analyze content from social platforms.
Field Notes: Record observations during fieldwork.
Diaries or Journals: Collect self-reported data over time.
6
Data Manipulation and Analysis
Steps in Data collection:
3. Plan Your Data Collection Procedures:
Develop a detailed plan for data collection:
Sampling Strategy: Decide how to select participants or
cases.
Data Collection Tools: Prepare surveys, interview guides, or
observation protocols.
Data Recording: Specify how you’ll record data (e.g., paper
forms, digital tools).
Ethical Considerations: Ensure informed consent and protect
participants’ privacy.
Pilot Testing: Test your data collection procedures before full
implementation.
7
Data Manipulation and Analysis
Steps in Data collection:
4. Collect the Data:
Execute your plan, following the established
procedures.
Be consistent, accurate, and thorough in recording
observations or measurements.
Address any unexpected challenges during data
collection.
8
Data Manipulation and Analysis
Why Data Preprocessing?
Quality decisions must be based on quality data
Data in the real world is full of dirty
incomplete:
lacking attribute values that is vital for decision making so they have
to be added,
lacking certain attributes of interest in certain dimension and should
be again added with the required value,
containing only aggregate data so that the primary source of the
aggregation should be included
noisy: containing errors or outliers that deviate from the expected
inconsistent: containing discrepancies in codes or names of the
organization or domain
etc
9
Data Manipulation and Analysis
Why Data Preprocessing?
Incomplete, noisy and inconsistent data are commonplace
properties of large real world databases and data sources
Data cleaning routine work to clean such problems so that
results can be accepted
Before starting data preprocessing, it will be advisable to have
overall picture of the data we have so that it tell us high level
summary such as
General property of the data
Which data values should be considered as noise or outliers
This can be done with the help of exploratory data analysis
10
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
Descriptive summary about data can be
generated with the help of
measure of central tendency of the data,
measure of dispersion of the data and
their graphic display
11
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
Descriptive summary about data can be
generated with the help of
measure of central tendency of the data,
measure of dispersion of the data and
their graphic display
Measure of central tendency includes
Mean, Median, and Mode
12
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive
Data Summarization
Measure of dispersion includes
Range, Quartiles, Interquartile and range (IQR)
The five number summary (based on Quartiles)
minimum, Q1, median (Q2), Q3, IQR, and maximum
Variance and Standard deviation
13
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
Graphical Methods
The BoxPlots
Can be plotted based on the five number summary
It is useful tool for identifying outliers
It is also one of the popular way of visualizing a distribution
14
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
The BoxPlots
The end of the box are the quartiles Q1 and Q3 so that the length of
the box is the IQR
The median is marked by a line within the box
Two lines (called whiskers) outside the box extends to the smallest
(Minimum) and largest (Maximum) observation
The whiskers should extended to the extreme low and high value
only if these values are less than 1.5IQR beyond the quartiles.
Otherwise the whiskers terminates at the most extreme observation
occurring within 1.5IQR of the quartiles
The remaining observations are plotted individually to show outliers
15
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
The BoxPlots
Boxplot for the unit price data for items sold at four branches
16
Data Manipulation and Analysis
Exploratory Data Analysis (EDA) or Descriptive Data
Summarization
Other graphical Methods
Pie charts
Bar charts
Histograms
Quantile plots
q-q plots
Scatter plots
etc.
17
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data pre-processing in data analytics activity refers to the
processing of the various data elements to prepare for the
analytics operation.
Any activity performed prior to mining the data to get
knowledge out of it is called data pre-processing
This involves:
Data cleaning
Data integration
Data transformation
Data reduction
Data Discretization and concept hierarchy generation
18
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data Cleaning: Refers to the process of
filling in missing values,
smooth noisy data,
identify or remove outliers, and resolve inconsistencies
19
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data integration:
Combines data from multiple sources (databases, data
cubes, or files) into a coherent store
There are a number of issues to consider during data
integration. Some of these are:
Schema integration issue
Entity identification issue
Data value conflict issue
Avoiding redundancy issue
20
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data Transformation
• Data transformation is the process of transforming or
consolidating data into a form appropriate for mining which is
more appropriate for measurement of similarity and distance
This involves:
Smoothing
Aggregation
Generalization
Normalization
Attribute/feature construction
21
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data reduction
Data sources may store terabytes of data
Complex data analysis/mining may take a very long time to
run on the complete dataset
Data reduction tries to obtain a reduced representation of the
data set that is much smaller in volume but yet produces the
same (or almost the same or better) analytical results
22
Data Manipulation and Analysis
Major Tasks in Data Preprocessing
Data Discretization and concept hierarchy generation
Data discritization refers to transforming the data set which
is usually continous into discrete interval values
Concept hierarchy refers to generating the concept levels so
that data mining function can be applied at specific concept
level
Can be used to reduce the number of values for a given
continuous attribute by dividing the range of attribute into
intervals
Interval labels can be used to replace actual data values
This leads to concise, easy to use, knowledge level
representation of mining result
23
Introduction to Numpy and Pandas
NumPy
Is a fundamental package for scientific computing with Python.
It provides support for large, multi-dimensional arrays and
matrices, along with a collection of high-level mathematical
functions to operate on these arrays.
Some key features of NumPy include:
Multi-dimensional array objects (ndarray)
Mathematical functions for fast operations on arrays
Tools for reading and writing array data to disk
Linear algebra and random number generation capabilities
24
Introduction to Numpy and Pandas
Pandas
Is a powerful and easy-to-use open-source data analysis and
manipulation tool built on top of the Python programming
language.
It offers data structures and data analysis tools that are ideal for
working with structured data.
Key features of pandas include:
DataFrame object for data manipulation with integrated indexing
Tools for reading and writing data between in-memory data
structures and different file formats
Data alignment and handling of missing data
Reshaping and pivoting of data sets
Time series functionality
25