0% found this document useful (0 votes)

253 views30 pages

Data Prep & Processing Guide

This document discusses the process of data preparation and processing. It involves validating, editing, and cleaning raw data collected through questionnaires. Key steps include checking questionnaires for completeness and logical responses, coding open-ended questions, entering data into a computer file using a codebook for reference, and conducting consistency checks to find errors or outliers. The goal is to translate raw data into a clean and organized format suitable for analysis.

Uploaded by

Irfan Zubair

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

253 views30 pages

Data Prep & Processing Guide

Uploaded by

Irfan Zubair

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

DATA PREPARATION

AND
PROCESSING

1
DATA PREPARATION

• Once data is collected, process of analysis

begins.
• But, data has to be translated in an appropriate
form.
• This process is known as Data Preparation

2
STEPS IN DATA PREPARATION
• Validate data
• Questionnaire checking
• Edit acceptable questionnaires
• Code the questionnaires
• Keypunch the data
• Clean the data set
• Statistically adjust the data
• Store the data set for analysis
• Analyse data 3
VALIDATION
• Validity exists when the data actually measure
what they are suppose to measure. If they fail
to, they are misleading and should not be
accepted.
• One of the most serious concerns is errors in
survey data.
• When secondary data are involved, they may
be ancient or unimportant.
• With primary data also, this review is
important.
4
QUESTIONNAIRE CHECKING
• A questionnaire returned from the field may be
unacceptable for several reasons.
– Parts of the questionnaire may be
incomplete. Inadequate answers. No
responses to specific questions
– The pattern of responses may indicate that
the respondent did not understand or follow
the instructions.
– The responses show little variance.
– One or more pages are missing. 5
QUESTIONNAIRE CHECKING
– The questionnaire is answered by someone
who does not qualify for participation.
– Fictitious interviews
– Inconsistencies
– Illegible responses
– Yea- or nay-saying patterns
– Middle-of-the-road patterns 6
EDITING
• Next phase of data preparation involves
editing of the raw data.
• Three basic approaches:
- Go back to the respondents for clarification
- Infer from other responses
- Discard the response altogether

7
Treatment of Unsatisfactory Responses

Treatment of
Unsatisfactory
Responses

Return to the Assign Missing Discard

Unsatisfactory
Field Values Respondents

Substitute a Casewise Pairwise

Neutral Value Deletion Deletion
8
Treatment of Unsatisfactory Results:
- Returning to the Field – The
questionnaires with unsatisfactory responses
may be returned to the field, where the
interviewers recontact the respondents.
- Assigning Missing Values – If returning the
questionnaires to the field is not feasible, the
editor may assign missing values to
unsatisfactory responses.
- Discarding Unsatisfactory Respondents –
In this approach, the respondents with
unsatisfactory responses are simply 9 discarded
CODING
• Data entry refers to the creation of a
computer file that holds the raw data taken
from all of the questionnaires deemed suitable
for analysis
• Coding means assigning a code, usually a
number, to each possible response to each
question. The code includes an indication of
the column position (field) and data record it
will occupy. 10
CODING

• Fixed field codes, which mean that the

number of records for each respondent is the
same and the same data appear in the same
column(s) for all respondents, are highly
desirable.
– If possible, standard codes should be used
for missing data. Coding of structured
questions is relatively simple, since the
response options are predetermined.
11
CODING
– In questions that permit a large number of
responses, each possible response option
should be assigned a separate column.
– Guidelines for coding unstructured questions:
– Category codes should be mutually exclusive and
collectively exhaustive.
– Only a few (10% or less) of the responses should fall into
the “other” category.
– Category codes should be assigned for critical issues even
if no one has mentioned them.
– Data should be coded to retain as much detail
12 as possible.
CODING
• Principles for establishing categories for
coding:
- Convenient number of categories
- Similar responses within categories
- Differences of responses between categories
- Mutually exclusive categories
- Exhaustive categories
- Avoid open-ended class intervals
- Class interval of the same width
- Midpoints of class intervals 13
CODE BOOK
• A codebook contains coding instructions and
the necessary information about variables in
the data set. A codebook generally contains
the following information:
- column number
- record number
- variable number
- variable name
- question number
14
- instructions for coding
CODE BOOK
• Thus, a Data code book identifies all of the
variable names and code numbers associated
with each possible response to each question
that makes up the data set

15
Restaurant Preference
ID PREFER. QUALITY QUANTITY VALUE SERVICE INCOME
1 2 2 3 1 3 6
2 6 5 6 5 7 2
3 4 4 3 4 5 3
4 1 2 1 1 2 5
5 7 6 6 5 4 1
6 5 4 4 5 4 3
7 2 2 3 2 3 5
8 3 3 4 2 3 4
9 7 6 7 6 5 2
10 2 3 2 2 2 5
11 2 3 2 1 3 6
12 6 6 6 6 7 2
13 4 4 3 3 4 3
14 1 1 3 1 2 4
15 7 7 5 5 4 2
16 5 5 4 5 5 3
17 2 3 1 2 3 4
18 4 4 3 3 3 3
19 7 5 5 7 5 5
20 3 2 2 3 16
3 3
A Codebook Excerpt
Column Variable Variable Question Coding
Number Number Name Number Instructions
1 1 ID 1 to 20 as coded

2 2 Preference 1 input the number circled.

1=Weak Preference
7=Strong Preference

3 3 Quality 2 Input the number circled.

1=Poor
7=Excellent

4 4 Quantity 3 Input the number circled.

1=Poor
7=Excellent
17
A Codebook Excerpt
Column Variable Variable Question Coding
Number Number Name Number Instructions
5 5 Value 4 Input the number circled.
1=Poor
7=Excellent
6 6 Service 5 Input the number circled.
1=Poor
7=Excellent

7 7 Income 6 Input the number circled.

1 = Less than $20,000
2 = $20,000 to 34,999
3 = $35,000 to 49,999
4 = $50,000 to 74,999
5 = $75,000 to 99,999
6 = $100,00 or more
18
SPSS Variable View of the Data of Table

19
Keypunch the data / Data
transcription
• Transcribing data is the process of
transferring the coded data from the
questionnaire or coding sheets onto
disks or magnetic tapes or directly into
computers by keypunching.

20
Keypunch the data / Data transcription

Raw Data

CATI / Keypunching via Mark Sense Optical Computerized

CAPI CRT Terminal Forms Scanning Sensory
Analysis
Verification:Correct
Keypunching Errors

Magnetic
Computer Disks
Memory Tapes

Transcribed Data 21
Data Cleaning
• Consistency Checks
- Consistency checks identify data that are out of
range, logically inconsistent, or have extreme
values.
- Computer packages like SPSS, SAS, EXCEL and
MINITAB can be programmed to identify out-of-
range values for each variable and print out the
respondent code, variable code, variable name,
record number, column number, and out-of-range
value.
- Extreme values should be closely examined.
22
Data Cleaning
• Treatment of Missing Responses
• Substitute a Neutral Value – A neutral value, typically the
mean response to the variable, is substituted for the missing
responses.
• Substitute an Imputed Response – The respondents' pattern
of responses to other questions are used to impute or
calculate a suitable response to the missing questions.
• In case wise deletion, cases, or respondents, with any
missing responses are discarded from the analysis.
• In pair wise deletion, instead of discarding all cases with
any missing values, the researcher uses only the cases or
respondents with complete responses for each calculation.
23
Statistically Adjusting the Data
• Weighting
• In weighting, each case or respondent in the
database is assigned a weight to reflect its
importance relative to other cases or respondents.
• Weighting is most widely used to make the sample
data more representative of a target population on
specific characteristics.
• Yet another use of weighting is to adjust the
sample so that greater importance is attached to
respondents with certain characteristics.
24
Statistically Adjusting the Data
Use of Weighting for Representativeness
Years of Sample Population
Education Percentage Percentage Weight

Elementary School
0 to 7 years 2.49 4.23 1.70
8 years 1.26 2.19 1.74

High School
1 to 3 years 6.39 8.65 1.35
4 years 25.39 29.24 1.15

College
1 to 3 years 22.33 29.42 1.32
4 years 15.02 12.01 0.80
5 to 6 years 14.94 7.36 0.49
7 years or more 12.18 6.90 0.57

Totals 100.00 100.00 25

Statistically Adjusting the Data
• Variable Respecification
• Variable respecification involves the transformation of
data to create new variables or modify existing
variables.
• E.G., the researcher may create new variables that are
composites of several other variables.
• Dummy variables are used for respecifying categorical
variables. The general rule is that to respecify a
categorical variable with K categories, K-1 dummy
variables are needed

26
Statistically Adjusting the Data
Product Usage Original Dummy Variable Code
Category Variable
Code X1 X2 X3

Nonusers 1 1 0 0
Light users 2 0 1 0
Medium users 3 0 0 1
Heavy users 4 0 0 0

Note that X1 = 1 for nonusers and 0 for all others. Likewise, X2 =

1 for light users and 0 for all others, and X3 = 1 for medium users
and 0 for all others. In analyzing the data, X1, X2, and X3 are
used to represent all user/nonuser groups
27
Statistically Adjusting the Data

• Scale Transformation and Standardization:

- Scale transformation involves a manipulation of scale
values to ensure comparability with other scales or
otherwise make the data suitable for analysis.

- A more common transformation procedure is

standardization. Standardized scores, Zi, may be
obtained as:

Zi = (Xi -X )/sx
28
A Classification of Univariate Techniques
Univariate Techniques

Metric Data Non-numeric Data

One Sample Two or More One Sample Two or More

Samples Samples
* t test * Frequency
* Z test * Chi-Square
* K-S
* Runs
* Binomial
Independent Related
Independent Related
* Two- Group test * Paired
* Z test t test * Chi-Square * Sign
* One-Way * Mann-Whitney * Wilcoxon
ANOVA * Median * McNemar
* K-S * Chi-Square
* K-W ANOVA29
A Classification of Multivariate Techniques
Multivariate Techniques

Dependence Interdependence
Technique Technique

One Dependent More Than One Variable Interobject

Variable Dependent Interdependence Similarity
Variable
* Cross- * Multivariate * Factor * Cluster Analysis
Tabulation Analysis of Analysis * Multidimensional
* Analysis of Variance and Scaling
Variance and Covariance
Covariance * Canonical
* Multiple Correlation
Regression * Multiple
* Conjoint Discriminant
Analysis Analysis
30

Data Preparation
100% (1)
Data Preparation
87 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Data Wrangling & Analysis Tools
No ratings yet
Data Wrangling & Analysis Tools
9 pages
Data Wrangling PDF
No ratings yet
Data Wrangling PDF
14 pages
Data Cleaning Guide for Analysts
100% (2)
Data Cleaning Guide for Analysts
19 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
10 pages
Decision Tree Algorithm Guide
No ratings yet
Decision Tree Algorithm Guide
25 pages
Week 1 Analytics in Practice
100% (2)
Week 1 Analytics in Practice
12 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
11 Data Visualization
No ratings yet
11 Data Visualization
44 pages
Hands-On Data Science With R Text Mining
No ratings yet
Hands-On Data Science With R Text Mining
41 pages
Performance Comparison Between Naïve Bayes, Decision Tree and K-Nearest Neighbor in Searching Alternative Design in An Energy Simulation Tool
No ratings yet
Performance Comparison Between Naïve Bayes, Decision Tree and K-Nearest Neighbor in Searching Alternative Design in An Energy Simulation Tool
7 pages
RStudio Beginner Guide and Setup
No ratings yet
RStudio Beginner Guide and Setup
27 pages
PDI-Labguide ETL Using Pentaho Data Integration
No ratings yet
PDI-Labguide ETL Using Pentaho Data Integration
36 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Cheatsheet Data Visualization
100% (1)
Cheatsheet Data Visualization
5 pages
Python Pandas Tutorial - The Ultimate Guide For Beginner
No ratings yet
Python Pandas Tutorial - The Ultimate Guide For Beginner
32 pages
History of SPSS Software
100% (1)
History of SPSS Software
35 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
17 pages
Business Analytics and Data Science
No ratings yet
Business Analytics and Data Science
25 pages
Data Transformed in Power BI
No ratings yet
Data Transformed in Power BI
6 pages
RapidMiner Setup & Data Handling Guide
No ratings yet
RapidMiner Setup & Data Handling Guide
38 pages
Project 1
0% (1)
Project 1
21 pages
Data Mining in Medicine
No ratings yet
Data Mining in Medicine
42 pages
MIS - Management Information System
No ratings yet
MIS - Management Information System
25 pages
17ME-ENV-48 SPSS Practical
No ratings yet
17ME-ENV-48 SPSS Practical
41 pages
Microsoft Data Science Interview Guide
No ratings yet
Microsoft Data Science Interview Guide
17 pages
Data Analytics Certificate Glossary
No ratings yet
Data Analytics Certificate Glossary
23 pages
Wrangling Webinar
No ratings yet
Wrangling Webinar
151 pages
Twitter Scraping Streamlit - Py
No ratings yet
Twitter Scraping Streamlit - Py
2 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
MicroStrategy Tutorial Documentation
No ratings yet
MicroStrategy Tutorial Documentation
18 pages
DataMiningForTheMasses (001 158)
No ratings yet
DataMiningForTheMasses (001 158)
158 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Chart Handout
No ratings yet
Chart Handout
9 pages
DATA MINING TOOLS & ACTIVITIES PPT by Me.....
50% (2)
DATA MINING TOOLS & ACTIVITIES PPT by Me.....
25 pages
Lesson 6 Data Life Cycle Part 2
No ratings yet
Lesson 6 Data Life Cycle Part 2
30 pages
R for Biomedical Data Science
No ratings yet
R for Biomedical Data Science
17 pages
MBA in Data Science and Business Analytics
No ratings yet
MBA in Data Science and Business Analytics
35 pages
Data Science With R
No ratings yet
Data Science With R
21 pages
SQL Techniques for Data Analysts
100% (1)
SQL Techniques for Data Analysts
7 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
NOTES OF Python Ok
No ratings yet
NOTES OF Python Ok
73 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
49 pages
Health Data Science Project Guide
0% (1)
Health Data Science Project Guide
5 pages
Data Mining Lab Notes
0% (1)
Data Mining Lab Notes
93 pages
Python List and Numpy Array Basics
No ratings yet
Python List and Numpy Array Basics
1 page
Fundamentals of Predictive Analytics A Business Analytics Course
No ratings yet
Fundamentals of Predictive Analytics A Business Analytics Course
36 pages
Exploratory Data Analysis: Masters of Science
No ratings yet
Exploratory Data Analysis: Masters of Science
12 pages
Data Preparation and Processing
No ratings yet
Data Preparation and Processing
30 pages
Data Preparation: March 6, 2010
No ratings yet
Data Preparation: March 6, 2010
17 pages
Session 1
No ratings yet
Session 1
23 pages
SPSS Data Analysis Techniques Guide
0% (1)
SPSS Data Analysis Techniques Guide
82 pages
Data Preparation
No ratings yet
Data Preparation
47 pages
Data Prep for Researchers
No ratings yet
Data Prep for Researchers
16 pages
Data Preparation in Marketing Research
No ratings yet
Data Preparation in Marketing Research
21 pages
Data Preparation in Market Reasearch
No ratings yet
Data Preparation in Market Reasearch
16 pages
08 Chapter3 PDF
No ratings yet
08 Chapter3 PDF
38 pages
Sustainable Corporate Strategies: BY DEVAYALINI. M (18MBR014)
No ratings yet
Sustainable Corporate Strategies: BY DEVAYALINI. M (18MBR014)
6 pages
18MBC24 Business Research Methods Notes For Test II Qualitative Data Collection Methods
No ratings yet
18MBC24 Business Research Methods Notes For Test II Qualitative Data Collection Methods
35 pages
Chennai MTC Bus Complaint Numbers
No ratings yet
Chennai MTC Bus Complaint Numbers
1 page
Performance Management Lecture Notes
No ratings yet
Performance Management Lecture Notes
63 pages
Railway Management System
No ratings yet
Railway Management System
2 pages
Microprocessor Lab Manual - Final
100% (6)
Microprocessor Lab Manual - Final
157 pages
Op Tim Us Prime
No ratings yet
Op Tim Us Prime
1 page
Biostats and Research Methodology Unit 4 Notes
No ratings yet
Biostats and Research Methodology Unit 4 Notes
14 pages
طرق متقدمة في الإحصاء الحيوي بواسطة SPSS
No ratings yet
طرق متقدمة في الإحصاء الحيوي بواسطة SPSS
194 pages
J CTCP 2019 03 004
No ratings yet
J CTCP 2019 03 004
15 pages
Spatial-Temporal Data Analysis
No ratings yet
Spatial-Temporal Data Analysis
19 pages
The Relationship Between College Students Indecis
No ratings yet
The Relationship Between College Students Indecis
13 pages
SPSS Data Analysis
100% (6)
SPSS Data Analysis
47 pages
How To Use SPSS For Analyzing Basic Quantitative Research Questions
No ratings yet
How To Use SPSS For Analyzing Basic Quantitative Research Questions
29 pages
Advanced Educational Stats Guide
No ratings yet
Advanced Educational Stats Guide
25 pages
Examining The Effects of An Intervention On Mathematical Modeling in Problem Solving at Upper Elementary Grades A Cluster Randomized Trial Study
No ratings yet
Examining The Effects of An Intervention On Mathematical Modeling in Problem Solving at Upper Elementary Grades A Cluster Randomized Trial Study
20 pages
Microbiological Evaluation of Imported Frozen Mussels and Locally Harvested Black Bea Mussels
No ratings yet
Microbiological Evaluation of Imported Frozen Mussels and Locally Harvested Black Bea Mussels
2 pages
The Effect of Games On Iranian Young EFL Learners' Vocabulary Learning
No ratings yet
The Effect of Games On Iranian Young EFL Learners' Vocabulary Learning
13 pages
UCLA - What Statistical Analysis Should I Use - SPSS
No ratings yet
UCLA - What Statistical Analysis Should I Use - SPSS
54 pages
Statistics and Probability - q4 - Mod6 - Computation of Test Statistic On Population-Mean - V2
No ratings yet
Statistics and Probability - q4 - Mod6 - Computation of Test Statistic On Population-Mean - V2
24 pages
The in Uence of Extracurricular Activities in The Educational, Academic Outcomes
No ratings yet
The in Uence of Extracurricular Activities in The Educational, Academic Outcomes
7 pages
Permutation Test for Incomplete Pairs
No ratings yet
Permutation Test for Incomplete Pairs
13 pages
One-Sample T Test
No ratings yet
One-Sample T Test
4 pages
Statistik Izna
No ratings yet
Statistik Izna
2 pages
Self-Evaluation of Voice As A Treatment Outcome Measure
No ratings yet
Self-Evaluation of Voice As A Treatment Outcome Measure
9 pages
Marketing Mix Analysis in Indian Life Insurance
No ratings yet
Marketing Mix Analysis in Indian Life Insurance
17 pages
Salehi, H., Et Al. - 2015 - Impacts of The Extensive Reading Texts On The Writing Performance of Iranian EFL Pre-University Students. Asian Journa
No ratings yet
Salehi, H., Et Al. - 2015 - Impacts of The Extensive Reading Texts On The Writing Performance of Iranian EFL Pre-University Students. Asian Journa
11 pages
Analysis of Covariance-ANCOVA-with Two Groups PDF
No ratings yet
Analysis of Covariance-ANCOVA-with Two Groups PDF
41 pages
Thesis With T-Test
100% (2)
Thesis With T-Test
4 pages
The Influence of Celebrity Endorsement On Young Vietnamese Consumers' Purchasing Intention
No ratings yet
The Influence of Celebrity Endorsement On Young Vietnamese Consumers' Purchasing Intention
10 pages
YUSI AssignmentModule6
No ratings yet
YUSI AssignmentModule6
4 pages
JASP: Free, User-Friendly Stats Tool
No ratings yet
JASP: Free, User-Friendly Stats Tool
13 pages
Ecology Lab Final Review
No ratings yet
Ecology Lab Final Review
7 pages
D6518
No ratings yet
D6518
19 pages
ABD Formulas
No ratings yet
ABD Formulas
55 pages
Research Methodology - Brief Notes
No ratings yet
Research Methodology - Brief Notes
17 pages
Business Research Methods Overview
No ratings yet
Business Research Methods Overview
26 pages

Data Prep & Processing Guide

Uploaded by

Data Prep & Processing Guide

Uploaded by

DATA PREPARATION

• Once data is collected, process of analysis

Return to the Assign Missing Discard

Substitute a Casewise Pairwise

• Fixed field codes, which mean that the

2 2 Preference 1 input the number circled.

3 3 Quality 2 Input the number circled.

4 4 Quantity 3 Input the number circled.

7 7 Income 6 Input the number circled.

CATI / Keypunching via Mark Sense Optical Computerized

Totals 100.00 100.00 25

Note that X1 = 1 for nonusers and 0 for all others. Likewise, X2 =

• Scale Transformation and Standardization:

- A more common transformation procedure is

Metric Data Non-numeric Data

One Sample Two or More One Sample Two or More

One Dependent More Than One Variable Interobject

You might also like