0% found this document useful (0 votes)
67 views47 pages

Exploratory Data Analysis

This document provides an overview of exploratory data analysis techniques for an engineering course. It covers topics like data sampling methods, descriptive statistics, plotting data, and the scientific method. Specifically, it discusses the differences between probability and non-probability sampling, describing techniques like simple random sampling, stratified random sampling, clustering sampling, and multistage sampling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views47 pages

Exploratory Data Analysis

This document provides an overview of exploratory data analysis techniques for an engineering course. It covers topics like data sampling methods, descriptive statistics, plotting data, and the scientific method. Specifically, it discusses the differences between probability and non-probability sampling, describing techniques like simple random sampling, stratified random sampling, clustering sampling, and multistage sampling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

EXPLORATORY Engineering Systems and

DATA Decision Analysis (CIVE3066)

ANALYSIS
Instructors:
Nguyen-Tuan-Thanh LE (thanhlnt@[Link])

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA
CONTENTS
1. Data sampling
2. Descriptive statistics
3. Data processing
4. Plotting data

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 2
THE SCIENTIFIC METHOD
Define the problem Gather information

Construct hypothesis

Collect data and


propose a model

Hypothesis Analyze results / test


is false the hypotheses

Draw conclusions Decisions / Recommendations

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 3
1. DATA SAMPLING (1/2)
▪Population vs Sample
▪ Example:
▪ You ask randomly 100 chosen people, at a football match, about
which team they like
▪ Your sample is the 100 chosen people, while the population is all
the people at that match
▪ Sample: a selection taken from a larger/total group (the
“Population”) so that you can examine it to find out something
about the larger/total group [1]

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 4
[1] [Link]
DATA SAMPLING (2/2)
▪Data sampling is a statistical analysis technique
▪ used to select, manipulate and analyze a
representative subset of data points
▪ in order to identify patterns and trends in the larger
data set being examined [2]
▪How can we get a sample from the population?
▪ Sample design: which strategy?

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 5
[2] [Link]
DATA COLLECTION:
SAMPLE DESIGN – STRATEGY
▪ Non-probability Sampling: selecting samples based on the subjective
judgment of researchers rather than random selection
▪ Haphazard Sampling (Convenience Sampling)
▪ Judgment Sampling (Purposive/Expert Sampling)
▪ Probability Sampling: sample are chosen using a method based on the
theory of probability
▪ Simple Random Sampling
▪ Stratified Random Sampling
▪ Clustering Sampling
▪ Multistage Sampling
▪ Systematic Sampling

6
NON-PROBABILITY SAMPLING:
HAPHAZARD/CONVENIENCE
SAMPLING
▪ Try to create a random sample by haphazardly choosing items in order to try
and recreate true randomness.
▪ Example: you stand on a busy corner during rush hour and interviewing
people who pass by.
▪ Based on the philosophy of any sampling location will work
▪ Taking samples at convenient locations or times.
▪ A very homogeneous population over time and space is essential to obtain
unbiased estimates
▪ It is very difficult to verify this assumption

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 7
NON-PROBABILITY SAMPLING:
JUDGMENT/PURPOSIVE/EXPERT SAMPLING
▪ Based on subjective selection of population units by an individual/expert
▪ Samples are selected based purely on researcher’s knowledge and
credibility.
▪ In other words, researchers choose only those who he feels are a right fit
(with respect to attributes and representation of a population) to
participate in research study.
▪ The target population should be clearly defined, homogeneous, and
completely accessible so that sample selection bias is not a problem.
▪ The degree of accuracy is hard to quantify

8
PROBABILITY SAMPLING:
SIMPLE RANDOM SAMPLING (1/3)
▪ Each of the population units has an equal chance of being
selected for measurement.
▪ The selection of one unit does not influence the selection of
other units.
▪ Note that random sampling is not equivalent to selecting
locations haphazardly.
▪ Appropriatefor estimating means and totals when
population does not contain major trends, cycles, or
patterns.
9
PROBABILITY SAMPLING:
SIMPLE RANDOM SAMPLING (2/3) -
STEPS
1. A list of all the members of the population is prepared
initially and then each member is marked with a specific
number (for example, there are N members then they will be
numbered from 1 to N)
2. From this population, random samples are chosen using two
ways:
▪ Method of lottery
▪ Use of random numbers (random number tables, random number
generator software)
▪ A random number generator software is preferred more as the sample
numbers can be generated randomly without human interference.
10
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 11
Method of lottery

12
Excel Functions:
• RANDBETWEEN(a,b)
• RAND()

13
PROBABILITY SAMPLING:
SIMPLE RANDOM SAMPLING (3/3) – EXAMPLE
▪ An organization has 500 employees. We want to extract a sample of
100 from them.
▪ Step 1: Make a list of all the employees working in the organization
▪ As mentioned above there are 500 employees in the organization, the list must contain
500 names).
▪ Step 2: Assign a sequential number to each employee (1,2,3…500). This is
your sampling frame (the list from which you draw your simple random sample).
▪ Step 3: Figure out what your sample size is going to be.
▪ In this case, the sample size is 100
▪ Step 4: Use a random number generator to select the sample, using your
sampling frame (population size) from Step 2 and your sample size from Step 3.
▪ In this case, your sample size is 100 and your population is 500, so generate 100
random numbers between 1 and 500.
14
PROBABILITY SAMPLING:
STRATIFIED RANDOM SAMPLING (1/2)
▪ The target population is divided into non-overlapping,
homogeneous sub-regions/groups called strata (statum) to obtain
a better estimation of the mean of the population.
▪ Age, socioeconomic divisions, nationality, religion, educational
achievements, … fall under stratified random sampling.
▪ Samples within each strata is selected by Simple Random
Sampling.
▪ Useful when a heterogeneous population can be broken down
into parts that are internally homogeneous.
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 15
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 16
PROBABILITY SAMPLING:
STRATIFIED RANDOM SAMPLING (2/2) - EXAMPLE

▪Let’s consider a situation where a research team is seeking


opinions about religion amongst various age groups.
▪Instead of collecting feedback from 326,044,985 U.S citizens,
random samples of around 10000 can be selected for research.
▪These 10000 citizens can be divided into strata according to
age,i.e, groups of 18-29, 30-39, 40-49, 50-59, and 60 and above.
▪ Each stratum will have distinct members and number of
members.

17
PROBABILITY SAMPLING:
CLUSTERING SAMPLING
▪The target population is divided into clusters of
individual units
▪Some clusters are chosen at random, and all units
in the chosen clusters are measured.
▪Useful when population units cluster together and
each unit in the randomly selected cluster can be
measured.
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 18
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 19
CLUSTER SAMPLING VS
STRATIFIED SAMPLING
Cluster Sampling Stratified Random Sampling
▪ Elements of a population are randomly ▪ The entire population is divided into even
selected to be a part of groups (clusters). segments (strata).
▪ Members from randomly selected clusters are ▪ Individual components of the strata are
a part of this sample. randomly considered to be a part of sampling
units.
▪ Homogeneity is maintained between clusters
▪ Homogeneity is maintained within the strata.
▪ Heterogeneity is maintained with the clusters.
▪ Heterogeneity is maintained between strata.
▪ The clusters are divided naturally.
▪ The strata division is primarily decided by the
▪ The key objective is to minimize the cost researchers or statisticians.
involved and enhance competence.
▪ The key objective is to conduct accurate
sampling along with properly represented
population.

20
PROBABILITY SAMPLING:
MULTISTAGE SAMPLING
▪The target population is divided into primary units
(clusters)
▪Then, a set of primary units is selected by using
Simple Random Sampling and each is randomly
sub-sampled
▪Needed when measurements are made on
sub-samples of the field sample.

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 21
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 22
PROBABILITY SAMPLING:
SYSTEMATIC SAMPLING (1/2)
▪ The elements are chosen from a target population by selecting a
random starting point and selecting other members after a fixed
‘sampling interval’.
▪ Sampling interval is calculated by dividing the entire population
size by the desired sample size.
▪ Example:
▪ A local NGO is seeking to form a systematic sample of 500
volunteers from a population of 5000,
▪ They can select every 10th person in the population to
systematically form a sample.

23
PROBABILITY SAMPLING:
SYSTEMATIC SAMPLING (2/2) – TYPES
Linear Systematic Sampling Circular Systematic Sampling

if N = 7, n = 2,
k=3, the samples
will be: ad, be,
ca, db and ec.

1. Arrange the entire population in a classified sequence. 1. Calculate sampling interval (k) = N/n. (If N = 11 and n
2. Select the sample size (n) = 2, then k is taken as 5 and not 6)
3. Calculate sampling interval (k) = N/n 2. Start randomly between 1 to N
4. Select a random number between 1 to k (including k) 3. Create samples by skipping through k units every
5. Add the sampling interval (k) to the chosen random number to time until you select members of the entire population.
add the next member to a sample and repeat this procedure to 4. In case of this systematic sampling method, there will
add remaining members of the sample. be N number of samples, unlike k samples in 24 the
6. In case k isn’t an integer, can select the closest integer to N/n. linear systematic sampling method.
THE SCIENTIFIC METHOD
Define the problem Gather information

Construct hypothesis

Collect data and


propose a model

Hypothesis Analyze results / test


is false the hypotheses

Draw conclusions Decisions / Recommendations

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 25
2. DESCRIPTIVE STATISTICS

▪Descriptive statistics are statistical measures used to


describe a set of samples (or observations)
▪Three kinds of descriptive statistics:
1. Describing the Central tendency of observations
2. Describing the Dispersion/Spread of observations
3. Describing the Asymmetry of observations

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 26
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY - MODE
▪ The value has the largest number of observations
▪ MATLAB Syntax: M = mode(X); M = mode(X, dim);
▪ Description
▪ M = mode(X)
▪ If X is a vector, M is the sample mode (the most frequently occurring value) of X
▪ If X is a matrix, M is a row vector containing the mode of each column of that matrix
▪ When there are multiple values occurring equally frequently, mode returns the smallest
of those values.
▪ M = mode(X, dim) computes the mode along the dimension dim of X
▪ dim = 1 or 2
▪ 1: return a row vector (default)
▪ 2: return a colum vector

27
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY - MEDIAN





Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 28
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – ARITHMETIC MEAN (AM)






29
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – TRUNCATED (TRIMMED) MEAN

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 30
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – TRUNCATED (TRIMMED) MEAN


Octave: pkg install –forge statistics


k=n*(percent/100)/2


31
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – GEOMETRIC MEAN (GM)





32
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – HARMONIC MEAN (HM)






33
≥ ≥

34
DESCRIPTIVE STATISTICS:
DISPERSION – RANGE





Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 35
DESCRIPTIVE STATISTICS:
DISPERSION – STD & VARIANCE (1/2)

Normalize with N-1, provides the square root


of the best unbiased estimator of the variance

Normalize with N, this provides the square root


of the second moment around the mean

  36
DESCRIPTIVE STATISTICS:
DISPERSION – STD & VARIANCE (2/2)




Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 37
DESCRIPTIVE STATISTICS:
DISPERSION – COEFFICIENT VARIATION & PERCENTILE


1.
2.


3.

38
3. DATA PROCESSING
▪ Data processing involves verification, coding, classification
& tabulation of data
▪ Verification: verify to ensure that the data is accurate
▪ Coding: the verify data is converted into machine readable form so
that it can be processed through computer
▪ Classification: data are classified on the basis of common
characteristics which may be qualitative or descriptive &
quantitative or numericals
▪ Tabulation: it is concise, logical & orderly arrangement of data in a
columns & rows

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 39
MISSING DATA (1/2)
▪ Missing or unavailable data values in MATLAB/Octave are
usually represented by the special value NaN, which is
Not-a-Number.
▪ Function isnan(X|M): used to identify NaNs
▪ Function any(X): for a vector argument, return true (logical
1) if any element of the vector is nonzero

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 40
MISSING DATA (2/2)
▪ Example:
M= NaN 6 10 2
▪ X = [8 NaN 5 9];
8 1 4 2
▪ isnan(X|M) = ?
8 6 7 2
▪ ~isnan(X|M)= ?

▪ i=find(~isnan(X)); X=X(i);: find the indices of elements in a


vector X that are not NaNs and keep only non-NaN
elements.
▪ X=X(~isnan(X)): remove NaNs from X
▪ X(isnan(X)) = []: remove NaNs from x
▪ M(any(isnan(M),2),:)=[]: remove any rows containing NaNs
41
from a matrix M
4. PLOTTING DATA
(STATISTICAL GRAPHICS)
▪Scatter plot
▪Time series plot
▪Bar chart
▪Stem plot
▪Box-plot

Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 42
SCATTER PLOT
▪ scatter(X, Y): create a scatter plot with circles at the
locations specified by the vectors X and Y

43
TIME SERIES PLOT
▪ plot(X): plot the time series data X against time.

44
BAR CHART
▪ bar(Y): create a bar graph with one bar for each element in Y
▪ bar(X, Y): draws the bars of Y at the locations specified in X

45
STEM PLOT
▪ stem(Y): plot the data sequence Y as stems that extend from the
baseline along the x-axis
▪ stem(X, Y): plot the data sequence Y at values specified by X

46
BOX-PLOT
▪ boxplot(X): create a box plot of the data in X
▪ boxplot(X1, X2, X3): create a box plot for each group of data X1, X2, X3

47

You might also like