EXPLORATORY Engineering Systems and
DATA Decision Analysis (CIVE3066)
ANALYSIS
Instructors:
Nguyen-Tuan-Thanh LE (thanhlnt@[Link])
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA
CONTENTS
1. Data sampling
2. Descriptive statistics
3. Data processing
4. Plotting data
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 2
THE SCIENTIFIC METHOD
Define the problem Gather information
Construct hypothesis
Collect data and
propose a model
Hypothesis Analyze results / test
is false the hypotheses
Draw conclusions Decisions / Recommendations
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 3
1. DATA SAMPLING (1/2)
▪Population vs Sample
▪ Example:
▪ You ask randomly 100 chosen people, at a football match, about
which team they like
▪ Your sample is the 100 chosen people, while the population is all
the people at that match
▪ Sample: a selection taken from a larger/total group (the
“Population”) so that you can examine it to find out something
about the larger/total group [1]
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 4
[1] [Link]
DATA SAMPLING (2/2)
▪Data sampling is a statistical analysis technique
▪ used to select, manipulate and analyze a
representative subset of data points
▪ in order to identify patterns and trends in the larger
data set being examined [2]
▪How can we get a sample from the population?
▪ Sample design: which strategy?
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 5
[2] [Link]
DATA COLLECTION:
SAMPLE DESIGN – STRATEGY
▪ Non-probability Sampling: selecting samples based on the subjective
judgment of researchers rather than random selection
▪ Haphazard Sampling (Convenience Sampling)
▪ Judgment Sampling (Purposive/Expert Sampling)
▪ Probability Sampling: sample are chosen using a method based on the
theory of probability
▪ Simple Random Sampling
▪ Stratified Random Sampling
▪ Clustering Sampling
▪ Multistage Sampling
▪ Systematic Sampling
6
NON-PROBABILITY SAMPLING:
HAPHAZARD/CONVENIENCE
SAMPLING
▪ Try to create a random sample by haphazardly choosing items in order to try
and recreate true randomness.
▪ Example: you stand on a busy corner during rush hour and interviewing
people who pass by.
▪ Based on the philosophy of any sampling location will work
▪ Taking samples at convenient locations or times.
▪ A very homogeneous population over time and space is essential to obtain
unbiased estimates
▪ It is very difficult to verify this assumption
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 7
NON-PROBABILITY SAMPLING:
JUDGMENT/PURPOSIVE/EXPERT SAMPLING
▪ Based on subjective selection of population units by an individual/expert
▪ Samples are selected based purely on researcher’s knowledge and
credibility.
▪ In other words, researchers choose only those who he feels are a right fit
(with respect to attributes and representation of a population) to
participate in research study.
▪ The target population should be clearly defined, homogeneous, and
completely accessible so that sample selection bias is not a problem.
▪ The degree of accuracy is hard to quantify
8
PROBABILITY SAMPLING:
SIMPLE RANDOM SAMPLING (1/3)
▪ Each of the population units has an equal chance of being
selected for measurement.
▪ The selection of one unit does not influence the selection of
other units.
▪ Note that random sampling is not equivalent to selecting
locations haphazardly.
▪ Appropriatefor estimating means and totals when
population does not contain major trends, cycles, or
patterns.
9
PROBABILITY SAMPLING:
SIMPLE RANDOM SAMPLING (2/3) -
STEPS
1. A list of all the members of the population is prepared
initially and then each member is marked with a specific
number (for example, there are N members then they will be
numbered from 1 to N)
2. From this population, random samples are chosen using two
ways:
▪ Method of lottery
▪ Use of random numbers (random number tables, random number
generator software)
▪ A random number generator software is preferred more as the sample
numbers can be generated randomly without human interference.
10
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 11
Method of lottery
12
Excel Functions:
• RANDBETWEEN(a,b)
• RAND()
13
PROBABILITY SAMPLING:
SIMPLE RANDOM SAMPLING (3/3) – EXAMPLE
▪ An organization has 500 employees. We want to extract a sample of
100 from them.
▪ Step 1: Make a list of all the employees working in the organization
▪ As mentioned above there are 500 employees in the organization, the list must contain
500 names).
▪ Step 2: Assign a sequential number to each employee (1,2,3…500). This is
your sampling frame (the list from which you draw your simple random sample).
▪ Step 3: Figure out what your sample size is going to be.
▪ In this case, the sample size is 100
▪ Step 4: Use a random number generator to select the sample, using your
sampling frame (population size) from Step 2 and your sample size from Step 3.
▪ In this case, your sample size is 100 and your population is 500, so generate 100
random numbers between 1 and 500.
14
PROBABILITY SAMPLING:
STRATIFIED RANDOM SAMPLING (1/2)
▪ The target population is divided into non-overlapping,
homogeneous sub-regions/groups called strata (statum) to obtain
a better estimation of the mean of the population.
▪ Age, socioeconomic divisions, nationality, religion, educational
achievements, … fall under stratified random sampling.
▪ Samples within each strata is selected by Simple Random
Sampling.
▪ Useful when a heterogeneous population can be broken down
into parts that are internally homogeneous.
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 15
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 16
PROBABILITY SAMPLING:
STRATIFIED RANDOM SAMPLING (2/2) - EXAMPLE
▪Let’s consider a situation where a research team is seeking
opinions about religion amongst various age groups.
▪Instead of collecting feedback from 326,044,985 U.S citizens,
random samples of around 10000 can be selected for research.
▪These 10000 citizens can be divided into strata according to
age,i.e, groups of 18-29, 30-39, 40-49, 50-59, and 60 and above.
▪ Each stratum will have distinct members and number of
members.
17
PROBABILITY SAMPLING:
CLUSTERING SAMPLING
▪The target population is divided into clusters of
individual units
▪Some clusters are chosen at random, and all units
in the chosen clusters are measured.
▪Useful when population units cluster together and
each unit in the randomly selected cluster can be
measured.
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 18
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 19
CLUSTER SAMPLING VS
STRATIFIED SAMPLING
Cluster Sampling Stratified Random Sampling
▪ Elements of a population are randomly ▪ The entire population is divided into even
selected to be a part of groups (clusters). segments (strata).
▪ Members from randomly selected clusters are ▪ Individual components of the strata are
a part of this sample. randomly considered to be a part of sampling
units.
▪ Homogeneity is maintained between clusters
▪ Homogeneity is maintained within the strata.
▪ Heterogeneity is maintained with the clusters.
▪ Heterogeneity is maintained between strata.
▪ The clusters are divided naturally.
▪ The strata division is primarily decided by the
▪ The key objective is to minimize the cost researchers or statisticians.
involved and enhance competence.
▪ The key objective is to conduct accurate
sampling along with properly represented
population.
20
PROBABILITY SAMPLING:
MULTISTAGE SAMPLING
▪The target population is divided into primary units
(clusters)
▪Then, a set of primary units is selected by using
Simple Random Sampling and each is randomly
sub-sampled
▪Needed when measurements are made on
sub-samples of the field sample.
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 21
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 22
PROBABILITY SAMPLING:
SYSTEMATIC SAMPLING (1/2)
▪ The elements are chosen from a target population by selecting a
random starting point and selecting other members after a fixed
‘sampling interval’.
▪ Sampling interval is calculated by dividing the entire population
size by the desired sample size.
▪ Example:
▪ A local NGO is seeking to form a systematic sample of 500
volunteers from a population of 5000,
▪ They can select every 10th person in the population to
systematically form a sample.
23
PROBABILITY SAMPLING:
SYSTEMATIC SAMPLING (2/2) – TYPES
Linear Systematic Sampling Circular Systematic Sampling
if N = 7, n = 2,
k=3, the samples
will be: ad, be,
ca, db and ec.
1. Arrange the entire population in a classified sequence. 1. Calculate sampling interval (k) = N/n. (If N = 11 and n
2. Select the sample size (n) = 2, then k is taken as 5 and not 6)
3. Calculate sampling interval (k) = N/n 2. Start randomly between 1 to N
4. Select a random number between 1 to k (including k) 3. Create samples by skipping through k units every
5. Add the sampling interval (k) to the chosen random number to time until you select members of the entire population.
add the next member to a sample and repeat this procedure to 4. In case of this systematic sampling method, there will
add remaining members of the sample. be N number of samples, unlike k samples in 24 the
6. In case k isn’t an integer, can select the closest integer to N/n. linear systematic sampling method.
THE SCIENTIFIC METHOD
Define the problem Gather information
Construct hypothesis
Collect data and
propose a model
Hypothesis Analyze results / test
is false the hypotheses
Draw conclusions Decisions / Recommendations
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 25
2. DESCRIPTIVE STATISTICS
▪Descriptive statistics are statistical measures used to
describe a set of samples (or observations)
▪Three kinds of descriptive statistics:
1. Describing the Central tendency of observations
2. Describing the Dispersion/Spread of observations
3. Describing the Asymmetry of observations
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 26
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY - MODE
▪ The value has the largest number of observations
▪ MATLAB Syntax: M = mode(X); M = mode(X, dim);
▪ Description
▪ M = mode(X)
▪ If X is a vector, M is the sample mode (the most frequently occurring value) of X
▪ If X is a matrix, M is a row vector containing the mode of each column of that matrix
▪ When there are multiple values occurring equally frequently, mode returns the smallest
of those values.
▪ M = mode(X, dim) computes the mode along the dimension dim of X
▪ dim = 1 or 2
▪ 1: return a row vector (default)
▪ 2: return a colum vector
27
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY - MEDIAN
▪
▪
▪
▪
▪
▪
▪
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 28
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – ARITHMETIC MEAN (AM)
▪
▪
▪
▪
▪
▪
▪
29
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – TRUNCATED (TRIMMED) MEAN
▪
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 30
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – TRUNCATED (TRIMMED) MEAN
▪
▪
Octave: pkg install –forge statistics
▪
▪
k=n*(percent/100)/2
▪
▪
31
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – GEOMETRIC MEAN (GM)
▪
▪
▪
▪
▪
▪
▪
32
DESCRIPTIVE STATISTICS:
CENTRAL TENDENCY – HARMONIC MEAN (HM)
▪
▪
▪
▪
▪
▪
▪
▪
33
≥ ≥
34
DESCRIPTIVE STATISTICS:
DISPERSION – RANGE
▪
▪
▪
▪
▪
▪
▪
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 35
DESCRIPTIVE STATISTICS:
DISPERSION – STD & VARIANCE (1/2)
▪
▪
Normalize with N-1, provides the square root
of the best unbiased estimator of the variance
Normalize with N, this provides the square root
of the second moment around the mean
36
DESCRIPTIVE STATISTICS:
DISPERSION – STD & VARIANCE (2/2)
▪
▪
▪
▪
▪
▪
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 37
DESCRIPTIVE STATISTICS:
DISPERSION – COEFFICIENT VARIATION & PERCENTILE
▪
▪
▪
1.
2.
▪
3.
38
3. DATA PROCESSING
▪ Data processing involves verification, coding, classification
& tabulation of data
▪ Verification: verify to ensure that the data is accurate
▪ Coding: the verify data is converted into machine readable form so
that it can be processed through computer
▪ Classification: data are classified on the basis of common
characteristics which may be qualitative or descriptive &
quantitative or numericals
▪ Tabulation: it is concise, logical & orderly arrangement of data in a
columns & rows
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 39
MISSING DATA (1/2)
▪ Missing or unavailable data values in MATLAB/Octave are
usually represented by the special value NaN, which is
Not-a-Number.
▪ Function isnan(X|M): used to identify NaNs
▪ Function any(X): for a vector argument, return true (logical
1) if any element of the vector is nonzero
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 40
MISSING DATA (2/2)
▪ Example:
M= NaN 6 10 2
▪ X = [8 NaN 5 9];
8 1 4 2
▪ isnan(X|M) = ?
8 6 7 2
▪ ~isnan(X|M)= ?
▪ i=find(~isnan(X)); X=X(i);: find the indices of elements in a
vector X that are not NaNs and keep only non-NaN
elements.
▪ X=X(~isnan(X)): remove NaNs from X
▪ X(isnan(X)) = []: remove NaNs from x
▪ M(any(isnan(M),2),:)=[]: remove any rows containing NaNs
41
from a matrix M
4. PLOTTING DATA
(STATISTICAL GRAPHICS)
▪Scatter plot
▪Time series plot
▪Bar chart
▪Stem plot
▪Box-plot
Based on the CIVE203 - Engineering system and decision analysis - Colorado state University, USA 42
SCATTER PLOT
▪ scatter(X, Y): create a scatter plot with circles at the
locations specified by the vectors X and Y
43
TIME SERIES PLOT
▪ plot(X): plot the time series data X against time.
44
BAR CHART
▪ bar(Y): create a bar graph with one bar for each element in Y
▪ bar(X, Y): draws the bars of Y at the locations specified in X
45
STEM PLOT
▪ stem(Y): plot the data sequence Y as stems that extend from the
baseline along the x-axis
▪ stem(X, Y): plot the data sequence Y at values specified by X
46
BOX-PLOT
▪ boxplot(X): create a box plot of the data in X
▪ boxplot(X1, X2, X3): create a box plot for each group of data X1, X2, X3
47