• Variables, Units of Measurement and Frequency
• Measures of Central Tendency
• Measures of Dispersion
• Probability Theory
• Binomial Distribution
• Poisson Distribution
• Normal Distribution
Units of Measurement for Variables
Variables
Categorical / Qualitative Continuous / Quantitative
Nominal Ordinal Interval Ratio
Frequency
• Frequency: Number of times values of the variable repeats itself
• Frequency Distribution: Statistical table which shows the corresponding frequencies against the values of the variables
Simple Frequency Distribution Grouped Frequency Distribution
Variable (x) Frequency (f) Variable (x) Frequency (f) Class
2 8 2-5 8 Class Frequency
4 10 5-7 10 Class Width
7-9 15 Class Limit
7 15
Class Boundary
Relative Frequency
Measures of Central Tendency
• Arithmetic Mean: Sum of a collection of numbers divided by the count of numbers in the collection
•
Simple A.M:
Weighted A.M :
• Mode: It is that value of the variable which has the highest frequency
Simple frequency distribution
Grouped frequency distribution : Mode = l1 + c ( )
• Median: It is the central most value of the variable and divides the dataset into two equal halves
Simple series
Simple frequency distribution
Grouped frequency distribution: Median = l1 + ( - ∑ f1 )
Measures of Dispersion
• Range: Difference between maximum value of the variable and minimum value of the variable from the dataset
•
• Standard Deviation: “ Root – Mean – Square – Deviation – from Mean “
Simple A.M:
Weighted A.M :
• Quartile Deviation: Divides the dataset into four equal parts and so we have Q1, Q2 and Q3. Way to estimate the spread of the
distribution w.r.t the central measure.
• Inter-Quartile Range: Range between the Quartiles Q1 and Q3 and is used to measure outliers (Box Plot)
• Coefficient of Variance:
R codes
• Creating a Vector:
> x <- c(2,4,7,8,10) # Quantitative values #
> y <- c( “Yes”, “No” ) # Qualitative values #
• Creating a dataframe:
> df <- data.frame (x, y)
• Creating Frequency Table:
> t <- table (data.frame name $ variable name) # with one variable #
> t1 <- table (data.frame $ var 1, dataframe $ var 2 ) # with more than one variable #
• Creating Groups or Cut points:
> cutvariable <- cut(variable name, breaks = c( 10,20,30,40), labels = c(“A”, “B”, “C”,) # e.g. 20 will fall in 10-20 #
> cutvariable <- cut(variable name, breaks = c( 10,20,30,40), labels = c(“A”, “B”, “C”), right = FALSE) # 20 will fall in 20-30 #
• Creating Charts:
> barplot( t, main = “title”, xlab = “x”, ylab = “y”, legend = row.names(t), col = rainbow (specify no.))
> pie (t)
> hist(t)
> boxplot (dataframe $ variable name)
R codes
• Measures of Central Tendency:
> mean(dataframe $ variable name)
> median (dataframe $ variable name)
> t <- table (dataframe name $ variable name)
> t[t = = max (t)] # Gives the Modal value; which needs to be calculated from the frequency table #
• Measures of Dispersion:
>sd (dataframe $ variable name)
>range (dataframe $ variable name)
>quantile (dataframe $ variable name) # Gives all four quartile values #
Probability
• Important concepts:
•
Trial : An experiment which can be conducted repeatedly
Event: The outcome of an experiment
Mutually Exclusive: Events cannot occur simultaneously
Exhaustive: At least one event has to occur after every experiment
Equally Likely: Every event has same chance of occurrence
Union (U): Events A union B means, A or B = A + B = A U B
Intersection (Ω): Events A intersection B means, A and B = A * B = A Ω B
Complement: Á means wherever event A is not present
• Classical definition:
If there are N mutually exclusive, exhaustive and equally likely events; and if N(A) of them are favorable to event A, then:
P(A) =
Probability
• Properties:
•
Values of probability lies between 0 and 1
The sum of all the events present in the sample space = 1
Á=1–A
Addition Rule : A or B = A + B = A U B
a. Mutually Exclusive events: P(AUB) = P(A) + P (B)
b. Not Mutually Exclusive events: P(AUB) = P(A) + P (B) – P(A Ω B)
Multiplication Rule: A and B = A * B = A Ω B
a. Independent Events: P(A Ω B) = P(A) * P(B)
b. Conditional Events: P(A Ω B) = P(B) * P(A/B)
Thomas Bayes Theorem: If event A can occur with any N mutually exclusive, exhaustive and equally likely events and if A actually occurs with Ei
P(Ei / A) =
Binomial Distribution
•
• Properties: It is a Discrete Probability distribution
(Used when there are repeated trials)
Every trial has a success or a failure pmf: f(x) = nCx. θx .(1-θ)n-x
Every trial is independent to each other
Probability of success is same for every trial
Poisson Distribution
• Properties: It is a Discrete Probability distribution
(Used when trials becomes huge and tends towards infinity)
Limiting form of Binomial distribution pmf: f(x) =
The average occurrence of the event is known
No. of trials is generally very large and so is unknown
Normal Distribution
• Properties of Normal Distribution: Continuous Probability Distribution
•
Symmetrical curve with Skewness = 0
Infinite Limits tending from - to +
Mean = Median = Mode
• Standard Normal (Z) Distribution: Continuous Probability Distribution
Symmetrical curve with Skewness = 0 Standard Normal Density Function
Finite Limits tending from - 3 to + 3 f(z) =
Mean = Median = Mode at z= 0
R codes
• Binomial Distribution
> dbinom(12:24, size = n, prob = θ)
> sum (dbinom(12:24, size = n, prob = θ))
• Poisson Distribution
> dpois (x = 112: 115, lambda = value)
> sum (dpois (x = 112: 115, lambda = value))
• Normal Distribution:
> X <-pnorm(5000, mean = value, sd = value)
>Y <- pnorm(10000, mean = value, sd = value)
> Y – X # Probability between 10000 and 5000 #
* Values of x has been assumed to make it understandable
* By default it calculates values of the lower tail…so we add : lower.tail =FALSE