Introduction To R
▶ The "R" name is derived from the first letter of the names of its two developers,
Ross Ihaka and Robert Gentleman, who were associated with the University of
Auckland at the time.
▶ The initial version of R was released in 1995 to allow academic statisticians and
others with sophisticated programming skills to perform complex data statistical
analysis and display the results in form of a multitude of visual graphics.
▶ Commands can be anything from simple mathematical operators, including +, -, *,
and /, to more complicated functions that perform linear regressions and other
advanced calculations.
▶ Languages such as C++ require that an entire section of the code be written,
compiled, and run to see results, but in the case of r results can be seen after
one command at a time.
Types of Data in R-studio:
▶ According to r-studio the four main types of data most likely to be used are
1. Numeric data,
2. Character data (string like x),
3. Date data (time-based),
4. Logical data (true/false).
R Operators:
▶ Arithmetic operators: The R arithmetic operators allow us to do math operations,
like sums, divisions, or multiplications, among others.
▶Logical/boolean operators: In addition, boolean or logical operators in ‘R’ are used
to specify multiple conditions between objects. These comparisons return TRUE
and FALSE values.
▶ Relational/comparison operators: Comparison or relational operators are designed
to compare objects like Greater than (>), Less than (<), Greater than equal to
(>=), Less than equal to (<=), Equal to (=).
1
▶ Assignment operators in R: The assignment operators in R allow you to assign data
to a named object in order to store the data like (<-)Left assignment, (->)Right
assignment.
Command to clear R Console
CTRL + L
Variable
Variables can be thought of as a labelled container used to store information.
Variables allow us to recall saved information to later use in calculations. Variables
can store many different things in R studio, from single values to tables of
information, images and graphs.
Defining and Assigning values to variables
Storing a value or “assigning” it to a variable is completed using either <-, = or ->
function. The name given to a variable should describe the information or data
being stored. This helps when revisiting old code or when sharing it with others.
>num1=2
>name="Aastha"
>Feepaid=TRUE
Assignment Operator: =, <-, ->
>num1=2
>num1
[1] 2
>num2<-4
>num2
[1] 4
2
>7->num3
>num3
[1] 7
Arithmetic Operators
Addition
>2+1
[1] 3
>2+6
[1] 8
Subtraction
>2-4
[1] -2
Multiplication
>2*9
[1] 18
Division
>2/7
[1] 0.28571428
EXPONENT/POWER
>2^5
[1] 32
Use of Arithmetic Operators
>num1=15
>num2=5
>num1+num2
3
[1] 20
>num1-num2
[1] 10
>num1*num2
[1] 75
>num1/num2
[1] 3
Use of Relational Operators
num1=10
>num2=4
>num1>num2
[1] TRUE
>num1<num2
[1] FALSE
>num1=num2
[1] FALSE
>num1=num2
[1] TRUE
>num1>=num2
[1] TRUE
>num1<=num2
[1] FALSE
Use of Logical Operator: & (and) 4
>num1=9
>num2=4
>num1>5 & num2<6
[1] TRUE
> num1>5 & num2<3
[1] FALSE
> num1>15 & num2<15
[1] FALSE
5
> num1>20 & num2<2
[1] FALSE
Use of Logical Operator: | (or)
>num1=20
>num2=10
>num1>5 | num2<6
[1] TRUE
> num1>5 | num2<4
[1] TRUE
> num1>20 | num2<20
[1] TRUE
> num1>30 | num2<2
[1] FALSE
Vectors in R studio:
In R, a sequence of elements that share the same data type is known as a vector. If we
use only one item like 2 then it is variable but if a number of items are calculated
collectively it is called a vector.
>vec1=c(6,5,4,3,2,1)
>vec1
[1] 6,5,4,3,2,1
>class(vec1)
[1] "numeric"
>vec2=c("a","b","c","d","e")
6
>vec2
[1] "a" "b" "c" "d" "e"
>class(vec2)
[1] "character"
>vec3=c(F,T,F,T)
>vec3
[1] FALSE TRUE FALSE TRUE
>class(vec3)
[1] "logical"
>vec4=c(1,"a",4,"b",3)
>vec4
[1] "1" "a" "4" "b" "3"
>class(vec4)
[1] "character"
>vec5=c(1,T,2,F,3,F)
>vec5
[1] 1 1 2 0 3 0
>class(vec5)
[1] "numeric"
>vec6=c("d",T,"b",F,"c")
>vec6
[1] "d" "TRUE" "b" "FALSE" "c"
>class(vec6) 7
[1] "character"
>vec7=c(1,"a",T,2,"e",F)
>vec7
[1] "1" "a" "TRUE" "2" "e" "FALSE"
>class(vec7)
[1] "character"
Vector Arithmetic:
>vec1=c(1,2,3,4,5,6)
>vec2=c(1,1,1,1,1,1)
>vec1+vec2
[1] 2 3 4 5 6 7
>vec1-vec2
[1] 0 1 2 3 4 5
>vec1*vec2
[1] 1 2 3 4 5 6
[1] 1 2 3 4 5 6
Vector indexing:
Now try finding V1[4] or V2[3] etc. to find the value at 4 or 3 items. Also, try finding the
length and class of V1 and V2.
1. Write V1[4] which means extracting the 4th element from the V1 array.
2. Second to know the number of items in an array write Length(V1) – R will show
you the total number of items in an array.
8
3. To know the class of an array use the command Class(V1) it will show you the
class whether it is numeric, character, logical, etc.
>vec1=c(5,16,33,23,24,40,35,17)
>vec1[2]
[1] 16
>length(vec1)
[1] 8
Lists in R studio:
Lists are the R objects which contain elements of different types like − numbers, strings,
vectors, and other lists inside them. The list is created using list() function.
>l1=list(1,"a",TRUE)
>l1
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
>class(l1[[1]])
[1] "numeric"
9
>class(l1[[2]])
[1] "character"
>class(l1[[3]])
[1] "logical"
List of Vectors
>l2=list(c(1,2,3),c("a","b","c"),c(T,F,T))
>l2
[[2]]
[1] "a" "b" "c"
[[3]]
[1] TRUE FALSE TRUE
>l2[[2]][1]
[1] "a"
>l2[[1]][3]
[1] 3
Matrices in R studio:
In R, a two-dimensional rectangular data set is known as a matrix. A matrix is created
with the help of the vector input to the matrix function. On R matrices, we can perform
addition, subtraction, multiplication, and division operation.
In the R matrix, elements are arranged in a fixed number of rows and columns. The
matrix elements are the real numbers.
>m1=matrix(c(6,5,4,3,2,1))
>m1
[,1] 10
[1,] 6
[2,] 5
[3,] 4
[4,] 3
[5,] 2
[6,] 1
>m1=matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
>m1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
>m1=matrix(c(1,2,3,4,5,6),nrow=2,ncol=3,byrow=T)
>m1
[,1][,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
>m1[1,2]
[1] 2
Array in R studio:
An array is a data structure that can hold multi-dimensional data. In R, the array objects
can hold two or more two-dimensional data. Arrays are also called vector structures. A
vector is an array of numbers with a single index while a matrix is an array of numbers
with two indices.
▶ Uni-dimensional arrays are called vectors with the length being their only
dimension.
11
▶ Two-dimensional arrays are called matrices, consisting of fixed numbers of rows
and columns.
▶ Arrays consist of all elements of the same data type.
▶ An array in R can be created with the use of array() function.
>vec1=c(1,2,3,4,5,6)
>vec2=c(7,8,9,10,11,12)
>a1=array(c(vec1,vec2),dim=c(2,3,2))
>a1
,,1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
,,2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
>a1[1,2,1]
[1] 3
>a1[1,1,2]
[1] 7
>a1[2,3,2]
[1] 12
12
Simple programming constructs such as if… else, for, while, break.
When we’re programming in R (or any other language, for that matter), we often
want to control when and how particular parts of our code are executed. We can do
that using control structures like if-else statements, for loops, and while loops.
▶The IF Conditional Statement: Let’s say we’re watching a sports match that decides
which team makes the playoffs. We could visualize the possible outcomes using
this tree chart:
IF STATEMENT:
▶ As we can see in the tree chart, there are only two possible outcomes. If Team A
wins, they go to the playoffs. If Team B wins, then they go.
>teamA=5
>teamB=3
>if(teamA>teamB)
{print("Team A Wins")}
[1] "Team A Wins"
R will write Team A wins Because it is true as 5 is more than 3
ELSE STATEMENT:
What if Team A had 1 goal and Team B had 3 goals? Our teamA>teamB
conditional would evaluate to FALSE. As a result, nothing would be printed if we 13
ran our code. Because the if statement evaluates to false, the code block inside
the if statement is not executed. In this >teamA=1
>teamB=3
>if(teamA>teamB)
{print("Team A Wins")}else{print("Team B wins")}
[1] "Team B wins"
FOR LOOP:
It is a type of control statement that enables one to easily construct a loop that
has to run statements or a set of statements multiple times. For loop is
commonly used to iterate over items of a sequence.
For(value in sequence){statement}
>For(val in 1:4){print(val)}
[1] 1
[1] 2
[1] 3
[1] 4
Program to print first 10 natural numbers
>for(val in 1:9){print(val)}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5 14
[1] 6
[1] 7
[1] 8
[1] 9
Program to print square of first 10 natural numbers
>for(val in 1:10){print(val*val)}
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
[1] 49
[1] 64
[1] 81
[1] 100
Program to print table of a number:
>for(val in 1:9){print(2*val)}
[1] 2
[1] 4
[1] 6
[1] 8 15
[1] 10
[1] 12
[1] 14
[1] 16
[1] 18
Print Days of Week
>week=c("sunday","monday","tuesday","wednesday","thursday","friday","satur
day")
>for (days in week) {print(days)}
[1] "sunday"
[1] "monday"
[1] "tuesday"
[1] "wednesday"
[1] "thursday"
[1] "friday"
[1] "saturday"
While LOOP:
It is a type of control statement which will run a statement or a set of statements
repeatedly unless the given condition becomes false. A while loop in R is a close
cousin of the for loop in R. However, a while loop will check a logical condition,
and keep running the loop as long as the condition is true.
While(condition){statement}
16
▶ If the condition in the while loop in R is always true, the while loop will be an
infinite loop, and our program will never stop running.
Program to print first 10 natural numbers
>i=1
>while(i<=9)
{prin
t(i)
i=i+1
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
17
▶ Example: Let’s take a team that’s starting the season with zero wins. They’ll need
to win 10 matches to make the playoffs. We can write a while loop to tell us
whether the team wins:
>wins=0
>while(wins<10)
{print("will not win")
wins=wins+1}
It will run the command till the statement becomes false means till the number
reaches 10 in this case.
Break statement in R:
Break statement is used to terminate the loop
>i=1
>while(i<=10)
{pr
int
(i)
if(i
==
4)
br
ea
i=i
+1 18
}
[1] 1
[1] 2
[1] 3
[1] 4
Introduction to Data Frame
>fruits=data.frame(fruit_name=c("apple","banana","mango"),fruit_cost=c(100,200,300)
)
>fruits
fruit_namefruit_cost
1 apple 100
2 banana 200
3 mango 300
>fruits$fruit_cost
[1] 100 200 300
>fruits$fruit_name
[1] "apple" "banana" "mango"
19
Summary statistics
• R provides a wide range of functions for obtaining summary statistics. One method of
obtaining descriptive statistics is to use the summary(file name) function with a specified
summary statistic.
• For this we first need to install a package in r studio named: Fbasics
• It helps you to calculate the descriptive statistics of the whole data series, the values
calculated by this are:
o Mean
o Median
o Minimum
o Maximum
o 1st and 3rd quartile
>marks=c(8,10,12,15,20,7,6,5,8,3,2,12,8,9,7,15,8)
>summary(marks)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 7.000 8.000 9.118 12.000 20.000
>summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
Quick Plots
• For presenting the data in the form of simple plots we just need to write a command
plot(row name).
• This will help to draw a simple basic level plot of the data file selected, which looks like
this:
20
>marks=c(8,10,12,15,20,7,6,5,8,3,2,12,8,9,7,15,8)
>plot(marks)
>plot(iris$Sepal.Length)
21
Coloured Quick plots:-
• We can also get the coloured version of our plots for this the command adds the
colour element to it, and the command is plot(row name, col=1)
• Colour codes: 1- Black
2- Red
3- green
4- Blue
5- Aqua
6- Pink
>marks=c(8,10,12,15,20,7,6,5,8,3,2,12,8,9,7,15,8)
>plot(marks,col=2)
22
Histogram
• A histogram is a graph that shows the frequency of numerical data using
rectangles.
• The height of a rectangle (the vertical axis) represents the distribution frequency of
a variable (the amount, or how often that variable appears).
• The width of the rectangle (horizontal axis) represents the value of the variable (for
instance, minutes, years, or ages).
• The histogram displays the distribution frequency as a two-dimensional figure,
meaning the height and width of columns or rectangles have particular meanings
and can vary. A bar chart is a one-dimensional figure. The height of its bars
represents something specific.
• To draw a histogram using the command hist(data file)
• To draw the coloured histogram use the command hist(data file, col(“Red”))
• To add labels to the horizontal axis use the command (xlab=) in the above
command.
• To add a heading to the histogram using the commanding main() in the above
command.
>marks=c(8,10,12,15,20,7,6,5,8,3,2,12,8,9,7,15,8)
>hist(SepalLength)
23
>hist(marks)
PIE CHARTS
A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices to
illustrate numerical proportions.
24
In a pie chart, the arc length of each slice (and consequently its central angle and area) is
proportional to the quantity it represents.
Pie charts are created with the function pie(x, labels=) where x is a non-negative numeric
vector indicating the area of each slice and labels= notes a character vector of names for
the slices.
>slices <- c(10, 12,4, 16, 8)
>lbls<- c("US", "UK", "Australia", "Germany", "France")
>pie(slices, labels = lbls, main="Pie Chart of Countries")
25
Z test in R
Hypothesis testing, also known as significance testing, is a statistical test that is
used to conclude the population based on assumption. Here two hypotheses are
proposed. One is the null hypothesis, and the other is the alternate hypothesis. For
hypothesis testing different tests are used. The tests have been categorized in two
ways:
Parametric Test: These tests make assumptions about the population
parameters. Some of the tests are Z Test, F test etc.
Non-Parametric Tests: These tests do not make any assumptions about the
population parameters.
What is Z Test?
Z test is a popular parametric test used for hypothesis testing. Z test is a statistical
method used to determine if there is a significant difference between sample and
population means or between the means of two samples. It is used when there is a
large sample size and the population. It is to be noted that Z Test follows normal
distribution. The Z value acts as a threshold. Based on its value it is decided
whether to accept the hypothesis or reject the hypothesis. This test is applicable
where the sample size is greater than 30.
There are two types of Z tests based on samples:
One Sample Z-test
Two Sample Z-test
One Sample Z test
Here Z Test is applicable on one sample that has been taken from the population.
The formula is as follows:
Z = \frac{{\bar{X} - \mu}}{{\frac{\sigma}{\sqrt{n}}}}
Here,
Z denotes the Z value
\bar{X} is the sample mean
\mu denotes mean of the population
\sigma denotes population standard deviation
n denotes sample size.
Two sample Z test
Here Z Test is applicable on two samples that has been taken from the population.
The formula is as follows:
26
Z = \frac{{\bar{X}_1 - \bar{X}_2}}{{\sqrt{\frac{{s_1^2}}{{n_1}} +
\frac{{s_2^2}}{{n_2}}}}}
Here,
{\bar X_1} and {\bar X_2} are the sample means.
s1 and s2 are standard deviations of the two samples.
n1 and n2 are sample sizes of two samples.
Application of Z-test
Z-test is applied when:
1. Population Standard Deviation is Known:
We use z-test when, we know the standard deviation of the population and
are comparing a sample mean to a population mean or comparing means of
two independent samples.
If you know the average height of a population and you want to test whether
a sample of individuals has a significantly different average height.
2. Large Sample Size:
The Z-test is most reliable when dealing with large sample sizes (typically, n
> 30 is considered "large").
As the sample size increases, the sampling distribution of the sample mean
becomes approximately normal, according to the Central Limit Theorem.
Therefore, the Z-test becomes more appropriate as the sample size increases.
Z test in R
R is a popular high level programming language used for statistical analysis. It is
open-source programming language as it has a huge community and users can
contribute to the development as well. It has vast number of packages which
allows the data miners to perform statistical analysis and data visualizations in an
interactive manner.
The syntax of z- test in R is:
z.test(x, y, alternative='two.sided', mu=0, sigma.x=NULL,
sigma.y=NULL,conf.level=.95)
Now we can conduct one sample test and two sample tests in R.
Here we provide the vector(s) and also provide the value of standard deviation and
population mean whose hypothesis is to be tested against. Then we use z.test to
calculate the z value. This method provides a complete summary of the output.
The one sample test is as follows:
Here,
mu is the population mean under the null hypothesis.
sigma.x is the known population standard deviation.
# Sample data
27
sample_data <- c(26, 25, 10, 34, 30, 23, 28, 29, 25, 27)
# One-sample Z-test
z_test <- z.test(sample_data, mu = 24,sigma.x=10)
# Print the result
print(z_test)
Output:
One-sample z-Test
data: sample_data
z = 0.53759, p-value = 0.5909
alternative hypothesis: true mean is not equal to 24
95 percent confidence interval:
19.50205 31.89795
sample estimates:
mean of x
25.7
The z.test function returns a test result object that includes the test statistic, p-
value, and other relevant information.
The output of the z test is:
Test Statistics (z): 0.53759
P-value: 0.5909
Alternative Hypothesis: The true mean is not equal to 24.
95% Confidence Interval: The confidence interval for the true mean is given
as (19.50205, 31.89795).
Sample Estimate (mean of x): 25.7
The p-value is 0.5909 and the value is greater than the chosen significance level,
hence, we will fail to reject the null hypothesis. There is not enough evidence to
suggest that the true mean is different from 24 based on your sample data. The
95% confidence interval provides a range of plausible values for the true mean.
Based on the above output it is said that there is not much evidence to reject null
hypothesis. So, the null hypothesis is accepted, and the alternate hypothesis is
rejected.
Now we will perform two sample Z-Test
R
# Two vectors of sample data
data1 <- c(27, 24, 18, 29, 30,27)
data2 <- c(23, 28, 20, 19, 35,23)
# Two-sample Z-test
28
z_test_result <- z.test(data1,data2,mu=26,sigma.x=10,sigma.y=15)
# Print the result
print(z_test_result)
Output:
Two-sample z-Test
data: data1 and data2
z = -3.3742, p-value = 0.0007403
alternative hypothesis: true difference in means is not equal to 26
95 percent confidence interval:
-13.25828 15.59161
sample estimates:
mean of x mean of y
25.83333 24.66667
The output of the two-sample z-test comparing two independent samples:
Test Statistic (z): -3.3742
P-value: 0.0007403
Alternative Hypothesis: The true difference in means is not equal to 26.
95% Confidence Interval: (-13.25828, 15.59161)
Sample Estimates:
o Mean of Group 1 (data1): 25.83333
o Mean of Group 2 (data2): 24.66667
From the above output we can see that the z-value is negative, and the p value is
very small. So based on the above calculations we can say that there is sufficient
evidence to accept null hypothesis. In this case we have to accept alternate
hypothesis.
Differences Between two-sample, t-test and paired t-test
Statistical tests are essential tools in data analysis, helping researchers make
inferences about populations based on sample data. Two common tests used to
compare the means of different groups are the two-sample t-test and the paired t-
test. Both tests are based on the t-distribution, but they have distinct use cases and
assumptions. In this arti
29
R - Calculate Test MSE given a trained model from a training set and a test
set
Mean Squared Error (MSE) is a widely used metric for evaluating the performance
of regression models. It measures the average of the squares of the errors. the
average squared difference between the actual and predicted values. The Test
MSE, specifically, helps in assessing how well the model generalizes to new,
unseen data. In this article, we wil
Upper Tail Test of Population Mean with Unknown Variance in R
A statistical hypothesis test is a method of statistical inference used to decide
whether the data at hand sufficiently support a particular hypothesis. The
conventional steps that are followed while formulating the hypothesis test, are
listed as follows State null hypothesis (Ho) and alternate hypothesis (Ha1)
T-Test Approach in R Programming
We will be trying to understand the T-Test in R Programming with the help of an
example. Suppose a businessman with two sweet shops in a town wants to check if
the average number of sweets sold in a day in both stores is the same or not. So,
the businessman takes the average number of sweets sold to 15 random people in
the respective shops. He foun
6 min read
One-Proportion Z-Test in R Programming
The One proportion Z-test is used to compare an observed proportion to a
theoretical one when there are only two categories. For example, we have a
population of mice containing half male and half females (p = 0.5 = 50%). Some of
these mice (n = 160) have developed spontaneous cancer, including 95 males and
65 females.
30
Code-1
If function in R (condition function in R)
Code-2
Condition function in R
31
Code-3
Conversion of vector into logical and character type
- In R, a sequence of elements that share the same data type is known as a vector. If we
use only one item like 2 then it is variable but if a number of items are calculated
collectively it is called a vector.
Code-4
List of R function
- Lists are the R objects which contain elements of different types like − numbers,
strings, vectors, and other lists inside them.
32
Code-5
Loop in function in R
- Loop is a type of control statement that enables one to easily construct a loop that has to
run statements or a set of statements multiple times. For loop is commonly used to iterate
over items of a sequence.
33
- Loop with number and vector
Code-6
Array in R
- An array is a data structure that can hold multi-dimensional data. In R, the array objects
can hold two or more two-dimensional data. Arrays are also called vector structures.
34
Code-7
Matrix function in R
- In R, a two-dimensional rectangular data set is known as a matrix. A matrix is created
with the help of the vector input to the matrix function.
Code-8
Codes for factor in R
35
Code-9
Pie Chart
- A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices
to illustrate numerical proportions. In a pie chart, the arc length of each slice (and
consequently its central angle and area) is proportional to the quantity it represents.
Code-10
Histogram
- A histogram is a graph that shows the frequency of numerical data using
rectangles. The histogram displays the distribution frequency as a two-
dimensional figure, meaning the height and width of columns or rectangles have
particular meanings and can vary. A bar chart is a one-dimensional figure. The
height of its bars represents something specific
36
37