0% found this document useful (0 votes)
18 views32 pages

R Programing

The document outlines a series of exercises related to data manipulation and visualization using R, including importing data, performing matrix and dataframe operations, creating various types of graphs, and applying min-max normalization. Each exercise includes a clear aim, step-by-step procedures, and R code examples. The document concludes with successful execution results for each exercise.

Uploaded by

systemev206hql
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views32 pages

R Programing

The document outlines a series of exercises related to data manipulation and visualization using R, including importing data, performing matrix and dataframe operations, creating various types of graphs, and applying min-max normalization. Each exercise includes a clear aim, step-by-step procedures, and R code examples. The document concludes with successful execution results for each exercise.

Uploaded by

systemev206hql
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CONTENTS

PAGE
[Link] DATE PROGRAM NAME SIGNATURE
NO
Import Data from other
01
formats
02 Matrix Operations

03 Dataframe Operations
Different types of
04
Graphs
Min-max Normalization
05
for a dataset
Mean, Median,
06 Standard Deviation and
t-test
Handling missing
07
Values
Statistical Correlation
08
Test
Data Transformation
09
Operations
Explore the distribution
10
of dataset
[Link]: 01
DATE: IMPORT DATA FROM docx, xls, txt AND OTHER FORMATS

Aim:
To import data from docx, xls, txt and other formats.
procedure:
Step 1 : Start the process
Step 2 : Determine the format of the data file that needs to be imported
(Excel ,SPSS, text, or CSV).
Step 3 : Install and load the necessary R packages based on the format of the
data file (readxl, XLConnect, foreign, or readr).
Step 4 : Open the data file in its respective software and save it with
An appropriate file name and extension.
Step 5 : Use the respective R function (read_excel, [Link], read_table2,
or read_csv) to import the data into R as a data frame.
Step 6 : Assign the imported data to a variable for further analysis.
Step 7 : Check the imported data using the print or head functions to ensure it
is correctly imported.
Step 8 : If required, clean and preprocess the imported data for further
analysis.
Step 9: Stop the process.
CODE:
Importing Excel File
library(readxl) # load gdata package
help(read_excel) # documentation
mydata = read_excel("[Link]") # read from first sheet

Importing SPSS File


library(foreign) # load the foreign package
help([Link]) # documentation
mydata = [Link]("myfile", [Link]=TRUE)

Importing Text File with Tables


library(readr)
mydata = read_table2("[Link]") # read text file
mydata # print data frame

Importing CSV File


mydata = read_csv("[Link]") # read csv file
mydata

Result:
Thus the above program has been executed successfully.
[Link]: 02
DATE: MATRIX OPERATIONS

Aim:
To perform matrix operations using r.
procedure:
Step 1: Create three matrices P with different dimensions and values, each with
specified row and column names.
Step 2: Print each matrix P.
Step 3: Create matrices A and B with specific dimensions and values.
Step 4: Print matrices A and B.
Step 5: Perform arithmetic operations (addition, subtraction, multiplication,
division) between matrices A and B, and print the results.
Step 6: Create a matrix P with specific dimensions and values.
Step 7: Print matrix P.
Step 8: Generate random matrix data, find the indices of its maximum and
minimum values, and print them.
CODE:
2a.
> rownames = c("row1", "row2", "row3", "row4","row5")
> colnames = c("col1", "col2", "col3","col4")
> P= matrix(c(3:22), nrow = 5, byrow = TRUE, dimnames = list(rownames,
+ colnames))
> print(P)
col1 col2 col3 col4
row1 3 4 5 6
row2 7 8 9 10
row3 11 12 13 14
row4 15 16 17 18
row5 19 20 21 22
> rownames = c("row1", "row2", "row3")
> colnames = c("col1", "col2", "col3")
> P= matrix(c(3:11), nrow = 3, byrow = TRUE, dimnames = list(rownames,
+ colnames))
> print(P)
col1 col2 col3
row1 3 4 5
row2 6 7 8
row3 9 10 11
> rownames = c("row1", "row2")
> colnames = c("col1", "col2")
> P= matrix(c(3:6), nrow = 2, byrow = FALSE, dimnames = list(rownames,
+ colnames))
> print(P)
col1 col2
row1 3 5
row2 4 6
>

2b.
A= matrix(c(3:8), nrow = 2, byrow = TRUE)
> print(A)
[,1] [,2] [,3]
[1,] 3 4 5
[2,] 6 7 8
> B= matrix(c(3:8), nrow = 2, byrow = FALSE)
> print(B)
[,1] [,2] [,3]
[1,] 3 5 7
[2,] 4 6 8
> result=A+B
> result
[,1] [,2] [,3]
[1,] 6 9 12
[2,] 10 13 16
> result=A-B
> result
[,1] [,2] [,3]
[1,] 0 -1 -2
[2,] 2 1 0
> result=A*B
> result
[,1] [,2] [,3]
[1,] 9 20 35
[2,] 24 42 64
> result=A/B
> result
[,1] [,2] [,3]
[1,] 1.0 0.800000 0.7142857
[2,] 1.5 1.166667 1.0000000

2c.
P= matrix(c(3:18), nrow = 4, byrow = FALSE)
> print(P)
[,1] [,2] [,3] [,4]
[1,] 3 7 11 15
[2,] 4 8 12 16
[3,] 5 9 13 17
[4,] 6 10 14 18
> P[2,3]
[1] 12
> P[3,]
[1] 5 9 13 17
> P[,4]
[1] 15 16 17 18

2d.
x = matrix(1:6,nrow=2, ncol=3)
> y = matrix(13:21,nrow=3, ncol=3)
> print(x)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> y
[,1] [,2] [,3]
[1,] 13 16 19
[2,] 14 17 20
[3,] 15 18 21
> mat=rbind(x,y)
> mat
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[3,] 13 16 19
[4,] 14 17 20
[5,] 15 18 21
2e.
[Link](123)
> matrix_data= matrix(sample(1:100, 25,replace=TRUE), nrow = 5, ncol = 5)
> matrix_data
[,1] [,2] [,3] [,4] [,5]
[1,] 31 42 90 92 26
[2,] 79 50 91 9 7
[3,] 51 43 69 93 42
[4,] 14 14 91 99 9
[5,] 67 25 57 72 83
> max_index =which(matrix_data == max(matrix_data), [Link] = TRUE)
> min_index =which(matrix_data == min(matrix_data), [Link] = TRUE)
> max_index
row col
[1,] 4 4
> min_index
row col
[1,] 2 5

>

Result:
Thus the above program has been executed successfully.
[Link]: 03
DATE: DATAFRAME OPERATIONS

Aim:
To perform dataframe operations using r.

procedure:
Step 1: Create a dataframe [Link] with employee details including ID, name,
salary, and start date.
Step 2: Print the dataframe [Link] and its structure using print() and str()
functions.
Step 3: Print summary statistics of the dataframe using summary() and check
the class of the dataframe using class().
Step 4: Extract the column 'emp_name' from the dataframe and store it in a new
dataframe result. Print result.
Step 5: Add a new column 'dept' to the dataframe [Link] with department
names. Print the updated dataframe.
Step 6: Create a new row with employee details and append it to the dataframe
[Link]. Print the updated dataframe with the new row added.
CODE:
3a.
[Link] =[Link](
+ emp_id = c (1:5),
+ emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
+ salary = c(623.3,515.2,611.0,729.0,843.25),
+ start_date = [Link](c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
+ "2015-03-27")),
+ stringsAsFactors = FALSE)
> print([Link])
emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27
> str([Link])
'[Link]': 5 obs. of 4 variables:
$ emp_id : int 1 2 3 4 5
$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
$ salary : num 623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" ...

3b.
print("summary")
[1] "summary"
> summary([Link])
emp_id emp_name salary start_date
Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
Median :3 Mode :character Median :623.3 Median :2014-05-11
Mean :3 Mean :664.4 Mean :2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
Max. :5 Max. :843.2 Max. :2015-03-27
dept
Length:5
Class :character
Mode :character

> print("nature of df")


[1] "nature of df"
> class(df)
[1] "function"
3c.
result <- [Link]([Link]$emp_name)
> print(result)
[Link].emp_name
1 Rick
2 Dan
3 Michelle
4 Ryan
5 Gary

3d.
[Link]$dept <- c("IT","Operations","IT","HR","Finance")
> v <- [Link]
> print(v)
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
> #####adding new row#########
> v = c(6,"Anand",999.00,"2024-01-05","Business")
> new_df=rbind([Link],v)
> print(new_df)
emp_id emp_name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
2 2 Dan 515.2 2013-09-23 Operations
3 3 Michelle 611 2014-11-15 IT
4 4 Ryan 729 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Anand 999 2024-01-05 Business

Result:
Thus the above program has been executed successfully.
[Link]: 04
DATE: DIFFERENT TYPES OF GRAPHS

Aim:
To demonstrate different types of graphs using ggplot for a dataset.
procedure:
Scatter Plot:
Plot Sepal Width against Sepal Length, color points by Species, and use a
minimalistic theme.

Box Plot:
Create a box plot of Petal Length by Species, color boxes by Species, and use a
minimalistic theme (optionally remove legend).

Histogram:
Generate a histogram of Sepal Length, fill bars by Species, and use a
minimalistic theme.

Heatmap:
Construct a heatmap of Sepal Length against Petal Length, fill cells by Species,
and use a minimalistic theme.

Bar Chart:
Plot the mean Petal Length by Species as a bar chart, color bars by Species, and
use a minimalistic theme.
CODE:
Scatter Plot:
ggplot(data=iris,aes(x=[Link], y=[Link],color=Species)) +
geom_point() +
theme_minimal()

Box plot
ggplot(data=iris,aes(x=Species, y=[Link],color=Species)) +
geom_boxplot()
+theme_minimal()+
theme([Link]="none")

Histogram:
ggplot(data=Iris,aes(x=SepalLengthCm,fill=Species)) + geom_histogram()
+theme_minimal()
Heat Map:
ggplot(data=Iris,aes(x=SepalLengthCm,y=PetalLengthCm,fill=Species)) +
geom_bin2d() +
theme_minimal()
Bar chart :
ggplot(data=iris,aes(x=Species,y=[Link],fill=Species))+geom_bar(stat
="summary",fun.y="mean")+theme_minimal()

Result:
Thus the above program has been executed successfully.
[Link]: 05
DATE: MIN-MAX NORMALIZATION IN DATASET

Aim:
To perform min-max normalization in dataset and show the result using
ggplot.
procedure:
Step 1: Load the ggplot2 library and the iris dataset.
Step 2: Examine the structure of the iris dataset.
Step 3: Create a scatter plot using Petal Width and Petal Length, color points by
species.
Step 4: Normalize the numerical columns of the iris dataset to a range of [0,1]
and retain the species information.
Step 5: Examine the structure and summary statistics of the normalized iris
dataset.
Step 6: Create another scatter plot using the normalized Petal Width and Petal
Length, color points by species.
CODE:
> library(ggplot2)
> data("iris")
> str(iris)
'[Link]': 150 obs. of 5 variables:
$ [Link]: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ [Link] : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ [Link]: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ [Link] : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1
1 1 ...
> ggplot(data = iris, aes(x = [Link], y = [Link],color = Species))
+ geom_point()
> iris_norm <- [Link](apply(iris[, 1:4], 2, function(x) (x - min(x))/(m
ax(x)-min(x))))
> iris_norm$Species <- iris$Species
> str(iris_norm)
'[Link]': 150 obs. of 5 variables:
$ [Link]: num 0.2222 0.1667 0.1111 0.0833 0.1944 ...
$ [Link] : num 0.625 0.417 0.5 0.458 0.667 ...
$ [Link]: num 0.0678 0.0678 0.0508 0.0847 0.0678 ...
$ [Link] : num 0.0417 0.0417 0.0417 0.0417 0.0417 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1
1 1 ...
> summary(iris_norm)
[Link] [Link] [Link] [Link]
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
Species
setosa :50
versicolor:50
virginica :50

> ggplot(data = iris_norm, aes(x = [Link], y = [Link],color = Speci


es)) +
+ geom_point()
Before Noramalization:

After Normalization:
Result:
Thus the above program has been executed successfully.
[Link]: 06
DATE: MEAN, MEDIAN, STANDARD DEVIATION AND T -TEST

Aim:
To calculate mean, median, standard deviation and t-test in a dataset.
procedure:
Step 1: Load the dplyr library and the iris dataset.
Step 2: Calculate summary statistics (mean, median, and standard deviation) for
Sepal Length across all species using dplyr's summarise() function with across()
and list() functions.
Step 3: Perform a t-test comparing the Sepal Length between the "setosa" and
"versicolor" species.
Step 4: Store Sepal Length data for "setosa" and "versicolor" species in separate
vectors, x and y.
Step 5: Conduct the t-test between vectors x and y using the [Link]() function.
Step 6: Print the results of the t-test.
CODE:

library(dplyr)
data("iris")
> iris %>%
+ summarise(across([Link], list(mean = mean, median = median, sd = s
d)))
Sepal.Length_mean Sepal.Length_median Sepal.Length_sd
1 5.843333 5.8 0.8280661
> #t test
> x <- iris[iris$Species == "setosa", ]$[Link]
> y <- iris[iris$Species == "versicolor", ]$[Link]
> tt <- [Link](x, y)
> tt

Welch Two Sample t-test

data: x and y
t = -10.521, df = 86.538, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1057074 -0.7542926
sample estimates:
mean of x mean of y
5.006 5.936

Result:
Thus the above program has been executed successfully.
[Link]: 07
DATE: HANDLING MISSING VALUES

Aim:
To handle the missing values in a dataset.
procedure:
Step 1: Load the "airquality" dataset.
Step 2: Convert the "airquality" dataset into a dataframe called "df".
Step 3: Print a summary of the dataframe "df".
Step 4: Calculate the number of missing values in the "Ozone" column of the
dataframe.
Step 5: Compute the mean of the "Ozone" column, excluding missing values.
Step 6: Replace missing values in the "Ozone" column with the computed mean.
Step7: Recalculate the number of missing values in the "Ozone" column to
confirm replacement.
CODE:
> df=[Link](airquality)
> summary(df)
Ozone Solar.R Wind Temp Month
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.0
00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.0
00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.0
00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.9
93
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.0
00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.0
00
NA's :37 NA's :7
Day
Min. : 1.0
1st Qu.: 8.0
Median :16.0
Mean :15.8
3rd Qu.:23.0
Max. :31.0

> sum([Link](df$Ozone))
[1] 37
> mean(df$Ozone,[Link]=T)
[1] 42.12931
> df$Ozone[[Link](df$Ozone)]=mean(df$Ozone,[Link]=T)
> sum([Link](df$Ozone))
[1] 0

Result:
Thus the above program has been executed successfully.
[Link]: 08
DATE: STATISTICAL CORRELATION TEST

Aim:
To perform statistical correlation test for comparing two variables.

procedure:
Step 1: Load the iris dataset using .data(iris).
Step 2: Assign the "[Link]" column of the iris dataset to the variable
variable1.
Step 3: Assign the "[Link]" column of the iris dataset to the variable
variable2.
Step 4: Perform a correlation test between variable1 and variable2 using the
[Link]() function.
Step 5: Store the results of the correlation test in the variable correlation_test.
step 6: Print the results of the correlation test using the print() function.
CODE:
> data(iris)
> variable1 <- iris$[Link]
> variable2 <- iris$[Link]
> correlation_test <- [Link](variable1, variable2)
> print(correlation_test)

Pearson's product-moment correlation

data: variable1 and variable2


t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538

Result:
Thus the above program has been executed successfully.
[Link]: 09
DATE: DATA TRANSFORMATION OPERATIONS

Aim:
To perform various data transformation operations using filter, arrange,
select, mutate, summarize functions.
procedure:

Step 1: Load the dplyr library and the iris dataset.


Step 2: Filter the iris dataset to include only rows where the species is 'virginica'
and Sepal Length is greater than 7.
Step 3: Arrange the iris dataset in ascending order based on Petal Length.
Step 4: Select the "Species" column from the iris dataset.
Step 5: Add a new column "PetalDiff" to the iris dataset, calculated as the
difference between Petal Length and Petal Width.
Step 6: Summarize Sepal Length column in the iris dataset, calculating mean,
median, and standard deviation.
CODE:
library(dplyr)
> data("iris")
> iris %>% filter(Species == 'virginica' & [Link]>7)
[Link] [Link] [Link] [Link] Species
1 7.1 3.0 5.9 2.1 virginica
2 7.6 3.0 6.6 2.1 virginica
3 7.3 2.9 6.3 1.8 virginica
4 7.2 3.6 6.1 2.5 virginica
5 7.7 3.8 6.7 2.2 virginica
6 7.7 2.6 6.9 2.3 virginica
7 7.7 2.8 6.7 2.0 virginica
8 7.2 3.2 6.0 1.8 virginica
9 7.2 3.0 5.8 1.6 virginica
10 7.4 2.8 6.1 1.9 virginica
11 7.9 3.8 6.4 2.0 virginica
12 7.7 3.0 6.1 2.3 virginica
> iris %>% filter(Species == "versicolor") %>% arrange(desc([Link]))
[Link] [Link] [Link] [Link] Species
1 7.0 3.2 4.7 1.4 versicolor
2 6.9 3.1 4.9 1.5 versicolor
3 6.8 2.8 4.8 1.4 versicolor
4 6.7 3.1 4.4 1.4 versicolor
5 6.7 3.0 5.0 1.7 versicolor
6 6.7 3.1 4.7 1.5 versicolor
7 6.6 2.9 4.6 1.3 versicolor
8 6.6 3.0 4.4 1.4 versicolor
9 6.5 2.8 4.6 1.5 versicolor
10 6.4 3.2 4.5 1.5 versicolor
11 6.4 2.9 4.3 1.3 versicolor
12 6.3 3.3 4.7 1.6 versicolor
13 6.3 2.5 4.9 1.5 versicolor
14 6.3 2.3 4.4 1.3 versicolor
15 6.2 2.2 4.5 1.5 versicolor
16 6.2 2.9 4.3 1.3 versicolor
17 6.1 2.9 4.7 1.4 versicolor
18 6.1 2.8 4.0 1.3 versicolor
19 6.1 2.8 4.7 1.2 versicolor
20 6.1 3.0 4.6 1.4 versicolor
21 6.0 2.2 4.0 1.0 versicolor
22 6.0 2.9 4.5 1.5 versicolor
23 6.0 2.7 5.1 1.6 versicolor
24 6.0 3.4 4.5 1.6 versicolor
25 5.9 3.0 4.2 1.5 versicolor
26 5.9 3.2 4.8 1.8 versicolor
27 5.8 2.7 4.1 1.0 versicolor
28 5.8 2.7 3.9 1.2 versicolor
29 5.8 2.6 4.0 1.2 versicolor
30 5.7 2.8 4.5 1.3 versicolor
31 5.7 2.6 3.5 1.0 versicolor
32 5.7 3.0 4.2 1.2 versicolor
33 5.7 2.9 4.2 1.3 versicolor
34 5.7 2.8 4.1 1.3 versicolor
35 5.6 2.9 3.6 1.3 versicolor
36 5.6 3.0 4.5 1.5 versicolor
37 5.6 2.5 3.9 1.1 versicolor
38 5.6 3.0 4.1 1.3 versicolor
39 5.6 2.7 4.2 1.3 versicolor
40 5.5 2.3 4.0 1.3 versicolor
41 5.5 2.4 3.8 1.1 versicolor
42 5.5 2.4 3.7 1.0 versicolor
43 5.5 2.5 4.0 1.3 versicolor
44 5.5 2.6 4.4 1.2 versicolor
45 5.4 3.0 4.5 1.5 versicolor
46 5.2 2.7 3.9 1.4 versicolor
47 5.1 2.5 3.0 1.1 versicolor
48 5.0 2.0 3.5 1.0 versicolor
49 5.0 2.3 3.3 1.0 versicolor
50 4.9 2.4 3.3 1.0 versicolor
> iris %>% select(Species) %>% distinct()
Species
1 setosa
2 versicolor
3 virginica
> iris_with_new_column <- iris %>% mutate(SepalRatio = [Link] / [Link]
dth)
> head(iris_with_new_column)
[Link] [Link] [Link] [Link] Species SepalRatio
1 5.1 3.5 1.4 0.2 setosa 1.457143
2 4.9 3.0 1.4 0.2 setosa 1.633333
3 4.7 3.2 1.3 0.2 setosa 1.468750
4 4.6 3.1 1.5 0.2 setosa 1.483871
5 5.0 3.6 1.4 0.2 setosa 1.388889
6 5.4 3.9 1.7 0.4 setosa 1.384615
> iris %>% summarise(across([Link], list(mean = mean, median = median, s
d = sd)))
Sepal.Length_mean Sepal.Length_median Sepal.Length_sd
1 5.843333 5.8 0.8280661

Result:
Thus the above program has been executed successfully.
[Link]: 10
DATE: EXPLORE THE DISTRIBUTION OF DATASET

Aim:
To explore the distribution of variables in a dataset.
procedure:
Step 1: Load the ggplot2 library.
Step 2: Load the diamonds dataset.
Step 3: Plot histograms for x, y, and z variables:
- Use ggplot2 to create histograms for each variable.
- Set the binwidth to 0.5 for x, y, and z histograms.
- Customize colors for better visualization.
- Add titles to indicate the variables.
Step 4: Calculate summary statistics for x, y, and z variables:
- Use the summary() function to get mean, median, min, max, etc.
Step 5: Plot a histogram for the price variable:
- Use ggplot2 to create a histogram for the price variable.
- Set the binwidth to 500 for the price histogram.
- Customize colors for better visualization.
- Add a title to indicate the variable.
Step 6: Calculate summary statistics for the price variable:
- Use the summary() function to get mean, median, min, max, etc.
CODE:
library(ggplot2)
>
> # Load the diamonds dataset
> data("diamonds")
>
> # Explore the distribution of x, y, z variables
> # Plot histograms for each variable
> ggplot(diamonds, aes(x = x)) + geom_histogram(binwidth = 0.5, fill = "skyblue", color = "
black") + ggtitle("Distribution of x Variable")
> ggplot(diamonds, aes(x = y)) + geom_histogram(binwidth = 0.5, fill = "lightgreen", color
= "black") + ggtitle("Distribution of y Variable")
> ggplot(diamonds, aes(x = z)) + geom_histogram(binwidth = 0.5, fill = "salmon", color = "b
lack") + ggtitle("Distribution of z Variable")
>
> # Summary statistics for x, y, z variables
> summary(diamonds[c("x", "y", "z")])
x y z
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.700 Median : 5.710 Median : 3.530
Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :10.740 Max. :58.900 Max. :31.800
>
> # Explore the distribution of the price variable
> # Plot a histogram for the price variable
> ggplot(diamonds, aes(x = price)) + geom_histogram(binwidth = 500, fill = "orange", color
= "black") + ggtitle("Distribution of Price")
>
> # Summary statistics for price variable
> summary(diamonds$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 950 2401 3933 5324 18823
Explore the distribution of price. Do you discover anything unusual or
surprising? (Hint: Carefully think about the binwidth and make sure you try a
wide range of values.)

The price data has many spikes, but I can’t tell what each spike corresponds to.
The following plots don’t show much difference in the distributions in the last
one or two digits.
There are no diamonds with a price of $1,500 (between $1,455 and $1,545,
including).
There’s a bulge in the distribution around $750.

ggplot(filter(diamonds, price < 2500), aes(x = price,fill=color)) +


geom_histogram(binwidth = 10, center = 0)
Result:
Thus the above program has been executed successfully.

You might also like