Libraries:
The Tidyverse
includes several popular R packages, such as:
dplyr: for data manipulation and analysis
ggplot2: for data visualization
tidyr: for data transformation and reshaping
readr and writexl: for data import and export
purrr: for functional programming and data manipulation
stringr: for string manipulation
forcats: for categorical data manipulation
The MDSR Library :
Explore the MDSR datasets, using the data() function
Use the MDSR functions, such as mdsr_clean() and mdsr_visualize()
Take advantage of the MDSR utilities, such as mdsr_import() and mdsr_export()
Work through the MDSR course and book series, using the library to support your
learning
Lubridate is an R package that provides a set of functions for working with dates
and Tidyverse: This is a collection of packages that provide a consistent and intuitive
way of working with data in R. The core packages in the tidyverse are:
Tidyverse: This is a collection of packages that provide a consistent and intuitive way of
working with data in R. The core packages in the tidyverse are:
dplyr: For data manipulation and filtering
tidyr: For data transformation and reshaping
ggplot2: For data visualization
readr: For reading and parsing data files
lubridate: For working with dates and times
Other Important Packages:
stringr: For string manipulation and text analysis
magrittr: For piping operations together
pacman: For package management and installation
Key Functions to Know:
dplyr:
filter(): For filtering data
select(): For selecting specific columns
mutate(): For creating new columns
group_by(): For grouping data
summarise() : For summarizing data
glimpse is a function from the dplyr package, which is part of the tidyverse. It provides a
concise summary of a data frame, similar to str() or summary()
tidyr:
pivot_longer(): For converting data from wide to long format
pivot_wider() : For converting data from long to wide format
drop_na(): For removing missing values
lubridate:
year(): For extracting the year from a date
month(): For extracting the month from a date
day(): For extracting the day from a date
ggplot2:
ggplot(): For creating visualizations
aes(): For mapping variables to visual properties
geom_point(): For creating scatter plots
geom_bar(): For creating bar charts
nycflights13 Package? flights: all flights that departed from NYC in 2013
weather: hourly meterological data for each airport
planes: construction information about each plane
airports: airport names and locations
airlines: translation between two-letter carrier codes and na
Functions :
filter(): Select specific rows or columns based on conditions.
arrange(): Sort data in ascending or descending order.
group_by(): Divide data into groups based on one or more variables.
summarise(): Calculate summary statistics for each group.
mutate(): Add new columns to the data.
select(): Select specific columns from the data.
select is for selecting columns
filter is for selecting rows based on conditions
note that in R, you need to use the & operator to combine multiple conditions,
rather than chaining them together with < and >
Mutate:
ymd(): parses a character string into a Date object
mdy(): parses a character string into a Date object (month-day-year
format)
dmy(): parses a character string into a Date object (day-month-year
format)
interval(): creates an interval object representing a specific time
span
duration(): creates a duration object representing a specific length
of time
period(): creates a period object representing a specific length of
time
inner_join():inner_join(table1, table2, by = "id")
left_join()
full_join()
nrow(flights)
glimpse is a function from the dplyr package, which is part of the tidyverse. It provides a
concise summary of a data frame, similar to str() or summary()
Other methods and functions:
class(A),str(A) Finding the Type of Output
head(A)// summary(A)//glimpse ()
package_name::function_name
?function_name or help(function_name)
As.integer or ……..
Tribble // make table
Paste() for concate: paste is a function that concatenates strings or vectors of strings
into a single string.
n_distinct
sum(!is.na(name))
sorted()
*** important difference between sort and arrange :
As you can see, the arrange function returns a new data frame with the rows sorted in
ascending order by yearID. The output is a data frame with the same structure as the original
data frame, but with the rows rearranged according to the sorting criteria.
As you can see, the sort function returns a sorted vector, not a data frame. The output is a
single vector with the sorted values of the yearID column.
df <- data.frame(yearID = c(1992, 1990, 1991, 1992, 1990, 1991),
2 teamID = c(10, 10, 10, 10, 10, 10),
3 playerID = c(123, 123, 123, 123, 123, 123))
4
5sorted_df <- sort(df, by = "yearID")
6 sorted_df <- df %>% arrange(yearID)
7sorted_df
Out put sort : [1] 1990 1990 1991 1991 1992 1992
Output of arrange:
yearID teamID playerID
21 1990 10 123
32 1990 10 123
43 1991 10 123
54 1991 10 123
65 1992 10 123
76 1992 10 123
sum(): This function calculates the sum of a numeric vector. It's not
suitable for counting the number of characters in a string, as you've
noticed.
length(): This function returns the number of elements in a vector,
including strings. However, it doesn't count the number of characters
within a string.
nchar(): This function returns the number of characters in a string.
It's what you need to count the number of characters in a string, like
the name column in your example.
nzchar()
strsplit()
The syntax is x %in% y, where x is the vector or column you want to check, and y is the vector
or column you want to check against. %in% can be used with both columns and rows,
depending on the context
Details of making Table:
1: Using the data.frame function:
# Create a table with 3 columns and 4 rows
table <- data.frame(
Name = c("John", "Mary", "David", "Emily"),
Age = c(25, 31, 42, 28),
Country = c("USA", "Canada", "UK", "Australia")
)
# Print the table
Table
Method 2: Using the tibble function
# Create a table with 3 columns and 4 rows
library(tibble)
table <- tibble(
Name = c("John", "Mary", "David", "Emily"),
Age = c(25, 31, 42, 28),
Country = c("USA", "Canada", "UK", "Australia")
)
# Print the table
Table
Method 3: Using a matrix
# Create a matrix with 3 columns and 4 rows
matrix <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), nrow = 4, ncol = 3)
# Convert the matrix to a table
table <- as.data.frame(matrix)
# Print the table
table
Method 4|: Reading in data from a file
If you have data in a file (e.g., CSV, Excel, or text file), you can read it into R using various functions
suc # Read in a CSV file
2table <- read.csv("data.csv")
4# Print the table
5tableh as read.csv, read.table, or read_excel. Here's an example
Conditional operators :
Loops and if condition:
List :
Vector:
Matrix:
Dataframe:
Create data frame:
df <- data.frame(
column1 = c(values),
column2 = c(values),
...
)
Functionns on dataframe
str(df)- see in console output
print(df) see output in console like a table and tidy
sample_n() function-> from the dplyr package. This function allows you to take a
random sample of rows from a dataframe.
Or
Alternatively, you can use the sample() function to take a random sample of
rows. Here's an example:
# Take a random sample of 3 rows
2Random_subset <- df[sample(nrow(df), 3), ]
4# Print the random subset
5print(Random_subset)
# Create a dataframe
3df <- data.frame(name = c("Welcome", "to", "Geeks", "for", "Geeks"),
4 year = c(10, 51, 19, 126, 99),
5 length = c(40, NA, NA, 100, 95),
6 education = c("yes", "yes", "no", "no", "yes"))
7
8# Take a random sample of 3 rows
9Random_subset <- df %>% sample_n(3)
10
11# Print the random subset
12print(Random_subset)
Column:
names(df) or str(df) to see the column names
access specific columns in a dataframe using the $ operator or the [[ ]] :
df$year or df[["year"]]
Removing a column: df %>% select(-year) or df$year <- NULL
Converting a column to a string: df$year <- as.character(df$year)
Applying a function to a specific column : To apply a function to a specific
column, you can use the mutate function from the dplyr package.
df %>% mutate(n_sqrt = sqrt(n)) -----
mutate(prop = prop * 10000000/1000000) -- result in column no create new column
or you can use you can use the $ operator to access the column and apply the function
directly, like this: df$n <- sqrt(df$n)
Finding columns with NaN values: sapply(df, function(x) any(is.nan(x)))
Finding duplicate values in a column: df %>%
2 group_by(name) %>%
3 filter(n() > 1)
ROW :
**access the first row of the df dataframe df[1,] access first column df[,1]
------------------------------------------------------------------------------------------------------------
Adding a row : r bind
new_row <- c(2022, "M", "John", 50, 0.0001)
2df <- rbind(df, new_row)
Deleting a row : function slice from the dplyr : The slice function takes a
dataframe and a vector of row indices as arguments. df <- df %>% slice(-1) all rows except first
row
Or : Alternatively, you can use the [- operator to remove the first row, like df <- df[-1,]
Changing a row :
To change a row in a dataframe, you can use the [ operator to access the row and
assign new values to it.For example, to change the first row of the df dataframe, you can
use:
df[1,] <- c(2022, "M", "John", 50, 0.0001)
Applying a function to a specific row :
you can use the rowwise function from the dplyr package. The rowwise function allows
you to apply a function to each row of a dataframe.
df %>% rowwise() %>% mutate(sum = sum(n, prop))
1. rowwise():
This is a function from the dplyr package that groups the dataframe by rows.
When you use rowwise(), each row of the dataframe is treated as a separate group.
This is similar to using apply(df, 1, ...) in base R, but rowwise() is more concise
and efficient.
rowwise() and mutate() functions, the code applies the sum() function to each row of the
dataframe, and the result is added as a new column sum to the dataframe.
df$sum <- apply(df[,c("n", "prop")], 1, sum) :
The apply function takes three arguments:
The first argument is the dataframe or matrix that we want to apply the function to. In
this case, it's df[,c("n", "prop")].
The second argument is the MARGIN argument, which specifies whether we want to
apply the function to rows (1) or columns (2). In this case, we're using 1, which means
we want to apply the function to each row.
The third argument is the function that we want to apply. In this case, it's
the sum function.
Finding rows with NaN values: df[is.nan(df$prop), ] This will return all rows
where the prop column has NaN values.
Finding duplicate rows: df[duplicated(df) | duplicated(df, fromLast = TRUE), ] To find
duplicate rows, we can use the duplicated() function:
Alternatively, we can use the group_by() and filter() functions from
the dplyr package:
library(dplyr)
2df %>%
3 group_by(year, sex, name, n, prop) %>%
4 filter(n() > 1)
operator.Class:
Seq:
Function Armethic:
Operators: