BUSINESS ANALYTICS – SEM 6TH
Chapter 4
Q1: Understand and work with vectors, matrices, arrays, lists, factors, and data frames in R.
In R, several data structures are available for storing and manipulating data. Understanding when to
use each data structure is crucial for efficient data analysis.
• Vectors are the most basic data structure in R and store elements of the same type (numeric,
character, logical). You create a vector using the c() function, for example:
• V <- C(1, 2, 3, 4)
• Matrices are two-dimensional arrays that store data in rows and columns. All elements must
be of the same type. You create a matrix using the matrix() function:
• M <- MATRIX(1:6, NROW=2, NCOL=3)
• Arrays extend matrices to more than two dimensions, and all elements must still be of the
same type. They can be created using the array() function.
• Lists are similar to vectors, but they can store elements of different types. Lists are useful
when you need to store mixed data types. Example:
• L <- LIST(1, "A", TRUE)
• Factors are categorical data types that are useful for representing nominal or ordinal data.
Factors are created using the factor() function.
• Data frames are two-dimensional data structures like matrices but can store different types
of data in each column. They are widely used in R for data manipulation. You can create data
frames using data.frame():
• DF <- DATA.FRAME(NAME=C("JOHN", "JANE"), AGE=C(25, 30))
These data structures are essential tools in R for managing and analyzing data efficiently.
Q2: Use conditionals and control flows to add logic to your programs.
In R, conditionals and control flows allow you to introduce decision-making and repetition into your
programs, making them more flexible and dynamic.
• If-Else Statements: These allow your program to make decisions based on conditions. The
basic syntax is:
• IF (CONDITION) {
• # CODE TO EXECUTE IF CONDITION IS TRUE
• } ELSE {
• # CODE TO EXECUTE IF CONDITION IS FALSE
• }
EXAMPLE:
X <- 10
IF (X > 5) {
PRINT("X IS GREATER THAN 5")
} ELSE {
PRINT("X IS LESS THAN OR EQUAL TO 5")
• Else If: This is used when you need to check multiple conditions in sequence.
• IF (X > 10) {
• PRINT("X IS GREATER THAN 10")
• } ELSE IF (X == 10) {
• PRINT("X IS EQUAL TO 10")
• } ELSE {
• PRINT("X IS LESS THAN 10")
• }
• Switch: The switch() function is used when you have multiple conditions based on a single
expression. It is cleaner than multiple if-else statements.
• Control Flow: You can control the flow of execution using loops like for, while, and repeat.
These loops are used to repeat certain operations multiple times based on a condition.
Control structures like these allow for more dynamic and responsive R programs.
Q3: Implement loops for repeated code and enhance efficiency by using the apply family of
functions.
In R, loops are commonly used to repeat code a specified number of times or until a condition is
met. While loops are useful, they can be inefficient when dealing with large datasets. To optimize
code, R provides the apply family of functions to perform repetitive tasks more efficiently.
• For Loops: The for loop is used to repeat a block of code a specified number of times.
• FOR (I IN 1:5) {
• PRINT(I)
• }
• While Loops: The while loop repeats as long as a specified condition remains true.
• X <- 1
• WHILE (X <= 5) {
• PRINT(X)
• X <- X + 1
• }
• Repeat Loops: The repeat loop is similar to while, but you need a break statement to exit the
loop.
• X <- 1
• REPEAT {
• PRINT(X)
• X <- X + 1
• IF (X > 5) BREAK
• }
The apply family of functions provides a more efficient way to apply a function over an object like a
vector, matrix, or data frame. Some common apply functions are:
• apply(): Applies a function over the margins (rows or columns) of a matrix.
• lapply(): Applies a function to each element of a list.
• sapply(): Similar to lapply(), but tries to simplify the result.
• tapply(): Applies a function to subsets of a vector.
For example, instead of using a loop to calculate the sum of each column in a matrix, use apply():
MATRIX_DATA <- MATRIX(1:6, NROW=2)
APPLY(MATRIX_DATA, 2, SUM) # APPLY SUM FUNCTION OVER COLUMNS (MARGIN 2)
Using the apply family of functions significantly enhances efficiency and readability, especially when
working with large datasets.
Q4: Identify when and how to select among various data structures and control mechanisms.
In R, selecting the appropriate data structure and control mechanism is key to writing efficient and
readable code. The choice depends on the problem at hand and the nature of the data you're
working with.
1. Data Structures:
o Use vectors when you need to store a sequence of data elements of the same type
(e.g., numerical values or characters).
o Use matrices and arrays when you need to work with multi-dimensional data.
Matrices are ideal for numerical data, while arrays can handle more than two
dimensions.
o Use data frames for heterogeneous data. When your data consists of multiple
columns with different types (numeric, character), data frames are the best option.
o Lists should be used when elements of your data have varying data types (like a mix
of characters, numbers, and logical values).
o Factors are essential for categorical data, especially in modeling and statistical
analysis.
2. Control Mechanisms:
o Use if-else statements for branching logic when you need to make decisions.
o For loops are ideal when you need to repeat operations over a fixed range or
collection of values.
o While loops are useful when you want to repeat an action until a condition changes
dynamically.
o Switch statements are helpful when you have multiple conditions based on a single
variable.
o When handling large datasets or applying functions over objects, consider using
apply functions for efficiency over loops.
In summary, selecting the right combination of data structure and control mechanisms optimizes
performance and ensures clarity in your R programs.
Q5: Write cleaner, more efficient R code using the strength of functional programming.
R is a powerful language that supports functional programming, a paradigm that encourages the use
of functions to process data rather than relying on mutable state or loops. Writing cleaner and more
efficient R code involves utilizing the functional programming features that R offers.
• Vectorization: One of the core principles of functional programming in R is vectorization. This
means performing operations on entire vectors or matrices at once rather than iterating over
elements using loops. Vectorized operations are more efficient and concise.
Example of vectorization:
# VECTORIZED APPROACH
X <- C(1, 2, 3, 4)
Y <- X^2 # SQUARE ALL ELEMENTS IN X WITHOUT A LOOP
• Apply Functions: As mentioned, the apply() family of functions (e.g., lapply(), sapply(),
apply(), tapply()) enables you to perform operations on data structures like lists and matrices
without using explicit loops. This leads to more efficient and readable code.
Example with lapply():
# APPLY A FUNCTION TO EACH ELEMENT IN A LIST
RESULT <- LAPPLY(LIST(1, 2, 3), FUNCTION(X) X^2)
• Purrr Package: The purrr package enhances functional programming in R by providing tools
to work with lists and vectors in a functional style. Functions like map() and reduce() are
commonly used.
Example using purrr:
LIBRARY(PURRR)
RESULT <- MAP(LIST(1, 2, 3), ~ .X^2) # SQUARING ELEMENTS OF THE LIST
By leveraging vectorization, the apply() family, and packages like purrr, you can write R code that is
not only clean and easy to read but also faster and more efficient.
1. What is the primary difference between a matrix and an array in R?
In R, both matrices and arrays are used to store multi-dimensional data, but they differ in the
number of dimensions they can handle:
• Matrix: A matrix is a two-dimensional data structure, meaning it can store data in rows and
columns. All elements in a matrix must be of the same data type (e.g., all numeric or all
character). Matrices are created using the matrix() function.
Example:
M <- MATRIX(1:6, NROW = 2, NCOL = 3)
• Array: An array is more general and can have more than two dimensions. An array can store
data in three, four, or even higher dimensions. Like matrices, arrays also require that all
elements are of the same type. Arrays are created using the array() function.
Example:
A <- ARRAY(1:12, DIM = C(3, 2, 2))
In summary, the primary difference between a matrix and an array is the number of dimensions: a
matrix is always two-dimensional, while an array can have multiple dimensions.
2. Write an R code snippet to create a 3x3 matrix.
To create a 3x3 matrix in R, you can use the matrix() function and specify the number of rows and
columns. Here’s an example:
# CREATING A 3X3 MATRIX
M <- MATRIX(1:9, NROW = 3, NCOL = 3)
# PRINTING THE MATRIX
PRINT(M)
Output:
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
This code creates a matrix with 3 rows and 3 columns, populated with the numbers 1 to 9.
3. How do lists differ from vectors in R?
In R, lists and vectors are both data structures used to store collections of elements, but they have
key differences:
• Vector: A vector is a one-dimensional array that stores elements of the same type, such as all
numeric values or all characters. Vectors are created using the c() function and are
homogenous (all elements must be of the same type).
Example:
V <- C(1, 2, 3, 4) # NUMERIC VECTOR
• List: A list is a more flexible data structure that can store elements of different types. A list
can contain numbers, characters, vectors, or even other lists. Lists are created using the list()
function.
Example:
L <- LIST(1, "A", TRUE, C(2, 3, 4))
In summary, vectors are homogenous (only one data type), while lists can contain heterogeneous
data types (mix of numbers, characters, etc.).
4. What makes a data frame unique compared to a matrix?
A data frame and a matrix are both two-dimensional data structures in R, but they differ in several
ways:
• Data Frame: A data frame is a table-like structure where each column can contain elements
of different data types (e.g., numeric, character, factor). Data frames are used to store data in
the form of rows and columns, often used in data analysis and statistical modeling. Each
column in a data frame can be thought of as a vector, and the columns can have different
types.
Example:
DF <- DATA.FRAME(NAME = C("JOHN", "JANE"), AGE = C(28, 34), SCORE = C(88.5, 92.3))
• Matrix: A matrix is a two-dimensional structure, but all elements must be of the same data
type. Matrices are mainly used for numerical computations, and every element must be of
the same type (e.g., all numeric or all character).
Example:
M <- MATRIX(1:6, NROW = 2, NCOL = 3)
The key difference is that a data frame allows columns to have different data types, while a matrix
requires all elements to be of the same type. Data frames are ideal for handling mixed-type data
commonly encountered in datasets.
Here is an R code snippet using an if-else statement to check if a number is even or odd:
# DEFINE THE NUMBER
NUM <- 7
# CHECK IF THE NUMBER IS EVEN OR ODD
IF (NUM %% 2 == 0) {
PRINT("THE NUMBER IS EVEN.")
} ELSE {
PRINT("THE NUMBER IS ODD.")
Explanation:
• num %% 2 == 0: This checks if the remainder when num is divided by 2 is 0. If it is, the
number is even.
• If the condition is true, the message "The number is even." is printed.
• If the condition is false, the else part is executed, printing "The number is odd.".
In the case above, since num = 7, the output will be:
THE NUMBER IS ODD.
-------------------------------------------------------------------------------------------------------------------------------------
1. What is the difference between a vector and a matrix in R? Provide an example of each.
In R, a vector and a matrix are both types of data structures that store elements in a specific
arrangement, but they differ in their dimensions and data handling.
• Vector: A vector is a one-dimensional structure, meaning it contains elements in a single row
or column. All elements in a vector must be of the same type (numeric, character, etc.).
Vectors are created using the c() function. For example, a numeric vector can be created as
follows:
• V <- C(1, 2, 3, 4, 5) # NUMERIC VECTOR
• Matrix: A matrix is a two-dimensional structure that stores data in rows and columns. Like
vectors, matrices must contain elements of the same type. A matrix is created using the
matrix() function, and you can define the number of rows and columns. Example:
• M <- MATRIX(1:6, NROW = 2, NCOL = 3) # MATRIX WITH 2 ROWS AND 3 COLUMNS
The primary difference between them is that a vector is one-dimensional, while a matrix is two-
dimensional. Matrices are particularly useful when dealing with mathematical operations, while
vectors are commonly used for simpler, one-dimensional data handling.
2. Explain how a list differs from a vector and give a practical example of when you would use a list
instead of a vector.
In R, both lists and vectors are used to store multiple elements, but they differ in terms of flexibility
and the types of data they can store.
• Vector: A vector is a one-dimensional structure that holds elements of the same data type
(numeric, character, etc.). For example, a numeric vector can be created as follows:
• V <- C(1, 2, 3, 4, 5) # NUMERIC VECTOR
• List: A list is a more flexible data structure that can hold elements of different data types.
You can store a mix of numeric values, characters, vectors, matrices, or even other lists. Lists
are created using the list() function. Example:
• L <- LIST(NAME = "JOHN", AGE = 30, SCORES = C(90, 85, 88))
The key difference is that lists can hold heterogeneous types of data, whereas vectors are
homogeneous.
Practical Example:
Suppose you want to store information about multiple students, such as their names, ages, and
scores. Since these data types differ (strings for names, integers for ages, and vectors for scores), you
would use a list:
STUDENT <- LIST(NAME = "ALICE", AGE = 20, SCORES = C(95, 80, 78))
3. Describe the structure of a data frame and explain why it is particularly useful for working with
tabular data.
A data frame is a two-dimensional data structure in R that is particularly useful for working with
tabular data. It is similar to a spreadsheet or a database table, where each row represents an
observation and each column represents a variable. The columns in a data frame can hold different
data types, such as numeric, character, and factor, making it highly flexible.
• Structure: A data frame consists of columns that are vectors (or factors), and all columns are
aligned to the same number of rows. This structure makes it ideal for storing datasets with
different types of information (e.g., age, name, gender, scores).
Example:
DF <- DATA.FRAME(NAME = C("JOHN", "JANE"), AGE = C(28, 34), SCORE = C(85, 90))
• Why useful: Data frames are highly suitable for storing and analyzing heterogeneous data.
They allow you to mix different types of variables (numeric, character, and factors) within the
same dataset. This is particularly useful when handling data from real-world sources like
surveys, experiments, or databases, where each column might represent a different type of
measurement or information.
Advantages:
• Easy to subset and manipulate using functions like subset(), dplyr functions, and apply()
functions.
• Ideal for use in statistical analysis, data visualization, and modeling.
4. Write an R code snippet using an if-else statement to determine whether a number is positive,
negative, or zero.
To determine whether a number is positive, negative, or zero, we can use an if-else statement in R.
Here's an example:
# Define the number
NUM <- -5
# DETERMINE IF THE NUMBER IS POSITIVE, NEGATIVE, OR ZERO
IF (NUM > 0) {
PRINT("THE NUMBER IS POSITIVE.")
} ELSE IF (NUM < 0) {
PRINT("THE NUMBER IS NEGATIVE.")
} ELSE {
PRINT("THE NUMBER IS ZERO.")
Explanation:
• If statement: Checks if the number is greater than 0 (positive).
• Else-if statement: Checks if the number is less than 0 (negative).
• Else statement: If neither condition is true, the number must be zero.
For NUM = -5, the output would be:
The number is negative.
5. What is the purpose of the apply family of functions, and how do they improve code efficiency
compared to traditional loops?
The apply family of functions in R consists of functions like apply(), lapply(), sapply(), tapply(), etc.,
and are designed to perform operations on elements of data structures such as matrices, lists, and
vectors. They provide an alternative to traditional for loops, offering better readability and often
enhanced efficiency.
• Purpose: These functions allow you to apply a function to each element or subset of an
object without having to explicitly write a loop. They simplify the process of manipulating
data structures and can help optimize code.
Example:
# USING APPLY TO CALCULATE THE SUM OF EACH ROW IN A MATRIX
M <- MATRIX(1:6, NROW = 2)
ROW_SUMS <- APPLY(M, 1, SUM)
• How they improve efficiency:
o Vectorization: Apply functions are typically faster than traditional loops because they
internally use optimized C code.
o Less code: Apply functions eliminate the need for manual loop management
(initialization, condition checking, and updating).
o Parallelization: Some apply functions can be parallelized (e.g., mclapply), speeding
up operations on large datasets.
In conclusion, the apply family improves code efficiency by simplifying syntax and optimizing
underlying calculations, making data manipulation tasks more concise and faster than traditional
loops.
Here are the answers to each question in tabular format:
1. What is the difference between a vector and a matrix in R? Provide an example of each.
Concept Vector Matrix
A two-dimensional data structure (rows and
Definition A one-dimensional data structure.
columns).
Dimensions 1D (single row or column). 2D (multiple rows and columns).
Can only hold elements of the same
Data Types Can hold only elements of the same data type.
data type.
Creation v <- c(1, 2, 3) m <- matrix(1:6, nrow = 2, ncol = 3)
Example v <- c(1, 2, 3, 4, 5) m <- matrix(1:6, nrow = 2, ncol = 3)
2. Explain how a list differs from a vector and give a practical example of when you would use a list
instead of a vector.
Concept List Vector
A heterogeneous data structure that can hold A homogeneous data structure that holds
Definition
elements of different types. elements of the same type.
Data Can hold different types (numeric, character, Must contain elements of the same data
Types lists, etc.). type.
l <- list(name = "John", age = 30, scores =
Creation v <- c(1, 2, 3, 4, 5)
c(90, 85, 88))
Example list1 <- list("Alice", 25, c(90, 85, 88)) vector1 <- c(1, 2, 3)
Use a vector when dealing with
Useful when dealing with data of mixed types
Use Case homogeneous data, like a sequence of
(e.g., a combination of numbers and text).
numbers.
3. Describe the structure of a data frame and explain why it is particularly useful for working with
tabular data.
Concept Data Frame
A two-dimensional data structure with columns that can hold different types of data
Definition
(numeric, character, etc.).
Each column in a data frame can be a vector, and all columns have the same length
Structure
(number of rows).
Use Case Especially useful for storing tabular data (e.g., CSV or database tables).
Creation df <- data.frame(Name = c("John", "Jane"), Age = c(28, 34), Score = c(85, 90))
Ideal for handling heterogeneous data and for easy manipulation with functions like
Advantages
subset(), dplyr packages, and apply().
Example A dataset containing names, ages, and scores of individuals.
4. Write an R code snippet using an if-else statement to determine whether a number is positive,
negative, or zero.
Concept R Code
Task Check whether a number is positive, negative, or zero.
Code ```r
num <- -5
if (num > 0) {
print("The number is positive.")
} else if (num < 0) {
print("The number is negative.")
} else {
print("The number is zero.")
```
Uses if, else if, and else to check if the number is positive, negative, or zero. For num = -
Explanation
5, the output will be "The number is negative."
5. What is the purpose of the apply family of functions, and how do they improve code efficiency
compared to traditional loops?
Concept Apply Family Functions Traditional Loops
Used to apply a function over elements of data
Loops explicitly iterate over each
Purpose structures (e.g., vectors, matrices) without explicit
element of a structure.
loops.
Functions apply(), lapply(), sapply(), tapply(), etc. for or while loops.
Less efficient due to manual
Code More concise and often faster due to internal
iteration and more verbose
Efficiency optimizations.
code.
Apply a function to rows or columns of a matrix: Use a for loop to manually sum
Use Case
apply(m, 1, sum) for row sums. rows or columns of a matrix.
Example ```r
m <- matrix(1:6, nrow = 2)
row_sums <- apply(m, 1, sum)
```
Cleaner, faster, and more readable code. Reduces
Advantages
the need for manual looping.