R Data Types and Objects - Detailed Notes
Overview of R Data Types:
o R data types are the fundamental building blocks for storing and manipulating
data in the R programming language.
o R supports several basic data types, each designed for specific kinds of data,
ensuring flexibility in statistical computing and data analysis.
o Understanding data types is crucial because R’s operations and functions often
behave differently depending on the type of data they process.
Numeric Type:
o Represents numbers with decimal points (floating-point numbers), also known
as doubles in R.
o Example: x <- 3.14; typeof(x) returns "double".
o Used for continuous data, such as measurements, e.g., heights (5.9) or
temperatures (23.5).
o Default type for numbers unless specified otherwise; even integers like 5 are
stored as numeric unless defined with L.
o Operations: Supports arithmetic like 3.14 + 2.86 (returns 6).
Integer Type:
o Represents whole numbers without decimal points.
o Defined explicitly using the L suffix, e.g., y <- 5L; typeof(y) returns
"integer".
o Example: z <- as.integer(5.7) converts 5.7 to 5, truncating the decimal.
o Memory-efficient compared to numeric types, useful for large datasets with
count data, e.g., number of items sold (10L).
o Use case: Statistical models like Poisson regression often require integer
counts, e.g., glm(count ~ predictor, family=poisson).
Character Type:
o Represents text or strings, enclosed in single (') or double (") quotes.
o Example: name <- "Alice"; typeof(name) returns "character".
o Useful for categorical labels, e.g., city <- "New York", or text data like
comments.
o Operations: String manipulation with functions like paste("Hello",
"World") (returns "Hello World") or nchar("Alice") (returns 5).
o Conversion: as.character(123) converts a number to a string, returning
"123".
Logical Type:
o Represents boolean values: TRUE (or T) and FALSE (or F).
o Example: is_valid <- TRUE; typeof(is_valid) returns "logical".
o Generated by comparisons, e.g., 5 > 3 returns TRUE.
o Used in conditional statements: if (5 > 3) { print("Yes") } prints
"Yes".
o Operations: Logical operators like & (AND), | (OR), ! (NOT), e.g., TRUE &
FALSE returns FALSE.
o Use case: Subsetting data, e.g., vec <- c(10, 20, 30); vec[vec > 15]
returns 20, 30.
Complex Type:
o Represents numbers with real and imaginary parts, used in advanced
mathematical computations.
o Example: z <- 2 + 3i; typeof(z) returns "complex".
o Components: Re(z) returns 2 (real part); Im(z) returns 3 (imaginary part).
o Operations: z1 <- 1 + 2i; z2 <- 2 + 3i; z1 + z2 returns 3 + 5i.
o Use case: Signal processing or solving equations in physics, e.g., exp(1i *
pi) returns -1 + 0i (Euler’s formula).
Overview of R Objects:
o R objects are structures that hold data of various types, used to organize and
manipulate data efficiently.
o Objects determine how data is stored, accessed, and processed in R, making
them essential for programming tasks.
Vectors:
o One-dimensional arrays holding elements of the same type (homogeneous).
o Created with c(): vec <- c(1, 2, 3); typeof(vec) returns "double".
o Can hold any data type: char_vec <- c("a", "b", "c") for characters.
o Operations: Vectorized, e.g., vec * 2 returns 2, 4, 6.
o Use case: Store a sequence of numbers, e.g., ages (c(25, 30, 35)), for
statistical analysis like mean(vec).
Lists:
o One-dimensional collections that can hold elements of different types
(heterogeneous).
o Created with list(): my_list <- list(1, "a", TRUE, c(10, 20)).
o Access: my_list[[1]] returns 1; my_list[1] returns a sublist.
o Named lists: list(name="Alice", age=25); access with my_list$name.
o Use case: Store mixed data, e.g., metadata of a dataset (list(id=1,
data=c(10, 20), desc="test")).
Matrices:
o Two-dimensional arrays, homogeneous, storing elements of the same type.
o Created with matrix(): mat <- matrix(1:6, nrow=2, ncol=3) creates a
2x3 matrix.
o Structure: print(mat) shows [[1,1]] 1, [[1,2]] 3, [[1,3]] 5;
[[2,1]] 2, [[2,2]] 4, [[2,3]] 6.
o Operations: Matrix algebra, e.g., mat %*% t(mat) for matrix multiplication.
o Use case: Linear algebra tasks, e.g., solving systems of equations or image
processing.
Data Frames:
o Table-like structures, heterogeneous, where each column can have a different
type.
o Created with data.frame(): df <- data.frame(name=c("Alice",
"Bob"), age=c(25, 30)).
o Access: df$name or df[, "name"] for the name column; df[1, ] for the first
row.
o Properties: Combines features of lists (columns) and matrices (row-column
structure).
o Use case: Store datasets, e.g., survey data with columns for id, age, gender,
for analysis like summary(df).
Factors:
o Used for categorical data, storing unique levels as labels.
o Created with factor(): f <- factor(c("low", "high", "low")).
o Levels: levels(f) returns "high", "low" (alphabetical by default).
o Internal storage: Stored as integers, e.g., as.numeric(f) returns codes like 2,
1, 2.
o Use case: Statistical modeling, e.g., lm(y ~ factor(group), data=df)
treats group as categorical.
o Customization: Reorder levels with factor(f, levels=c("low",
"high")).
Arrays:
o Multi-dimensional extensions of matrices, homogeneous.
o Created with array(): arr <- array(1:12, dim=c(2, 3, 2)) creates a
2x3x2 array.
o Access: arr[1, 2, 1] for specific elements.
o Use case: Multi-dimensional data, e.g., 3D image data (height, width, color
channels).
o Operations: Similar to matrices, e.g., arr + 1 adds 1 to each element.
Type and Object Checking:
o typeof(): Returns the data type, e.g., typeof(5L) returns "integer".
o class(): Returns the object class, e.g., class(df) returns "data.frame".
o str(): Displays the structure, e.g., str(df) shows column types and data.
o Use case: Debugging to ensure correct types before operations, e.g., if
(is.numeric(vec)).
Coercion and Conversion:
o R automatically coerces types in mixed operations: c(1, "2") coerces to
character ["1", "2"].
o Explicit conversion: as.numeric("123") returns 123; as.character(123)
returns "123".
o Factors: as.factor(c("a", "b")) for categorical data;
as.numeric(factor) for integer codes.
o Use case: Prepare data for analysis, e.g., converting strings to numbers for
calculations.
Practical Applications:
o Numeric/Integer: Compute statistics, e.g., mean(c(1, 2, 3)).
o Character: Label data, e.g., df$city <- c("NY", "LA").
o Logical: Filter data, e.g., df[df$age > 25, ].
o Vectors: Store sequences for analysis, e.g., sales <- c(100, 200, 300).
o Lists: Organize mixed data, e.g., list(data=vec, params=list(mean=0,
sd=1)).
o Data Frames: Analyze tabular data, e.g., summary(df) for descriptive
statistics.
Best Practices:
o Choose the right type/object for the task: Use factors for categorical data, data
frames for datasets.
o Check types with typeof() or class() to avoid errors in operations.
o Avoid unnecessary coercion to prevent data loss, e.g., as.numeric("text")
returns NA.
o Use str() to understand complex objects before manipulation.
o Ensure homogeneity in vectors/matrices to maintain performance.
Efficiency Tips:
o Integers are more memory-efficient than numerics for whole numbers.
o Pre-allocate vectors with vector("numeric", length=1000) for large
datasets.
o Use data frames for tabular data instead of lists for faster subsetting.
o Factors reduce memory usage for categorical data compared to characters.
o Arrays are efficient for multi-dimensional numerical data, avoiding nested
lists.