0% found this document useful (0 votes)
7 views55 pages

Lecture 1

biostat 607 r lecture

Uploaded by

yuea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views55 pages

Lecture 1

biostat 607 r lecture

Uploaded by

yuea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LECTURE 1: OVERVIEW OF THE R PROGRAMMING

LANGUAGE

Biostatistics 607: Module 1


OVERVIEW OF BIOSTAT 607: R MODULE 1

Biostatistics 607: Module 1


607 R - BASIC INFORMATION

Date and Time: Mon. and Wed. 4:30 - 6:00PM (Eastern Standard Time)
R Module: August 28 - September 27
Instructors:
Primary Instructor: Nicholas Henderson ([email protected])
Graduate Student Instructor: Ye Yao
Office Hours:
Primary Instructor Office Hours: Wednesdays 2:00-3:00PM
GSI Office Hours: Thursdays 10:00-11:00AM

Biostatistics 607: Module 1


PLANNED LIST OF COURSE TOPICS
Basic R syntax
Logical expressions/if-else statements
Writing functions in R
Vectors, matrices, and lists
Data Frames
Reading files
Cleaning/organizing data/Tidyverse packages (dplyr, tidyr)
Data visualization
Statistical operations/Examples of statistical analyses
Creating reports in R markdown

Biostatistics 607: Module 1


WHO SHOULD TAKE THIS COURSE?
Those who want to learn how to start coding with R.
Those who have little or no background in R (or any other programming
language).
Those who want to quickly build up basic R programming skills.
Those who want to have a roughly “intermediate” R skillset after
completing the course.
We only have roughly 4 weeks for this module, so there is a limit to
how many topics we can cover.
Nevertheless, I think this course can give you a strong base in R from
which you can continue to learn more or to start performing data
analyses with R.

Biostatistics 607: Module 1


WHO SHOULD TAKE THIS COURSE?
Note: For some of the topics, there will be many similarities between R
and Python.

Biostatistics 607: Module 1


GRADING AND TENTATIVE SCHEDULE
2 online quizzes: 10% of final grade.
Pass/Fail: only 70% correct needed to get full credit on each quiz
2 homeworks: 30% of final grade.
tentative due dates: September 12th and September 22nd
1 in-class quiz: 30% of final grade
September 28th
final assignment: 30% of final grade.
tentative due date: September 29th
Attendance during the in-person lectures will not affect your grade.

Biostatistics 607: Module 1


OVERVIEW OF R 1

Biostatistics 607: Module 1


WHAT DOES A COMPUTER DO?
Storage of data
Data are collection of binary numbers (i.e 0 and 1)
On temporary storage (CPU, RAM) or permanent storage (disks)
Perform Operations on data
Basic operations, such as +, -, x, /, …
Complex operations by combining basic operations.
Input and output
Numbers, text, picture, video, audio, …

Biostatistics 607: Module 1


WHAT IS A PROGRAMMING LANGUAGE?
A Computer Program : a series of coded instructions
that a computer (or a machine) can interpret
to control the operation of the computer/machine.
Programming language : a formal language that can produce programs
A language that a computer can interpret.
A language that human can express their ideas.
A language that human can communicate with other people.

Biostatistics 607: Module 1


THE R PROGRAMMING LANGUAGE
Can begin using quickly
R was a language designed for statistical analysis.
Powerful for data analysis and visualization
Computationally efficient when handling matrices and arrays
Seamless integration with C/C++ is possible.
Available across various platforms.
Free under GNU General Public License.

Biostatistics 607: Module 1


HISTORY OF R

Biostatistics 607: Module 1


GETTING STARTED WITH R 1

Biostatistics 607: Module 1


R AND RSTUDIO
R
Free software developed by R Core Team
Available at https://www.r-project.org/
Software and packages are managed by the nonprofit organization “R
Foundation”
RStudio
An integrated development environment (IDE) for programming in R.
Provides many add-ons to R available in a single interface.
Developed by RStudio, Inc.
Available in both free (AGPLv3) and commercial editions at
https://www.rstudio.com

Biostatistics 607: Module 1


INSTALLING R/RSTUDIO ON YOUR LOCAL COMPUTER
R and RStudio are separate things.
You will need to install R first
R
Download and install R at https://cloud.r-project.org/
RStudio
Download and install the open source version of RStudio Desktop at
https://rstudio.com/products/rstudio/download/#download
You do not have to use RStudio for this course, but I would definitely
recommend it.

Biostatistics 607: Module 1


UNDERSTANDING THE RSTUDIO INTERFACE
When you open Rstudio, it should look something like:

You can type in R commands here

Biostatistics 607: Module 1


UNDERSTANDING THE RSTUDIO INTERFACE
The left-hand panel is where you can type in R code directly.
For example, we can treat R as a calculator and add and multiply numbers
by typing them directly in the left-hand panel.
Typing and running R code line-by-line like this is referred to as using R in
interactive mode.

Biostatistics 607: Module 1


UNDERSTANDING THE RSTUDIO INTERFACE
When writing more complex code that you can reuse, it is usually better
to write it in a separate file such as an R script (this type of file ends in .R).
To create a new R script, go to File –> New File –> R script in Rstudio.

Biostatistics 607: Module 1


UNDERSTANDING THE RSTUDIO INTERFACE (WRITING AN R SCRIPT)
Let’s write an R script that simply will print out “Hello World” whenever
we run it.
To do this we just write the following R code in the empty R script:
1 "Hello World"

Biostatistics 607: Module 1


SAVING THE R SCRIPT
Before running the script, you can save the file as “hello_world.R”.

Biostatistics 607: Module 1


PRINTING “HELLO WORLD” IN RSTUDIO
To run the script, just click the “Run” button located at the top right of
your R script.
The message “Hello World” should appear in the R console below:

Biostatistics 607: Module 1


KAGGLE NOTEBOOKS (CODING IN THE BROWSER)
Kaggle Notebooks is one option that allows you to directly write and
execute R code in your web browser.
This can be convenient as you can write and execute R code as long as
you have an internet connection (though I would still recommend
installing R and RStudio on your local computer).
This can also make it easier to work on and save R code across multiple
devices.
Kaggle notebooks is just one option for coding in the browser. Other
options include Replit, Google colab notebooks, Rstudio Cloud, ….

Biostatistics 607: Module 1


KAGGLE NOTEBOOKS (CODING IN THE BROWSER)
https://www.kaggle.com/notebooks
Click on “+ New Notebook” to start programming in R.
Make sure to select R in the “Select Language” drop-down box.
Kaggle notebooks allow you to write both R code and written text in a
single document.
You can create a Kaggle account if you want to save all your notebooks
and scripts within your Kaggle account.

Biostatistics 607: Module 1


TRYING OUT CODE IN A KAGGLE NOTEBOOK
As an example, let’s try to write code that multiplies two numbers and
prints the result.

Biostatistics 607: Module 1


REPLIT (CODING IN THE BROWSER)
There quite a few other good browser-based coding options that allow
you to write and execute R code in your web browser
These include:
Jupyter Notebooks
Rstudio Cloud
Google Colaboratory (easy to share code and save your work to
Google drive directly)
Replit: https://replit.com/languages/rlang.

Biostatistics 607: Module 1


VARIABLES AND OPERATIONS IN R 1

Biostatistics 607: Module 1


USING R AS A CALCULATOR
You can use R as a basic calculator.
For example:
1 42 + 17
[1] 59
1 sqrt(243)
[1] 15.58846
1 1.56*1233
[1] 1923.48
1 7.21*8^4
[1] 29532.16

Biostatistics 607: Module 1


USING R AS A CALCULATOR
For more complicated mathematical operations, it is useful to store
intermediate values in named variables.
For example:
1 x <- (42 + 17)*sqrt(43)
2 y <- 7.21*8^4 + log(2.34)
3 z <- x/y
4 z ## print out the value of z
[1] 0.01310022

Here, x, y, and z are examples of variables.


The pair of characters <- used together is known as the assignment
operator in R.
x <- 2 assigns the value 2 to the variable x.

Biostatistics 607: Module 1


VARIABLES IN R
What is a variable?
A named storage of a value (or an object) in memory.
Why do we need variables?
To reuse the same value later on.
To generalize an expression to use in many cases.
How to use variables?
To use (read) the value, simply use the variable name as if it were equal
to its stored value.
To set the value, use assignment operator <-

Biostatistics 607: Module 1


RULES FOR CHOOSING VARIABLE NAMES IN R
In R, variable names can include the following:
letters : A-Z a-z
digits : 0-9
underscore and period : _ .
Additional rules:
Variable names must start with letters or a period (not underscore or
digits).
If it starts with a period, it cannot be followed by digits.
Variable names are case sensitive.

Biostatistics 607: Module 1


EXAMPLES OF VALID AND INVALID VARIABLE NAMES IN R
Valid Invalid
i 2things
my_variable location@
answer42 _user.name
.name .3rd

Biostatistics 607: Module 1


CONVENTIONS FOR VARIABLE NAMES
Variables can be named however you want as long as you do not violate
any of the variable-naming rules.
However, making variable names descriptive is recommended.
Descriptive variable names make it easier to read code. This is very
helpful if:
You are sharing your code or
Looking back at code you wrote many weeks/months ago
Using a consistent convention for naming variables is recommended, too:

https://r4ds.had.co.nz/workflow-basics.html

Biostatistics 607: Module 1


QUESTION:
Which of the following is a valid variable name?
a. 10messages
b. important_message!
c. key.message
d. _important_message2

Biostatistics 607: Module 1


ASSIGNING VARIABLES
Variables can be assigned using either <- or =
1 x = 123 # Use = to assign a variable
2 y <- 123 # Or use <- to assign a variable
1 x # Retrieve the value of x
[1] 123
1 y # Retrieve the value of y
[1] 123

The pair of characters <- is the classic symbol used for variable
assignment in R.
The use of <- instead of = is often recommended in R style guides:
http://adv-r.had.co.nz/Style.html

Biostatistics 607: Module 1


<- VS. =
<- and = will work the same if they are both used in the “usual way”
(when assigning variables within or outside of a function).
One case when they are different is when used inside a function call.
For example, if we use = in the function sd(x):
1 sd(x = c(1,2,3,4,5)) # only sets the argument x in sd(x) to (1,2,3,4,5)
[1] 1.581139
1 #x ## will return an error if we try to print x
1 sd(x <- c(1,2,3,4,5)) # This actually assigns the vector (1,2,3,4,5) to
[1] 1.581139
1 x
[1] 1 2 3 4 5

Biostatistics 607: Module 1


<- VS. =
However, using something like sd(x <- c(1,2,3,4,5)) where we
assign variables in a function call is not really done that often.
It is not common to assign variables in a function call (I never do it).
Whenever, using a function f with a keyword such as x, you will generally
want to call that function using f(x = ...)
So, in my opinion, there is not really a strong reason to prefer using <-
over = for assignment.
There are other justifications for using <- such as the ability to do
assignment from the left by using the reverse symbol ->
1 c(1, 2, 3, 4) -> a # Using c(1,2,3,4) = a will not work!
2 a
[1] 1 2 3 4

Biostatistics 607: Module 1


TYPES OF VARIABLES
Variables can be used to store different types of values.
Common types include numeric, text, and logical values.
1 x <- 3.2
2 x
[1] 3.2

Biostatistics 607: Module 1


TYPES OF VARIABLES
The elements in a vector can have different types (or modes).
You can find the types of the elements in a vector by using the function
typeof
1 y <- sqrt(1743)
2 typeof(y) # double and integer are the two numeric types
[1] "double"
1 z <- 3 # R automatically treats every number as double
2 z
[1] 3
1 typeof(z)
[1] "double"

Biostatistics 607: Module 1


TYPES OF VARIABLES
The other common types for the elements in a vector include
logical (TRUE or FALSE) values
character basically text, e.g., “hello”, “car”, …
1 y <- TRUE
2 typeof(y)
[1] "logical"
1 z <- "dog" # to define a character variable, place it inside quotes
2 typeof(z)
[1] "character"

We will discuss these types in more detail later on when we discuss


vectors, matrices, and lists.

Biostatistics 607: Module 1


R OPERATIONS WITH NUMBERS
Operator Meaning Example Result
+ addition 5+8 13
- subtraction 90 - 10 80
* multiplication 4*7 28
/ division 7/2 3.5
%% remainder 7 %% 2 1
^ exponent 3^4 81
** exponent 3 ** 4 81

Biostatistics 607: Module 1


OPERATIONS HAVE PRECEDENCE
Operator Description Precedence
+, - addition and subtraction low
*, /, %% multiplication, division, remainder …
**, ^ exponentiation …
(expressions…) Parenthesis high
Similar precedence rules to usual arithmetic operations

Biostatistics 607: Module 1


OPERATION PRECEDENCE EXAMPLES
1 1 + 2 *3 ^ 4 # power > mult/div > add/sub
[1] 163
1 (1 + 2 ) *3 ^ 4 # parenthesis > power
[1] 243

Biostatistics 607: Module 1


EXERCISE
Compute the number

1.2
√‾‾‾‾‾‾‾‾‾
1.43 + 5 ‾

directly in the R console.

Biostatistics 607: Module 1


EXERCISE
Write an R script that assigns the value …

1.4 1.7
ln 1 + exp(−2 ) + ln 1 + 2 exp(3 )
( ) ( )

… to a variable named x and prints the result in the Console when you run
the script.
1 x <- log(1 + exp(-2^1.4) ) + log(1 + 2*exp(3^1.7))
2 x
[1] 7.235923

Note that in R, the function log computes the natural logarithm.

Biostatistics 607: Module 1


CONTROL FLOWS

Biostatistics 607: Module 1


CONTROL FLOWS
Control flows determine exactly which
statements are executed in R and the order
statements are executed in R.
Statement
A unit of execution
Often represented as one line of code
Statements are executed one by one
Controlling which statements are
executed and the order in which they are
executed largely determine what actions
your program performs.

Biostatistics 607: Module 1


CONTROL FLOWS
Conditional statements (most common are if-else statements)
Execution of statements that depend on a condition
Functions
Execute the same collection of statements in different locations of
your program
Loops (for or while)
Repeat one or more statements many times

Biostatistics 607: Module 1


CONDITIONAL (IF-ELSE) STATEMENTS AND LOGICAL EXPRESSIONS
To begin to write if-else statements, we will need to known how to
construct logical expressions
If-else statements are useful when we only want to perform an operation
only if certain “conditions” are satisfied.
For example, you may only want to perform an operation only on
numbers which are greater than zero.
In programming languages, conditions are usually represented by logical
expressions (often called Boolean expressions).
Logical expression: an expression that evaluates to either TRUE or
FALSE.

Biostatistics 607: Module 1


CONDITIONAL (IF-ELSE) STATEMENTS AND LOGICAL EXPRESSIONS
The following are examples of logical expressions in R:
5 > 3
3 <= 5
16.0 + 1.3*1.3 > 17.0
"number" == "digit"
Each of the above expressions will evaluate to either TRUE or FALSE if
you run them in R.

Biostatistics 607: Module 1


LOGICAL EXPRESSIONS

Biostatistics 607: Module 1


CONSTRUCTING LOGICAL EXPRESSIONS
Most logical expressions are constructed by using some combination of:
Comparison operators (<, <=, ==, !=)
Logical operators (and, or, not) (in R: &&, ||, !)
Examples:
1 5 > 3
[1] TRUE
1 3 <= 5
[1] TRUE
1 x <- "number" == "digit" # assign to the variable x the value
2 # returned by this logical expression
3 x
[1] FALSE

Biostatistics 607: Module 1


COMPARISON OPERATORS
Operator Meaning Example Result
< Less than 5<3 FALSE
> Greater than 5>3 TRUE
<= Less than or equal to 3 <= 6 TRUE
>= Greater than or equal to 4 >= 3 TRUE
== Equal to 2 == 2 TRUE
!= Not equal to ‘str’ != ‘stR’ TRUE

Biostatistics 607: Module 1


LOGICAL OPERATOR: &&
The logical operator && is used in R to represent the logical AND
&& is used to test whether or not two statements are both true.
For two logical expressions A and B, the logical expression A && B is true
only if both A and B evaluate to true.
1 4 > 2 && 5/2 == 1 ## only the first statement is TRUE
[1] FALSE
1 4 > 2 && "car" == "truck" ## only the first statement is TRUE
[1] FALSE
1 4 > 2 && 3 < 5 ## both statements are TRUE
[1] TRUE

Biostatistics 607: Module 1


LOGICAL OPERATOR: ||
The logical operator || is used in R to represent the logical OR.
For two Boolean expressions A and B, the Boolean expression A || B is
true if at least one of A and B evaluates to true.
Note that if A and B are both true, A || B will be true; or does not mean
only one of A and B is true.
1 4 > 2 || 5/2 == 1 ## only the first statement is TRUE
[1] TRUE
1 4 > 2 || "car" == "truck" ## only the first statement is TRUE
[1] TRUE
1 4 > 2 || 3 < 5 ## both statements are TRUE
[1] TRUE

Biostatistics 607: Module 1


Biostatistics 607: Module 1

You might also like