0% found this document useful (0 votes)
13 views21 pages

R Workflow Optimization Guide

Uploaded by

lucashelal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

R Workflow Optimization Guide

Uploaded by

lucashelal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ProjectTemplate and

 R  Workflow

Dr  Jeromy  Anglim
School  of  Psychology,  Deakin  University
Context
• Use  case
– You  have  one  or  more  datasets
– You  need  to  prepare  the  data
– You  need  to  explore  the  data  and  generate  insights  probably  in  the  form  of  
tables,  figures,  and  text
• I.e.,  not  package  development  where  the  aim  is  to  develop  a  set  of  reusable  
functions
Symptoms  of  a  bad  R  workflow
• Loading  a  saved  rdata file  to  return  to  analyses
• library  statements  spread  throughout  the  code;  
– i.e.,  no  easy  way  to  see  at  a  glance  which  packages  are  required
– unclear  whether  a  package  will  be  loaded  in  time  for  when  it  is  needed
• Diagnosing  an  obscure  error  when  you  try  to  run  your  code  on  a  system  with  a  
different  stringAsFactors setting
• It's  unclear  how  to  get  back  up  and  running  with  analyses
• When  starting  a  new  project,  you  need  to  create  a  pile  of  setup  folders;  loose  or  
approximate  standards  force  you  to  think  about  where  to  put  things,  and  make  it  
difficult  to  return  to  projects  in  the  future
• Data  manipulations  are  interspersed  with  analyses.  Thus,  derived  variables  or  
datasets  may  not  exist  or  may  be  in  the  wrong  form  when  a  given  analysis  is  
performed.
Aims  of  this  talk
• Aims
– Introduce  ProjectTemplate
– Introduce  the  idea  of  customising ProjectTemplate
– Show  how  to  use  this  approach
– Explain  why  this  overcomes  many  issues  related  to  data  analysis  workflow  in  
R

• Benefits
– All  the  benefits  of  ProjectTemplate
– And  you  are  up  and  running  in  a  new  project  in  a  few  clicks
Initial  Demo:  
Example  project  for  journal  article
• Files  are  available  at:  https://osf.io/wkc5u/
• Video  review  of  materials:  https://www.youtube.com/watch?v=GKtjr-­‐lxHYM
ProjectTemplate
• http://projecttemplate.net/
ProjectTemplate
• Why  use  ProjectTemplate?
– Encourages  good  workflow  
• (1)  Configure,  (2)  load  packages,  (3)  load  support  files,  (4)  load  data,  (5)  
manipulate  data,  (6)  analyse  data
• More
– Systematic  place  to  store  files  and  settings
– Standardises  configuration  and  package  loading  settings  
– Automatically  load  r-­‐script  files
– Automatically  load  data  files  stored  in  data  directory
– Automate  running  initial  data  manipulation  code
• Installation
– install.project("ProjectTemplate", dep = TRUE)
Summary  of  Creating  a  ProjectTemplate Project
• Create  the  folder  structure
– library('ProjectTemplate')
create.project('myproject')
• Review  config/global.dcf
– Choose  settings
– Specify  packages  to  load
• Add  data  for  auto-­‐loading  to  data directory
• Add  any  additional  R  support  functions  to  the  lib
directory
• Load  the  project
– library('ProjectTemplate');
load.project()
• Write  any  initial  data  manipulation  code  and  place  in  
the  munge directory
• Create  data  analysis  files  (e.g.,  r-­‐scripts,  RMarkdown,  
Sweave Files)  in  home  or  reports  directory
– Include  the  load  project  commands  above  at  the  
top  of  each  such  file
config directory
• Configuration  settings  are  stored  in  config/global.dcf
• data_loading, munging, load_libraries: indicate  which  aspects  
of  ProjectTemplate should  run
– munging:  FALSE  is  useful  when  debugging  the  data  manipulation  process
• libraries: Specify  which  packages  you  want  to  load
• as_factors: Specifies  whether  by  default  strings  should  be  imported  as  
factors
lib  directory
• Code  is  automatically  run  
• Thus,  you  don't  have  to  add  and  run:    source("lib/myfile.r")each  time  
you  add  a  new  script
• Typical  use  case:  
– you  have  little  functions  that  you  have  from  another  project  or  you  have  
developed  some  custom  function  for  the  project  and  you  want  them  to  be  
accessible  (i.e.,  a  full  R  package  would  be  overkill).
data  directory
• Place  files  in  the  data  folder
• Data  is  automatically  loaded  based  on  the  file  
extension.
– Benefit:  no  need  to  remember  function  names  
for  data  import
• Name  of  file  is  generally  derived  from  file  name
– e.g.,  "mydata.csv"  is  imported  as  object  
"mydata"
– Excel  files  with  multiple  worksheets  take  the  
format  "filename.sheetname"
– Data  in  Rdata format  keeps  original  names
– You  can  override  defaults  by  putting  a  function  
in  the  lib  directory
• Full  list  of  supported  file  formats:  
http://projecttemplate.net/file_formats.html
munge  directory
• Munge  files  are  automatically  run  after  data  import
– Typical  names  start  with  numbers  to  have  control  over  sequencing  "01-­‐
munge.r",  "02-­‐munge.r"
• Common  tasks
– create  derived  variables:  
• composites  of  other  variables
• collapse  categories
• convert  from  character  to  numeric
• create  factors
• etc.
– created  aggregated  datasets
– merge  datasets
– remove  cases
– reshape  data
– etc.
• Important  point:  If  you  find  you  are  creating  derived  variables  in  your  analysis  
script,  move  this  code  to  the  munge  files
Running  ProjectTemplate
• You  should  place  the  following  as  the  first  command  in  
an  analysis  script
– library("ProjectTemplate");
load.project()
• What  happens  when  you  run  it:
– Configuration  file  is  loaded:  options  are  set;  
packages  are  loaded  (and  installed  from  CRAN  if  
needed)
– Scripts  in  the  lib file  are  sourced
– Data  in  the  data folder  is  loaded  into  R
– Data  manipulations  specified  in  the  munge folder  
are  run
• The  benefits
– Thus,  after  running  a  single  command  you  are  
now  ready  to  analyse  your  data,  or  perform  new  
analyses.
– All  data  import  and  manipulation  steps  are  
reproducible
Customise  your  own  version  of  ProjectTemplate
• Once  you  start  using  ProjectTemplate,  you  may  find  many  customisations  that  
you  always  need  to  make  to  a  new  project
– Packages  that  you  always  use
– Settings  that  you  prefer  over  the  defaults  (e.g.,  as_factors: FALSE)
– Particular  ways  that  you  generate  analysis  scripts
– Integration  with  RStudio project  structure
– Only  a  few  folders  in  ProjectTemplate are  truly  necessary  (i.e.,  lib,  config,  
data,  munge);  can  be  cleaner  to  remove  unnecessary  folders
– Create  new  folders  you  routinely  use
• Thus,  make  a  customised  version  of  ProjectTemplate that  matches  your  specific  
needs
• Save  this  customised  version  to  a  special  folder  on  your  computer  or  put  it  on  
Dropbox,  github,  etc.
• To  create  a  new  project
– Make  a  copy  of  your  customised  folder  structure
– Rename  the  project
– You  only  need  to  complete  the  project  specific  customisations
My  Customised  ProjectTemplate
• Basic  description
– http://jeromyanglim.blogspot.com.au/2014/05/customising-­‐projecttemplate-­‐in-­‐r.html

• Overview  of  files


– https://github.com/jeromyanglim/AnglimModifiedProjectTemplate

• Zip  file  of  Template


– https://github.com/jeromyanglim/AnglimModifiedProjectTemplate/archive/master.zip
My  customisations
• Modified  config to  include  preferred  packages  (e.g.,  ggplot2,  psych,  etc.)
• added  to  config:  as_factors:  FALSE
• added  rstudio project  file  so  that  project  can  be  opened  with  one  click  in  rstudio
• added  initial  rmarkdown file  for  performing  analyses
– includes  code  to  load  ProjectTemplate
• added  output  folder  as  a  default  space  to  output  any  derived  files  (tables,  graphs,  
derived  data)
• added  the  following  to  munge  file  to  make  it  easy  to  debug  munge  code
– #  library(ProjectTemplate);  load.project(list(munging=FALSE))  
• added  raw-­‐data  folder;  standardised place  to  do  very  low  level  data  
transformations
• readme.md is  modified  to  explain  to  others  how  to  run  code

• My  conventions  evolve  as  project  needs  evolve  or  new  tricks  arise
– considering  an  export  folder  and  script  designed  to  export  for  open  science
– makefile to  run  rmarkdown files
Customised  ProjectTemplate Workflow
• Setup  ProjectTemplate Folder  Structure
– Copy  the  zip  file  
• I  have  mine  stored  on  github and  bookmarked
• https://github.com/jeromyanglim/AnglimModifiedProjectTemplate/archive/
master.zip
– Rename  the  folder  and  the  RStudio  Project  file
• Add  script  files
– Functions  that  get  created  during  the  project  or  functions  that  need  to  be  
imported  get  put  in  .r  script  files  in  the  lib folder  (e.g.,  "myfuntions.r")
• Data
– Ensure  that  raw  data  is  roughly  in  the  right  format
– Place  data  files  in  data folder  with  the  names  you  want  the  data.frames to  have  
in  R  (e.g.,  mydata.csv becomes  mydata in  R)
• Data  manipulation
– Before  analysing  data,  it  is  usually  necessary  to  clean  the  data,  create  new  
variables,  merge  data,  and  so  on.
– This  all  goes  in  scripts  in  the  munge folder.
– Run  library("ProjectTemplate"); load.project() to  load  the  
data  and  then  write  any  data  manipulation  code.
Customised  ProjectTemplate Workflow
• Analyses
– Store  analyses  (i.e.,  code  to  generate  summary  statistics,  models,  tables,  
figures,  etc.)  in  Rmarkdown files
– You  need  a  code  chunk  before  any  analysis  that  loads  the  project
with  the  following  code
• library("ProjectTemplate"); load.project()
– It  can  be  useful  to  have  multiple  RMarkdown files:  e.g.,  for  exploratory  
analyses,  final  analyses  and  so  on.
– Alternatively,  just  put  the  rmarkdown file  in  the  working  directory
Example  of  Creating  a  Project
• If  you  want  to  do  workshop,  go  to:  https://github.com/jeromyanglim/leuven2016rworkshop
• https://github.com/jeromyanglim/leuven2016rworkshop/tree/master/project-­‐
examples/exercise-­‐project-­‐template

– Go  to  "exercise-­‐project-­‐template/raw-­‐materials"  unzip  the  Customised  version  of  


ProjectTemplate
– Give  the  folder  and  the  rstudio project  file  an  appropriate  name
– Put  cas.sav into  the  data folder  (California  Schools  Data)
– Open  the  Rstudio project  file  in  RStudio
– Open  "reports/explore.rmd"  and  run
library(ProjectTemplate); load.project()
– Add  a  few  basic  analyses  of  cas to  the  next  R  code  chunk
– Go  to  "munge/01-­‐munge.R"  and  add  a  new  variable  to  cas (e.g.,  create  a  variable  called  
performance  which  is  the  sum  of  cas$math and  cas$english
– Return  to  "reports/explore.rmd"  and  add  another  code  chunk.  Create  a  histogram  of  
cas$performance.
– Now  imagine  that  you  are  exiting  RStudio  and  then  returning  again.  i.e.,  Quit  RStudio  and  
then  reload  the  Rstudio Project  file
– Open  "reports/explore.rmd"  and  run
library(ProjectTemplate); load.project()
– You  should  see  that  your  histogram  code  for  cas$performance still  runs
Conclusion
• A  few  draw  backs  pertain  mostly  to  collaboration  and  sharing
– It  does  introduce  some  alternative  conventions  
• (e.g.,  option  specification,  package  loading,  data  loading)
• it  helps  to  have  a  readme  that  explains  how  it  all  works
• A  little  bit  of  a  startup  cost  if  you  typically  just  have  a  five  line  script  and  a  
data  file.
– Sometimes  you  don't  want  to  rename  data  files
– It  creates  one  more  dependency
• But  many  benefits  
– ProjectTemplate is  a  great  tool  if  you  regularly  perform  data  analysis  projects
– Standardisation is  very  helpful  to  your  future  self.
• One  click  and  you're  back  up  and  running  with  your  analysis.
– It's  a  great  framework  for  reproducible  research.
– The  true  power  comes  from  customising ProjectTemplate to  your  specific  
workflow.  It  is  very  flexible.
Thank  You  

Questions

You might also like