July 1, 2015

A creek near 'home'

Poll

Which best describes your work/role

  • statistical consultant
  • researcher using statistics
  • programmer
  • statistical researcher
  • other?

Outline

  • Background
  • Data analysis project workflow
  • Make
  • Git
  • DRY: dryworkflow package
  • Conclusions

My Background

  • Many years as a statistical consultant
    • NSW Agriculture, CSIRO, UQ Public Health
    • agricultural, genetics, medical researchers
  • Statistical software
    • GENSTAT, Minitab, SAS, SPSS, …
    • R (almost) exclusively since 2001

Real world consulting

Are these scenarios familiar?

  • I have a very simple question that will only take 5 minutes. I won't need to see you again
  • We have several data points that need deleting. Can you rerun the analysis, and insert the new tables and plot into our report by 4pm today?
  • The journal got back to us: Can you rerun the analysis to take account critisicms of our method? Its not the project we did last year but the one from 2009

Real world consulting

No matter what clients/funders/bosses say, what happens is often very different

All these situations need to be well organised and well documented

Standardised systems help too

Additionally, good computing tools (R and non R) can help this process too

Statistical consulting cycle

  1. Plan
  2. Document
  3. Organise
  4. Carry out analysis
  5. Communicate results
  6. Iterate through steps 1 to 5 and refine process

Reference:

See (Long 2009) The Workflow of Data Analysis Using Stata. StataCorp LP.

Workflow of data analysis and reporting

Need to consider

  • Efficiency
  • Simplicity
  • Standardisation
  • Automation
  • Usability
  • Scalability
  • Collaboration
  • Reproducibility

R, make and git can help in many of these

Examples: Automation

  • write functions (packages) to automate routine work
  • standard directory structure
    • many projects can use same directory structure
    • can create directories using R or shell script
  • also
    • create Makefiles automatically
    • create git repos automatically
    • create R syntax or reuse R syntax

Directory Structure (Base)

Standard data analysis project directory set up

admin/
backups/
configFile.rds
data/
doc/
extra/
lib/
Makefile
posted/
readMergeData/
reports/
src/
test/
work/

Data Directory Structure (Initial)

Data directories initially (raw data and codebook files)

data
├── codebook
│   ├── data1_codebook.csv
│   └── small2_codebook.csv
├── derived
└── original
    ├── data1-birth.csv
    ├── data1-yr21.csv
    └── small2.csv

3 directories, 5 files

Examples: Rerunning analysis

  • manually
    • need to document steps heavily
    • still may forget something
  • make (Mecklenburg 2004)
    • automates process
    • only rerun steps needed
    • keeps track of the process
      • but need to read/tweak make
  • also make batch SAS, SPSS, stata, perl, python, sed, awk, ….

simple Makefile

target: dependencies
<TAB> command to run

.PHONY: all
all: read.Rout

read.Rout: read.R bmi2009.dta
<TAB>   R CMD BATCH read.R
  • type make at shell or set up Build in RStudio
  • only runs command if target older than dependencies
  • read file bottom up to see process

less simple Makefile

.PHONY: all
all: report.pdf

report.pdf: report.Rmd analysis.Rout
<TAB> Rscript -e "library(rmarkdown);render('report.Rmd')"
analysis.Rout: analysis.R read.Rout
<TAB> R CMD BATCH --vanilla analysis.R
read.Rout: read.R bmi2009.dta
<TAB> R CMD BATCH --vanilla read.R

Each target depends on

  • syntax/report file (and data)
  • previous step in process

Problem: no Make rules for R

No Problem

.PHONY: all
all: analysis.Rout

report.pdf: ${@:.pdf=.Rmd} analysis.Rout
analysis.Rout: ${@:.Rout=.R} read.Rout
read.Rout: ${@:.Rout=.R} bmi2009.dta

include ~/lib/common.mk

Use git!

Version control helps all projects

  • even solo statistician
  • several statisticians
  • clients too (rarely agree to use it though)

Easy to learn?

  • start simply (today)
  • built in to RStudio
  • bells and whistles later

Good info online or see (Loeliger and McCullough 2012)

dryworklow package

Don't Repeat Yourself workflow

  • creates (standardised) directory structure
  • moves data, doc, codebook files to appropriate directories
  • creates R, Rmd, Makefiles, log files
    • read.R, clean.R, summary.R, analyse.R, report.Rmd
  • initialise git repo and first commit

https://github.com/petebaker/dryworkflow

dryworklow package

Early stage of development but currently can customise:

  • directory structure and file naming (more on way)
  • R, Rmarkdown, Sweave options
  • templates (syntax/report generation)
  • Demo?

conclusions

  • approach & attitude is important
  • planning and documentation a must
  • consistency and common sense
  • R, make, git can help automation

.

.

.

See blog at http://www.petebaker.id.au

References

Loeliger, Jon, and Matthew McCullough. 2012. Version Control with Git: Powerful Tools and Techniques for Collaborative Software Development. 2nd ed. O’Reilly Media, Inc.

Long, J. Scott. 2009. The Workflow of Data Analysis Using Stata. StataCorp LP.

Mecklenburg, Robert. 2004. Managing Projects with GNU Make. 3rd ed. O’Reilly Media, Inc.