Statistical consulting using R: a DRY approach from the Australian outback

July 1, 2015

A creek near 'home'

Poll

Which best describes your work/role

statistical consultant
researcher using statistics
programmer
statistical researcher
other?

Outline

Background
Data analysis project workflow
Make
Git
DRY: dryworkflow package
Conclusions

My Background

Many years as a statistical consultant
- NSW Agriculture, CSIRO, UQ Public Health
- agricultural, genetics, medical researchers
Statistical software
- GENSTAT, Minitab, SAS, SPSS, …
- R (almost) exclusively since 2001

Real world consulting

Are these scenarios familiar?

I have a very simple question that will only take 5 minutes. I won't need to see you again

We have several data points that need deleting. Can you rerun the analysis, and insert the new tables and plot into our report by 4pm today?

The journal got back to us: Can you rerun the analysis to take account critisicms of our method? Its not the project we did last year but the one from 2009

Real world consulting

No matter what clients/funders/bosses say, what happens is often very different

All these situations need to be well organised and well documented

Standardised systems help too

Additionally, good computing tools (R and non R) can help this process too

Statistical consulting cycle

Plan
Document
Organise
Carry out analysis
Communicate results
Iterate through steps 1 to 5 and refine process

Reference:

See (Long 2009) The Workflow of Data Analysis Using Stata. StataCorp LP.

Workflow of data analysis and reporting

Need to consider

Efficiency
Simplicity
Standardisation
Automation
Usability
Scalability
Collaboration
Reproducibility

R, make and git can help in many of these

Examples: Automation

write functions (packages) to automate routine work
standard directory structure
- many projects can use same directory structure
- can create directories using R or shell script
also
- create Makefiles automatically
- create git repos automatically
- create R syntax or reuse R syntax

Directory Structure (Base)

Standard data analysis project directory set up

admin/
backups/
configFile.rds
data/
doc/
extra/
lib/
Makefile
posted/
readMergeData/
reports/
src/
test/
work/

Data Directory Structure (Initial)

Data directories initially (raw data and codebook files)

data
├── codebook
│   ├── data1_codebook.csv
│   └── small2_codebook.csv
├── derived
└── original
    ├── data1-birth.csv
    ├── data1-yr21.csv
    └── small2.csv

3 directories, 5 files

Examples: Rerunning analysis

manually
- need to document steps heavily
- still may forget something
make (Mecklenburg 2004)
- automates process
- only rerun steps needed
- keeps track of the process
  - but need to read/tweak make

also make batch SAS, SPSS, stata, perl, python, sed, awk, ….

simple Makefile

target: dependencies
<TAB> command to run

.PHONY: all
all: read.Rout

read.Rout: read.R bmi2009.dta
<TAB>   R CMD BATCH read.R

type make at shell or set up Build in RStudio
only runs command if target older than dependencies
read file bottom up to see process

less simple Makefile

.PHONY: all
all: report.pdf

report.pdf: report.Rmd analysis.Rout
<TAB> Rscript -e "library(rmarkdown);render('report.Rmd')"
analysis.Rout: analysis.R read.Rout
<TAB> R CMD BATCH --vanilla analysis.R
read.Rout: read.R bmi2009.dta
<TAB> R CMD BATCH --vanilla read.R

Each target depends on

syntax/report file (and data)
previous step in process

Problem: no Make rules for R

No Problem

write your own rules, or
include file common.mk from
https://github.com/petebaker/r-makefile-definitions

.PHONY: all
all: analysis.Rout

report.pdf: ${@:.pdf=.Rmd} analysis.Rout
analysis.Rout: ${@:.Rout=.R} read.Rout
read.Rout: ${@:.Rout=.R} bmi2009.dta

include ~/lib/common.mk

Use git!

Version control helps all projects

even solo statistician
several statisticians
clients too (rarely agree to use it though)

Easy to learn?

start simply (today)
built in to RStudio
bells and whistles later

Good info online or see (Loeliger and McCullough 2012)

dryworklow package

Don't Repeat Yourself workflow

creates (standardised) directory structure
moves data, doc, codebook files to appropriate directories
creates R, Rmd, Makefiles, log files
- read.R, clean.R, summary.R, analyse.R, report.Rmd
initialise git repo and first commit

https://github.com/petebaker/dryworkflow

dryworklow package

Early stage of development but currently can customise:

directory structure and file naming (more on way)
R, Rmarkdown, Sweave options
templates (syntax/report generation)

Demo?

conclusions

approach & attitude is important
planning and documentation a must
consistency and common sense
R, make, git can help automation

See blog at http://www.petebaker.id.au

References

Loeliger, Jon, and Matthew McCullough. 2012. Version Control with Git: Powerful Tools and Techniques for Collaborative Software Development. 2nd ed. O’Reilly Media, Inc.

Long, J. Scott. 2009. The Workflow of Data Analysis Using Stata. StataCorp LP.

Mecklenburg, Robert. 2004. Managing Projects with GNU Make. 3rd ed. O’Reilly Media, Inc.