R/Medicine
August 28, 2020

Reproducible Research

Increasingly important with:

  1. Data science in everything
  2. Journal expectations for availability of data and code
  3. Cloud computing for PHI

Unique to R

Very easy for R users to accidentally break something by updating R, or a package.

Goal

To maximize reproduciblity without sacrificing the RStudio experience we love!

RStudio

Hurdles to reproducibility

Nothing is perfect, use multiple layers of control to improve reproducibility

Components to control:

  • Operating system (and libraries)
  • R version
  • Package versions

If all of these are controlled, your analysis will always be reproducible.

My Recommendations

Proactive > Reactive!


For two years I've used:

  1. Docker, with
  2. package snapshots, and
  3. packrat.

Specifying OS with Docker

For complete control over the OS, we can use Docker. Docker Layers

Docker Layers

Specifying OS with Docker

For complete control over the OS, we can use Docker. Docker Layers

Docker Layers

Specifying an R version with Docker

Test it out (R 3.4.1)

$ docker run -d rocker/rstudio:3.4.1

Opening RStudio

Using package repo to match (R 3.4.1)

The versioned Rocker images change the default CRAN repo to a dated Microsoft snapshot (MRAN)

install with MRAN

Additions

To simplify everything I recommend:

  1. mounting a folder of source code,
  2. adding your UID to ease permissions issues
  3. initializing a packrat project within the mounted source directory

$ docker run -d -e USERID=$UID -e PASSWORD=fake -v $(pwd):/work
-p 7009:8787 rocker/rstudio:3.4.1

> install.packages("packrat")
> packrat::init("/work")

Code here:
github.com/vincentmajor/reproducible_RStudio_projects

Example

  • Acting as user
    rstudio,
  • change working
    directory into mount

Example

  • init packrat,
    note the update
    to library paths
  • All packages exist
    outside the container.

Example

  • Can use packrat
    to snapshot:

Example

  • And to restore
    from snapshot:

Demo

Other resources