Example Workflows

Mark Edmondson

2016-11-19

See all documentation on the googleComputeEngineR website

The following are some R scripts for common workflows. They assume you have previously signed up and setup a Google project, authentication and SSH.

Custom Team RStudio Server

This gives you a private RStudio Server with your custom packages and users.

In summary it:

  1. Launches an RStudio template with Hadley’s tidyverse
  2. Lets you add users and passwords
  3. Log into RStudio and install packages as you would in normal RStudio desktop
  4. Once satisfied, save the state of the RStudio into a private Docker repo on Google Container Engine
  5. Lets you start other instances of RStudio server with your custom settings

Launch the RStudio server template

Here we are setting up a 13GB RAM instance, as found via gce_list_machinetype()

library(googleComputeEngineR)

## setting up a 13GB RAM instance 
## see gce_list_machinetype() for options of predefined_type
vm <- gce_vm(template = "rstudio-hadleyverse",
             name = "rstudio-team",
             username = "mark", password = "mark1234",
             predefined_type = "n1-highmem-2")

## wait a bit, login at the IP it gives you

Add users and setting up packages

You can add users via:

gce_rstudio_adduser(vm, username = "bill", password = "flowerpot")

You can then login at the IP address given via vm or gce_get_external_ip(vm), and install packages as you would on RStudio Desktop.

Saving the Docker container to Google Container REgistry

Every Google project has its own private Docekr repo called the Container Registry.

This command takes the running container that has your changes and saves it to there.

By default, the RStudio container runs with name “rstudio” which you can see via containers(vm)

gce_push_registry(vm, 
                  save_name = "my_rstudio",
                  container_name = "rstudio")

This can take a while the first time so go make a cup of tea. If successful you should be able to see your container saved at this URL https://console.cloud.google.com/kubernetes/images/list

Start up another VM your configuration

Now say you want a larger more powerful instance, or to launch another with your settings. You can now pull from the Container Registry and start up a VM with your settings enabled.

We use template=rstudio to make sure the right ports and so forth are configured for your Rstudio, and dynamic_image="my_rstudio" to instruct the template to pull from your own image instead of using the default. You need to make sure the dynamic image is based on an RStudio one for this to work correctly.

The function gce_tag_container constructs the name of the custom image on your Container Registry for you.

## construct the correct tag name for your custom image
tag <- gce_tag_container("my_rstudio")
# gcr.io/mark-edmondson-gde/my_rstudio

## start a 50GB RAM instance
vm2 <- gce_vm(name = "rstudio-big",
              predefined_type = "n1-highmem-8",
              template = "rstudio",
              dynamic_image = tag)

## wait for it to launch

Clean up

You don’t get charged for stopped containers, and the next time you start them they will start within 20 seconds.

gce_vm_stop(vm2)
gce_vm_stop(vm)

Remote R cluster

This workflow takes advatage of the future integration to run your local R-functions within a cluster of GCE machines.
You can do this to throw up expensive computations by spinning up a cluster and tearing it down again once you are done.

In summary, this workflow:

  1. Creates a GCE cluster
  2. Lets you perform computations
  3. Stops the VMs

Create the cluster

The example below uses a default r-base template, but you can use the steps above to create a dynamic_template pulled from the Container Registry if required.

Instead of the more generic gce_vm() that is used for more interactive use, we create the instances directly using gce_vm_container() so it doesn’t wait for the job to complete before starting the next (not useful if you have a lot of VMs). You can then use gce_get_zone_op() to get the job status.

library(future)
library(googleComputeEngineR)

## names for your cluster
vm_names <- c("vm1","vm2","vm3")

## create the cluster using default template for r-base
## creates jobs that are creating VMs in background
jobs <- lapply(vm_names, function(x) {
    gce_vm_container(file = get_template_file("r-base"),
                     predefined_type = "n1-highmem-2",
                     name = x)
                     })
jobs
# [[1]]
# ==Operation insert :  PENDING
# Started:  2016-11-16 06:52:58
# [[2]]
# ==Operation insert :  PENDING
# Started:  2016-11-16 06:53:04
# [[3]]
# ==Operation insert :  PENDING
# Started:  2016-11-16 06:53:09

## check status of jobs
lapply(jobs, function(x) gce_get_zone_op(x))
# [[1]]
# ==Operation insert :  DONE
# Started:  2016-11-16 06:52:58
# Ended: 2016-11-16 06:53:14 
# Operation complete in 16 secs 

# [[2]]
# ==Operation insert :  DONE
# Started:  2016-11-16 06:53:04
# Ended: 2016-11-16 06:53:20 
# Operation complete in 16 secs 

# [[3]]
# ==Operation insert :  DONE
# Started:  2016-11-16 06:53:09
# Ended: 2016-11-16 06:53:30 
# Operation complete in 21 secs

## get the VM objects
vms <- lapply(vm_names, gce_vm)

It is safest to setup the SSH keys seperately for multiple instances, using gce_ssh_setup() - this is normally called for you when you first connect to a VM.

## set up SSH for the VMs
vms <- lapply(vms, gce_ssh_setup)

We now make the VM cluster as per details given in the future README

## make a future cluster
plan(cluster, workers = as.cluster(vms))

Using the cluster

The cluster is now ready to recieve jobs. You can send them by simply using %<-% instead of <-.

## use %<-% to send functions to work on cluster
## See future README for details: https://github.com/HenrikBengtsson/future
a %<-% Sys.getpid()

## make a big function to run asynchronously
f <- function(my_data, args){
   ## ....expensive...computations
   
   result
}

## send to cluster
result %<-% f(my_data) 

For long running jobs you can use future::resolved to check on its progress.

## check if resolved
resolved(result)
[1] TRUE

Cleanup

Remember to shut down your cluster. You are charged per minute, per instance of uptime.

## shutdown instances when finished
lapply(vms, gce_vm_stop)

RStudio server + scheduler

This workflow demonstrates how you can take advatage of Dockerfiles to customise the VM templates.

Using Dockerfiles is recommended if you are making a lot of changes to a template, as its a lot easier to keep track on what is happening.

In summary:

  1. Launch a template VM with the container you want to base yours upon
  2. Construct a Dockerfile in a folder with any other files or dependencies, such as cron
  3. Use docker_build to upload and build your custom Docker image on the VM
  4. Save your custom image to the Container Registry
  5. Launch another VM to run your custom image
  6. Schedule a script to download from Google Analytics, send an email and upload to BigQuery

Launch a template VM

Build VMs should be more powerful than the default machine type (f1-micro) else there is a danger of it hanging for big expensive builds.

library(googleComputeEngineR)

## installs rocker/hadleyverse docker image
vm <- gce_vm(name = "build-schedule-r", 
             template = "rstudio-hadleyverse", 
             predefined_type = "n1-standard-2")

Construct a Dockerfile

The Dockerfile here is available via get_dockerfile("hadleyverse-crontab").

It is shown below, which you could base your own upon. This one installs cron for scheduling, and nano a simple text editor. It then also installs some libraries needed for my scheduled scripts:

From CRAN:

From Github

FROM rocker/hadleyverse
MAINTAINER Mark Edmondson (r@sunholo.com)

# install cron and R package dependencies
RUN apt-get update && apt-get install -y \
    cron \
    nano \
    ## clean up
    && apt-get clean \ 
    && rm -rf /var/lib/apt/lists/ \ 
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds
    
## Install packages from CRAN
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR shinyFiles googleCloudStorage bigQueryR gmailR googleAnalyticsR \
    ## install Github packages
    && Rscript -e "devtools::install_github(c('bnosac/cronR'))" \
    ## clean up
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds \

Using docker_build

We now upload to the instance the Dockerfile, and build a new docker image called my-cron-verse.

This can take some time the first time, so its time for another cup of tea.

## build a new image based on rocker/hadleyverse with this Dockerfile
docker_build(vm, 
             dockerfile = get_dockerfile("hadleyverse-crontab"), 
             new_image = "my-cron-verse")

## wait for it to build (5mins +)
# ...
# [0m ---> 059fe3ed926a
# Removing intermediate container 7695f3dc071f
# Successfully built 059fe3ed926a
# Using existing public key in /Users/mark/.ssh/google_compute_engine.pub
# REPOSITORY           TAG                 IMAGE ID            CREATED             SIZE
# my-cron-verse        latest              059fe3ed926a        2 seconds ago       1.933 GB
# rocker/hadleyverse   latest              05e3b636e90b        24 hours ago        1.921 GB

Save your custom image to the Container Registry

This image is now saved to the Container Registry

gce_push_registry(vm, "my-cron-verse", container_name = "myrstudio")

Once that is done, we don’t need this instance anymore.

gce_vm_delete(vm)

Launch another VM to run your custom image

You can now launch instances using your constructed image.

You can also use your custom image to create further Dockerfiles that use it as a dependency, using gce_tag_container() to get its correct name.

## now start an instance using our rstudio image in cloud-config
## this takes care of rstudio friendly settings, restart behaviour etc.
tag <- gce_tag_container("my-rstudio")

## rstudio template, but with your private rstudio build
vm2 <- gce_vm(name = "myrstudio2", 
              template = "rstudio", 
              dynamic_image = tag, 
              username = "mark", 
              password = "mark1234")

You can check when the images are downloaded by using gce_check_container()

## check on progress on the container pull
gce_check_container(vm2, "rstudio")
# -- Logs begin at Thu 2016-11-17 14:54:38 UTC, end at Thu 2016-11-17 14:57:38 UTC. --
# Nov 17 14:54:43 myrstudio2 docker[1045]: Unable to find image 'gcr.io/mark-edmondson-gde/my-rstudio:latest' locally
# Nov 17 14:54:47 myrstudio2 docker[1045]: latest: Pulling from mark-edmondson-gde/my-rstudio
# Nov 17 14:54:47 myrstudio2 docker[1045]: a84f66826a7f: Pulling fs layer
# ...
# ...
# Nov 17 14:58:36 myrstudio2 docker[1045]: [cont-init.d] conf: exited 0.
# Nov 17 14:58:36 myrstudio2 docker[1045]: [cont-init.d] done.
# Nov 17 14:58:36 myrstudio2 docker[1045]: [services.d] starting services
# Nov 17 14:58:36 myrstudio2 docker[1045]: [services.d] done.

## your custom rstudio instance is now ready
> vm2
# ==Google Compute Engine Instance==
#
# Name:                myrstudio2
# Created:             2016-11-17 06:54:18
# Machine Type:        f1-micro
# Status:              RUNNING
# Zone:                europe-west1-b
# External IP:         104.199.67.250
# Disks: 
#             deviceName       type       mode boot autoDelete
# 1 myrstudio2-boot-disk PERSISTENT READ_WRITE TRUE       TRUE

You can delete your instances, knowing that the custom image is safe in the Container Registry, or just stop them using gce_vm_stop() and start again with gce_vm_start()

## delete the instance (the container is safe)
gce_vm_delete(vm2)

A demo script

A demo script for scheduling is below.

It is not recommended to rely on loading data into a Docker contianer, so we do not include data in the Docker container. Instead, call outside dedicated data stores such as BigQuery or Cloud Storage, which if you are using Google Compute Engine you have access to under the same project.

In summary the script below:

  1. Downloads data from Google Analytics
  2. Uploads the data to BigQuery
  3. Uploads the data to Google Cloud Storage
  4. Sends an email giving the daily total

Log into your RStudio Server instance and create the following script:

library(googleCloudStorageR)
library(bigQueryR)
library(gmailr)
library(googleAnalyticsR)

## set options for authentication
options(googleAuthR.client_id = XXXXX)
options(googleAuthR.client_secret = XXXX)
options(googleAuthR.scopes.selected = c("https://www.googleapis.com/auth/cloud-platform",
                                        "https://www.googleapis.com/auth/analytics.readonly"))

## authenticate
## using service account, ensure service account email added to GA account, BigQuery user permissions set, etc.
googleAuthR::gar_auth_service("auth.json")

## get Google Analytics data
gadata <- google_analytics_4(123456, 
                             date_range = c(Sys.Date() - 2, Sys.Date() - 1),
                             metrics = "sessions",
                             dimensions = "medium",
                             anti_sample = TRUE)

## upload to Google BigQuery
bqr_upload_data(projectId = "myprojectId", 
                datasetId = "mydataset",
                tableId = paste0("gadata_",format(Sys.Date(),"%Y%m%d")),
                upload_data = gadata,
                create = TRUE)

## upload to Google Cloud Storage
gcs_upload(gadata, name = paste0("gadata_",Sys.Date(),".csv"))


## get top medium referrer
top_ref <- paste(gadata[order(gadata$sessions, decreasing = TRUE),][1, ], collapse = ",")
# 3456, organic

## send email with todays figures
daily_email <- mime(
  To = "bob@myclient.com",
  From = "bill@cool-agency.com",
  Subject = "Todays winner is....",
  body = paste0("Top referrer was: "),top_ref)
send_message(daily_email)

Save the script within RStudio as daily-report.R

You can then use cronR to schedule the script for a daily extract.

Use its RStudio addin, or in the console issue:

library(cronR)
cron_add(paste0("Rscript ", normalizePath("daily-report")), frequency = "daily")
# Adding cronjob:
# ---------------
#
# ## cronR job
# ## id:   fe9168c7543cc83c1c2489de82216c0f
# ## tags: 
# ## desc: 
# 0 0 * * * Rscript /home/mark/demo-schedule.R

The script will then run every day.

Make sure to run the script locally and in a test CRON job first. Once satisfied, you can run locally the gce_push_registry() again to save the RStudio image with your scehduled script embedded within.

If you want to call the data from a Shiny app, then you can use bqr_query from bigQueryR or gcs_get_object from googleCloudStorageR within your Shiny script to pull in the data into your app.

Self-contained Shiny app

This is useful to have a dedicated Docker container that has all the libraries, files and scripts necessary to run your app, on any machine that can run Docker, without worrying about versions etc.

This example uses a local Dockerfile to install the libraries you need for your Shiny app, but also copies your Shiny app into itself so its all self-contained.

The Shiny app can then be deployed on new instances via the gce_template() function.

In summary:

  1. Create a build VM
  2. Build your custom image with your Shiny app directory included
  3. Push to the Container registry
  4. Deploy to a production VM

Create a build VM

We start with a Shiny template:

library(googleComputeEngineR)

## make sure the instance is big enough to install, 
## the default "f1-micro" does not compile packages easily
vm <- gce_vm(name = "build-app", template = "shiny", predefined_type = "n1-standard-2")

Build your custom image

This is as before, but the Dockerfile also includes a COPY command to copy necessary Shiny ui.R and server.R files into the Docker image.

The Shiny app used is the googleAuthR demo app, and the build directory can be found via: get_dockerfolder("shiny-googleAuthRdemo")

FROM rocker/shiny
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    libssl-dev \
    ## clean up
    && apt-get clean \ 
    && rm -rf /var/lib/apt/lists/ \ 
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds
    
## Install packages from CRAN
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR \
    ## clean up
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds

## assume shiny app is in build folder /shiny
COPY ./shiny/ /srv/shiny-server/myapp/

Note the COPY command at the end - this copies from a folder in the same location as the Dockerfile. and then places it in the /srv/shiny-server/ folder which is the default location for Shiny apps - this location means that the Shiny app will be avialable at your.ip.addr.ess/myapp/

We also install googleAuthR from CRAN, and a Debian dependency for googleAuthR that is needed, libssl-dev via apt-get.

The file structure for the build is then:

list.files(get_dockerfolder("shiny-googleAuthRdemo"), recursive = TRUE)
# "Dockerfile"        "shiny/DESCRIPTION" "shiny/readme.md"   "shiny/server.R"    "shiny/ui.R"

We now build the custom image:

docker_build(vm, 
             dockerfolder = get_dockerfolder("shiny-googleAuthRdemo"),
             new_image = "shiny_gar")

Push to the Container registry

On a successful build, you can now upload to the Container Registry.

## push up to your private Google Container registry
gce_push_registry(vm, 
                  save_name = "shiny_gar", 
                  image_name = "shiny_gar")
# ...
# ...
# b363013633c9: Pushed
# 4809649dffb9: Pushed
# latest: digest: sha256:cb233f547d84dd94e7616fa7615522e15213d65cc2abb423dd4f3305d19309ce size: 19731

Deploy

You can now deploy your Shiny app on any instance by calling it from your Container Registry:

## make new Shiny template VM with your self-contained Shiny app
vm2 <- gce_vm(name = "deployedapp", 
              template = "shiny", 
              dynamic_image = gce_tag_container("shiny_gar"),
              predefined_type = "n1-standard-2")
              
## check for when image has finished downloading              
gce_check_container(vm2, "shinyserver")
# ...
# ...
# Nov 17 22:11:15 myshinyapp2 docker[1039]: [cont-init.d] done.
# Nov 17 22:11:15 myshinyapp2 docker[1039]: [services.d] starting services
# Nov 17 22:11:15 myshinyapp2 docker[1039]: [services.d] done.

Your app should now be running on your IP + folder in Dockerfile, such as http://123.456.XXX.XXX/myapp/

Clean up the VMs to avoid unnecessary costs:

# delete build VM
gce_vm_delete(vm)

# stop and start production shiny app as needed
gce_vm_stop(vm2)

Setting up a custom OpenCPU server

The below installs your Github repo into the Docker container using a small Dockerfile.

In summary:

  1. Launch an OpenCPU instance
  2. Build a new OpenCPU image with your Github custom package
  3. Push the image to the Container Registry for safe-keeping
  4. Stop the default OpenCPU docker container and launch your own

Launch an OpenCPU Instance

library(googleComputeEngineR)

## start an opencpu template
vm <- gce_vm(name = "opencpu", template = "opencpu", predefined_type = "n1-standard-2")

## wait for opencpu image to load
gce_check_container(vm, "opencpu")

Build a new OpenCPU image

This Dockerfile is available via get_dockerfolder("opencpu-installgithub") and below. It installs an OpenCPU package from my prediction of user URLs for prefetching application.

FROM opencpu/base
MAINTAINER Mark Edmondson (r@sunholo.com)

# install any package dependencies
RUN apt-get update && apt-get install -y \
    nano \
    ## clean up
    && apt-get clean \ 
    && rm -rf /var/lib/apt/lists/ \ 
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds
    
## Install your custom package from Github
RUN Rscript -e "devtools::install_github(c('MarkEdmondson1234/predictClickOpenCPU'))"

The Dockerfile is used to build the custom image below:

## build a docker image with your package installed
docker_build(vm, 
             dockerfolder = get_dockerfolder("opencpu-installgithub"),
             new_image = "opencpu-predictclick")

Push the image to the Container Registry

## push up to your private Google Container registry
gce_push_registry(vm, 
                  save_name = "opencpu-predictclick", 
                  image_name = "opencpu-predictclick")

Deploy

In this case we don’t start a new instance, just stop the running OpenCPU container and start our own. We need to stop the default container to free up the ports 80 and 8004 that are needed for OpenCPU to work.

## stop default opencpu container
docker_cmd(vm, "stop opencpu-server")

## run custom opencpu server
docker_run(vm, 
           image = "opencpu-predictclick", 
           name = "predictclick", 
           detach = TRUE, 
           docker_opts = "-p 80:80 -p 8004:8004")

Clean up when you are done to avoid charges.

gce_vm_stop(vm)