Een digital analytics workflow creeëren met R in Google Cloud

Door Jan Roekens | 18-07-2018

Mark Edmondson, Data Insight Developer bij IIH Nordic is spreker tijdens het Digital Analytics Congres. In deze blog gaat hij dieper in op het gebruik van het googleAuthR framework in R. 


By Mark Edmondson

If you’re new to R, and would like to know how it helps with your digital analytics, Tim Wilson and I ran a workshop last month aimed at getting a digital analyst up and running. The course material is online at www.dartistics.com.

A top level overview is shown below:

r infrastructure

 

The diagram shows googleAuthR packages, other packages, servers and APIs all of which interact to turn data into actionable analytics.

The most recent addition is googleComputeEngineR which has helped make great savings in the time it takes to set up servers. I have in the past blogged about setting up R servers in the Google Cloud, but it was still more dev-ops than I really wanted to be doing. Now, I can do setups similar to those I have written about in a couple of lines of R.

Data workflow

A suggested workflow for working with data is:

1. Infrastructure – Creating a computer to run your analysis. This can be either your own laptop, or a server in the cloud.
2. Collection – Downloading your raw data from one or multiple sources, such as APIs or your CRM database.
3. Transformation – Turning your raw data into useful information via ETL, modelling or statistics.
4. Storage – Storing the information you have created so that it can be automatically updated and used.
5. Output – Displaying or using the information into an end user application.

The components of the diagram can be combined into the workflow above. You can swap out various bits for your own needs, but its possible to use R for all of these steps.

You could do all of this with an Excel workbook on your laptop. However, as data analysis becomes more complicated, it makes sense to start breaking out the components into more specialised tools, since Excel will start to strain when you increase the data volume or want reproducibility.

Infrastructure

googleComputeEngineR uses the Google Clouds’ virtual machine instances to create servers, with specific support for R.

It uses Docker containers to make launching the servers and applications you need quick and simple, without you needing to know too much dev-ops to get started.

Due to Docker, the applications created can be more easily transferred into other IT environments, such as within internal client intranets or AWS.

To help with a digital analytics workflow, googleComputeEngineR includes templates for the below:

RStudio Server – an R IDE you can work with from your browser. The server edition means you can access it from anywhere, and can always ensure the correct packages are installed.
Shiny – A server to run your Shiny apps that display your data in dashboards or applications end users can work with.
OpenCPU – A server that turns your R code into a JSON API. Used for turning R output into a format web development teams can use straight within a website.

For instance, launching an RStudio server is as simple as:

A digital analytics workflowfig2

The instance will launch within a few minutes and give you a URL you can then login with.

Collection

Once you are logged in to your RStudio Server, you can use all of R’s power to download and work with your data.

The googleAuthR packages can all be authenticated under the same OAuth2 token, to simplify access.

Other packages useful to digital analytics include APIs such as RSiteCatalyst and twitteR. A full list of digital analytics R packages can be found in the web technologies section of CRAN Task Views.

Another option is the R package googlesheets by Jenny Bryan, which could either be used as a data source or as a data storage for reports, to be processed onwards later.

The below example R script fetches data from Google Analytics, SEO data from Google Search Console and CRM data from BigQuery.

A digital analytics workflowfig3

Transformation

This ground is well covered by existing R packages. My suggestion here is to embrace the tidyverse and use that to create and generate your information.

Applications include anomaly detection, measurement of causal effect, clustering and forecasting. Hadley Wickham’s book “R for Data Science” is a recent compendium of knowledge on this topic, which includes this suggested work flow:

tidy r

Storage

Once you have your data in the format you want, you often need to keep it somewhere it is easily accessible for other systems.

Google Cloud Storage is a cheap, reliable method of storing any type of data object so that its always available for further use, and is heavily used within Google Cloud applications as a central data store. I use it for storing RData files or storing csv files with a public link that is emailed to users when available. It is accessible in R via the googleCloudStorageR package.

For database style access, BigQuery can be queried from many data services, including data visualisation platforms such as Google’s Data Studio or Tableau. BigQuery offers incredibly fast analytical queries for TBs of data, accessible via the bigQueryR package.

An example of uploading data is below – again only one authentication is needed.

A digital analytics workflow fig5

Output

Finally, outputs include Shiny apps, RMarkdown, a scheduled email or an R API call using OpenCPU.

All googleAuthR functions are Shiny and RMarkdown compatible for user authentication – this means a user can login themselves and access their own data whilst using the logic of your app to gain insight, without you needing access to their data at all. An example of an RMarkdown app taking advantage of this is the demo of the GentelellaShiny GA dashboard

 gentellaShinyGA

You can launch OpenCPU and Shiny servers just as easily as RStudio Server via

A digital analytics workflowfig6

Shiny Apps or RMarkdown HTML files can then be uploaded for display to the end user. If the server needs more power, simply save the container and relaunch with a bigger RAM or CPU.

OpenCPU is the technology demonstrated in my recent EARL London talk on using R to forecast HTML prefetching and deploying through Google Tag Manager.

Your Shiny, RMarkdown or OpenCPU functions can download data via:

A digital analytics workflow fig7

Summary

Hopefully this has shown where some efficiencies could be made in your own digital analysis. For me, the reduction of computer servers to atoms of work has expanded the horizons on what is possible: applications such as sending big calculations to the cloud if taking too long locally; being able to send clients entire clusters of computers with a data application ready and working; and having customised R environments for every occasion, such as R workshops.

For the future, I hope to introduce Spark clusters via Google Dataproc, giving the ability to use machine learning directly on a dataset without needing to download locally; scheduled scripts that launch servers as needed; and working with Google’s newly launched machine learning APIs that dovetail into the Google Cloud. 

Mark Edmondson

Mark is a British data engineer living in Denmark, with a background in physics, SEO and music. Nowadays he works at IIH Nordic looking to bring data to life via machine learning APIs and services, whilst contributing to the open source communities as a Google Developer Expert for Google Analytics and the Google Cloud Platform. His main niche has been exposing Google APIs within the R digital marketing community, via creating several Google-focused R packages, Shiny apps and by blogging at code.markedmondson.me.

Dit artikel is eerder verschenen op https://code.markedmondson.me/digital-analytics-workflow-through-google-cloud/

Auteur: Jan Roekens, Hoofdredacteur

Deze artikelen vind je vast ook interessant

Ook de laatste bytes ontvangen?