This vignette explores R package download trends using the cranlogs
package.
Write the code files to your workspace.
drake_example("packages")
The new packages
folder now includes a file structure of a serious drake
project, plus an interactive-tutorial.R
to narrate the example. The code is also online here.
This small data analysis project explores some trends in R package downloads over time. The datasets are downloaded using the cranlogs package.
library(cranlogs)
cran_downloads(packages = "dplyr", when = "last-week")
## date count package
## 1 2018-04-02 12577 dplyr
## 2 2018-04-03 15520 dplyr
## 3 2018-04-04 16830 dplyr
## 4 2018-04-05 15695 dplyr
## 5 2018-04-06 12978 dplyr
## 7 2018-04-07 0 dplyr
## 6 2018-04-08 8140 dplyr
Above, each count is the number of times dplyr
was downloaded from the RStudio CRAN mirror on the given day. To stay up to date with the latest download statistics, we need to refresh the data frequently. With drake
, we can bring all our work up to date without restarting everything from scratch.
First, we load the required packages. Drake
knows about the packages you install and load.
library(drake)
library(cranlogs)
library(ggplot2)
library(knitr)
library(plyr)
We want to explore the daily downloads from these packages.
package_list <- c(
"knitr",
"Rcpp",
"ggplot2"
)
We plan to use the cranlogs package.
The data frames older
and recent
will
contain the number of daily downloads for each package
from the RStudio CRAN mirror.
data_plan <- drake_plan(
older = cran_downloads(
packages = package_list,
from = "2016-11-01",
to = "2016-12-01"
),
recent = target(
command = cran_downloads(
packages = package_list,
when = "last-month"
),
trigger = "always"
),
strings_in_dots = "literals"
)
data_plan
## # A tibble: 2 x 3
## target command trigger
## <chr> <chr> <chr>
## 1 older "cran_downloads(packages = package_list, from = \"2016-1… any
## 2 recent "cran_downloads(packages = package_list, when = \"last-m… always
Our data_plan
data frame has a "trigger"
column because the latest download data needs to be refreshed every day. We use
triggers to force recent
to always build. For more on triggers, see the vignette on debugging and testing. Instead of triggers, we could have just made recent
a global variable like package_list
instead of a formal target in data_plan
.
We want to summarize each set of download statistics a couple different ways.
output_types <- drake_plan(
averages = make_my_table(dataset__),
plot = make_my_plot(dataset__)
)
output_types
## # A tibble: 2 x 2
## target command
## <chr> <chr>
## 1 averages make_my_table(dataset__)
## 2 plot make_my_plot(dataset__)
We need to define functions to summarize and plot the data.
make_my_table <- function(downloads){
ddply(downloads, "package", function(package_downloads){
data.frame(mean_downloads = mean(package_downloads$count))
})
}
make_my_plot <- function(downloads){
ggplot(downloads) +
geom_line(aes(x = date, y = count, group = package, color = package))
}
Below, the targets recent
and older
each take turns substituting the dataset__
wildcard.
Thus, output_plan
has four rows.
output_plan <- plan_analyses(
plan = output_types,
datasets = data_plan
)
output_plan
## # A tibble: 4 x 2
## target command
## <chr> <chr>
## 1 averages_older make_my_table(older)
## 2 averages_recent make_my_table(recent)
## 3 plot_older make_my_plot(older)
## 4 plot_recent make_my_plot(recent)
We plan to weave the results together in a dynamic knitr report.
report_plan <- drake_plan(
knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
)
report_plan
## # A tibble: 1 x 2
## target command
## <chr> <chr>
## 1 "\"report.md\"" "knit(knitr_in(\"report.Rmd\"), file_out(\"report.md\")…
Because of the mention of knitr_in()
above, make()
will look dependencies inside report.Rmd
(targets mentioned with loadd()
or readd()
in active code chunks). That way, whenever a dependency changes, drake
will rebuild report.md
when you call make()
. For that to happen, we need report.Rmd
to exist before the call to make()
. For this example, you can find report.Rmd here.
Now, we complete the workflow plan data frame by
concatenating the results together.
Drake
analyzes the plan to figure out the dependency network,
so row order does not matter.
whole_plan <- bind_plans(
data_plan,
output_plan,
report_plan
)
whole_plan
## # A tibble: 7 x 3
## target command trigger
## <chr> <chr> <chr>
## 1 older "cran_downloads(packages = package_list, from =… any
## 2 recent "cran_downloads(packages = package_list, when =… always
## 3 averages_older make_my_table(older) any
## 4 averages_recent make_my_table(recent) any
## 5 plot_older make_my_plot(older) any
## 6 plot_recent make_my_plot(recent) any
## 7 "\"report.md\"" "knit(knitr_in(\"report.Rmd\"), file_out(\"repo… any
Now, we run the project to download the data and analyze it.
The results will be summarized in the knitted report, report.md
,
but you can also read the results directly from the cache.
make(whole_plan)
## target older
## target recent: trigger "always"
## target averages_older
## target averages_recent
## target plot_older
## target plot_recent
## target file "report.md"
## Used non-default triggers. Some targets may not be up to date.
readd(averages_recent)
## package mean_downloads
## 1 Rcpp 21870.967
## 2 ggplot2 15225.633
## 3 knitr 9980.433
readd(averages_older)
## package mean_downloads
## 1 Rcpp 14408.06
## 2 ggplot2 14641.29
## 3 knitr 9068.71
readd(plot_recent)
readd(plot_older)
Because we used triggers, each make()
rebuilds the recent
target to get the latest download numbers for today.
If the newly-downloaded data are the same as last time
and nothing else changes,
drake
skips all the other targets.
make(whole_plan)
## Unloading targets from environment:
## averages_recent
## averages_older
## plot_older
## plot_recent
## target recent: trigger "always"
## Used non-default triggers. Some targets may not be up to date.
To visualize the build behavior, plot the dependency network.
Target recent
and everything depending on it is always
out of date because of the "always"
trigger.
If you rerun the project tomorrow,
the recent
dataset will have shifted one day forward,
so make()
will refresh averages_recent
, plot_recent
, and
report.md
. Targets averages_older
and plot_older
should be unaffected, so drake
will skip them.
config <- drake_config(whole_plan)
vis_drake_graph(config)
When you rely on data from the internet, you should trigger a new download when the data change remotely. This section of the best practices guide explains how to automatically refresh the data when the online timestamp changes.