Introduction

This is a getting-started guide to the Repo R package, which implements an R objects repository manager. It is a data-centered data flow manager.

The Repo package builds one (or more) centralized local repository where R objects are stored together with corresponding annotations, tags, dependency notes, provenance traces, notes, source code. Once a repository has been populated, stored objects can be easily searched, navigated, edited, imported/exported. Repo can store and manage both R data objects and generic files with dedicated functions. Repo supports a Shiny interface for visual interaction with repositories.

What follows is a walk-through aimed at quickly introducing to the main features of Repo.

Repo latest version can be found at: https://github.com/franapoli/repo

Repo is also on CRAN at: https://cran.r-project.org/package=repo

Preparation

First of all, the following will enable Repo:

library(repo)

The following command creates a new repository in a temporary path, which is ok for this demo. By default the repository is instead created under “~/.R_repo”. The same function opens existing repositories. The variable rp will be used as the main interface to the repository throughout this guide.

rp <- repo_open(tempdir(), force=T)
Repo created.

This document is produced by a script named index.Rmd. The script itself can be added to the repository and newly created resources will be annotated as being produced by it. The following code (which will be clear later) stores the script.

rp$attach("index.Rmd", "Source code for Repo vignette", c("source","Rmd"))

Populating the repository

Here is a normalized version of the Iris dataset to be stored in the repository:

myiris <- scale(as.matrix(iris[,1:4]))

The following call to the put method stores the contents of myiris in the repository. The data will be actually saved in the repo root in RDS format.

rp$put(
    obj = myiris,
    name = "myiris", 
    description = paste(
        "A normalized version of the iris dataset coming with R.",
        "Normalization is made with the scale function",
        "with default parameters."
    ),
    tags = c("dataset", "iris", "repodemo"), 
    src = "index.Rmd"
)

The call provides the data to be stored (obj), an identifier (name), a longer description, a list of tags, the stored item containing the source code that generates the new item (src).

In this example, the Iris class annotation will be stored separately. Here’s a more compact call to put:

rp$put(iris$Species, "irisLabels", "The Iris class lables.",
         c("labels", "iris", "repodemo"), "index.Rmd")

Attaching visualizations

The following code produces a 2D visualization of the Iris data and shows it:

irispca <- princomp(myiris)
iris2d <- irispca$scores[,c(1,2)]
plot(iris2d, main="2D visualization of the Iris dataset",
     col=rp$get("irisLabels"))

Note that irisLabels is loaded on the fly from the repository.

It would be nice to store the figure itself in the repo together with the Iris data. This is done using the attach method, which stores any file in the repo as is, plus annotations. Internally, attach calls put, so it accepts most of its parameters. Two differences are worth noting:

fpath <- file.path(rp$root(), "iris2D.pdf")
pdf(fpath)
plot(iris2d, main="2D visualization of the Iris dataset",
     col=rp$get("irisLabels"))
invisible(dev.off())
rp$attach(fpath, "Iris 2D visualization obtained with PCA.",
            c("visualization", "iris", "repodemo"),
              "index.Rmd", to="myiris")

Note that the PDF temporarily created at fpath can be safely removed (a copy has been made). The attached PDF can be accessed using an external PDF viewer directly from within Repo through the sys command. On a Linux system, this command runs the Evince document viewer and shows iris2D.pdf:

rp$sys("iris2D.pdf", "evince")

As another example, the source code previously attached can be visualized as follows:

rp$sys("index.Rmd", "evince")

Back to data analysis, the PCA eigenvalues showed below can give hints on the reliability of the 2D plot:

plot(irispca)

Thus attaching this plot to the 2D Iris plot could be useful. This is attaching to an attachment and is OK.

fpath <- file.path(rp$root(), "irisPCA.pdf")
pdf(fpath)
plot(irispca)
invisible(dev.off())
rp$attach(fpath, "Variance explained by the PCs of the Iris dataset",
            c("visualization", "iris", "repodemo"),
              "index.Rmd", to="iris2D.pdf")

Storing some results

The following code makes a clustering of the Iris data and stores it in the repository. There is one parameter to note:

This dependency annotation is not mandatory, however it helps to keep things sorted as the repository grows.

kiris <- kmeans(myiris, 5)$cluster
rp$put(kiris, "iris_5clu", "Kmeans clustering of the Iris data, k=5.",
         c("metadata", "iris", "kmeans", "clustering", "repodemo"),
           "index.Rmd", depends="myiris", T)

The following shows what the clustering looks like. The figure will be attached to the repository as well.

plot(iris2d, main="Iris dataset kmeans clustering", col=kiris)

fpath <- file.path(rp$root(), "iris2Dclu.pdf")
pdf(fpath)
plot(iris2d, main="Iris dataset kmeans clustering", col=kiris)
invisible(dev.off())
rp$attach(fpath, "Iris K-means clustering.",
    c("visualization", "iris", "clustering", "kmeans", "repodemo"),
    "index.Rmd", to="iris_5clu")

Finally, a contingency table of the Iris classes versus clusters is computed below. Let’s assume this is just a qualitative sanity check that will be rarely accessed, thus it could unnecessarily clog the repository. The special tag hide prevents an item from being shown unless explicitly requested. Attachments are hidden by default.

res <- table(rp$get("irisLabels"), kiris)
rp$put(res, "iris_cluVsSpecies",
         paste("Contingency table of the kmeans clustering versus the",
               "original labels of the Iris dataset."),
         c("result", "iris","validation", "clustering", "repodemo", "hide"),
         "index.Rmd", c("myiris", "irisLabels", "iris_5clu"), T)

Looking at the repository

The info command summarizes some information about a repository:

rp$info()
Root:            /tmp/RtmphzD2m2 
Number of items: 8 
Total size:      33.29 kB 

The Repo library supports an S3 print method that shows the contents of the repository. All non-hidden items will be shown, together with some details, which by defaults are: name, dimensions, size.

rp ## resolves to print(rp)
         ID  Dims    Size
     myiris 150x4 1.82 kB
 irisLabels   150   123 B
  iris_5clu   150   105 B

Hidden items are… hidden. The following makes all the items appear:

print(rp, all=T)
                ID  Dims     Size
        @index.Rmd     - 14.87 kB
            myiris 150x4  1.82 kB
        irisLabels   150    123 B
       @iris2D.pdf     -  5.84 kB
      @irisPCA.pdf     -  4.38 kB
         iris_5clu   150    105 B
    @iris2Dclu.pdf     -  5.98 kB
 iris_cluVsSpecies   3x5    177 B

Items can also be filtered. With the following call, only items tagged with “clustering” will be shown:

print(rp, tags="clustering", all=T)
                ID Dims    Size
         iris_5clu  150   105 B
    @iris2Dclu.pdf    - 5.98 kB
 iris_cluVsSpecies  3x5   177 B

All attachments have the attachment tag, so they can be selectively visualized this way:

print(rp, tags="attachment", all=T)
             ID Dims     Size
     @index.Rmd    - 14.87 kB
    @iris2D.pdf    -  5.84 kB
   @irisPCA.pdf    -  4.38 kB
 @iris2Dclu.pdf    -  5.98 kB

Hidden state also depends on a special tag:

print(rp, tags="hide", all=T)
                ID Dims     Size
        @index.Rmd    - 14.87 kB
       @iris2D.pdf    -  5.84 kB
      @irisPCA.pdf    -  4.38 kB
    @iris2Dclu.pdf    -  5.98 kB
 iris_cluVsSpecies  3x5    177 B

print can show information selectively. This command shows tags and size on disk:

rp$print(show="st")
         ID                                         Tags    Size
     myiris                      dataset, iris, repodemo 1.82 kB
 irisLabels                       labels, iris, repodemo   123 B
  iris_5clu metadata, iris, kmeans, clustering, repodemo   105 B

The find command will try to match a search string against all item fields in the repository:

rp$find("clu", all=T)
                ID Dims    Size
         iris_5clu  150   105 B
    @iris2Dclu.pdf    - 5.98 kB
 iris_cluVsSpecies  3x5   177 B

It is also possible to obtain a visual synthetic summary of the repository by using the pies command:

rp$pies()

If the Shiny library is installed, the following command will show a Shiny interface to the repository:

rp$cpanel()

Finally, the check command runs an integrity check verifying that the stored data has not been modified/corrupted. The command will also check the presence of extraneous (not indexed) files. Since the rp repository was created in a temporary directory, a few extraneous files will pop up.

rp$check()
Checking index.Rmd... ok.
Checking myiris... ok.
Checking irisLabels... ok.
Checking iris2D.pdf... ok.
Checking irisPCA.pdf... ok.
Checking iris_5clu... ok.
Checking iris2Dclu.pdf... ok.
Checking iris_cluVsSpecies... ok.

Checking for extraneous files in repo root... found some:
/tmp/RtmphzD2m2/file5b853f3a75aa
/tmp/RtmphzD2m2/file5b85575c8c08
/tmp/RtmphzD2m2/file5b855d323833
/tmp/RtmphzD2m2/file5b85607f8559
/tmp/RtmphzD2m2/iris2D.pdf
/tmp/RtmphzD2m2/iris2Dclu.pdf
/tmp/RtmphzD2m2/irisPCA.pdf

Showing dependencies

In Repo, the relations “generated by”, “attached to” and “dependent on” are summarized in a dependency graph. The formal representation of the graph is a matrix, in which the entry (i,j) represent a relation from i to j of type 1, 2 or 3 (dependency, attachment or generation). Here’s how it looks like:

depgraph <- rp$dependencies(plot=F)
rownames(depgraph) <- colnames(depgraph) <- basename(rownames(depgraph))
library(knitr)
kable(depgraph)
index.Rmd myiris irisLabels iris2D.pdf irisPCA.pdf iris_5clu iris2Dclu.pdf iris_cluVsSpecies
index.Rmd 0 0 0 0 0 0 0 0
myiris 3 0 0 0 0 0 0 0
irisLabels 3 0 0 0 0 0 0 0
iris2D.pdf 3 2 0 0 0 0 0 0
irisPCA.pdf 3 0 0 2 0 0 0 0
iris_5clu 3 1 0 0 0 0 0 0
iris2Dclu.pdf 3 0 0 0 0 2 0 0
iris_cluVsSpecies 3 1 1 0 0 1 0 0

Omitting the plot=F parameter, the dependencies method will plot the dependency graph. This plot requires the igraph library.

rp$dependencies()

This is a small repository and all resources were created by the same script, so the “generated” edges are not interesting. The three types of edges can be shown selectively, so here’s how the graph looks like without the “generated” edges:

rp$dependencies(generated=F)

Accessing items in the repo

The get command is used to retrieve items from a repository. In the following the variable myiris is loaded into the variable x in the current environment.

x <- rp$get("myiris")

To get additional information about the entry, the info command can be used this way:

rp$info("myiris")
ID:           myiris
Description:  A normalized version of the iris dataset coming with R. Normalization is made with the scale function with default parameters.
Tags:         dataset, iris, repodemo
Dimensions:   150x4
Timestamp:    2016-05-02 17:39:27
Size on disk: 1.82 kB
Provenance:   index.Rmd
Attached to:  -
Stored in:    1r/1j/de/1r1jdestr718324esnomsy42ui4mfyup
MD5 checksum: d23a5831dfd459be089e51f5bdda8799
URL:          -

Item versions, temporary items, remote contents

There are actually 4 different ways of adding an object to a repository:

This section will cover versioning, stashing and pulling.

Versioning

The K-means algorithm will likely provide different solutions over multiple runs. One may want to store an alternative clustering solution as an additional version of the iris_5clu item. This can be done as follows:

kiris2 <- kmeans(myiris, 5)$cluster
rp$put(kiris, "iris_5clu",
         "Kmeans clustering of the Iris data, k=5. Today's version!",
         c("metadata", "iris", "kmeans", "clustering", "repodemo"),
           "index.Rmd", depends="myiris", replace="addversion")

The new repository looks like the old one:

rp
         ID  Dims    Size
     myiris 150x4 1.82 kB
 irisLabels   150   123 B
  iris_5clu   150   105 B

Except that iris_5clu is actually the one just put (look at the description):

rp$info("iris_5clu")
ID:           iris_5clu
Description:  Kmeans clustering of the Iris data, k=5. Today's version!
Tags:         metadata, iris, kmeans, clustering, repodemo
Dimensions:   150
Timestamp:    2016-05-02 17:39:28
Size on disk: 105 B
Provenance:   index.Rmd
Attached to:  -
Stored in:    2y/38/8s/2y388scs359lbevjl1o3dmc4jqdpjf8h
MD5 checksum: cb57837a5baa71e0331c64022c279f37
URL:          -

while the old one has been renamed and hidden:

rp$print(all=T)
                ID  Dims     Size
        @index.Rmd     - 14.87 kB
            myiris 150x4  1.82 kB
        irisLabels   150    123 B
       @iris2D.pdf     -  5.84 kB
      @irisPCA.pdf     -  4.38 kB
       iris_5clu#1   150    105 B
    @iris2Dclu.pdf     -  5.98 kB
 iris_cluVsSpecies   3x5    177 B
         iris_5clu   150    105 B

However, it can be referred to as any other item in the repository:

rp$info("iris_5clu#1")
ID:           iris_5clu#1
Description:  Kmeans clustering of the Iris data, k=5.
Tags:         metadata, iris, kmeans, clustering, repodemo, hide
Dimensions:   150
Timestamp:    2016-05-02 17:39:28
Size on disk: 105 B
Provenance:   index.Rmd
Attached to:  -
Stored in:    4p/z2/vc/4pz2vctlj0nig2c3dhqx5xyxbn9842ye
MD5 checksum: cb57837a5baa71e0331c64022c279f37
URL:          -

Stashing

Repo tries to force the user into building a structured and annotated repository. However, this implies a small overhead that in some cases may not be justified. This is when stashing comes handy.

Consider the case of caching intermediate results. Intermediate results are not going to be used directly, however they will save time in case the final results have to be generated again. In such cases one can just store the intermediate results without specifying annotations: in Repo, this is called stashing.

Below is a fake computation taking 10 seconds. In the following, one may set the dorun variable to FALSE so that the script will get the precomputed variable from the repository.

if(dorun) {
    Sys.sleep(10)
    result <- "This took 10 seconds to compute"
    rp$stash(result)
} else result <- rp$get("result")

The stash function has a rather rough behavior: it will search the object name in the caller environment, create some generic descriptions and tags, put the object into the repo overwriting stashed items by the same name, and finally hide the newly created item.

rp$info("result")
ID:           result
Description:  Stashed object
Tags:         stash, hide
Dimensions:   1
Timestamp:    2016-05-02 17:39:28
Size on disk: 73 B
Provenance:   
Attached to:  -
Stored in:    no/m6/2y/nom62ygr9s48e0mv5kucf9cic636e894
MD5 checksum: 09e38f750ef253d8d843bfcb749b392b
URL:          -

It is also possible to automate the process of stashing data for caching purposes by using the lazydo command. The lazydo command will run an expression and stash the results. When the same expression is run again, the results will be rather loaded from the the repository.

expr <- expression({
    Sys.sleep(3)
    result <- "This took 3 seconds to compute"
})
    
system.time(rp$lazydo(expr)) # first run
lazydo is building resource from code.
Cached item name is: ed37e506c7ed1b11a4d81c5d9aebb599.
   user  system elapsed 
  0.009   0.004   3.015 
system.time(rp$lazydo(expr)) # second run
lazydo found precomputed resource.
   user  system elapsed 
  0.002   0.000   0.002 

Pulling

Existing items can feature an URL property. The pull function is meant to update item contents by downloading them from the Internet. This allows for the distribution of “stub” repositories containing all items information but not the actual data. The following code creates an item provided with a remote URL. A call to pull overwrite the stub local content with the remote content.

rp$put("Local content", "item1",
    "This points to big data you may want to download",
    "tag", URL="http://www.francesconapolitano.it/repo/remote")
print(rp$get("item1"))
[1] "Local content"
rp$pull("item1", replace=T)
print(rp$get("item1"))
[1] "Remote content"

Handlers

It’s a shame that the auto-completion feature by your favorite R working environment can not be used on repo item names. Except it can. The handlers method returns a list of functions by the same names of the items in the repo. Each of these functions can call Repo methods (get by default) on the corresponding items.

h <- rp$handlers()
names(h)
 [1] "index.Rmd"                        "myiris"                          
 [3] "irisLabels"                       "iris2D.pdf"                      
 [5] "irisPCA.pdf"                      "iris_5clu#1"                     
 [7] "iris2Dclu.pdf"                    "iris_cluVsSpecies"               
 [9] "iris_5clu"                        "result"                          
[11] "ed37e506c7ed1b11a4d81c5d9aebb599" "item1"                           
[13] "repo"                            

Handlers call get by default:

print(h$iris_cluVsSpecies())
            kiris
              1  2  3  4  5
  setosa      0 16 34  0  0
  versicolor  0  0  0 13 37
  virginica   3  0  0 39  8

The tag command (not yet described) adds a tag to an item:

h$iris_cluVsSpecies("tag", "onenewtag")
h$iris_cluVsSpecies("info")
ID:           iris_cluVsSpecies
Description:  Contingency table of the kmeans clustering versus the original labels of the Iris dataset.
Tags:         result, iris, validation, clustering, repodemo, hide, onenewtag
Dimensions:   3x5
Timestamp:    2016-05-02 17:39:31
Size on disk: 177 B
Provenance:   index.Rmd
Attached to:  -
Stored in:    0u/x2/yi/0ux2yig17mjt621vfwpai3psfuuj7lqa
MD5 checksum: 5c25b5b2e5b8e9051a5daa971d6ce7ab
URL:          -

One may want to open a repo directly with:

h <- repo_open(rp$root())$handlers()
Found repo index in "/tmp/RtmphzD2m2/R_repo.RDS".

In that case, the handler to the repo itself will come handy:

h$repo
         ID  Dims    Size
     myiris 150x4 1.82 kB
 irisLabels   150   123 B
  iris_5clu   150   105 B
      item1     1    58 B

If items are removed or added, handlers may need a refresh:

h <- h$repo$handlers()

Other features

The repo manual starts at:

help(repo)

All repo methods are also defined as functions in the global environment. Any call like rp$func(x) can be executed as repo_func(rp, x). In order to get help on the function “func”, try the following:

help(repo_func)

Based on Repo build 2.0.1