The dependency network of all CRAN packages

2020-08-10

In the introduction we have see that a dependency network can be built using get_dep_df(). While it is theoretically possible to use get_dep_df() iteratively to obtain all dependencies of all packages available on CRAN, it is not practical to do so. This package provides two functions get_dep_all_packages() and get_graph_all_packges() for obtaining the dependencies of all CRAN packages directly, as well as an example dataset.

library(crandep)
library(dplyr)
library(ggplot2)
library(igraph)
library(visNetwork)

All types of dependencies, in a data frame

The example dataset cran_dependencies contains all dependencies as of 2020-05-09.

data(cran_dependencies)
cran_dependencies
#> # A tibble: 211,381 x 4
#>    from  to             type     reverse
#>    <chr> <chr>          <chr>    <lgl>  
#>  1 A3    xtable         depends  FALSE  
#>  2 A3    pbapply        depends  FALSE  
#>  3 A3    randomForest   suggests FALSE  
#>  4 A3    e1071          suggests FALSE  
#>  5 aaSEA DT             imports  FALSE  
#>  6 aaSEA networkD3      imports  FALSE  
#>  7 aaSEA shiny          imports  FALSE  
#>  8 aaSEA shinydashboard imports  FALSE  
#>  9 aaSEA magrittr       imports  FALSE  
#> 10 aaSEA Bios2cor       imports  FALSE  
#> # … with 211,371 more rows
dplyr::count(cran_dependencies, type, reverse)
#> # A tibble: 8 x 3
#>   type       reverse     n
#>   <chr>      <lgl>   <int>
#> 1 depends    FALSE   11123
#> 2 depends    TRUE     9672
#> 3 imports    FALSE   57617
#> 4 imports    TRUE    51913
#> 5 linking to FALSE    3433
#> 6 linking to TRUE     3721
#> 7 suggests   FALSE   35018
#> 8 suggests   TRUE    38884

This is essentially a snapshot of CRAN. We can obtain all the current dependencies using get_dep_all_packages(), which requires no arguments:

df0.cran <- get_dep_all_packages()
head(df0.cran)
#>    from             to    type reverse
#> 2 aaSEA             DT imports   FALSE
#> 3 aaSEA      networkD3 imports   FALSE
#> 4 aaSEA          shiny imports   FALSE
#> 5 aaSEA shinydashboard imports   FALSE
#> 6 aaSEA       magrittr imports   FALSE
#> 7 aaSEA       Bios2cor imports   FALSE
dplyr::count(df0.cran, type, reverse) # numbers in general larger than above
#>         type reverse     n
#> 1    depends   FALSE 11054
#> 2    depends    TRUE  9621
#> 3    imports   FALSE 60646
#> 4    imports    TRUE 54367
#> 5 linking to   FALSE  3655
#> 6 linking to    TRUE  3945
#> 7   suggests   FALSE 37341
#> 8   suggests    TRUE 41148

Network of one type of dependencies, as an igraph object

We can build dependency network using get_graph_all_packages(). Furthermore, we can verify that the forward and reverse dependency networks are (almost) the same, by looking at their size (number of edges) and order (number of nodes).

g0.depends <- get_graph_all_packages(type = "depends")
g0.rev_depends <- get_graph_all_packages(type = "reverse depends")
g0.depends
#> IGRAPH 44785b1 DN-- 4805 8013 -- 
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 44785b1 (vertex names):
#>  [1] A3         ->xtable     A3         ->pbapply    abc        ->abc.data  
#>  [4] abc        ->nnet       abc        ->quantreg   abc        ->MASS      
#>  [7] abc        ->locfit     abcdeFBA   ->Rglpk      abcdeFBA   ->rgl       
#> [10] abcdeFBA   ->corrplot   abcdeFBA   ->lattice    ABCp2      ->MASS      
#> [13] abctools   ->abc        abctools   ->abind      abctools   ->plyr      
#> [16] abctools   ->Hmisc      abd        ->nlme       abd        ->lattice   
#> [19] abd        ->mosaic     abodOutlier->cluster    AbSim      ->ape       
#> [22] AbSim      ->poweRlaw   Ac3net     ->data.table acc        ->mhsmm     
#> + ... omitted several edges
g0.rev_depends
#> IGRAPH 3053c61 DN-- 4805 8013 -- 
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 3053c61 (vertex names):
#>  [1] abc     ->abctools   abc     ->EasyABC    abc.data->abc       
#>  [4] abd     ->tigerstats abind   ->abctools   abind   ->BCBCSF    
#>  [7] abind   ->CPMCGLM    abind   ->depth      abind   ->dgmb      
#> [10] abind   ->dynamo     abind   ->fractaldim abind   ->funLBM    
#> [13] abind   ->informR    abind   ->interplot  abind   ->magic     
#> [16] abind   ->mlma       abind   ->mlogitBMA  abind   ->multicon  
#> [19] abind   ->MultiPhen  abind   ->multipol   abind   ->mvmesh    
#> [22] abind   ->mvSLOUCH   abind   ->plfm      
#> + ... omitted several edges

The dependency words accepted by the argument type is the same as in get_dep() and get_dep_df(). The two networks’ size and order should be very close if not identical to each other. Because of the dependency direction, their edge lists should be the same but with the column names from and to swapped.

For verification, the exact same graphs can be obtained by filtering the data frame for the required dependency and applying df_to_graph():

g1.depends <- df0.cran %>%
    dplyr::filter(type == "depends" & !reverse) %>%
    df_to_graph(nodelist = dplyr::rename(df0.cran, name = from))
g1.rev_depends <- df0.cran %>%
    dplyr::filter(type == "depends" & reverse) %>%
    df_to_graph(nodelist = dplyr::rename(df0.cran, name = from))
g1.depends # same as g0.depends
#> IGRAPH 73d7f2b DN-- 4805 8013 -- 
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 73d7f2b (vertex names):
#>  [1] A3         ->xtable     A3         ->pbapply    abc        ->abc.data  
#>  [4] abc        ->nnet       abc        ->quantreg   abc        ->MASS      
#>  [7] abc        ->locfit     abcdeFBA   ->Rglpk      abcdeFBA   ->rgl       
#> [10] abcdeFBA   ->corrplot   abcdeFBA   ->lattice    ABCp2      ->MASS      
#> [13] abctools   ->abc        abctools   ->abind      abctools   ->plyr      
#> [16] abctools   ->Hmisc      abd        ->nlme       abd        ->lattice   
#> [19] abd        ->mosaic     abodOutlier->cluster    AbSim      ->ape       
#> [22] AbSim      ->poweRlaw   Ac3net     ->data.table acc        ->mhsmm     
#> + ... omitted several edges
g1.rev_depends # same as g0.rev_depends
#> IGRAPH 1cd789e DN-- 4805 8013 -- 
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 1cd789e (vertex names):
#>  [1] abc     ->abctools   abc     ->EasyABC    abc.data->abc       
#>  [4] abd     ->tigerstats abind   ->abctools   abind   ->BCBCSF    
#>  [7] abind   ->CPMCGLM    abind   ->depth      abind   ->dgmb      
#> [10] abind   ->dynamo     abind   ->fractaldim abind   ->funLBM    
#> [13] abind   ->informR    abind   ->interplot  abind   ->magic     
#> [16] abind   ->mlma       abind   ->mlogitBMA  abind   ->multicon  
#> [19] abind   ->MultiPhen  abind   ->multipol   abind   ->mvmesh    
#> [22] abind   ->mvSLOUCH   abind   ->plfm      
#> + ... omitted several edges

External reverse dependencies & defunct packages

One may notice that there are external reverse dependencies which won’t be appear in the forward dependencies if the scraping is limited to CRAN packages. We can find these external reverse dependencies by nodelist = NULL in df_to_graph():

df1.rev_depends <- df0.cran %>%
    dplyr::filter(type == "depends" & reverse) %>%
    df_to_graph(nodelist = NULL, gc = FALSE) %>%
    igraph::as_data_frame() # to obtain the edge list
df1.depends <- df0.cran %>%
    dplyr::filter(type == "depends" & !reverse) %>%
    df_to_graph(nodelist = NULL, gc = FALSE) %>%
    igraph::as_data_frame()
dfa.diff.depends <- dplyr::anti_join(
    df1.rev_depends,
    df1.depends,
    c("from" = "to", "to" = "from")
)
head(dfa.diff.depends)
#>    from          to    type reverse
#> 1 abind      baySeq depends    TRUE
#> 2 abind      CNORdt depends    TRUE
#> 3 abind  FISHalyseR depends    TRUE
#> 4 abind     flowMap depends    TRUE
#> 5 abind    riboSeqR depends    TRUE
#> 6 abind RNAinteract depends    TRUE

This means we are extracting the reverse dependencies of which the forward equivalents are not listed. The column to shows the packages external to CRAN. On the other hand, if we apply dplyr::anti_join() by switching the order of two edge lists,

dfb.diff.depends <- dplyr::anti_join(
    df1.depends,
    df1.rev_depends,
    c("from" = "to", "to" = "from")
)
head(dfb.diff.depends)
#>                 from       to    type reverse
#> 1           abctools parallel depends   FALSE
#> 2                abd     grid depends   FALSE
#> 3 AcceptanceSampling  methods depends   FALSE
#> 4 AcceptanceSampling    stats depends   FALSE
#> 5            accrued     grid depends   FALSE
#> 6               acid    stats depends   FALSE

the column to lists those which are not on the page of available packages (anymore). These are either defunct or core packages.

Summary statistics

Using the dataset cran_dependencies as an example, we can also obtain the degree for each package and each type:

df0.summary <- dplyr::count(cran_dependencies, from, type, reverse)
df0.summary
#> # A tibble: 34,861 x 4
#>    from        type       reverse     n
#>    <chr>       <chr>      <lgl>   <int>
#>  1 A3          depends    FALSE       2
#>  2 A3          suggests   FALSE       2
#>  3 ABACUS      imports    FALSE       2
#>  4 ABACUS      suggests   FALSE       2
#>  5 ABC.RAP     imports    FALSE       3
#>  6 ABC.RAP     suggests   FALSE       2
#>  7 ABCanalysis imports    FALSE       1
#>  8 ABCanalysis suggests   TRUE        4
#>  9 ABCoptim    imports    FALSE       4
#> 10 ABCoptim    linking to FALSE       1
#> # … with 34,851 more rows

We can look at the “winner” in each of the reverse dependencies:

df0.summary %>%
    dplyr::filter(reverse) %>%
    dplyr::group_by(type) %>%
    dplyr::top_n(1, n)
#> # A tibble: 4 x 4
#> # Groups:   type [4]
#>   from    type       reverse     n
#>   <chr>   <chr>      <lgl>   <int>
#> 1 MASS    depends    TRUE      455
#> 2 Rcpp    linking to TRUE     2082
#> 3 ggplot2 imports    TRUE     2038
#> 4 knitr   suggests   TRUE     5806

This is not surprising given the nature of each package. To take the summarisation one step further, we can obtain the frequencies of the degrees, and visualise the empirical degree distribution neatly on the log-log scale:

df1.summary <- df0.summary %>%
    dplyr::count(type, reverse, n)
#> Storing counts in `nn`, as `n` already present in input
#> ℹ Use `name = "new_name"` to pick a new name.
gg0.summary <- df1.summary %>%
    dplyr::mutate(reverse = ifelse(reverse, "reverse", "forward")) %>%
    ggplot2::ggplot() +
    ggplot2::geom_point(ggplot2::aes(n, nn)) +
    ggplot2::facet_grid(type ~ reverse) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_y_log10() +
    ggplot2::labs(x = "Degree", y = "Number of packages") +
    ggplot2::theme_bw(20)
gg0.summary

This shows the reverse dependencies, in particular Reverse_depends and Reverse_imports, follow the power law, which is empirically observed in various academic fields.

Visualisation

We can now visualise (the giant component of) the CRAN network of Depends, using functions in the package visNetwork. To do this, we will need to convert the igraph object g0.depends to the node list and edge list as data frames.

prefix <- "http://CRAN.R-project.org/package=" # canonical form
degrees <- igraph::degree(g0.depends)
df0.nodes <- data.frame(id = names(degrees), value = degrees) %>%
    dplyr::mutate(title = paste0('<a href=\"', prefix, id, '\">', id, '</a>'))
df0.edges <- igraph::as_data_frame(g0.depends, what = "edges")

By adding the column title in df0.nodes, we enable clicking the nodes and being directed to their CRAN pages, in the interactive visualisation below:

set.seed(2345L)
vis0 <- visNetwork::visNetwork(df0.nodes, df0.edges, width = "100%", height = "720px") %>%
    visNetwork::visOptions(highlightNearest = TRUE) %>%
    visNetwork::visEdges(arrows = "to", color = list(opacity = 0.5)) %>%
    visNetwork::visNodes(fixed = TRUE) %>%
    visNetwork::visIgraphLayout(layout = "layout_with_drl")
vis0

Going forward

Methods in social network analysis, such as community detection algorithms and/or stochastic block models, can be applied to study the properties of the dependency network. Ideally, by analysing the dependencies of all CRAN packages, we can obtain a bird’s-eye view of the ecosystem. The number of reverse dependencies is modelled in this other vignette.