Exploring The Structure and Dependencies of An R Package

pkgnet is an R library designed for the analysis of R libraries! The goal of the package is to build a graph representation of a package and its dependencies to inform a variety of activities, including:

Below is a brief tour of pkgnet and its features.


Dependencies As a Graph

pkgnet represents both package dependencies and function dependencies as directed graphs. Before we look at the output of pkgnet, here are few core concepts to keep in mind.

Example Dependency Graph

Dependency

Functions or Dependencies are represented as nodes, and their dependent relationships are represented as edges (a.k.a. arcs or arrows). The direction of the edge points towards the node that is dependent on the other node.

In the example dependency graph above:

Descendants

The descendants of a node are all subsequent nodes that depend on that node, either directly or via the transitive property.

In the example dependency graph above:


Running pkgnet

pkgnet can analyze any R package locally installed. Run installed.packages() to see the full list of packages installed on your system. For this example, let’s say we are analyzing a custom built package, baseballstats.

To analyze baseballstats, run the following two lines of code:

library(pkgnet)
report1 <- CreatePackageReport(pkg_name = "baseballstats")

That’s it! You have generated a lot of valuable information with that one call for an installed package.

However, if the full source repository for the package is available on your system, you can supplement this report with other information such as code coverage from covr. To do so, specify the path to the repository in CreatePackageReport.

library(pkgnet)
report2 <- CreatePackageReport(
  pkg_name = "baseballstats"
  , pkg_path = <path to the repo>
)

Examining The Results

An HTML report has been created with the pertinent information and a list object is available with the same information and more.

The Report

An HTML report has been created, and its location is specified in the messages in the terminal.

This report has three sections:

  1. Package Summary – general information about the package and package level statistics
  2. Dependency Network – information regarding the packages upon which the current package under analysis relies upon
  3. Function Network – information regarding the functions within the current package under analysis and their inter dependencies

Each section has helpful tables and visuals.

As a sample, here’s how the Function Network Visualization looks for baseballstats:

Default

All Functions and their dependencies are visible. For example, we can see that both batting_avg and slugging_avg functions depend upon the at_bats function.

We also see that nothing depends on the on_base_pct function. This might be valuable information to an R package developer.

With Coverage Information

Same as the default visualization except we can see coverage information as well (Red = 0%, Green = 100%).

It appears the function with the most dependencies, at_bats, is well covered. However, no other functions are covered by unit tests.

Check out the full HTML report for more results

The List Object

The CreatePackageReport() function returns a list with three items:

  1. SummaryReporter
  2. DependencyReporter
  3. FunctionReporter

Each items contains information visible in the report and more. We can use this information for a more detailed analysis of the results and/or more easily incorporate pkgnet results into other R processes.

Here are a few notable items available within the list object.

Node Information

Both the DependencyReporter and the FunctionReporter contain metrics about their package dependencies or functions (a.k.a network nodes) in a nodes table.

dim(report2$FunctionReporter$nodes)
#> [1]  5 13
names(report2$FunctionReporter$nodes)
#>  [1] "node"                "coveredLines"        "totalLines"         
#>  [4] "coverageRatio"       "meanCoveragePerLine" "filename"           
#>  [7] "outDegree"           "outBetweeness"       "outCloseness"       
#> [10] "numDescendants"      "hubScore"            "pageRank"           
#> [13] "inDegree"

Note, a few of these metrics provided by default are from the field of Network Theory. You can leverage the Network Object described below to derive many more.

Network Measures

Both the DependencyReporter and the FunctionReporter contain measures based on their network structure in a network_measures list.

report2$FunctionReporter$network_measures
#> $centralization.OutDegree
#> [1] 0.3
#> 
#> $centralization.betweenness
#> [1] 0.03125
#> 
#> $centralization.closeness
#> [1] 0.2743056

Network Object

Both the DependencyReporter and the FunctionReporter are available as igraph objects named pkg_graph

report2$FunctionReporter$pkg_graph
#> IGRAPH bcbdcb2 DN-- 5 4 -- 
#> + attr: name (v/c)
#> + edges from bcbdcb2 (vertex names):
#> [1] at_bats     ->batting_avg  at_bats     ->slugging_avg
#> [3] batting_avg ->OPS          slugging_avg->OPS

A Deeper Look

With the reports and objects produced by pkgnet by default, there is plenty to inform us on the inner workings of an R package. However, we may want to know MORE! Since the igraph objects are available, we can leverage those graphs for further analysis.

In this section, let’s examine a larger R package, such as lubridate.

If you would like to follow along with the examples in this section, run these commands in your terminal to download and install lubridate1.

# Create a temporary workspace
mkdir -p ~/pkgnet_example && cd ~/pkgnet_example

# Grab the lubridate source code
git clone https://github.com/tidyverse/lubridate
cd lubridate

# If you want the examples to match exactly
git reset --hard 9797d69abe1574dd89310c834e52d358137669b8

# Install it
Rscript -e "devtools::install()"

Coverage of Most Referenced Functions

Let’s examine the relationship between a function’s total number of descendants and the unit test coverage of that function’s code.

# Run pkgnet
library(pkgnet)
report2 <- CreatePackageReport(
    pkg_name = "lubridate"
    , pkg_path = "~/pkgnet_example/lubridate"
)

# Extract Nodes Table
funcNodes <- report2$FunctionReporter$nodes

# List Coverage For Most Referenced Functions
mostRef <- funcNodes[order(numDescendants, decreasing = TRUE)][1:10]
mostRef[,list(`Function` = node
              , `Descendant Count` = numDescendants
              , `Coverage Ratio` = coverageRatio
              , `Total Lines` = totalLines)]
Function Descendant Count Coverage Ratio Total Lines
divide_period_by_period 39 1 2
days 22 1 1
check_duration 15 0 1
as.POSIXt 13 0 1
eweeks 13 0 2
check_interval 12 0 11
date<- 12 NA NA
add_months 10 1 4
ceil_multi_unit 10 1 1
am 6 1 1

Inspecting results such as these can help an R package developer decide which function to cover with unit tests next.

In this case, check_duration, one of the most referenced functions (either directly or indirectly), is not covered by unit tests. However, it appears to be a simple one line function that may not be necessary to cover in unit testing. check_interval, on the other hand, might benefit from some unit test coverage as it is a larger, uncovered function with a similar number of dependencies.

Discovering Similar Functions

Looking at that same large package, let’s say we wanted to explore options for consolidating functions. One approach might be to explore consolidating functions that share the same dependencies. In that case, we could use the igraph object to highlight functions with the same dependencies via Jaccard similarity.

# Get igraph object
funcGraph <- report2$FunctionReporter$pkg_graph
funcNames <- igraph::vertex_attr(funcGraph, name = "name")

# Jaccard Similarity
sim <- igraph::similarity(graph = funcGraph
                          , mode = "in"
                          , method = "jaccard")
diag(sim) <- 0
sim[sim < 1] <- 0

simGraph <- igraph::graph_from_adjacency_matrix(adjmatrix = sim, mode = "undirected")

# Find groups with same dependencies (similarity == 1)
sameDeps <- igraph::max_cliques(graph = simGraph
                                , min = 2
                                )

# Write results
for (i in seq_along(sameDeps)) {
         cat(paste0("Group ", i, ": "))
         cat(paste(funcNames[as.numeric(sameDeps[[i]])], collapse = ", "))
         cat("\n")
}
#> Group 1: stamp_time, stamp_date
#> Group 2: ms, hm
#> Group 3: new_interval, %--%, int_diff
#> Group 4: floor_date, quarter, semester
#> Group 5: picoseconds, microseconds, nanoseconds, milliseconds
#> Group 6: weeks, days, years, seconds_to_period, seconds, new_period, minutes, hours
#> Group 7: yq, dmy, ymd_hms, ymd_hm, ymd_h, ymd, ydm_hms, ydm_hm, ydm_h, ydm, pretty_dates, parse_date_time2, parse_date_time, myd, mdy_hms, mdy_hm, mdy_h, mdy, local_time, fast_strptime, dym, dmy_hms, dmy_hm, dmy_h

Now, we have identified seven different groups of functions within lubridate that share the exact same dependencies. We could explore each group of functions for potential consolidation.


  1. Examples from version 1.7.3 of Lubridate