pkgnet
is an R package designed for the analysis of R packages! The goal of the package is to build graph representations of a package’s various types of dependencies. This can inform a variety of activities, including:
Below is a brief tour of pkgnet
and its features.
pkgnet
represents aspects of R packages as graphs. The two default reporters, which we will discuss in this vignette, model their respective aspects as directed graphs: a package’s dependencies on other packages, and the interdependencies of functions within a package. Before we look at the output of pkgnet
, here are few core concepts to keep in mind.
Example Dependency Graph
Units of the analysis are represented as nodes, and their dependency relationships are represented as edges (a.k.a. arcs or arrows). In pkgnet
, the nodes could be functions in the package you are examining, or other packages that the package depends on. The direction of edges point from the dependent node to the independent node.1
In the example dependency graph above:
Following the direction of the edges allows you to figure out the dependencies of a node—the nodes that it depends on. On the flip side, tracing the edges backwards allows you to figure out the dependents of a node—the nodes that depend on it.
pkgnet
can analyze any R package locally installed. (Run installed.packages()
to see the full list of packages installed on your system.) For this example, let’s say we are analyzing a custom built package, baseballstats
.
To analyze baseballstats
, run the following two lines of code:
That’s it! You have generated a lot of valuable information with that one call for an installed package.
However, if the full source repository for the package is available on your system, you can supplement this report with other information such as code coverage from covr. To do so, specify the path to the repository in CreatePackageReport
.
library(pkgnet)
report2 <- CreatePackageReport(
pkg_name = "baseballstats"
, pkg_path = <path to the repo>
)
CreatePackageReport
has written an HTML report with the pertinent information, and it also returned a list object with the same information and more.
The location of the HTML report is specified in the messages in the terminal.
This report has three sections:
Each section has helpful tables and visuals.
As a sample, here’s how the Function Network Visualization looks for baseballstats
:
batting_avg
and slugging_avg
functions depend upon the at_bats
function.
We also see that nothing depends on the on_base_pct
function. This might be valuable information to an R package developer.
It appears the function with the most dependencies, at_bats
, is well covered. However, no other functions are covered by unit tests.
Check out the full HTML report for more results
The CreatePackageReport()
function returns a list with three items:
Each items contains information visible in the report and more. We can use this information for a more detailed analysis of the results and/or more easily incorporate pkgnet
results into other R processes.
Here are a few notable items available within the list object:
Both the DependencyReporter
and the FunctionReporter
contain metrics about their package dependencies or functions (a.k.a network nodes) in a nodes
table.
#> [1] 5 16
#> [1] "node" "type" "isExported"
#> [4] "coveredLines" "totalLines" "coverageRatio"
#> [7] "meanCoveragePerLine" "filename" "outDegree"
#> [10] "outBetweeness" "outCloseness" "outSubgraphSize"
#> [13] "inSubgraphSize" "hubScore" "pageRank"
#> [16] "inDegree"
Note, a few of these metrics provided by default are from the field of Network Theory. You can leverage the Network Object described below to derive many more.
Both the DependencyReporter
and the FunctionReporter
contain graph-level measures based on their network structure in a network_measures
list.
#> $centralization.OutDegree
#> [1] 0.3
#>
#> $centralization.betweenness
#> [1] 0.03125
#>
#> $centralization.closeness
#> [1] 0.2743056
#>
#> $packageTestCoverage.mean
#> [1] 0.1
#>
#> $packageTestCoverage.betweenessWeightedMean
#> [1] 0.5
Both the DependencyReporter
and the FunctionReporter
are available as igraph objects named pkg_graph
#> IGRAPH 6b731ef DN-- 5 4 --
#> + attr: name (v/c)
#> + edges from 6b731ef (vertex names):
#> [1] OPS ->slugging_avg OPS ->batting_avg
#> [3] slugging_avg->at_bats batting_avg ->at_bats
With the reports and objects produced by pkgnet
by default, there is plenty to inform us on the inner workings of an R package. However, we may want to know MORE! Since the igraph objects are available, we can leverage those graphs for further analysis.
In this section, let’s examine a larger R package, such as lubridate.
If you would like to follow along with the examples in this section, run these commands in your terminal to download and install lubridate
2.
# Create a temporary workspace
mkdir -p ~/pkgnet_example && cd ~/pkgnet_example
# Grab the lubridate source code
git clone https://github.com/tidyverse/lubridate
cd lubridate
# If you want the examples to match exactly
git reset --hard 9797d69abe1574dd89310c834e52d358137669b8
# Install it
Rscript -e "devtools::install()"
Let’s examine lubridate
’s functions through the lens of each function’s total number of dependents (i.e., the other functions that depend on it) and its code’s unit test coverage. In our graph model for the FunctionReporter
, the subgraph of paths leading into a given node is the set of functions that directly or indirectly depend on the function that node represents.
# Run pkgnet
library(pkgnet)
report2 <- CreatePackageReport(
pkg_name = "lubridate"
, pkg_path = "~/pkgnet_example/lubridate"
)
# Extract Nodes Table
funcNodes <- report2$FunctionReporter$nodes
# List Coverage For Most Depended-on Functions
mostRef <- funcNodes[order(inSubgraphSize, decreasing = TRUE)][1:10]
mostRef[,list(`Function` = node
, `In-Subgraph Size` = inSubgraphSize
, `Coverage Ratio` = coverageRatio
, `Total Lines` = totalLines)]
Function | In-Subgraph Size | Coverage Ratio | Total Lines |
---|---|---|---|
divide_period_by_period | 39 | 1 | 2 |
days | 22 | 1 | 1 |
check_duration | 15 | 0 | 1 |
as.POSIXt | 13 | 0 | 1 |
eweeks | 13 | 0 | 2 |
check_interval | 12 | 0 | 11 |
date<- | 12 | NA | NA |
add_months | 10 | 1 | 4 |
ceil_multi_unit | 10 | 1 | 1 |
am | 6 | 1 | 1 |
Inspecting results such as these can help an R package developer decide which function to cover with unit tests next.
In this case, check_duration
, one of the most depended-on functions (either directly or indirectly), is not covered by unit tests. However, it appears to be a simple one line function that may not be necessary to cover in unit testing. check_interval
, on the other hand, might benefit from some unit test coverage as it is a larger, uncovered function with a similar number of dependencies.
Looking at that same large package, let’s say we want to explore options for consolidating functions. One approach might be to explore consolidating functions that share the same dependencies. In that case, we could use the igraph
object to highlight functions with the same out-neighborhood via Jaccard similarity.
# Get igraph object
funcGraph <- report2$FunctionReporter$pkg_graph
funcNames <- igraph::vertex_attr(funcGraph, name = "name")
# Jaccard Similarity
sim <- igraph::similarity(graph = funcGraph
, mode = "out"
, method = "jaccard")
diag(sim) <- 0
sim[sim < 1] <- 0
simGraph <- igraph::graph_from_adjacency_matrix(adjmatrix = sim, mode = "undirected")
# Find groups with same out-neighbors (similarity == 1)
sameDeps <- igraph::max_cliques(graph = simGraph
, min = 2
)
# Write results
for (i in seq_along(sameDeps)) {
cat(paste0("Group ", i, ": "))
cat(paste(funcNames[as.numeric(sameDeps[[i]])], collapse = ", "))
cat("\n")
}
#> Group 1: stamp_time, stamp_date
#> Group 2: ms, hm
#> Group 3: new_interval, %--%, int_diff
#> Group 4: floor_date, quarter, semester
#> Group 5: picoseconds, microseconds, nanoseconds, milliseconds
#> Group 6: weeks, days, years, seconds_to_period, seconds, new_period, minutes, hours
#> Group 7: yq, dmy, ymd_hms, ymd_hm, ymd_h, ymd, ydm_hms, ydm_hm, ydm_h, ydm, pretty_dates, parse_date_time2, parse_date_time, myd, mdy_hms, mdy_hm, mdy_h, mdy, local_time, fast_strptime, dym, dmy_hms, dmy_hm, dmy_h
Now, we have identified seven different groups of functions within lubridate that share the exact same dependencies. We could explore each group of functions for potential consolidation.
Edge direction was previously Independent -> Dependent. It was changed to Dependent -> Independent in version v0.3.0. The new convention follows the Unified Modeling Language (UML) framework, a widely used standard for software system modeling.↩
Examples from version 1.7.3 of Lubridate↩