Introduction to rtrek

The rtrek package provides datasets related to the Star Trek fictional universe and functions for working with those datasets. The package interfaces with Wikipedia, the Star Trek API (STAPI), Memory Alpha and Memory Beta to retrieve data, metadata and other information relating to Star Trek. It also contains local datasets covering a variety of topics such as Star Trek universe species data, geopolitical data, and datasets resulting from text mining analyses of Star Trek novels. This introduction provides a brief, example-driven overview of rtrek.

Local datasets

Package datasets in rtrek are somewhat eclectic and currently limited. They will expand with further package development. To list all available package datasets with a short description, call st_datasets.

library(rtrek)
st_datasets()
#> # A tibble: 6 x 2
#>   dataset       description                             
#>   <chr>         <chr>                                   
#> 1 stGeo         Map tile set locations of interest.     
#> 2 stSpecies     Basic intelligent species data.         
#> 3 stTiles       Available map tile sets.                
#> 4 stBooks       Star Trek novel metadata.               
#> 5 stBooksWP     Star Trek novel metadata from Wikipedia.
#> 6 stapiEntities Star Trek API (STAPI) categories

Star Trek novels

The stBooksWP dataset provides a moderately curated data frame of over 715 Star Trek books published since the early days of the Original Series episode adaptations by James Blish up through the latest novels as of the most recent rtrek update. stBooksWP is not an exhaustive account, but it is sufficiently comprehensive, containing most published books listed on Wikipedia. Those not listed were those more difficult to web scrape such as small anthologies listed in footnotes rather than in table form online.

stBooksWP
#> # A tibble: 715 x 6
#>    Series                   Title    Author   Number Timeframe Released   
#>    <chr>                    <chr>    <chr>     <int> <chr>     <chr>      
#>  1 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1967-01-01~
#>  2 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1968-02-01~
#>  3 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1969-04-01~
#>  4 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1971-07-01~
#>  5 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1972-02-01~
#>  6 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1972-04-01~
#>  7 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1972-07-01~
#>  8 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1972-11-01~
#>  9 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1973-08-01~
#> 10 The_Original_Series - B~ Star Tr~ James B~     NA <NA>      1974-02-01~
#> # ... with 705 more rows

Some curation decisions were made in compiling the data frame for rtrek related to the inclusion of some metadata in addition to not being a completely exhaustive list. However, the master Wikipedia page for Star Trek literature can be browsed at any time. As if that were not easy enough, rtrek also offers a convenience function st_book_series to load the page in a browser tab auto-scrolled to a specific series of interest.

In order to use this function effectively, first call it with no arguments. It returns a table of available series abbreviations.

st_book_series()
#> # A tibble: 17 x 2
#>    series                                abb       
#>    <chr>                                 <chr>     
#>  1 THe Original Series                   TOS       
#>  2 The Next Generation                   TNG       
#>  3 Deep Space Nine                       DS9       
#>  4 Voyager                               VOY       
#>  5 Enterprise                            ENT       
#>  6 Discovery                             DSC       
#>  7 New Frontier                          NF        
#>  8 Stargazer                             SG        
#>  9 IKS Gorkon/Klingon Empire             IKE       
#> 10 Titan                                 TIT       
#> 11 Vanguard                              VAN       
#> 12 Seekers                               SKR       
#> 13 Mini-series                           miniseries
#> 14 Starfleet Corps of Engineers          SCE       
#> 15 Department of Temporal Investigations DTI       
#> 16 Mirror Universe                       MIR       
#> 17 Starfleet Academy                     SFA

Then call it with a specific acronym ID and the page will load at the desired table entry.

st_book_series("DS9")

This package data only graces the surface of Star Trek novels. A later section provides a brief introduction to Star Trek novel data compiled from text mining analyses of the actual book content. Unlike stBooksWP, which is limited to a metadata overview, these other datasets contain quantifiable variables much more suitable to interesting statistical analysis.

Spatial maps

The stTiles data frame shows all available Star Trek-themed map tile sets along with metadata and attribution information. These map tiles can be used with the leaflet and shiny packages to make interactive maps situated in the Star Trek universe.

stTiles
#> # A tibble: 2 x 8
#>   id      url    description width height tile_creator map_creator map_url
#>   <chr>   <chr>  <chr>       <dbl>  <dbl> <chr>        <chr>       <chr>  
#> 1 galaxy1 https~ Geopolitic~  8000   6445 Matthew Leo~ Rob Archer  https:~
#> 2 galaxy2 https~ Geopolitic~  5000   4000 Matthew Leo~ <NA>        http:/~

The list is scant at the moment, but more will come. One thing to keep in mind is these tile sets use a simple/non-geographical coordinate reference system (CRS). Clearly, they are not Earth-based, though they are spatial in more ways than one!

Similar to game maps, there is a sense of space, but it is a simple Cartesian coordinate system and does not use geographic projections like you may be used to working with when analyzing spatial data or making Leaflet maps. This system is much simpler, but simple does not necessarily mean easy!

Inspect stGeo:

stGeo
#> # A tibble: 18 x 4
#>    id      loc          col   row
#>    <chr>   <chr>      <dbl> <dbl>
#>  1 galaxy1 Earth       2196  2357
#>  2 galaxy1 Romulus     2615  1742
#>  3 galaxy1 Qo'noS      3310  3361
#>  4 galaxy1 Breen       1004   939
#>  5 galaxy1 Ferenginar  1431  1996
#>  6 galaxy1 Cardassia   1342  2841
#>  7 galaxy1 Tholia       407  3866
#>  8 galaxy1 Tzenketh    1553  2557
#>  9 galaxy1 Talar       1039  3489
#> 10 galaxy2 Earth       2201  1595
#> 11 galaxy2 Romulus     2514  1178
#> 12 galaxy2 Qo'noS      3197  2303
#> 13 galaxy2 Breen       1228  1181
#> 14 galaxy2 Ferenginar  2026   886
#> 15 galaxy2 Cardassia   1543  1903
#> 16 galaxy2 Tholia       713  2971
#> 17 galaxy2 Tzenketh    1734  1721
#> 18 galaxy2 Talar       1338  2368

This is another small dataset containing locations of key planets in the Star Trek universe. Notice the coordinates do not appear meaningful. There is no latitude and longitude. Instead there are row and column entries defining cells in a matrix. The matrix dimensions are defined by the pixel dimensions of source map that was used to create each tile set.

The coordinates are also not consistent. Source maps differ significantly. Even if they had identical pixel dimensions, which they do not, each artist’s visual rendering of the fictional universe will place locations differently in space. In this sense, every tile set has a unique coordinate reference system. For each new tile set produced, all locations of interest must be georeferenced again.

This is not ideal, but it gets worse. Once you have locations’ coordinates defined that map onto a particular tile set, the leaflet package does not work in these row and column grids. The (col, row) pairs need to be transformed or projected into Leaflet space. Fortunately, rtrek does this part for you with tile_coords. It takes a data frame like one returned by st_tiles_data with columns named col and row, as well as the name of an available Star Trek map tile set. It returns a data frame with new columns x and y that will map properly in a leaflet map built on that tile set.

id <- "galaxy1"
(d <- st_tiles_data(id))
#> # A tibble: 9 x 8
#>   id      loc          col   row body   category  zone            species 
#>   <chr>   <chr>      <dbl> <dbl> <chr>  <chr>     <chr>           <chr>   
#> 1 galaxy1 Earth       2196  2357 Planet Homeworld United Federat~ Human   
#> 2 galaxy1 Romulus     2615  1742 Planet Homeworld Romulan Star E~ Romulan 
#> 3 galaxy1 Qo'noS      3310  3361 Planet Homeworld Klingon Empire  Klingon 
#> 4 galaxy1 Breen       1004   939 Planet Homeworld Breen Confeder~ Breen   
#> 5 galaxy1 Ferenginar  1431  1996 Planet Homeworld Ferengi Allian~ Ferengi 
#> 6 galaxy1 Cardassia   1342  2841 Planet Homeworld Cardassian Uni~ Cardass~
#> 7 galaxy1 Tholia       407  3866 Planet Homeworld Tholian Assemb~ Tholian 
#> 8 galaxy1 Tzenketh    1553  2557 Planet Homeworld Tzenkethi Coal~ Tzenket~
#> 9 galaxy1 Talar       1039  3489 Planet Homeworld Talarian Repub~ Talarian
(d <- tile_coords(d, id))
#> # A tibble: 9 x 10
#>   id      loc          col   row body  category zone  species     x      y
#>   <chr>   <chr>      <dbl> <dbl> <chr> <chr>    <chr> <chr>   <dbl>  <dbl>
#> 1 galaxy1 Earth       2196  2357 Plan~ Homewor~ Unit~ Human    68.6  -73.7
#> 2 galaxy1 Romulus     2615  1742 Plan~ Homewor~ Romu~ Romulan  81.7  -54.4
#> 3 galaxy1 Qo'noS      3310  3361 Plan~ Homewor~ Klin~ Klingon 103.  -105. 
#> 4 galaxy1 Breen       1004   939 Plan~ Homewor~ Bree~ Breen    31.4  -29.3
#> 5 galaxy1 Ferenginar  1431  1996 Plan~ Homewor~ Fere~ Ferengi  44.7  -62.4
#> 6 galaxy1 Cardassia   1342  2841 Plan~ Homewor~ Card~ Cardas~  41.9  -88.8
#> 7 galaxy1 Tholia       407  3866 Plan~ Homewor~ Thol~ Tholian  12.7 -121. 
#> 8 galaxy1 Tzenketh    1553  2557 Plan~ Homewor~ Tzen~ Tzenke~  48.5  -79.9
#> 9 galaxy1 Talar       1039  3489 Plan~ Homewor~ Tala~ Talari~  32.5 -109.

Here is an example using the galaxy1 map with leaflet. The st_tiles function is used to link to the tile provider.

library(leaflet)
tiles <- st_tiles("galaxy1")
leaflet(d, options = leafletOptions(crs = leafletCRS("L.CRS.Simple"))) %>%
  addTiles(tiles) %>% setView(108, -75, 2) %>%
  addCircleMarkers(lng = ~x, lat = ~y, label = ~loc, color = "white", radius = 20)


The stSpecies dataset is just a small table that pairs species named with representative thumbnail avatars, mostly pulled from the Memory Alpha website. There is nothing map-related here, but these are used in this Stellar Cartography example. It is similar to the Leaflet example above, but a bit more interesting, with markers to click on and information displays.

In the course of the above map-related examples, a few functions have also been introduced. st_tiles takes an id argument that is mapped to the available tile sets in stTiles and returns the relevant URL. st_tiles_data takes the same id argument and returns a simple example data frame containing ancillary data related to the available locations from stGeo. The result is always the same except that the grid cells for locations change with respect to the chosen tile set. Finally, tile_coords can be applied to one of these data frames to add x and y columns for a CRS that Leaflet will understand.

Star Trek API

To use the words of the developers, the STAPI is

the first public Star Trek API, accessible via REST and SOAP. It’s an open source project, that anyone can contribute to.

The API is highly functional. Please do not abuse the API with constant requests. Their pages suggest no more than one request per second, but I would suggest ten seconds between successive requests. The default anti-DDOS measures in rtrek limit requests to one per second. You can update this global rtrek setting with options, e.g. options(rtrek_antiddos = 10) for a minimum ten second wait between API calls to be an even better neighbor. rtrek will not permit faster requests. If set below one second, the option is ignored and a warning thrown when making any API call.

STAPI entities

There a many fields, or entities, available in the API. The available IDs can be found in this table:

stapiEntities
#> # A tibble: 40 x 4
#>    id                 class   ncol colnames  
#>    <chr>              <chr>  <int> <list>    
#>  1 animal             tbl_df     7 <chr [7]> 
#>  2 astronomicalObject tbl_df     5 <chr [5]> 
#>  3 book               tbl_df    24 <chr [24]>
#>  4 bookCollection     tbl_df    10 <chr [10]>
#>  5 bookSeries         tbl_df    11 <chr [11]>
#>  6 character          tbl_df    24 <chr [24]>
#>  7 comicCollection    tbl_df    14 <chr [14]>
#>  8 comics             tbl_df    15 <chr [15]>
#>  9 comicSeries        tbl_df    15 <chr [15]>
#> 10 comicStrip         tbl_df    12 <chr [12]>
#> # ... with 30 more rows

These ID values are passed to stapi to perform a search using the API. The other columns provide some information about the object returned from a search. All entity searches return tibble data frames. You can inspect or unnest the column names of each table returned from every available entity search so you can see beforehand what variables are associated with each entity.

Accessing the API

Using stapi should be thought of as a three part process:

To determine how many pages of results exist for a given search, set page_count = TRUE. The impact on the API will be equivalent to only searching a single page of results. One page contains metadata including the total number of pages. Nothing is returned in this “safe mode”, but the total number of search results available is printed to the console.

Searching movies only returns one page of results. However, there are a lot of characters in the Star Trek universe. Check the total pages available for character search.

stapi("character", page_count = TRUE)
#> Total pages to retrieve all results: 62

And that is with 100 results per page!

The default page = 1 only returns the first page. page can be a vector, e.g. page = 1:62. Results from multi-page searches are automatically combined into a single, constant data frame output. For the second call to stapi, return only page two here, which contains the character, Q (currently, pending future character database updates that may shift the indexing). In case that does change and Q is not always near the top of page two of the search results, the example further below hard-codes his unique/universal ID.

stapi("character", page = 2)
#> # A tibble: 100 x 24
#>    uid     name    gender yearOfBirth monthOfBirth dayOfBirth placeOfBirth
#>    <chr>   <chr>   <chr>        <int>        <int>      <int> <lgl>       
#>  1 CHMA00~ Fuller  M               NA           NA         NA NA          
#>  2 CHMA00~ Burkus  M               NA           NA         NA NA          
#>  3 CHMA00~ Masaka~ <NA>            NA           NA         NA NA          
#>  4 CHMA00~ Thorne  M               NA           NA         NA NA          
#>  5 CHMA00~ Ah-Kel  M               NA           NA         NA NA          
#>  6 CHMA00~ Robert~ <NA>            NA           NA         NA NA          
#>  7 CHMA00~ Q       M               NA           NA         NA NA          
#>  8 CHMA00~ John D~ <NA>            NA           NA         NA NA          
#>  9 CHMA00~ Louis ~ <NA>            NA           NA         NA NA          
#> 10 CHMA00~ Marat ~ M               NA           NA         NA NA          
#> # ... with 90 more rows, and 17 more variables: yearOfDeath <int>,
#> #   monthOfDeath <lgl>, dayOfDeath <lgl>, placeOfDeath <lgl>,
#> #   height <int>, weight <int>, deceased <lgl>, bloodType <chr>,
#> #   maritalStatus <chr>, serialNumber <chr>, hologramActivationDate <chr>,
#> #   hologramStatus <chr>, hologramDateStatus <lgl>, hologram <lgl>,
#> #   fictionalCharacter <lgl>, mirror <lgl>, alternateReality <lgl>

Character tables can be sparse. There are a lot of variables, many of which will contain missing data for rare, esoteric characters. Even for more popular characters about whom much more universe lore has been uncovered, it still takes dedicated nerds to enter all the data in a database.

When a dataset contains a uid column, this can be used subsequently to extract a satellite dataset about that particular observation that was returned in the original search. First you used safe mode, then search mode, and now switch from search mode to extraction mode to obtain data about Q, specifically. All that is required to do this is pass Q’s uid to stapi and call the function one last time. When uid is no longer NULL, stapi knows not to bother with a search and makes a different type of API call requesting information about the uniquely identified entry.

Q <- "CHMA0000025118"
Q <- stapi("character", uid = Q)

library(dplyr)
Q$episodes %>% select(uid, title, stardateFrom, stardateTo)
#>              uid                 title stardateFrom stardateTo
#> 1 EPMA0000001458    All Good Things...      47988.0    47988.0
#> 2 EPMA0000001329                 Q Who      42761.3    42761.3
#> 3 EPMA0000000483 Encounter at Farpoint      41153.7    41153.7
#> 4 EPMA0000162588            Death Wish           NA         NA
#> 5 EPMA0000001510    The Q and the Grey      50384.2    50392.7
#> 6 EPMA0000000845                Q-Less      46531.2    46531.2
#> 7 EPMA0000000651              Tapestry           NA         NA
#> 8 EPMA0000001413                True Q      46192.3    46192.3
#> 9 EPMA0000001377                  Qpid      44741.9    44741.9

The data returned on Q is actually a large list, including multiple data frames. For simplicity only a piece of it is shown above.

Star Trek novel text mining

This section will be continued in a future version of rtrek. For now what is available is a dataset stBooks that compliments the stBooksWP dataset seen earlier. stBooks has a similar number of metadata entries for Star Trek books and there is considerable overlap between the two datasets. However, there are also considerable differences in entries as well as formatting.

This dataset represents metadata parsed, imperfectly but painstakingly thoroughly, from actual Star Trek books. Compared to stBooksWP, which represents a scraping of Wikipedia information on Star Trek books, stBooks contains several different fields, including more useful fields for analysts such as the number of words and chapters in a book.

stBooks
#> # A tibble: 743 x 11
#>    title  creator date  publisher identifier series subseries nchap  nword
#>    <chr>  <chr>   <chr> <chr>     <chr>      <chr>  <chr>     <int>  <int>
#>  1 The M~ Gene R~ 1979~ Simon an~ 978074341~ The O~ TOS 01 -~    30  54522
#>  2 Dread~ Diane ~ 1986~ Pocket B~ 978074341~ The O~ TOS 29 -~    15  58867
#>  3 Battl~ Diane ~ 1986~ Simon an~ 978074341~ The O~ TOS 31 -~    13  80159
#>  4 The T~ Barbar~ 1988~ Simon an~ 978074341~ The O~ TOS 41 -~    13  71465
#>  5 Home ~ Dana K~ 1990~ Simon an~ 978074342~ The O~ TOS 52 -~    52  68778
#>  6 Ghost~ Barbar~ 1991~ Simon an~ 0743420047 The O~ TOS 53 -~    21  80103
#>  7 First~ Diane ~ 1995~ Pocket B~ 978074342~ The O~ TOS 75 -~    45 110913
#>  8 The A~ Dayton~ 2002~ Pocket B~ 978074346~ The O~ TOS - Th~     1   7439
#>  9 World~ Judith~ 2003~ Pocket B~ 0743488148 The O~ TOS - Wo~    64 207246
#> 10 Duty,~ Noveli~ 2004~ Pocket B~ 978074349~ The O~ TOS - Du~    42 196779
#> # ... with 733 more rows, and 2 more variables: nchar <int>,
#> #   dedication <chr>

Obviously, licensed book content itself cannot be shared as data, so it is not possible to provide capability in rtrek to enable analysts to perform their own unique text mining analyses on Star Trek novel corpura. However, future versions of rtrek will include more summary datasets that will aim to represent more intersting variables. Just a couple examples could be the relative frequency of popular characters’ names per book, or a senitment analysis, or any other set of interesting metrics that could be used to inform suggested reading lists of various tiles or books by particular authors with a favored style or focus.