The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Getting started

Colin Fay

2019-03-20

Getting started with {tidystringdist}

About {tidystringdist}

{tidystringdist} is a package that extends the {stringdist} package with tidy data principles.

The idea is to perform string distance calculation and combine it with functions for data manipulation and visualisation from the tidyverse framework.

Installing tidystringdist

You can install the last stable version from GitHub with:

install.packages("tidystringdist")

Or the dev version from GitHub:

# install.packages(remotes)
remotes::install_github("ColinFay/tidystringdist")

{tidystringdist} basic workflow

`tidycomb()`

The tidycomb() & tidy_comb_all() functions return all the possible combinations from a vector / a data.frame and a column / two vectors:

library(tidystringdist)

tidy_comb_all(LETTERS[1:3])
#> # A tibble: 3 x 2
#>   V1    V2   
#> * <chr> <chr>
#> 1 A     B    
#> 2 A     C    
#> 3 B     C

tidy_comb_all(iris, Species)
#> # A tibble: 3 x 2
#>   V1         V2        
#> * <chr>      <chr>     
#> 1 setosa     versicolor
#> 2 setosa     virginica 
#> 3 versicolor virginica

tidy_comb("Paris", state.name)
#> # A tibble: 50 x 2
#>    V1          V2   
#>  * <chr>       <chr>
#>  1 Alabama     Paris
#>  2 Alaska      Paris
#>  3 Arizona     Paris
#>  4 Arkansas    Paris
#>  5 California  Paris
#>  6 Colorado    Paris
#>  7 Connecticut Paris
#>  8 Delaware    Paris
#>  9 Florida     Paris
#> 10 Georgia     Paris
#> # … with 40 more rows

Compute string distance

Once you’ve got this data.frame, you can use tidy_string_dist() to compute string distance. This function takes a data.frame, the two columns containing the strings, and one or more stringdist methods.

comb <- tidy_comb_all(state.name) 
tidy_stringdist(comb)
#> # A tibble: 1,225 x 12
#>    V1    V2      osa    lv    dl hamming   lcs qgram cosine jaccard    jw
#>  * <chr> <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>
#>  1 Alab… Alas…     3     3     3     Inf     5     5  0.216   0.571 0.254
#>  2 Alab… Ariz…     5     5     5       5    10    10  0.581   0.8   0.476
#>  3 Alab… Arka…     6     6     6     Inf     9     9  0.440   0.778 0.399
#>  4 Alab… Cali…     8     8     8     Inf    13    11  0.481   0.818 0.535
#>  5 Alab… Colo…     6     6     6     Inf    11    11  0.704   0.778 0.488
#>  6 Alab… Conn…    11    11    11     Inf    18    18  1       1     1    
#>  7 Alab… Dela…     5     5     5     Inf     9     9  0.440   0.778 0.399
#>  8 Alab… Flor…     5     5     5       5    10    10  0.581   0.8   0.476
#>  9 Alab… Geor…     6     6     6       6    12    12  0.686   0.909 0.571
#> 10 Alab… Hawa…     5     5     5     Inf     9     9  0.474   0.875 0.460
#> # … with 1,215 more rows, and 1 more variable: soundex <dbl>

Default call compute all the methods. You can use specific method with the method argument:

comb <- tidy_comb_all(state.name)
tidy_stringdist(comb, method = c("osa","jw"))
#> # A tibble: 1,225 x 4
#>    V1      V2            osa    jw
#>  * <chr>   <chr>       <dbl> <dbl>
#>  1 Alabama Alaska          3 0.254
#>  2 Alabama Arizona         5 0.476
#>  3 Alabama Arkansas        6 0.399
#>  4 Alabama California      8 0.535
#>  5 Alabama Colorado        6 0.488
#>  6 Alabama Connecticut    11 1    
#>  7 Alabama Delaware        5 0.399
#>  8 Alabama Florida         5 0.476
#>  9 Alabama Georgia         6 0.571
#> 10 Alabama Hawaii          5 0.460
#> # … with 1,215 more rows

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.