The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

levitate

library(levitate)

This article walks through an example of using levitate to compare text strings in the wild, and aims to give you a feel for the pros and cons of the different string similarity measures provided by the package.

levitate comes with hotel_rooms dataset that contains descriptions of the same hotel rooms from two different websites, Expedia and Booking.com. The list was compiled by Susan Li - all credit to her for the work.

head(hotel_rooms)
#>                                     expedia
#> 1     Standard Room, 1 King Bed, Accessible
#> 2        Grand Corner King Room, 1 King Bed
#> 3                Suite, 1 King Bed (Parlor)
#> 4       High-Floor Premium Room, 1 King Bed
#> 5              Room, 1 King Bed, Accessible
#> 6 Room, 2 Double Beds (19th to 25th Floors)
#>                                                 booking
#> 1               Standard King Roll-in Shower Accessible
#> 2                                Grand Corner King Room
#> 3                                     King Parlor Suite
#> 4                          High-Floor Premium King Room
#> 5                         King Room - Disability Access
#> 6 Two Double Beds - Location Room (19th to 25th Floors)

Let’s add columns to the dataset showing how the different algorithms score the two strings.

df <- hotel_rooms

df$lev_ratio <- lev_ratio(df$expedia, df$booking)
df$lev_partial_ratio <- lev_partial_ratio(df$expedia, df$booking)
df$lev_token_sort_ratio <- lev_token_sort_ratio(df$expedia, df$booking)
df$lev_token_set_ratio <- lev_token_set_ratio(df$expedia, df$booking)

A simple matching model

We can write a function to return the best match from a list of candidates.

best_match <- function(a, b, FUN) {
  scores <- FUN(a = a, b = b)
  best <- order(scores, decreasing = TRUE)[1L]
  b[best]
}

best_match("cat", c("cot", "dog", "frog"), lev_ratio)
#> [1] "cot"

We can then use this to find out which of the Booking.com entries each of the functions choose for each of the Expedia entries.

best_match_by_fun <- function(FUN) {
  best_matches <- character(nrow(hotel_rooms))
  for (i in seq_along(best_matches)) {
    best_matches[i] <- best_match(hotel_rooms$expedia[i], hotel_rooms$booking, FUN)
  }
  best_matches
}

df$lev_ratio_best_match <- best_match_by_fun(FUN = lev_ratio)
df$lev_partial_ratio_best_match <- best_match_by_fun(FUN = lev_partial_ratio)
df$lev_token_sort_ratio_best_match <- best_match_by_fun(FUN = lev_token_sort_ratio)
df$lev_token_set_ratio_best_match <- best_match_by_fun(FUN = lev_token_set_ratio)

We can now see how many each algo got right.

message("`lev_ratio()`: ", sum(df$lev_ratio_best_match == df$booking) / nrow(df))
#> `lev_ratio()`: 0.329411764705882

message("`lev_partial_ratio()`: ", sum(df$lev_partial_ratio_best_match == df$booking) / nrow(df))
#> `lev_partial_ratio()`: 0.223529411764706

message("`lev_token_sort_ratio()`: ", sum(df$lev_token_sort_ratio_best_match == df$booking) / nrow(df))
#> `lev_token_sort_ratio()`: 0.564705882352941

message("`lev_token_set_ratio()`: ", sum(df$lev_token_set_ratio_best_match == df$booking) / nrow(df))
#> `lev_token_set_ratio()`: 0.376470588235294

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.