The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This article walks through an example of using levitate
to compare text strings in the wild, and aims to give you a feel for the
pros and cons of the different string similarity measures provided by
the package.
levitate
comes with hotel_rooms
dataset
that contains descriptions of the same hotel rooms from two different
websites, Expedia and Booking.com. The list was compiled by Susan Li - all credit to her
for the work.
head(hotel_rooms)
#> expedia
#> 1 Standard Room, 1 King Bed, Accessible
#> 2 Grand Corner King Room, 1 King Bed
#> 3 Suite, 1 King Bed (Parlor)
#> 4 High-Floor Premium Room, 1 King Bed
#> 5 Room, 1 King Bed, Accessible
#> 6 Room, 2 Double Beds (19th to 25th Floors)
#> booking
#> 1 Standard King Roll-in Shower Accessible
#> 2 Grand Corner King Room
#> 3 King Parlor Suite
#> 4 High-Floor Premium King Room
#> 5 King Room - Disability Access
#> 6 Two Double Beds - Location Room (19th to 25th Floors)
Let’s add columns to the dataset showing how the different algorithms score the two strings.
df <- hotel_rooms
df$lev_ratio <- lev_ratio(df$expedia, df$booking)
df$lev_partial_ratio <- lev_partial_ratio(df$expedia, df$booking)
df$lev_token_sort_ratio <- lev_token_sort_ratio(df$expedia, df$booking)
df$lev_token_set_ratio <- lev_token_set_ratio(df$expedia, df$booking)
We can write a function to return the best match from a list of candidates.
best_match <- function(a, b, FUN) {
scores <- FUN(a = a, b = b)
best <- order(scores, decreasing = TRUE)[1L]
b[best]
}
best_match("cat", c("cot", "dog", "frog"), lev_ratio)
#> [1] "cot"
We can then use this to find out which of the Booking.com entries each of the functions choose for each of the Expedia entries.
best_match_by_fun <- function(FUN) {
best_matches <- character(nrow(hotel_rooms))
for (i in seq_along(best_matches)) {
best_matches[i] <- best_match(hotel_rooms$expedia[i], hotel_rooms$booking, FUN)
}
best_matches
}
df$lev_ratio_best_match <- best_match_by_fun(FUN = lev_ratio)
df$lev_partial_ratio_best_match <- best_match_by_fun(FUN = lev_partial_ratio)
df$lev_token_sort_ratio_best_match <- best_match_by_fun(FUN = lev_token_sort_ratio)
df$lev_token_set_ratio_best_match <- best_match_by_fun(FUN = lev_token_set_ratio)
We can now see how many each algo got right.
message("`lev_ratio()`: ", sum(df$lev_ratio_best_match == df$booking) / nrow(df))
#> `lev_ratio()`: 0.329411764705882
message("`lev_partial_ratio()`: ", sum(df$lev_partial_ratio_best_match == df$booking) / nrow(df))
#> `lev_partial_ratio()`: 0.223529411764706
message("`lev_token_sort_ratio()`: ", sum(df$lev_token_sort_ratio_best_match == df$booking) / nrow(df))
#> `lev_token_sort_ratio()`: 0.564705882352941
message("`lev_token_set_ratio()`: ", sum(df$lev_token_set_ratio_best_match == df$booking) / nrow(df))
#> `lev_token_set_ratio()`: 0.376470588235294
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.