#Elegant URL handling with urltools URLs are treated, by base R, as nothing more than components of a data retrieval process: they exist to create connections to retrieve datasets. This is an essential feature for the language to have, but it also means that URL handlers are designed for situations where URLs get you to the data - not situations where URLs are the data.
As web analytics and research around human-computer interaction becomes more prominent, there is an increasing need for packages that can handle things like server-side logs. urltools is intended to provide the URL-centric component of this, containing functions for encoding, decoding, parsing and component extraction.
Base R provides two functions - URLdecode
and URLencode
- for taking percentage-encoded
URLs and turning them into regular strings, or vice versa. As discussed, these are primarily designed to
enable connections, and so they have several inherent limitations, including a lack of vectorisation, that
make them unsuitable for large datasets.
Not only are they not vectorised, they also have several particularly idiosyncratic bugs and limitations:
URLdecode
, for example, breaks if the decoded value is out of range:
URLdecode("test%gIL")
#Error in rawToChar(out) : embedded nul in string: '\0L'
#In addition: Warning message:
#In URLdecode("%gIL") : out-of-range values treated as 0 in coercion to raw
URLencode, on the other hand, encodes slashes on its most strict setting - without paying attention to where those slashes are: if we attempt to URLencode an entire URL, we get:
URLencode("https://en.wikipedia.org/wiki/Article", reserved = TRUE)
#[1] "https%3a%2f%2fen.wikipedia.org%2fwiki%2fArticle"
That's a completely unusable URL.[or ewRL, if you will.]
urltools replaces both functions with url_decode
and url_encode
respectively:
library(urltools)
url_decode("test%gIL")
#[1] "test"
url_encode("https://en.wikipedia.org/wiki/Article")
#[1] "https://en.wikipedia.org%2fwiki%2fArticle"
As you can see, url_decode
simply excludes out-of-range characters from consideration, while url_encode
detects characters that make up part of the URLs scheme, and leaves them unencoded.
Once you've got your nicely decoded (or encoded) URLs, it's time to do something with them - and, most of the time, you won't actually care about most of the URL. You'll want to look at the scheme, or the domain, or the path, but not the entire thing as one string.
The solution is url_parse
, which takes a URL and breaks it out into its RfC 3986 components: scheme, domain, port, path, query string and fragment identifier. This is,
again, fully vectorised, and can happily be run over hundreds of thousands of URLs, rapidly processing them.
url_parse("https://en.wikipedia.org/wiki/Article")
#[[1]]
#[1] "https" "en.wikipedia.org" "" "wiki/article" "" ""
One of the fields, as mentioned, is the query string - commonly used as a component of web form requests
or API calls. Along with url_parse
, we also have url_param
, which extracts the value
a particular query parameter holds:
url_param(urls = "https://en.wikipedia.org/w/api.php?action=sitematrix&format=xml&smstate=all", parameter = "smstate")
#[1] "all"
If you have ideas for other URL handlers that would make your data processing easier, the best approach is to either request it or add it!