The censusxy
package is designed to provide easy and efficient access to the U.S. Census Bureau Batch Geocoder in R
.
The censusxy
package has been developed specifically with large data sets in mind. There are other implementations for accessing the Census Bureau’s API for geocoding in R
(e.g. the censusr
package) that require iteration to geocode multiple addresses at once. censusxy
, on the other hand, is designed to operate on on a column of addresses in a data frame or tibble object. Additionally, the Census Bureau caps the number of addresses that can be sent to the API in a single call at 1,000. If a data set exceeds 1,000 unique addresses, it will be automatically subset into appropriately sized API calls, geocoded, and then put back together so that a single object is returned. The package therefore provides an efficient solution to batch geocoding via the Census Bureau’s services.
The U.S. Census Bureau makes their geocoding API available without any API key, and this package allows for virtually unlimited batch geocoding. Please use this package responsibly, as others will need use of this API for their research.
We recommend that users install sf
before proceeding with the installation of censusxy
. Windows users should be able to install sf
without significant issues, but macOS and Linux users will need to install several open source spatial libraries to get sf
itself up and running. The easiest approach for macOS users is to install the GDAL 2.0 Complete framework from Kyng Chaos.
For Linux users, steps will vary based on the flavor being used. Our configuration file for Travis CI and its associated bash script should be useful in determining the necessary components to install.
Once sf
is installed, the easiest way to get censusxy
is to install it from CRAN:
The development version of censusxy
can be accessed from GitHub with remotes
:
The key function, cxy_geocode
, supports non-standard evaluation, meaning you can use either quoted or unquoted inputs for arguments that refer to variable names.
This implementation assumes that your data are contained in a data.frame
or tibble, and that address data are split into a number of component variables: street address, city, state, and five digit zip code. If your data are not split into components, the authors recommend the package postmastr
for street address parsing. Not all components are required. For example, the sample homicide data included in the package lack zip code data. However, the more components you have, the better your results will be. Both sample data objects in this package present data as they should be formatted for geocoding.
This package contains a single exported function, cxy_geocode()
. The only required arguments are .data
for the data.frame
or tibble containing address data, and address
specifying the column name containing street addresses. The function supports non-standard evaluation, meaning you do not need to quote arguments for column names.
However, it is highly recommended that you include city, state and zip code as well. Doing so will increase speed and accuracy significantly. The homicide data contain city and state data as well, so the preferred call for these data would be:
Finally, two output types are supported. By default, a tibble is returned (output = "tibble"
) with a minimal set of variables that describe the accuracy of a given observation’s geocoding (style = "minimal"
). A complete set of values returned by the API for each observation can be obtained by using style = "full"
. Alternatively, an sf
object can be returned with the geocoded data projected using the WGS 1984 geographic coordinate system:
homicide_sf <- cxy_geocode(stl_homicides, id, street_address, city, state, postal_code, output = "sf")
Note, however, that it returns only matched addresses, including those approximated by street length. If there are unmatched addresses, they will be dropped from the output. Use output = "tibble"
to return all addresses, including those that are unmatched.
Output returned as an sf
object can be previewed with a package like mapview
:
The function contains an argument for timeout, which specifies how many minutes until the API query ends as an error. In this implementation, it is per 1000 addresses, not the whole batch size. It is set to default at 30 minutes, which should be appropriate for most internet speeds.
If a batch times out, the next 1000 addresses will be attempted.
Be cautious that batches taking a long time may allow your computer to sleep, which may cause a batch to never return. macOS users may find the app caffeine useful.
R
itself, welcome! Hadley Wickham’s R for Data Science is an excellent way to get started with data manipulation in the tidyverse, which censusxy
is designed to integrate seamlessly with.R
, we strongly encourage you check out the excellent new Geocomputation in R by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow.censusxy
, you are encouraged to use the RStudio Community forums. Please create a reprex
before posting. Feel free to tag Chris (@chris.prener
) in any posts about censusxy
.reprex
and then open an issue on GitHub.