rnassqs
is a package to access the QuickStats API from national agricultural statistics service (NASS) at the USDA. There are at least two good reasons to do this:
Reproducability. downloading the data via an R script creates a trail that you can revisit later to see exactly what you downloaded. It also makes it much easier for people seeking to replicate your results to ensure they have the same data that you do.
DRY. Don’t repeat yourself. Downloading data via API makes it easier to download new data as it is released, and to fetch multiple variables, geographies, or time frames without having to manually click through the QuickStats tool for each data request.
In the beginning it can be more confusing, and potentially take more time, but as you become familiar with the variables and calls of the rnassqs
package and the QuickStats database, you’ll be able to quickly and easily download new data.
the QuickStats API requires authentication. You can get an API Key here. Once you have a key, you can use it in any of the following ways:
You can add a file to your project directory called .secret
that contains any necessary API keys, and ignore it via .gitignore
if you’re using github. The advantage of this method is that you don’t have to think about the API key for the rest of the project. Once the api key is in a file, you can use it like this:
# Load the api key
api_key <- readLines(".secret")
In your home directory create or edit the .Renviron
file, and add NASSQS_TOKEN=<your api key>
to the file. R
sessions will have the variable set automatically, and rnassqs
will detect this when querying data.
If you don’t want to add the API key to a file, you can enter it in the console in a session as follows
library(rnassqs)
# Checks if the api key is set and prints it.
# If it is not set, asks the user to set the value in the console.
nassqs_token()
The QuickStats API offers a bewildering array of fields on which to query. rnassqs
tries to help navigate query building with some functions that return field names and valid values for those fields. rnassqs::nassqs_fields()
provides the field names, which at the time of this writing are
# returns a list of fields that you can query
rnassqs::nassqs_fields()
#> [1] "agg_level_desc" "asd_code"
#> [3] "asd_desc" "begin_code"
#> [5] "class_desc" "commodity_desc"
#> [7] "congr_district_code" "country_code"
#> [9] "country_name" "county_ansi"
#> [11] "county_code" "county_name"
#> [13] "CV" "domaincat_desc"
#> [15] "domain_desc" "end_code"
#> [17] "freq_desc" "group_desc"
#> [19] "load_time" "location_desc"
#> [21] "prodn_practice_desc" "reference_period_desc"
#> [23] "region_desc" "sector_desc"
#> [25] "short_desc" "state_alpha"
#> [27] "state_ansi" "state_name"
#> [29] "state_fips_code" "statisticcat_desc"
#> [31] "source_desc" "unit_desc"
#> [33] "util_practice_desc" "Value"
#> [35] "watershed_code" "watershed_desc"
#> [37] "week_ending" "year"
#> [39] "zip_5"
A list of the valid values for a given field is available via rnassqs::nassqs_field_values(field = <field name>)
. For example,
rnassqs::nassqs_field_values(field = 'unit_desc')
returns a list of valid values.to see valid units in the unit_desc
field. There are 327 valid values at the time of this writing, with values including “STEMS”, “TON / TON”, “GALLONS / TANK”, etc…
To build a query usually requires some trial and error. One way of developing the query is to use the QuickStats web interface. This is often the fastest method and provides quick feedback on the subset of values for a given query. Alternatively, you can query values for each field as above and iteratively build your query. The query in the end takes the form of a list of parameters that looks like
params <- list("commodity_desc"="CORN", "year__GE"=2012, "state_alpha"="VA")
It’s worth spending some time on the selection of values. Most queries will probably be for specific values, but you may also want to query ranges or similar values. For those queries, append one of the following to the field you’d like to modify:
In the above parameter list, year__GE
is the year
field with the __GE
modifier attached to it. The returned data includes all records with year greater than or equal to 2012.
The query above selects all the data available on Corn since 2012 in the state of Virginia. The API only returns queries that return 50,000 or less records, so it’s a good idea to check that before running a query, perhaps as an assert:
# Check that the number of returned records will be less than 50000
records <- rnassqs::nassqs_record_count(params)
assertthat::assert_that(as.integer(records$count) <= 50000)
Once you’ve built a query, running it is easy:
# Run a query given a set of parameters and an API key
rnassqs::nassqs(params = params, key = api_key)
nassqs
is a wrapper around GET and PARSE functions, which you can use independently if you want to see the raw data before parsing:
# Get the data but but parse into a data.frame separately
raw <- rnassqs::nassqs_GET(params = params, key = api_key)
parsed <- rnassqs::nassqs_parse(raw, as = 'data.frame')
Putting all of the above together, we have a script that looks like:
library(rnassqs)
library(assertthat) #for checking the size of the query
# Load the api key
api_key <- readLines(".secret")
# Get a list of available fields
fields <- nassq_fields()
# Get valid values for 'commodity_desc'
nassqs_field_values(field = 'commodity_desc')
# Set a list of parameters to query on
params <- list("commodity_desc"="CORN", "year__GE"=2012, "state_alpha"="VA")
# Check that the number of returned records will be less than 50000
records <- nassqs_record_count(params)
assert_that(as.integer(records$count) <= 50000)
# Run a query given a set of parameters and an API key
nassqs(params = params, key = api_key)
# Run the same query but parse into a data.frame separately
raw <- nassqs_GET(params = params, key = api_key)
parsed <- nassqs_parse(raw, as = 'data.frame')