GSODR

Adam H Sparks

2017-10-28

Introduction

The GSOD or Global Surface Summary of the Day (GSOD) data provided by the US National Centers for Environmental Information (NCEI) are a valuable source of weather data with global coverage. However, the data files are cumbersome and difficult to work with. GSODR aims to make it easy to find, transfer and format the data you need for use in analysis and provides four main functions for facilitating this:

When reformatting data either with get_GSOD() or reformat_GSOD(), all units are converted to International System of Units (SI), e.g., inches to millimetres and Fahrenheit to Celsius. File output can be used in an R session as a tibble(), saved as a Comma Separated Value (CSV) file or in a spatial GeoPackage (GPKG) file, implemented by most major GIS software, summarising each year by station, which also includes vapour pressure and relative humidity elements calculated from existing data in GSOD.

For more information see the description of the data provided by NCEI, http://www7.ncdc.noaa.gov/CDO/GSOD_DESC.txt.

Using get_GSOD()

Find Stations in Australia

GSODR provides lists of weather station locations and elevation values. Using dplyr, we can find all the stations in Australia.

library(GSODR)
## 
## GSOD is distributed free by the US NCEI with the
## following conditions.
## 'The following data and products may have conditions placed
## their international commercial use. They can be used within
## the U.S. or for non-commercial international activities
## without restriction. The non-U.S. data cannot be
## redistributed for commercial purposes. Re-distribution of
## these data by others must provide this same notification.
## WMO Resolution 40. NOAA Policy'
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
load(system.file("extdata", "country_list.rda", package = "GSODR"))
load(system.file("extdata", "isd_history.rda", package = "GSODR"))

station_locations <- left_join(isd_history, country_list,
                               by = c("CTRY" = "FIPS"))

# create data.frame for Australia only
Oz <- filter(station_locations, COUNTRY_NAME == "AUSTRALIA")

Oz
## # A tibble: 1,412 x 16
##      USAF  WBAN                      STN_NAME  CTRY STATE  CALL     LAT
##     <chr> <chr>                         <chr> <chr> <chr> <chr>   <dbl>
##  1 695023 99999           HORN ISLAND   (HID)    AS  <NA>  KQXC -10.583
##  2 749430 99999            AIDELAIDE RIVER SE    AS  <NA>  <NA> -13.300
##  3 749432 99999     BATCHELOR FIELD AUSTRALIA    AS  <NA>  <NA> -13.049
##  4 749438 99999          IRON RANGE AUSTRALIA    AS  <NA>  <NA> -12.700
##  5 749439 99999      MAREEBA AS/HOEVETT FIELD    AS  <NA>  <NA> -17.050
##  6 749440 99999                     REID EAST    AS  <NA>  <NA> -19.767
##  7 749441 99999  TOWNSVILLE AUSTRALIA/GARBUTT    AS  <NA>  ABTL -19.249
##  8 749442 99999                     WOODSTOCK    AS  <NA>  <NA> -19.600
##  9 749443 99999 JACKY JACKY AUSTRALIA/HIGGINS    AS  <NA>  <NA> -10.933
## 10 749455 99999            LAKE BUCHANAN WEST    AS  <NA>  <NA> -21.417
## # ... with 1,402 more rows, and 9 more variables: LON <dbl>, ELEV_M <dbl>,
## #   BEGIN <dbl>, END <dbl>, STNID <chr>, ELEV_M_SRTM_90m <dbl>,
## #   COUNTRY_NAME <chr>, iso2c <chr>, iso3c <chr>
Oz %>%
  filter(grepl("TOOWOOMBA", STN_NAME))
## # A tibble: 2 x 16
##     USAF  WBAN          STN_NAME  CTRY STATE  CALL     LAT     LON ELEV_M
##    <chr> <chr>             <chr> <chr> <chr> <chr>   <dbl>   <dbl>  <dbl>
## 1 945510 99999         TOOWOOMBA    AS  <NA>  <NA> -27.583 151.933    676
## 2 955510 99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.550 151.917    642
## # ... with 7 more variables: BEGIN <dbl>, END <dbl>, STNID <chr>,
## #   ELEV_M_SRTM_90m <dbl>, COUNTRY_NAME <chr>, iso2c <chr>, iso3c <chr>

Download a Single Station and Year

Now that we’ve seen where the reporting stations are located, we can download weather data from the station Toowoomba, Queensland, Australia for 2010 by using the STNID in the station parameter of get_GSOD().

tbar <- get_GSOD(years = 2010, station = "955510-99999")
## 
## Checking requested station file for availability on server
## 
## Downloading individual station files.
tbar
## # A tibble: 365 x 48
##      USAF  WBAN        STNID          STN_NAME  CTRY STATE  CALL    LAT
##     <chr> <chr>        <chr>             <chr> <chr> <chr> <chr>  <dbl>
##  1 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
##  2 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
##  3 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
##  4 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
##  5 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
##  6 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
##  7 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
##  8 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
##  9 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
## 10 955510 99999 955510-99999 TOOWOOMBA AIRPORT    AS  <NA>  <NA> -27.55
## # ... with 355 more rows, and 40 more variables: LON <dbl>, ELEV_M <dbl>,
## #   ELEV_M_SRTM_90m <dbl>, BEGIN <dbl>, END <dbl>, YEARMODA <date>,
## #   YEAR <chr>, MONTH <chr>, DAY <chr>, YDAY <dbl>, TEMP <dbl>,
## #   TEMP_CNT <int>, DEWP <dbl>, DEWP_CNT <int>, SLP <dbl>, SLP_CNT <int>,
## #   STP <dbl>, STP_CNT <int>, VISIB <dbl>, VISIB_CNT <int>, WDSP <dbl>,
## #   WDSP_CNT <int>, MXSPD <dbl>, GUST <dbl>, MAX <dbl>, MAX_FLAG <chr>,
## #   MIN <dbl>, MIN_FLAG <chr>, PRCP <dbl>, PRCP_FLAG <chr>, SNDP <dbl>,
## #   I_FOG <int>, I_RAIN_DRIZZLE <int>, I_SNOW_ICE <int>, I_HAIL <int>,
## #   I_THUNDER <int>, I_TORNADO_FUNNEL <int>, EA <dbl>, ES <dbl>, RH <dbl>

Using nearest_stations()

Using the nearest_stations() function, you can find stations closest to a given point specified by latitude and longitude in decimal degrees. This can be used to generate a vector to pass along to get_GSOD() and download the stations of interest.

There are missing stations in this query. Not all that are listed and queried actually have files on the server.

tbar_stations <- nearest_stations(LAT = -27.5598,
                                  LON = 151.9507,
                                  distance = 50)

tbar <- get_GSOD(years = 2010, station = tbar_stations)
## 
## This station, 949999-00170, only provides data for years 1971 to 1984.
## 
## This station, 949999-00183, only provides data for years 1983 to 1984.
## 
## Checking requested station file for availability on server
## 
## Downloading individual station files.

If you wished to drop the stations, 949999-00170 and 949999-00183 from the query, you could do this.

remove <- c("949999-00170", "949999-00183")

tbar_stations <- tbar_stations[!tbar_stations %in% remove]

tbar <- get_GSOD(years = 2010,
                 station = tbar_stations,
                 dsn = "~/")

Plot Maximum and Miniumum Temperature Values

Using the first data downloaded for a single station, 955510-99999, plot the temperature for 2010 using read_csv() from Hadley’s readr package.

library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(tidyr)

# Create a dataframe of just the date and temperature values that we want to
# plot
tbar_temps <- tbar[, c("YEARMODA", "TEMP", "MAX", "MIN")]

# Gather the data from wide to long
tbar_temps <- gather(tbar_temps, Measurement, gather_cols = TEMP:MIN)

ggplot(data = tbar_temps, aes(x = ymd(YEARMODA), y = value,
                              colour = Measurement)) +
  geom_line() +
  scale_color_brewer(type = "qual", na.value = "black") +
  scale_y_continuous(name = "Temperature") +
  scale_x_date(name = "Date") +
  theme_bw()

Creating Spatial Files

Because the stations provide geospatial location information, it is possible to create a spatial file. GeoPackage files are a open, standards-based, platform-independent, portable, self-describing compact format for transferring geospatial information, which handle vector files much like shapefiles do, but eliminate many of the issues that shapefiles have with field names and the number of files. The get_GSOD() function can create a GeoPackage file, which can be used with a GIS for further analysis and mapping with other spatial objects.

After getting weather stations for Australia and creating a GeoPackage file, rgdal can import the data into R again in a spatial format.

get_GSOD(years = 2015, country = "Australia", dsn = "~/", filename = "AUS",
         CSV = FALSE, GPKG = TRUE)
#> trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2015/gsod_2015.tar'
#> Content type 'unknown' length 106352640 bytes (101.4 MB)
#> ==================================================
#> downloaded 101.4 MB


#> Finished downloading file.

#> Starting data file processing.


#> Writing GeoPackage file to disk.

Importing the GeoPackage file can be a bit tricky. The dsn will be the full path along with the file name. The layer to be specified is “GSOD”, this is specified in the get_GSOD() function and will not change. The file name, specified in the dsn will, but the layer name will not.

library(rgdal)
#> Loading required package: sp
#> rgdal: version: 1.1-10, (SVN revision 622)
#>  Geospatial Data Abstraction Library extensions to R successfully loaded
#>  Loaded GDAL runtime: GDAL 1.11.5, released 2016/07/01
#>  Path to GDAL shared files: /usr/local/Cellar/gdal/1.11.5_1/share/gdal
#>  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
#>  Path to PROJ.4 shared files: (autodetected)
#>  Linking to sp version: 1.2-3

AUS_stations <- readOGR(dsn = path.expand("~/AUS.gpkg"), layer = "GSOD")
#> OGR data source with driver: GPKG
#> Source: "/Users/asparks/AUS-2015.gpkg", layer: "GSOD"
#> with 186977 features
#> It has 46 fields

class(AUS_stations)
#> [1] "SpatialPointsDataFrame"
#> attr(,"package")
#> [1] "sp"

Since GeoPackage files are formatted as SQLite databases you can use the existing R tools for SQLite files (J. Stachelek 2016). One easy way is using dplyr, which we’ve already used to filter the stations.

This option is much faster to load since it does not load the geometry.

AUS_sqlite <- tbl(src_sqlite(path.expand("~/AUS.gpkg")), "GSOD")
class(AUS_sqlite)
#> [1] "tbl_dbi"  "tbl_sql"  "tbl_lazy" "tbl"

print(AUS_sqlite, n = 5)
#> Source:   table<GSOD> [?? x 48]
#> Database: sqlite 3.19.3 [/Users/U8004755/AUS.gpkg]
#>    fid         geom   USAF  WBAN        STNID  STN_NAME  CTRY STATE  CALL ELEV_M ELEV_M_SRTM_90m    BEGIN      END YEARMODA
#>  <int>       <blob>  <chr> <chr>        <chr>     <chr> <chr> <chr> <chr>  <dbl>           <dbl>    <dbl>    <dbl>    <chr>
#> 1     1 <blob[29 B]> 941000 99999 941000-99999 KALUMBURU    AS  <NA>  <NA>     24              17 20010912 20170916 20150101
#> 2     2 <blob[29 B]> 941000 99999 941000-99999 KALUMBURU    AS  <NA>  <NA>     24              17 20010912 20170916 20150102
#> 3     3 <blob[29 B]> 941000 99999 941000-99999 KALUMBURU    AS  <NA>  <NA>     24              17 20010912 20170916 20150103
#> 4      4 <blob[29 B]> 941000 99999 941000-99999 KALUMBURU    AS  <NA>  <NA>     24              17 20010912 20170916 20150104
#> 5     5 <blob[29 B]> 941000 99999 941000-99999 KALUMBURU    AS  <NA>  <NA>     24              17 20010912 20170916 20150105
#> ... with more rows, and 34 more variables: YEAR <chr>, MONTH <chr>, DAY <chr>, YDAY <dbl>, TEMP <dbl>, TEMP_CNT <int>,
#>   DEWP <dbl>, DEWP_CNT <int>, SLP <dbl>, SLP_CNT <int>, STP <dbl>, STP_CNT <int>, VISIB <dbl>, VISIB_CNT <int>, WDSP <dbl>,
#>   WDSP_CNT <int>, MXSPD <dbl>, GUST <dbl>, MAX <dbl>, MAX_FLAG <chr>, MIN <dbl>, MIN_FLAG <chr>, PRCP <dbl>, PRCP_FLAG <chr>,
#>   SNDP <dbl>, I_FOG <int>, I_RAIN_DRIZZLE <int>, I_SNOW_ICE <int>, I_HAIL <int>, I_THUNDER <int>, I_TORNADO_FUNNEL <int>,
#>   EA <dbl>, ES <dbl>, RH <dbl>

Using reformat_GSOD()

You may have already downloaded GSOD data or may just wish to use an FTP client to download the files from the server to you local disk and not use the capabilities of get_GSOD(). In that case the reformat_GSOD() function is useful.

There are two ways, you can either provide reformat_GSOD() with a list of specified station files or you can supply it with a directory containing all of the “WBAN-WMO-YYYY.op.gz” station files that you wish to reformat.

Reformat a list of local files

y <- c("~/GSOD/gsod_1960/200490-99999-1960.op.gz",
       "~/GSOD/gsod_1961/200490-99999-1961.op.gz")
x <- reformat_GSOD(file_list = y)

Reformat all local files found in directory

x <- reformat_GSOD(dsn = "~/GSOD/gsod_1960")

Using update_station_list()

GSODR uses internal databases of station data from the NCEI to provide location and other metadata, e.g. elevation, station names, WMO codes, etc. to make the process of querying for weather data faster. This database is created and packaged with GSODR for distribution and is updated with new releases. Users have the option of updating these databases after installing GSODR. While this option gives the users the ability to keep the database up-to-date and gives GSODR’s authors flexibility in maintaining it, this also means that reproducibility may be affected since the same version of GSODR may have different databases on different machines. If reproducibility is necessary, care should be taken to ensure that the version of the databases is the same across different machines.

The database file isd_history.rda can be located on your local system by using the following command, paste0(.libPaths(), "/GSODR/extdata")[1], unless you have specified another location for library installations and installed GSODR there, in which case it would still be in GSODR/extdata.

To update GSODR’s internal database of station locations simply use update_station_list(), which will update the internal station database according to the latest data available from the NCEI.

Using get_inventory()

GSODR provides a function, get_inventory() to retrieve an inventory of the number of weather observations by station-year-month for the beginning of record through to current.

Following is an example of how to retreive the inventory and check a station in Toowoomba, Queensland, Australia, which was used in an earlier example.

inventory <- get_inventory()
## THIS INVENTORY SHOWS THE NUMBER OF WEATHER OBSERVATIONS BY STATION-YEAR-MONTH FOR BEGINNING OF RECORD THROUGH OCTOBER 2017.  THE DATABASE CONTINUES TO BE UPDATED AND ENHANCED, AND THIS INVENTORY WILL BE  UPDATED ON A REGULAR BASIS.
inventory
## # A tibble: 616,652 x 14
##           STNID  YEAR   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG
##           <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 007005-99999  2012    18     0     0     0     0     0     0     0
##  2 007011-99999  2012   771     0   183     0     0     0   142    13
##  3 007018-99999  2013     0     0     0     0     0     0   710     0
##  4 007025-99999  2012    21     0     0     0     0     0     0     0
##  5 007026-99999  2012     0     0     0     0     0     0   367     0
##  6 007026-99999  2014     0     0     0     0     0     0   180     0
##  7 007026-99999  2016     0     0     0     0     0   794     0     0
##  8 007026-99999  2017     0   914  2626   380   277   406  1230  1009
##  9 007034-99999  2012     0     0     0     0     0     0     0     0
## 10 007037-99999  2012     0     0     0     0     0     0   830    35
## # ... with 616,642 more rows, and 4 more variables: SEP <int>, OCT <int>,
## #   NOV <int>, DEC <int>
subset(inventory, STNID == "955510-99999")
## # A tibble: 20 x 14
##           STNID  YEAR   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG
##           <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 955510-99999  1998     0     0   222   223   221   211   226   217
##  2 955510-99999  1999   213   201   235   224   244   229   239   247
##  3 955510-99999  2000   241   227   247   238   246   237   245   240
##  4 955510-99999  2001   245   223   246   238   239   236   243   240
##  5 955510-99999  2002   245   219   246   236   243   229   243   246
##  6 955510-99999  2003   244   217   220   232   235   233   246   242
##  7 955510-99999  2004   240   227   241   229   233   224   235   244
##  8 955510-99999  2005   243   221   243   241   247   242   248   247
##  9 955510-99999  2006   245   223   246   232   241   238   247   247
## 10 955510-99999  2007   247   222   244   240   248   240   244   244
## 11 955510-99999  2008   247   228   248   239   248   239   248   247
## 12 955510-99999  2009   245   222   246   235   244   237   248   248
## 13 955510-99999  2010   248   223   248   240   244   240   242   247
## 14 955510-99999  2011   247   224   247   240   247   240   248   247
## 15 955510-99999  2012   248   232   248   240   248   240   248   247
## 16 955510-99999  2013   236   220   247   233   248   239   252   246
## 17 955510-99999  2014   243   224   247   240   246   239   241   243
## 18 955510-99999  2015   248   222   248   239   247   240   247   246
## 19 955510-99999  2016   246   228   245   240   246   240   248   248
## 20 955510-99999  2017   240   224   248   240   248   237   248   247
## # ... with 4 more variables: SEP <int>, OCT <int>, NOV <int>, DEC <int>

Additional Climate Data Availability

Additional climate data, GSODRdata, formatted for use with GSOD data provided by GSODR are available as an R package install able through GitHub due to the package size, 5.1Mb, being too large for CRAN.

#install.packages("devtools")
devtools::install_github("adamhsparks/GSODRdata")
library("GSODRdata")

Notes

Elevation Values

90 metre (90m) hole-filled SRTM digital elevation (Jarvis et al. 2008) was used to identify and correct/remove elevation errors in data for station locations between -60˚ and 60˚ latitude. This applies to cases here where elevation was missing in the reported values as well. In case the station reported an elevation and the DEM does not, the station reported is taken. For stations beyond -60˚ and 60˚ latitude, the values are station reported values in every instance. See https://github.com/ropensci/GSODR/blob/master/data-raw/fetch_isd-history.md for more detail on the correction methods.

WMO Resolution 40. NOAA Policy

Users of these data should take into account the following (from the NCEI website):

“The following data and products may have conditions placed on their international commercial use. They can be used within the U.S. or for non-commercial international activities without restriction. The non-U.S. data cannot be redistributed for commercial purposes. Re-distribution of these data by others must provide this same notification.” WMO Resolution 40. NOAA Policy

References

Stachelek, J. (2016) Using the Geopackage Format with R. URL: https://jsta.github.io/2016/07/14/geopackage-r.html

Appendices

Appendix 1: GSODR Final Data Format, Contents and Units

GSODR formatted data include the following fields and units:

Appendix 2: Map of GSOD Station Locations