ipumsr
Example - NHGISOBJECTIVE: Gain an understanding of how the NHGIS datasets are structured and how they can be leveraged to explore your research interests. This exercise will use an NHGIS dataset to explore slavery in the United States in 1830.
This vignette is adapted from the NHGIS Data Training Exercise available here: https://pop.umn.edu/sites/pop.umn.edu/files/nhgis_training_ex1_2017-01.pdf
What was the state‐level distribution of slavery in 1830?
Q1) How many tables are available from the 1830 Census?
Q2) Other than slave status, what are some other topics could we learn about for 1830?
# A: Population that is urban, particular ages, deaf and dumb, blind, and foreign born
# not naturalized.
Q3) Click the table name to see additional information. How many variables does this table contain?
Q4) For which geographic levels is the table available?
Q5) Close the table pop‐up window and inspect the Select Data table… What is the universe for this table?
Q6) What differentiates this table from the other available slavery tables from 1830?
Q7) Name a percentage or ratio this table would allow us to calculate that the other tables would not, based on the counts available in each table?
If you refresh your browser window (click on the loop icon at top, or press F5), you will see the extract status change from ‘queued’ to ‘in progress’ to ‘complete’, at which time you will be able to click the ‘tables’ link to download the data.
You will need to change the filepaths noted below to the place where you have saved the extracts.
You will need to change the filepaths noted below to the place where you have saved the extracts.
library(ipumsr)
library(sf)
# Change these filepaths to the filepaths of your downloaded extract
nhgis_csv_file <- "nhgis0001_csv.zip"
nhgis_shp_file <- "nhgis0001_shape.zip"
#> Could not find NHGIS data and so could not run vignette.
#>
#> If you tried to download the data following the instructions above, please makesure that the filenames are correct:
#> csv - nhgis0001_csv.zip
#> shape - nhgis0001_shape.zip
#> And that you are in the correct directory if you are using a relative path:
#> Current directory - C:/Users/umn-burkx031/AppData/Local/Temp/Rtmpm6X7mY/Rbuilde8418a62dc1/ipumsr/vignettes
#>
#> The data is also available on github. You can install it using the following commands:
#> if (!require(devtools)) install.packages('devtools')
#> devtools::install_github('mnpopcenter/ipumsr/ipumsexamples')
#> After installation, the data should be available for this vignette.
nhgis_ddi <- read_ipums_codebook(nhgis_csv_file) # Contains metadata, nice to have as separate object
nhgis <- read_nhgis_sf(
data_file = nhgis_csv_file,
shape_file = nhgis_shp_file
)
Note that read_nhgis_sf
relies on package sf
. You can also read NHGIS data into the format used by package sp
with function read_nhgis_sp
.
These exercises include example code written in the “tidyverse” style, meaning that they use the dplyr package. This package provides easy to use functions for data analysis, including mutate()
, select()
, arrange()
, slice()
and the pipe (%>%
). There a numerous other ways you could solve these answers, including using the base R, the data.table
package and others.
Q8) How many states/territories are included in this table?
Q9) Why do you think other states are missing?
table(nhgis$STATE)
# A: In 1830, there were not any other states yet! Every decennial census is a
# historical snapshot, and NHGIS provides census counts just as they were
# originally reported without "filling in" any information for newer areas.
Q10) Create a new variable called total_pop, with the total population for each state, by summing the counts in columns ABO001 to ABO006. Which state had the largest population?
nhgis <- nhgis %>%
mutate(total_pop = ABO001 + ABO002 + ABO003 + ABO004 + ABO005 + ABO006)
nhgis %>%
as.data.frame() %>%
select(STATE, total_pop) %>%
arrange(desc(total_pop)) %>%
slice(1:5)
# A: New York
Q11) Create a variable called slave_pop, with the total slave population by summing the variables ABO003 and ABO004. Which state had the largest slave population?
nhgis <- nhgis %>%
mutate(slave_pop = ABO003 + ABO004)
nhgis %>%
as.data.frame() %>%
select(STATE, slave_pop) %>%
arrange(desc(slave_pop)) %>%
slice(1:5)
# A: Virginia
Q12) Create a variable called pct_slave with the Slave Population divided by the Total Population. Which states had the highest and lowest Percent Slave Population?
nhgis <- nhgis %>%
mutate(pct_slave = slave_pop / total_pop)
nhgis %>%
as.data.frame() %>%
select(STATE, pct_slave) %>%
filter(pct_slave %in% c(min(pct_slave, na.rm = TRUE), max(pct_slave, na.rm = TRUE)))
# A: South Carolina (54.27%) and Vermont (0.00%)
Q13) Are there any surprises, or is it as you expected?
nhgis %>%
as.data.frame() %>%
filter(pct_slave > 0.5) %>%
select(STATE, slave_pop, total_pop, pct_slave)
nhgis %>%
as.data.frame() %>%
filter(STATE %in% c("New York", "New Jersey")) %>%
select(STATE, slave_pop, total_pop, pct_slave)
# A: Possibilities: Did you know some states had more slaves than free persons? Did
# you know that some “free states” were home to substantial numbers of slaves?
Open the .txt codebook file that is in the same folder as the comma delimited file you have already analyzed. The codebook file is a valuable reference containing information about the table or tables you’ve downloaded.
Some of the information provided in the codebook can be read into R, using the function read_ipums_codebook()
.
Q14) What is the proper citation to provide when using NHGIS data in publications or researcher reports?
cat(ipums_file_info(nhgis_ddi, "conditions"))
# A: Minnesota Population Center. National Historical Geographic Information
# System: Version 11.0 [Database]. Minneapolis: University of Minnesota. 2016.
# http://doi.org/10.18128/D050.V11.0.
Q15) What is the email address for NHGIS to share any research you have published? (You can also send questions you may have about the site. We’re happy to help!)
One of the reasons we are excited about bringing IPUMS data to R is the GIS capabilities available for free in R.
Q16) Make a map of the percent of the population that are slaves.
# Note the function `geom_sf()` is currently only in the development version,
# so you may need to update ggplot2 to run using
# devtools::install_github("tidyverse/ggplot2")
library(ggplot2)
if ("geom_sf" %in% getNamespaceExports("ggplot2")) {
ggplot(data = nhgis, aes(fill = pct_slave)) +
geom_sf() +
scale_fill_continuous("", labels = scales::percent) +
labs(
title = "Percent of Population that was Enslaved by State",
subtitle = "1830 Census",
caption = paste0("Source: ", ipums_file_info(nhgis_ddi, "ipums_project"))
)
}