The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Example 5: Geocoding Addresses and Spatial Applications

Vinh Nguyen

2023-10-05

Introduction: Geocoding

Location data like student addresses are a rich source of information that could be leveraged for institutions to know where students are concentrated in the surrounding areas, the distance to campus for each student, the commute time to campus for each student, and population-level attributes of geographical areas provided by the U.S. Census Bureau (e.g., the American Community Survey or ACS). Geocoding is the act of converting an address to its geographical coordinates (longitude and latitude). It is a critical first step to make addresses more useful for analysis by institutional researchers.

In this vignette, we cover how to geocode using the tidygeocoder package in R. Although many options exist for geocoding, we focus on this particular package in R for several reasons:

  1. Many analysts use R to process or analyze data, so it is convenient to stay in the same environment.
  2. tidygeocoder provides a unified interface for geocoding with many services on the backend. The user specifies the appropriate service depending on their needs as services differ on match rates, costs, limits on the free tier (if available), usage limitations (queries per second or total queries in a time period), and data privacy / retention policies.
  3. tidygeocoder supports the Nominatim service, a geocoding service based on OpenStreetMap (OSM) data, a crowdsourced open data platform. Moreover, the user could specify a custom API URL, which is useful for geocoding with a local Nominatim server instead of the public server. This is especially critical for institutional researchers as address data never leave the local network, maintaining student privacy and limiting potential data security issues.

Besides geocoding, this vignette also illustrates a few spatial applications of the geocoded data.

Installation and Setup

tidygeocoder and other relevant packages in R

The tidygeocoder and other relevant packages could be installed by running the following line in an R console:

# Installation
install.packages(c('tidygeocoder' # for geocoding
                   , 'maps' # map data for Visualizations
                   , 'geodist' # calculating distance
                   , 'sf' # interface for spatial data (e.g., shapefiles)
                   , 'tigris' # US Census shapefiles
                   , 'tidycensus' # query US Census data
                   ) 
                 , repos='https://cran.r-project.org')

Example: Geocoding Addresses

Let’s load some necessary packages and load some address data provided by the IRexamples package.

# Load packages
library(IRexamples)
library(tidygeocoder)
library(dplyr)

# Load data
data(ccc_list)
data(uc_list)

dim(ccc_list)
## [1] 115   5

dim(uc_list)
## [1] 10  2

head(ccc_list)
##                     College                                       District
## 1     ALLAN HANCOCK COLLEGE Allan Hancock Joint Community College District
## 2    AMERICAN RIVER COLLEGE            Los Rios Community College District
## 3   ANTELOPE VALLEY COLLEGE     Antelope Valley Community College District
## 4       BAKERSFIELD COLLEGE                Kern Community College District
## 5 BARSTOW COMMUNITY COLLEGE             Barstow Community College District
## 6     BERKELEY CITY COLLEGE             Peralta Community College District
##                                               Address        Phone
## 1 800 South College Drive, Santa Maria, CA 93454-6368 805.922.6966
## 2   4700 College Oak Drive, Sacramento, CA 95841-4286 916.484.8011
## 3        3041 West Avenue K, Lancaster, CA 93536-5426 661.722.6300
## 4     1801 Panorama Drive, Bakersfield, CA 93305-1299 661.395.4011
## 5           2700 Barstow Road, Barstow, CA 92311-6699 760.252.2411
## 6         2050 Center Street, Berkeley, CA 94704-1205 510.981.2800
##                       Website
## 1      www.hancockcollege.edu
## 2         www.arc.losrios.edu
## 3                 www.avc.edu
## 4  www.bakersfieldcollege.edu
## 5             www.barstow.edu
## 6 www.berkeleycitycollege.edu

head(uc_list)
## # A tibble: 6 x 2
##   University     Address                                         
##   <chr>          <chr>                                           
## 1 UC Berkeley    Berkeley, CA                                    
## 2 UC Davis       1 Shields Ave, Davis, CA 95616                  
## 3 UC Irvine      510 E Peltason Dr. Irvine, California 92697-5700
## 4 UC Los Angeles Los Angeles, CA 90095                           
## 5 UC Merced      5200 Lake Rd, Merced, CA 95343                  
## 6 UC Riverside   900 University Ave, Riverside, CA 92521        

The ccc_list and uc_list data sets contain lists of addresses for the California Community Colleges and the Universities of California, respectively.

Next, we geocode the data with the public Nominatim service. To use the local server, uncomment the api_url and _min_time arguments.

ccc_list_geo <- ccc_list %>%
  geocode(address=Address
          , method='osm' # nominatim
          # , api_url='http://localhost:8080/' # Uncomment this if using a local Nominatim server on the same computer that's running R; default uses the public online service; can also specify the address to a local server running on the network
          # , min_time=0.1 # If using local server, allow 1 query per 0.1 seconds.  Otherwise, default is 1 query per second on the public server (see ?min_time_reference)
          )
dim(ccc_list_geo)
## [1] 115   7

names(ccc_list_geo)
## [1] "College"  "District" "Address"  "Phone"    "Website"  "lat"      "long"

ccc_list_geo %>%
  select(College, lat, long) %>%
  print(n=120)
## # A tibble: 115 x 3
##     College                          lat  long
##     <chr>                          <dbl> <dbl>
##   1 ALLAN HANCOCK COLLEGE           34.9 -120.
##   2 AMERICAN RIVER COLLEGE          38.6 -121.
##   3 ANTELOPE VALLEY COLLEGE         34.7 -118.
##   4 BAKERSFIELD COLLEGE             35.4 -119.
##   5 BARSTOW COMMUNITY COLLEGE       34.9 -117.
##   6 BERKELEY CITY COLLEGE           37.9 -122.
##   7 BUTTE COLLEGE                   39.6 -122.
##   8 CABRILLO COLLEGE                37.0 -122.
##   9 CAÑADA COLLEGE                  37.4 -122.
##  10 CERRITOS COLLEGE                33.9 -118.
##  11 CERRO COSO COMMUNITY COLLEGE    35.6 -118.
##  12 CHABOT COLLEGE                  37.6 -122.
##  13 CHAFFEY COLLEGE                 34.1 -118.
##  14 CITRUS COLLEGE                  34.1 -118.
##  15 CITY COLLEGE OF SAN FRANCISCO   NA     NA 
##  16 CLOVIS COMMUNITY COLLEGE        36.9 -120.
##  17 COASTLINE COMMUNITY COLLEGE     33.7 -118.
##  18 COLLEGE OF ALAMEDA              37.8 -122.
##  19 COLLEGE OF MARIN                38.0 -123.
##  20 COLLEGE OF SAN MATEO            37.5 -122.
##  21 COLLEGE OF THE CANYONS          NA     NA 
##  22 COLLEGE OF THE DESERT           33.7 -116.
##  23 COLLEGE OF THE REDWOODS         NA     NA 
##  24 COLLEGE OF THE SEQUOIAS         36.3 -119.
##  25 COLLEGE OF THE SISKIYOUS        41.4 -122.
##  26 COLUMBIA COLLEGE                NA     NA 
##  27 COMPTON COLLEGE                 33.9 -118.
##  28 CONTRA COSTA COLLEGE            38.0 -122.
##  29 COPPER MOUNTAIN COLLEGE         NA     NA 
##  30 COSUMNES RIVER COLLEGE          38.4 -121.
##  31 CRAFTON HILLS COLLEGE           34.0 -117.
##  32 CUESTA COLLEGE                  NA     NA 
##  33 CUYAMACA COLLEGE                32.7 -117.
##  34 CYPRESS COLLEGE                 33.8 -118.
##  35 DEANZA COLLEGE                  37.3 -122.
##  36 DIABLO VALLEY COLLEGE           38.0 -122.
##  37 EAST LOS ANGELES COLLEGE        NA     NA 
##  38 EL CAMINO COLLEGE               33.9 -118.
##  39 EVERGREEN VALLEY COLLEGE        37.3 -122.
##  40 FEATHER RIVER COLLEGE           40.0 -121.
##  41 FOLSOM LAKE COLLEGE             38.7 -121.
##  42 FOOTHILL COLLEGE                37.4 -122.
##  43 FRESNO CITY COLLEGE             36.8 -120.
##  44 FULLERTON COLLEGE               33.9 -118.
##  45 GAVILAN COLLEGE                 37.0 -122.
##  46 GLENDALE COMMUNITY COLLEGE      34.2 -118.
##  47 GOLDEN WEST COLLEGE             33.7 -118.
##  48 GROSSMONT COLLEGE               32.8 -117.
##  49 HARTNELL COLLEGE                36.7 -122.
##  50 IMPERIAL VALLEY COLLEGE         32.8 -116.
##  51 IRVINE VALLEY COLLEGE           33.7 -118.
##  52 LAKE TAHOE COMMUNITY COLLEGE    NA     NA 
##  53 LANEY COLLEGE                   37.8 -122.
##  54 LAS POSITAS COLLEGE             37.7 -122.
##  55 LASSEN COLLEGE                  NA     NA 
##  56 LONG BEACH CITY COLLEGE         33.8 -118.
##  57 LOS ANGELES CITY COLLEGE        34.1 -118.
##  58 LOS ANGELES HARBOR COLLEGE      NA     NA 
##  59 LOS ANGELES MISSION COLLEGE     34.3 -118.
##  60 LOS ANGELES PIERCE COLLEGE      34.2 -119.
##  61 LOS ANGELES SOUTHWEST COLLEGE   33.9 -118.
##  62 LOS ANGELES TRADE-TECH COLLEGE  34.0 -118.
##  63 LOS ANGELES VALLEY COLLEGE      34.2 -118.
##  64 LOS MEDANOS COLLEGE             38.0 -122.
##  65 MADERA COLLEGE                  36.9 -120.
##  66 MENDOCINO COLLEGE               39.2 -123.
##  67 MERCED COLLEGE                  37.3 -120.
##  68 MERRITT COLLEGE                 37.8 -122.
##  69 MIRACOSTA COLLEGE               33.2 -117.
##  70 MISSION COLLEGE                 37.4 -122.
##  71 MODESTO JUNIOR COLLEGE          37.7 -121.
##  72 MONTEREY PENINSULA COLLEGE      36.6 -122.
##  73 MOORPARK COLLEGE                34.3 -119.
##  74 MORENO VALLEY COLLEGE           33.9 -117.
##  75 MT. SAN ANTONIO COLLEGE         34.0 -118.
##  76 MT. SAN JACINTO COLLEGE         33.8 -117.
##  77 NAPA VALLEY COLLEGE             38.2 -122.
##  78 NORCO COLLEGE                   NA     NA 
##  79 OHLONE COLLEGE                  37.5 -122.
##  80 ORANGE COAST COLLEGE            NA     NA 
##  81 OXNARD COLLEGE                  34.2 -119.
##  82 PALO VERDE COLLEGE              NA     NA 
##  83 PALOMAR COLLEGE                 33.2 -117.
##  84 PASADENA CITY COLLEGE           34.1 -118.
##  85 PORTERVILLE COLLEGE             36.0 -119.
##  86 REEDLEY COLLEGE                 36.6 -119.
##  87 RIO HONDO COLLEGE               34.0 -118.
##  88 RIVERSIDE CITY COLLEGE          34.0 -117.
##  89 SACRAMENTO CITY COLLEGE         38.5 -121.
##  90 SADDLEBACK COLLEGE              33.6 -118.
##  91 SAN BERNARDINO VALLEY COLLEGE   34.1 -117.
##  92 SAN DIEGO CITY COLLEGE          32.7 -117.
##  93 SAN DIEGO MESA COLLEGE          32.8 -117.
##  94 SAN DIEGO MIRAMAR COLLEGE       32.9 -117.
##  95 SAN JOAQUIN DELTA COLLEGE       38.0 -121.
##  96 SAN JOSE CITY COLLEGE           37.3 -122.
##  97 SANTA ANA COLLEGE               33.8 -118.
##  98 SANTA BARBARA CITY COLLEGE      34.4 -120.
##  99 SANTA MONICA COLLEGE            NA     NA 
## 100 SANTA ROSA JUNIOR COLLEGE       38.5 -123.
## 101 SANTIAGO CANYON COLLEGE         33.8 -118.
## 102 SHASTA COLLEGE                  NA     NA 
## 103 SIERRA COLLEGE                  38.8 -121.
## 104 SKYLINE COLLEGE                 37.6 -122.
## 105 SOLANO COMMUNITY COLLEGE        38.2 -122.
## 106 SOUTHWESTERN COLLEGE            32.6 -117.
## 107 TAFT COLLEGE                    NA     NA 
## 108 VENTURA COLLEGE                 34.3 -119.
## 109 VICTOR VALLEY COLLEGE           34.5 -117.
## 110 WEST HILLS COLLEGE COALINGA     NA     NA 
## 111 WEST HILLS COLLEGE LEMOORE      36.3 -120.
## 112 WEST LOS ANGELES COLLEGE        34.0 -118.
## 113 WEST VALLEY COLLEGE             37.3 -122.
## 114 WOODLAND COMMUNITY COLLEGE      38.7 -122.
## 115 YUBA COLLEGE                    NA     NA 

# Match rate
ccc_list_geo %>%
  summarize(N_Missing=sum(is.na(lat)), Match_Rate=mean(!is.na(lat)))
## # A tibble: 1 x 2
##   N_Missing Match_Rate
##       <int>      <dbl>
## 1        18      0.843

# Non-matches
ccc_list_geo %>%
  filter(is.na(long)) %>%
  select(College, Address) %>% 
  as.data.frame
##                          College
## 1  CITY COLLEGE OF SAN FRANCISCO
## 2         COLLEGE OF THE CANYONS
## 3        COLLEGE OF THE REDWOODS
## 4               COLUMBIA COLLEGE
## 5        COPPER MOUNTAIN COLLEGE
## 6                 CUESTA COLLEGE
## 7       EAST LOS ANGELES COLLEGE
## 8   LAKE TAHOE COMMUNITY COLLEGE
## 9                 LASSEN COLLEGE
## 10    LOS ANGELES HARBOR COLLEGE
## 11                 NORCO COLLEGE
## 12          ORANGE COAST COLLEGE
## 13            PALO VERDE COLLEGE
## 14          SANTA MONICA COLLEGE
## 15                SHASTA COLLEGE
## 16                  TAFT COLLEGE
## 17   WEST HILLS COLLEGE COALINGA
## 18                  YUBA COLLEGE
##                                                        Address
## 1          50 Phelan Avenue E200, San Francisco, CA 94112-1898
## 2  26455 N. Rockwell Canyon Road, Santa Clarita, CA 91355-1899
## 3               7351 Tompkins Hill Road, Eureka, CA 95501-9301
## 4          11600 Columbia College Drive, Sonora, CA 95370-8518
## 5                  6162 Rotary Way, Joshua Tree, CA 92252-6100
## 6                  PO Box 8106, San Luis Obispo, CA 93403-8106
## 7      1301 Avenida Cesar Chavez, Monterey Park, CA 91754-6099
## 8               1 College Drive, So. Lake Tahoe, CA 96150-4524
## 9                       PO Box 3000, Susanville, CA 96130-3000
## 10              1111 Figueroa Place, Wilmington, CA 90744-2397
## 11                     2001 Third Street, Norco, CA 92860-2600
## 12                      PO Box 5005, Costa Mesa, CA 92628-5005
## 13                    One College Drive, Blythe, CA 92225-1118
## 14            1900 Pico Boulevard, Santa Monica, CA 90405-1628
## 15                       PO Box 496006, Redding, CA 96049-6006
## 16                          29 Cougar Ct., Taft, CA 93268-4217
## 17                    300 Cherry Lane, Coalinga, CA 93210-1301
## 18            2088 North Beale Road, Marysville, CA 95901-7699

There is an 84% match rate for the ccc_list data using OSM.

The geocode function provides a cascade method that one could use to leverage multiple geocoding services: when the first service does not return coordinates for an address, a second service is used as a second attempt, with additional services used subsequently should more be specified. The following code illustrates this feature with the census method as backup:

ccc_list_geo2 <- ccc_list %>%
  geocode(address=Address
          , method='cascade'
          , cascade_order=c('osm', 'census')
          )

ccc_list_geo2 %>%
  summarize(N_Missing=sum(is.na(lat)), Match_Rate=mean(!is.na(lat)))
## # A tibble: 1 x 2
##   N_Missing Match_Rate
##       <int>      <dbl>
## 1         7      0.939

ccc_list_geo2 %>%
  filter(is.na(long)) %>%
  select(College, Address, lat, long) %>% 
  as.data.frame
##                        College                                        Address
## 1               CUESTA COLLEGE    PO Box 8106, San Luis Obispo, CA 93403-8106
## 2 LAKE TAHOE COMMUNITY COLLEGE 1 College Drive, So. Lake Tahoe, CA 96150-4524
## 3               LASSEN COLLEGE         PO Box 3000, Susanville, CA 96130-3000
## 4         ORANGE COAST COLLEGE         PO Box 5005, Costa Mesa, CA 92628-5005
## 5           PALO VERDE COLLEGE       One College Drive, Blythe, CA 92225-1118
## 6               SHASTA COLLEGE          PO Box 496006, Redding, CA 96049-6006
## 7                 TAFT COLLEGE             29 Cougar Ct., Taft, CA 93268-4217
##   lat long
## 1  NA   NA
## 2  NA   NA
## 3  NA   NA
## 4  NA   NA
## 5  NA   NA
## 6  NA   NA
## 7  NA   NA

With the Census as the second service, the match rate increased to 94%. One thing to note is that cascade method would not work with a local Nominatim server. To use a local Nominatim server, the user should geocode using the local server first, then identify the non-matched addresses for geocoding using a secondary service. This is illustrated as follows:

ccc_list_geo2_manual <- ccc_list_geo %>%
  filter(is.na(lat)) %>% # identify non-matches
  select(-lat, -long) %>% # remove these columns
  geocode(address=Address
          , method='census' # geocode non-matches with census method
          ) %>%
  rbind(ccc_list_geo %>% filter(!is.na(lat))) # Append matches from 1st run
dim(ccc_list_geo2_manual)
## [1] 115   7

ccc_list_geo2_manual %>%
  summarize(N_Missing=sum(is.na(lat)), Match_Rate=mean(!is.na(lat)))
## # A tibble: 1 x 2
##   N_Missing Match_Rate
##       <int>      <dbl>
## 1         7      0.939

The manual approach yields the same results as the cascade method, but is preferred when a local Nominatim server is used.

Example: Visualizing Locations

With the geocoded results, we can visualize the locations of the California community colleges on a map:

library(ggplot2)
library(maps)
library(ggrepel)

ggplot(ccc_list_geo2, aes(long, lat), color = "grey99") +
  borders(database='state', region='california') + # See ?map
  geom_point() +
  # geom_label_repel(aes(label = College)) +
  theme_void()
#dev.copy(png, filename='./images/ccc_map.png', width=3, height=5, units='in', res=150)
#dev.off()

# Concentration
# https://stackoverflow.com/questions/13316185/r-convert-zipcode-or-lat-long-to-county

The following shows how to visualize the map of the US:

ggplot(ccc_list_geo2, aes(long, lat), color = "grey99") +
  borders(database='county', region='california') + # See ?map
  geom_point() +
  # geom_label_repel(aes(label = College)) +
  theme_void()
#dev.copy(png, filename='./images/ccc_map_us.png', width=9, height=5, units='in', res=150)
#dev.off()

With the geographical coordinates, the geocoded data set could also be exported and then imported in other platforms for visualization such as Tableau, ArcGIS, or QGIS.

Example: Calculating Distance

With geocoded data, distance can be calculated using the geodist in R. In the following example, we first geocode the location of all the UC’s, then calculate the distance to each UC for each community college.

# Geocode UC's
uc_list_geo <- uc_list %>%
  geocode(address=Address
          , method='cascade'
          , cascade_order=c('osm', 'census')
          )
uc_list_geo %>%
  as.data.frame
##          University                                          Address      lat
## 1       UC Berkeley                                     Berkeley, CA 37.87535
## 2          UC Davis                   1 Shields Ave, Davis, CA 95616       NA
## 3         UC Irvine 510 E Peltason Dr. Irvine, California 92697-5700 33.64163
## 4    UC Los Angeles                            Los Angeles, CA 90095 34.07088
## 5         UC Merced                   5200 Lake Rd, Merced, CA 95343 37.34685
## 6      UC Riverside          900 University Ave, Riverside, CA 92521 33.96371
## 7      UC San Diego               9500 Gilman Dr, La Jolla, CA 92093 32.87615
## 8  UC San Francisco       505 Parnassus Ave, San Francisco, CA 94143 37.76307
## 9  UC Santa Barbara                          Santa Barbara, CA 93106 34.42213
## 10    UC Santa Cruz               1156 High St, Santa Cruz, CA 95064 36.97738
##         long geo_method
## 1  -122.2396        osm
## 2         NA     census
## 3  -117.8456        osm
## 4  -118.4468        osm
## 5  -120.4326        osm
## 6  -117.3398        osm
## 7  -117.2432        osm
## 8  -122.4574        osm
## 9  -119.7027        osm
## 10 -122.0549        osm

# Load packages
library(geodist)
library(stringr)

# Calculate distance between each CC to each UC
dist_mat <- geodist(x=ccc_list_geo, y=uc_list_geo, measure='haversine') / 1609.34 # results are in meters; convert to miles
dim(dist_mat)
## [1] 115  10

# Append results as new columns in original data frame
ccc_list_geo[paste0('Dist: ', uc_list$University)] <- dist_mat
dim(ccc_list_geo)
## [1] 115  17

# Print first few rows
head(ccc_list_geo) %>% as.data.frame
##                     College                                       District
## 1     ALLAN HANCOCK COLLEGE Allan Hancock Joint Community College District
## 2    AMERICAN RIVER COLLEGE            Los Rios Community College District
## 3   ANTELOPE VALLEY COLLEGE     Antelope Valley Community College District
## 4       BAKERSFIELD COLLEGE                Kern Community College District
## 5 BARSTOW COMMUNITY COLLEGE             Barstow Community College District
## 6     BERKELEY CITY COLLEGE             Peralta Community College District
##                                               Address        Phone
## 1 800 South College Drive, Santa Maria, CA 93454-6368 805.922.6966
## 2   4700 College Oak Drive, Sacramento, CA 95841-4286 916.484.8011
## 3        3041 West Avenue K, Lancaster, CA 93536-5426 661.722.6300
## 4     1801 Panorama Drive, Bakersfield, CA 93305-1299 661.395.4011
## 5           2700 Barstow Road, Barstow, CA 92311-6699 760.252.2411
## 6         2050 Center Street, Berkeley, CA 94704-1205 510.981.2800
##                       Website      lat      long Dist: UC Berkeley
## 1      www.hancockcollege.edu 34.94327 -120.4231        226.612872
## 2         www.arc.losrios.edu 38.64884 -121.3465         72.218347
## 3                 www.avc.edu 34.67492 -118.1874        316.272955
## 4  www.bakersfieldcollege.edu 35.40877 -118.9720        248.964004
## 5             www.barstow.edu 34.87101 -117.0258        356.981425
## 6 www.berkeleycitycollege.edu 37.86980 -122.2697          1.685213
##   Dist: UC Davis Dist: UC Irvine Dist: UC Los Angeles Dist: UC Merced
## 1             NA       172.62547            127.78918        166.2585
## 2             NA       397.65959            355.42338        102.9175
## 3             NA        74.10235             44.33047        223.4554
## 4             NA       138.06228             97.23803        156.8008
## 5             NA        97.09591             98.12937        256.0385
## 6             NA       383.57060            338.82270        106.9657
##   Dist: UC Riverside Dist: UC San Diego Dist: UC San Francisco
## 1          188.45515           231.8548              225.56507
## 2          393.48090           460.8291               86.02155
## 3           69.02799           135.7479              319.94485
## 4          136.41495           201.1988              252.93543
## 5           65.26556           138.5493              362.74946
## 6          386.23756           446.7000               12.63799
##   Dist: UC Santa Barbara Dist: UC Santa Cruz
## 1               54.57738           167.75483
## 2              306.29097           121.92436
## 3               88.08124           269.05504
## 4               79.84264           203.42795
## 5              155.45306           317.08630
## 6              278.23339            62.84623

# Distance to UCI from colleges with Orange in it's district's name
ccc_list_geo %>%
  filter(str_detect(District, 'Orange')) %>%
  select(College, `Dist: UC Irvine`)
## # A tibble: 4 x 2
##   College               `Dist: UC Irvine`
##   <chr>                             <dbl>
## 1 CYPRESS COLLEGE                   15.7 
## 2 FULLERTON COLLEGE                 16.7 
## 3 IRVINE VALLEY COLLEGE              4.71
## 4 SADDLEBACK COLLEGE                12.0 

Example: Data Augmentation with Census Data

The US Census Bureau offers a lot of useful geographical-level information based on data collected from the Decennial Census (entire population, every 10 years) and the American Community Survey (ACS; sample of population, every month). For example, for a given geographical unit, one could obtain the population size, age group sizes, income levels, and education attainment of the area. Such information could be used to describe the service areas of a college or impute the information about students if such information is unavailable to the institution (e.g., income or socioeconomic level). The augmented data could be used to describe the student population or service areas in grant applications, or could be used as explanatory variables in a statistical model.

Before obtaining the needed information from the US Census, we must first determine the geographical area that an address corresponds, namely the census tract. Census tracts are granular geographical units determined by the US Census Bureau for which statistical information are summarized on (e.g., population size). Previously, we illustrated how to geocode an address. From these geographical coordinates, we can determine the corresponding census tracts these addresses belong to. To do so, we first download the California census tract shapefiles via the tigris R package. From there, we could do a spatial join between the geographical coordinates (as spatial points) and the census tracts (areas) via the sf R package. If a location or point is in a particular tract, a join is returned. Once an address have been linked to a census tract, information from the US Census could be queried using the tidycensus R package.

The following code illustrates how to link geographical coordinates to census tracts using the addresses found in the uc_list data as sample input:

# Load packages
# https://stackoverflow.com/a/52260988/199140
library(tigris) # Download shapefiles from Census
library(sf) # Handle shapefiles; load after tigris
library(tidycensus)

# Get CA shapefiles
tract_CA <- tracts(state='CA', class='sf')
## Using FIPS code '06' for state 'CA'
##   |======================================================================| 100%

# Print
head(tract_CA)
## Simple feature collection with 6 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -118.5715 ymin: 34.14174 xmax: -118.4836 ymax: 34.18716
## Geodetic CRS:  NAD83
##   STATEFP COUNTYFP TRACTCE       GEOID    NAME             NAMELSAD MTFCC
## 1      06      037  139301 06037139301 1393.01 Census Tract 1393.01 G5020
## 2      06      037  139302 06037139302 1393.02 Census Tract 1393.02 G5020
## 3      06      037  139502 06037139502 1395.02 Census Tract 1395.02 G5020
## 4      06      037  139600 06037139600    1396    Census Tract 1396 G5020
## 5      06      037  139701 06037139701 1397.01 Census Tract 1397.01 G5020
## 6      06      037  139801 06037139801 1398.01 Census Tract 1398.01 G5020
##   FUNCSTAT   ALAND AWATER    INTPTLAT     INTPTLON
## 1        S 2865657      0 +34.1781538 -118.5581265
## 2        S  338289      0 +34.1767230 -118.5383655
## 3        S 1047548      0 +34.1628402 -118.5263110
## 4        S 2477482      0 +34.1640599 -118.5101001
## 5        S 3396396   2411 +34.1574290 -118.4954117
## 6        S 3665744      0 +34.1527206 -118.5455805
##                         geometry
## 1 MULTIPOLYGON (((-118.5715 3...
## 2 MULTIPOLYGON (((-118.5407 3...
## 3 MULTIPOLYGON (((-118.5322 3...
## 4 MULTIPOLYGON (((-118.5186 3...
## 5 MULTIPOLYGON (((-118.5098 3...
## 6 MULTIPOLYGON (((-118.5567 3...

# Convert coordinates into geospatial data type (sf)
uc_points <- uc_list_geo %>%
  filter(!is.na(lat)) %>%  # remove missing
  st_as_sf(coords=c('long', 'lat'), crs=st_crs(tract_CA))

# Determine which census tract each point is located in
uc_points_tract <- st_join(uc_points, tract_CA)
dim(uc_points_tract)
## [1]  9 16

names(uc_points_tract)
##  [1] "University" "Address"    "geo_method" "geometry"   "STATEFP"   
##  [6] "COUNTYFP"   "TRACTCE"    "GEOID"      "NAME"       "NAMELSAD"  
## [11] "MTFCC"      "FUNCSTAT"   "ALAND"      "AWATER"     "INTPTLAT"  
## [16] "INTPTLON"

GEOID is the unique identifier of a census tract, and as you might infer from reviewing some the values, it is a concatenation of STATEFP (state identifier), COUNTYFP (county identifier), and TRACTCE (tract identifier). Once GEOID is known, we could leverage the get_decennial and get_acs functions from the tidycensus package to query the data elements of interest, as described in the package’s basic usage vignette and the spatial data vignette.

To query data from the Census API, one must obtain a free API key here. This key could be set in R with the following code from the tidycensus package:

# US Census API Key
#census_api_key(Sys.getenv('census_api_key'))
census_api_key('YOUR API KEY')

As noted in tidycensus’s basic usage vignette, getting information from the Census or ACS requires knowing the variable ID’s. As tehre are many, one suggested workflow is to download the variable descriptions via the load_variables function, and search for the needed variables from these descriptions (type ?load_variables in the R console for help).

We illustrate how to find the variable ID’s for the following information, how to query the data, and how to append them to our sample addresses in the previously created uc_points_tract data set:

  1. Median income
  2. Educational attainment
  3. Race and Ethnicity

1. Median Income

# Download variables
v18 <- load_variables(2018, "acs5", cache = TRUE)
# ACS time estimates: https://www.census.gov/programs-surveys/acs/guidance/estimates.html

dim(v18)
## [1] 26997     3

names(v18)
## [1] "name"    "label"   "concept"

# View and filter variables in RStudio
# Concept: median income
#View(v18)

# Manually view and filter
v18 %>%
  filter(str_detect(tolower(concept), 'median')
       , str_detect(tolower(concept), 'income')
         ) %>%
  as.data.frame
## Long, so will not print here in vignette

What we want is MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2018 INFLATION-ADJUSTED DOLLARS), which corresponds to variable B19013_001. We can query this information at the tract level, then join it to our list of addresses in uc_points_tract with GEOID as the join key:

lu_median_income <- get_acs(geography='tract'
                          , variables=c(median_income='B19013_001')
                          , state='CA'
                          , year=2018
                          , geometry=FALSE
  )
dim(lu_median_income)
## [1] 8057    5

names(lu_median_income)
## [1] "GEOID"    "NAME"     "variable" "estimate" "moe"

head(lu_median_income) %>% as.data.frame
##         GEOID                                          NAME      variable
## 1 06001400100 Census Tract 4001, Alameda County, California median_income
## 2 06001400200 Census Tract 4002, Alameda County, California median_income
## 3 06001400300 Census Tract 4003, Alameda County, California median_income
## 4 06001400400 Census Tract 4004, Alameda County, California median_income
## 5 06001400500 Census Tract 4005, Alameda County, California median_income
## 6 06001400600 Census Tract 4006, Alameda County, California median_income
##   estimate   moe
## 1   200893 49177
## 2   160536 29320
## 3    94732 23862
## 4   113036 16872
## 5   103846 18727
## 6   127232 25110

# Join
uc_income <- uc_points_tract %>%
  select(University, Address, GEOID) %>%
  left_join(lu_median_income %>%
            select(GEOID, estimate) %>%
            rename(median_income=estimate)
            )
uc_income %>%
  as.data.frame
##         University                                          Address       GEOID
## 1      UC Berkeley                                     Berkeley, CA 06001400100
## 2        UC Irvine 510 E Peltason Dr. Irvine, California 92697-5700 06059062614
## 3   UC Los Angeles                            Los Angeles, CA 90095 06037265301
## 4        UC Merced                   5200 Lake Rd, Merced, CA 95343 06047001801
## 5     UC Riverside          900 University Ave, Riverside, CA 92521 06065046500
## 6     UC San Diego               9500 Gilman Dr, La Jolla, CA 92093 06073008305
## 7 UC San Francisco       505 Parnassus Ave, San Francisco, CA 94143 06075030102
## 8 UC Santa Barbara                          Santa Barbara, CA 93106 06083000900
## 9    UC Santa Cruz               1156 High St, Santa Cruz, CA 95064 06087100400
##                     geometry median_income
## 1 POINT (-122.2396 37.87535)        200893
## 2 POINT (-117.8456 33.64163)         39135
## 3 POINT (-118.4468 34.07088)            NA
## 4 POINT (-120.4326 37.34685)         96875
## 5 POINT (-117.3398 33.96371)         22102
## 6 POINT (-117.2432 32.87615)         39306
## 7 POINT (-122.4574 37.76307)        147549
## 8 POINT (-119.7027 34.42213)         46275
## 9 POINT (-122.0549 36.97738)         46375

2. Educational Attainment

# View and search in RStudio
# Concept: educational attainment
#v18 %>%
#  filter(str_detect(tolower(concept), 'educational attainment')) %>%
#  as.data.frame

After searching for “educational attainment” and reviewing the results, we will want to focus on EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER as described by variables with the prefix B15003.

v18 %>% filter(str_detect(name, 'B15003')) %>% as.data.frame
##          name                                                     label
## 1  B15003_001                                           Estimate!!Total
## 2  B15003_002                   Estimate!!Total!!No schooling completed
## 3  B15003_003                           Estimate!!Total!!Nursery school
## 4  B15003_004                             Estimate!!Total!!Kindergarten
## 5  B15003_005                                Estimate!!Total!!1st grade
## 6  B15003_006                                Estimate!!Total!!2nd grade
## 7  B15003_007                                Estimate!!Total!!3rd grade
## 8  B15003_008                                Estimate!!Total!!4th grade
## 9  B15003_009                                Estimate!!Total!!5th grade
## 10 B15003_010                                Estimate!!Total!!6th grade
## 11 B15003_011                                Estimate!!Total!!7th grade
## 12 B15003_012                                Estimate!!Total!!8th grade
## 13 B15003_013                                Estimate!!Total!!9th grade
## 14 B15003_014                               Estimate!!Total!!10th grade
## 15 B15003_015                               Estimate!!Total!!11th grade
## 16 B15003_016                   Estimate!!Total!!12th grade, no diploma
## 17 B15003_017              Estimate!!Total!!Regular high school diploma
## 18 B15003_018            Estimate!!Total!!GED or alternative credential
## 19 B15003_019           Estimate!!Total!!Some college, less than 1 year
## 20 B15003_020 Estimate!!Total!!Some college, 1 or more years, no degree
## 21 B15003_021                       Estimate!!Total!!Associate's degree
## 22 B15003_022                        Estimate!!Total!!Bachelor's degree
## 23 B15003_023                          Estimate!!Total!!Master's degree
## 24 B15003_024               Estimate!!Total!!Professional school degree
## 25 B15003_025                         Estimate!!Total!!Doctorate degree
##                                                        concept
## 1  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 2  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 3  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 4  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 5  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 6  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 7  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 8  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 9  EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 10 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 11 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 12 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 13 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 14 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 15 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 16 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 17 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 18 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 19 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 20 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 21 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 22 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 23 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 24 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER
## 25 EDUCATIONAL ATTAINMENT FOR THE POPULATION 25 YEARS AND OVER

# Query all of these variables at the tract level
lu_edu_long <- get_acs(geography='tract'
                          , variables=v18 %>%
                              filter(str_detect(name, 'B15003')) %>%
                              pull(name)
                          , state='CA'
                          , year=2018
                          , geometry=FALSE
  )

dim(lu_edu_long)
## [1] 201425      5

names(lu_edu_long)
## [1] "GEOID"    "NAME"     "variable" "estimate" "moe"     

unique(lu_edu_long$variable)

# Calculate proportions at the tract level
lu_edu <- lu_edu_long %>%
  group_by(GEOID) %>%
  summarize(Edu_25_Total=sum(estimate[variable=='B15003_001'])
          , Pct_Edu_25_Less_HS=sum(estimate[variable %in% c('B15003_002', 'B15003_003', 'B15003_004', 'B15003_005', 'B15003_006', 'B15003_007', 'B15003_008', 'B15003_009', 'B15003_010', 'B15003_011', 'B15003_012', 'B15003_013', 'B15003_014', 'B15003_015', 'B15003_016')]) / Edu_25_Total
          , Pct_Edu_25_HS_GED=sum(estimate[variable %in% c('B15003_017', 'B15003_018')]) / Edu_25_Total       

          , Pct_Edu_25_Some_HE=sum(estimate[variable %in% c('B15003_019', 'B15003_020', 'B15003_021')]) / Edu_25_Total       
          , Pct_Edu_25_BA=sum(estimate[variable %in% c('B15003_022')]) / Edu_25_Total
          , Pct_Edu_25_MA_plus=sum(estimate[variable %in% c('B15003_023', 'B15003_024', 'B15003_025')]) / Edu_25_Total
            ) %>%
  ungroup

dim(lu_edu)
## [1] 8057    7

# Join
uc_edu <- uc_points_tract %>% 
  select(University, Address, GEOID) %>%
  left_join(lu_edu)

# Print
uc_edu %>%
  as.data.frame
##         University                                          Address       GEOID
## 1      UC Berkeley                                     Berkeley, CA 06001400100
## 2        UC Irvine 510 E Peltason Dr. Irvine, California 92697-5700 06059062614
## 3   UC Los Angeles                            Los Angeles, CA 90095 06037265301
## 4        UC Merced                   5200 Lake Rd, Merced, CA 95343 06047001801
## 5     UC Riverside          900 University Ave, Riverside, CA 92521 06065046500
## 6     UC San Diego               9500 Gilman Dr, La Jolla, CA 92093 06073008305
## 7 UC San Francisco       505 Parnassus Ave, San Francisco, CA 94143 06075030102
## 8 UC Santa Barbara                          Santa Barbara, CA 93106 06083000900
## 9    UC Santa Cruz               1156 High St, Santa Cruz, CA 95064 06087100400
##                     geometry Edu_25_Total Pct_Edu_25_Less_HS Pct_Edu_25_HS_GED
## 1 POINT (-122.2396 37.87535)         2455        0.024439919        0.03095723
## 2 POINT (-117.8456 33.64163)         5296        0.003587613        0.04040785
## 3 POINT (-118.4468 34.07088)           89        0.000000000        0.00000000
## 4 POINT (-120.4326 37.34685)         1839        0.038064165        0.19738989
## 5 POINT (-117.3398 33.96371)         1577        0.163601776        0.12999366
## 6 POINT (-117.2432 32.87615)         1256        0.027070064        0.06130573
## 7 POINT (-122.4574 37.76307)         4200        0.009047619        0.04833333
## 8 POINT (-119.7027 34.42213)         2633        0.200151918        0.12229396
## 9 POINT (-122.0549 36.97738)          906        0.006622517        0.04525386
##   Pct_Edu_25_Some_HE Pct_Edu_25_BA Pct_Edu_25_MA_plus
## 1         0.12545825     0.3213849          0.4977597
## 2         0.07817221     0.2471677          0.6306647
## 3         0.83146067     0.1685393          0.0000000
## 4         0.36052202     0.2300163          0.1740076
## 5         0.38173748     0.1521877          0.1724794
## 6         0.16401274     0.3542994          0.3933121
## 7         0.08238095     0.4159524          0.4442857
## 8         0.28142803     0.2054690          0.1906570
## 9         0.26158940     0.2207506          0.4657837

3. Race and Ethnicity

Working with race and ethnicity data from the Census is not as straightforward as one would expect. It’s important to understand how the information is gathered. This sheet describes how race and ethnicity are asked in two questions by the Census. Based on how these questions are asked and how the options are presented, it’s important to recognize that respondents could:

  1. Check more than 1 race categories in the race question.
  2. Respond if they are of Hispanic origins or not in the ethnicity question.

Because of these points, respondents could fall under more than 1 racial and ethnic categorizations based on what is checked in the two questions. It’s also important to note that Census data are summarized with counts of individuals for each geographical area. That is, data are not reported at the individual level. Because of this, one could summarize each racial or ethnic group by counts and proportions, but these counts and proportions will most almost certainly not sum up to the whole.

We illustrate two ways of summarizing race and ethnicity data at the tract level. In the first case, we include in each race group all respondents that checked a particular group, even if the respondent checks more than one group. In the second case, we include in each race group all respondents that only check one particular group and not multiple. In both cases, the Hispanic ethnicity is summarized separately.

Here is sample code for the first case that includes all respondents that identify with a racial group:

# Major racial groups
v18 %>%
  filter(str_detect(tolower(concept), 'in combination')) %>%
  as.data.frame
##         name           label
## 1 B02008_001 Estimate!!Total
## 2 B02009_001 Estimate!!Total
## 3 B02010_001 Estimate!!Total
## 4 B02011_001 Estimate!!Total
## 5 B02012_001 Estimate!!Total
## 6 B02013_001 Estimate!!Total
##                                                                                           concept
## 1                                      WHITE ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES
## 2                  BLACK OR AFRICAN AMERICAN ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES
## 3          AMERICAN INDIAN AND ALASKA NATIVE ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES
## 4                                      ASIAN ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES
## 5 NATIVE HAWAIIAN AND OTHER PACIFIC ISLANDER ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES
## 6                            SOME OTHER RACE ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES

# Ethnicity
v18 %>%
  filter(concept=='HISPANIC OR LATINO ORIGIN') %>%
  as.data.frame
##         name                                   label                   concept
## 1 B03003_001                         Estimate!!Total HISPANIC OR LATINO ORIGIN
## 2 B03003_002 Estimate!!Total!!Not Hispanic or Latino HISPANIC OR LATINO ORIGIN
## 3 B03003_003     Estimate!!Total!!Hispanic or Latino HISPANIC OR LATINO ORIGIN

# Total
v18 %>%
  filter(concept=='TOTAL POPULATION') %>%
  as.data.frame
##         name           label          concept
## 1 B01003_001 Estimate!!Total TOTAL POPULATION

# Query data elements
lu_race_long <- get_acs(geography='tract'
                          , variables=c('B01003_001', 'B03003_003', 'B02008_001', 'B02009_001', 'B02010_001', 'B02011_001', 'B02012_001', 'B02013_001')
                          , state='CA'
                          , year=2018
                          , geometry=FALSE
  )

dim(lu_race_long)
## [1] 64456     5

names(lu_race_long)
## [1] "GEOID"    "NAME"     "variable" "estimate" "moe"     

unique(lu_race_long$variable)
## [1] "B01003_001" "B02008_001" "B02009_001" "B02010_001" "B02011_001"
## [6] "B02012_001" "B02013_001" "B03003_003"

lu_race <- lu_race_long %>%
  group_by(GEOID) %>%
  summarize(Pop_Total=sum(estimate[variable=='B01003_001'])
          , Pct_Pop_White=sum(estimate[variable=='B02008_001']) / Pop_Total
          , Pct_Pop_Black=sum(estimate[variable=='B02009_001']) / Pop_Total
          , Pct_Pop_American_Indian=sum(estimate[variable=='B02010_001']) / Pop_Total            
          , Pct_Pop_Asian=sum(estimate[variable=='B02011_001']) / Pop_Total
          , Pct_Pop_Pacific_Islander=sum(estimate[variable=='B02012_001']) / Pop_Total
          , Pct_Pop_Other=sum(estimate[variable=='B02013_001']) / Pop_Total
          , Pct_Pop_Hispanic=sum(estimate[variable=='B03003_003']) / Pop_Total
            )

dim(lu_race)
## [1] 8057    9

# Join
uc_race <- uc_points_tract %>% 
  select(University, Address, GEOID) %>%
  left_join(lu_race)

# Print
uc_race %>%
  as.data.frame
##         University                                          Address       GEOID
## 1      UC Berkeley                                     Berkeley, CA 06001400100
## 2        UC Irvine 510 E Peltason Dr. Irvine, California 92697-5700 06059062614
## 3   UC Los Angeles                            Los Angeles, CA 90095 06037265301
## 4        UC Merced                   5200 Lake Rd, Merced, CA 95343 06047001801
## 5     UC Riverside          900 University Ave, Riverside, CA 92521 06065046500
## 6     UC San Diego               9500 Gilman Dr, La Jolla, CA 92093 06073008305
## 7 UC San Francisco       505 Parnassus Ave, San Francisco, CA 94143 06075030102
## 8 UC Santa Barbara                          Santa Barbara, CA 93106 06083000900
## 9    UC Santa Cruz               1156 High St, Santa Cruz, CA 95064 06087100400
##                     geometry Pop_Total Pct_Pop_White Pct_Pop_Black
## 1 POINT (-122.2396 37.87535)      3115     0.7248796    0.05232745
## 2 POINT (-117.8456 33.64163)     17086     0.5653752    0.02680557
## 3 POINT (-118.4468 34.07088)     11235     0.5556742    0.04779706
## 4 POINT (-120.4326 37.34685)      5432     0.5427099    0.03589838
## 5 POINT (-117.3398 33.96371)      8754     0.4152387    0.14016450
## 6 POINT (-117.2432 32.87615)      1877     0.6105487    0.07032499
## 7 POINT (-122.4574 37.76307)      5255     0.7590866    0.03139867
## 8 POINT (-119.7027 34.42213)      3686     0.6706457    0.02414542
## 9 POINT (-122.0549 36.97738)     10299     0.6078260    0.05058744
##   Pct_Pop_American_Indian Pct_Pop_Asian Pct_Pop_Pacific_Islander Pct_Pop_Other
## 1             0.009951846    0.23210273              0.000000000   0.034349920
## 2             0.005677163    0.38622264              0.003921339   0.074154278
## 3             0.014508233    0.46995995              0.004361371   0.002225189
## 4             0.017304860    0.17746686              0.003129602   0.265095729
## 5             0.013022618    0.35435230              0.024560201   0.162668494
## 6             0.014917421    0.35109217              0.010122536   0.030367608
## 7             0.006089439    0.20761180              0.002664129   0.022264510
## 8             0.032013022    0.03635377              0.000000000   0.250135648
## 9             0.012525488    0.32119623              0.015438392   0.073502282
##   Pct_Pop_Hispanic
## 1       0.03338684
## 2       0.24007960
## 3       0.23230975
## 4       0.38494109
## 5       0.35857894
## 6       0.20618007
## 7       0.06698382
## 8       0.43271839
## 9       0.28546461

Here is sample code for the second case that includes respondents in a racial group only if they checked a single racial category:

v18 %>%
  filter(concept=='RACE') %>%
  as.data.frame
##          name
## 1  B02001_001
## 2  B02001_002
## 3  B02001_003
## 4  B02001_004
## 5  B02001_005
## 6  B02001_006
## 7  B02001_007
## 8  B02001_008
## 9  B02001_009
## 10 B02001_010
##                                                                                               label
## 1                                                                                   Estimate!!Total
## 2                                                                      Estimate!!Total!!White alone
## 3                                                  Estimate!!Total!!Black or African American alone
## 4                                          Estimate!!Total!!American Indian and Alaska Native alone
## 5                                                                      Estimate!!Total!!Asian alone
## 6                                 Estimate!!Total!!Native Hawaiian and Other Pacific Islander alone
## 7                                                            Estimate!!Total!!Some other race alone
## 8                                                                Estimate!!Total!!Two or more races
## 9                           Estimate!!Total!!Two or more races!!Two races including Some other race
## 10 Estimate!!Total!!Two or more races!!Two races excluding Some other race, and three or more races
##    concept
## 1     RACE
## 2     RACE
## 3     RACE
## 4     RACE
## 5     RACE
## 6     RACE
## 7     RACE
## 8     RACE
## 9     RACE
## 10    RACE

# Query data elements
lu_race_single_long <- get_acs(geography='tract'
                          , variables=c('B01003_001', 'B03003_003', 'B02001_001', 'B02001_002', 'B02001_003', 'B02001_004', 'B02001_005', 'B02001_006', 'B02001_007', 'B02001_008', 'B02001_009', 'B02001_010')
                          , state='CA'
                          , year=2018
                          , geometry=FALSE
  )

dim(lu_race_single_long)
## [1] 96684     5

names(lu_race_single_long)
## [1] "GEOID"    "NAME"     "variable" "estimate" "moe"     

unique(lu_race_single_long$variable)
##  [1] "B01003_001" "B02001_001" "B02001_002" "B02001_003" "B02001_004"
##  [6] "B02001_005" "B02001_006" "B02001_007" "B02001_008" "B02001_009"
## [11] "B02001_010" "B03003_003"

lu_race_single <- lu_race_single_long %>%
  group_by(GEOID) %>%
  summarize(Pop_Total=sum(estimate[variable=='B01003_001'])
            # , Pop_Total2=sum(estimate[variable=='B02001_001'])
          , Pct_Pop_White_Only=sum(estimate[variable=='B02001_002']) / Pop_Total
          , Pct_Pop_Black_Only=sum(estimate[variable=='B02001_003']) / Pop_Total
          , Pct_Pop_American_Indian_Only=sum(estimate[variable=='B02001_004']) / Pop_Total            
          , Pct_Pop_Asian_Only=sum(estimate[variable=='B02001_005']) / Pop_Total
          , Pct_Pop_Pacific_Islander_Only=sum(estimate[variable=='B02001_006']) / Pop_Total
          , Pct_Pop_Other=sum(estimate[variable=='B02001_007']) / Pop_Total
          , Pct_Pop_Two_Or_More=sum(estimate[variable=='B02001_008']) / Pop_Total
          , Pct_Pop_Hispanic=sum(estimate[variable=='B03003_003']) / Pop_Total
            )

dim(lu_race_single)
## [1] 8057   10

# Join
uc_race_single <- uc_points_tract %>% 
  select(University, Address, GEOID) %>%
  left_join(lu_race_single)

# Print
uc_race_single %>%
  as.data.frame
##         University                                          Address       GEOID
## 1      UC Berkeley                                     Berkeley, CA 06001400100
## 2        UC Irvine 510 E Peltason Dr. Irvine, California 92697-5700 06059062614
## 3   UC Los Angeles                            Los Angeles, CA 90095 06037265301
## 4        UC Merced                   5200 Lake Rd, Merced, CA 95343 06047001801
## 5     UC Riverside          900 University Ave, Riverside, CA 92521 06065046500
## 6     UC San Diego               9500 Gilman Dr, La Jolla, CA 92093 06073008305
## 7 UC San Francisco       505 Parnassus Ave, San Francisco, CA 94143 06075030102
## 8 UC Santa Barbara                          Santa Barbara, CA 93106 06083000900
## 9    UC Santa Cruz               1156 High St, Santa Cruz, CA 95064 06087100400
##                     geometry Pop_Total Pct_Pop_White_Only Pct_Pop_Black_Only
## 1 POINT (-122.2396 37.87535)      3115          0.6812199         0.04109149
## 2 POINT (-117.8456 33.64163)     17086          0.5149245         0.01984081
## 3 POINT (-118.4468 34.07088)     11235          0.4753004         0.02999555
## 4 POINT (-120.4326 37.34685)      5432          0.5154639         0.02172312
## 5 POINT (-117.3398 33.96371)      8754          0.3516107         0.09355723
## 6 POINT (-117.2432 32.87615)      1877          0.5524774         0.03356420
## 7 POINT (-122.4574 37.76307)      5255          0.7326356         0.02188392
## 8 POINT (-119.7027 34.42213)      3686          0.6573521         0.02414542
## 9 POINT (-122.0549 36.97738)     10299          0.5383047         0.03349840
##   Pct_Pop_American_Indian_Only Pct_Pop_Asian_Only Pct_Pop_Pacific_Islander_Only
## 1                 0.0000000000         0.19229535                  0.0000000000
## 2                 0.0007608568         0.34226852                  0.0013461313
## 3                 0.0023141967         0.40338229                  0.0007120605
## 4                 0.0060751105         0.16678940                  0.0031296024
## 5                 0.0060543751         0.30877313                  0.0124514508
## 6                 0.0117208311         0.30687267                  0.0000000000
## 7                 0.0000000000         0.20057088                  0.0000000000
## 8                 0.0320130222         0.02306023                  0.0000000000
## 9                 0.0025245169         0.28255170                  0.0059229051
##   Pct_Pop_Other Pct_Pop_Two_Or_More Pct_Pop_Hispanic
## 1   0.031781701          0.05361156       0.03338684
## 2   0.063151118          0.05770807       0.24007960
## 3   0.002225189          0.08607032       0.23230975
## 4   0.251840943          0.03497791       0.38494109
## 5   0.130340416          0.09721270       0.35857894
## 6   0.014384656          0.08098029       0.20618007
## 7   0.018458611          0.02645100       0.06698382
## 8   0.250135648          0.01329354       0.43271839
## 9   0.059131955          0.07806583       0.28546461

Example: CCC District Service Areas

The Foundation for California Community Colleges maintains a district service area map. They also make the shapefiles available here. One could download the shapefiles and be able to load them into R, Tableau, or other visualization or GIS platforms.

One use of this data is to quantify the number of students that a district serves outside of their service area.

Appendix: R and R Package Versions

This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tigris_1.4.1         sf_1.0-2             stringr_1.4.0       
##  [4] geodist_0.0.7        ggrepel_0.9.1        maps_3.3.0          
##  [7] ggplot2_3.3.3        dplyr_1.0.6          tidygeocoder_1.0.3  
## [10] IRexamples_0.0.2     RevoUtils_11.0.2     RevoUtilsMath_11.0.0
## 
## loaded via a namespace (and not attached):
##   [1] uuid_0.1-4           plyr_1.8.6           igraph_1.2.6        
##   [4] sp_1.4-5             splines_4.0.2        crosstalk_1.1.1     
##   [7] rstantools_2.1.1     inline_0.3.19        digest_0.6.27       
##  [10] htmltools_0.5.1.1    rsconnect_0.8.18     fansi_0.5.0         
##  [13] RSelenium_1.7.7      magrittr_2.0.1       readr_1.4.0         
##  [16] RcppParallel_5.1.4   matrixStats_0.59.0   xts_0.12.1          
##  [19] askpass_1.1          prettyunits_1.1.1    jpeg_0.1-8.1        
##  [22] colorspace_2.0-1     rvest_1.0.1          rappdirs_0.3.3      
##  [25] mitools_2.4          rgdal_1.5-23         callr_3.7.0         
##  [28] crayon_1.4.1         jsonlite_1.7.2       lme4_1.1-27         
##  [31] survival_3.2-11      zoo_1.8-9            glue_1.4.2          
##  [34] gtable_0.3.0         MatrixModels_0.5-0   V8_3.4.2            
##  [37] DisImpact_0.0.15     pkgbuild_1.2.0       rstan_2.21.2        
##  [40] semver_0.2.0         scales_1.1.1         DBI_1.1.1           
##  [43] ggthemes_4.2.4       miniUI_0.1.1.1       Rcpp_1.0.7          
##  [46] xtable_1.8-4         units_0.7-2          foreign_0.8-81      
##  [49] proxy_0.4-26         stats4_4.0.2         StanHeaders_2.21.0-7
##  [52] survey_4.0           DT_0.18              htmlwidgets_1.5.3   
##  [55] httr_1.4.2           threejs_0.3.3        RColorBrewer_1.1-2  
##  [58] ellipsis_0.3.2       farver_2.1.0         pkgconfig_2.0.3     
##  [61] loo_2.4.1            XML_3.99-0.6         utf8_1.2.1          
##  [64] labeling_0.4.2       tidyselect_1.1.1     rlang_0.4.11        
##  [67] reshape2_1.4.4       later_1.2.0          munsell_0.5.0       
##  [70] tools_4.0.2          xgboost_1.4.1.1      cli_2.5.0           
##  [73] generics_0.1.0       ggridges_0.5.3       fastmap_1.1.0       
##  [76] binman_0.1.2         processx_3.5.2       caTools_1.18.2      
##  [79] purrr_0.3.4          twang_2.3            nlme_3.1-152        
##  [82] mime_0.10            rstanarm_2.21.1      xml2_1.3.2          
##  [85] tidycensus_1.0       rstudioapi_0.13      compiler_4.0.2      
##  [88] bayesplot_1.8.1      shinythemes_1.2.0    curl_4.3.1          
##  [91] png_0.1-7            e1071_1.7-8          tibble_3.1.2        
##  [94] stringi_1.4.6        ps_1.6.0             forcats_0.5.1       
##  [97] lattice_0.20-44      Matrix_1.3-4         classInt_0.4-3      
## [100] nloptr_1.2.2.2       markdown_1.1         shinyjs_2.0.0       
## [103] gbm_2.1.8            vctrs_0.3.8          pillar_1.6.1        
## [106] lifecycle_1.0.0      data.table_1.14.0    bitops_1.0-7        
## [109] maptools_1.1-1       httpuv_1.6.1         wdman_0.2.5         
## [112] R6_2.5.0             latticeExtra_0.6-29  promises_1.2.0.1    
## [115] KernSmooth_2.23-20   gridExtra_2.3        codetools_0.2-18    
## [118] boot_1.3-28          colourpicker_1.1.0   MASS_7.3-54         
## [121] gtools_3.9.2         assertthat_0.2.1     openssl_1.4.4       
## [124] withr_2.4.2          shinystan_2.5.0      parallel_4.0.2      
## [127] hms_1.1.0            grid_4.0.2           tidyr_1.1.3         
## [130] class_7.3-19         minqa_1.2.4          shiny_1.6.0         
## [133] base64enc_0.1-3      dygraphs_1.1.1.6    

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.