Guerry data: Multivariate Analysis

Michael Friendly

2021-07-29

André-Michel Guerry’s (1833) Essai sur la Statistique Morale de la France collected data on crimes, suicide, literacy and other “moral statistics” for various départements in France. He provided the first real social data analysis, using graphics and maps to summarize this multivariate dataset. One of his main goals in this ground-breaking study was to determine if the prevalence of crime in France could be explained by other social variables.

In 1833, the scatterplot had not yet been invented; the idea of a correlation or a regression was still 50 years in the future (Galton 1886). Guerry displayed his data in shaded choropleth maps and semi-graphic tables and argued how these could be seen as implying systematic, lawful relations among moral variables.

In this analysis, we ignore the spatial context of the départements and focus on multivariate analyses of the the data set.

Load the data and packages

library(Guerry)
library(car)
#> Loading required package: carData
library(ggplot2)
library(ggrepel)
data(Guerry)

Guerry data set

Guerry’s (1833) data consisted of six main moral variables shown in the table below. He wanted all of these to be recorded on aligned scales so that larger numbers consistently reflected “morally better”. Thus, four of the variables are recorded in the inverse form, “Population per …”.

Name Description
Crime_pers Population per crime against persons
Crime_prop Population per crime against property
Literacy Percent of military conscripts who can read and write
Donations Donations to the poor
Infants Population per illegitimate birth
Suicides Population per suicide

The Guerry data set also contains:

names(Guerry)[-(1:9)]
#>  [1] "MainCity"        "Wealth"          "Commerce"        "Clergy"         
#>  [5] "Crime_parents"   "Infanticide"     "Donation_clergy" "Lottery"        
#>  [9] "Desertion"       "Instruction"     "Prostitutes"     "Distance"       
#> [13] "Area"            "Pop1831"

Guerry’s questions

The main questions that concerned Guerry were whether indicators of crime could be shown to be related to factors which might be considered to ameliorate crime. Among these, Guerry focused most on Literacy defined as the number of military conscripts who could do more than mark an “X” on their enrollment form. Other potential explanatory variables are:

: Donations (a measure of donations to the poor),
Clergy (the rank of number of Catholic priests in active service, per population)

Load the data and packages

library(Guerry)
library(car)
data(Guerry)

Bivariate relationships

Let’s start with plots of crime (Crime_pers and Crime_prop) in relation to Literacy. A simple scatterplot is not very informative.

ggplot(aes(x=Literacy, y=Crime_pers/1000), data=Guerry) +
  geom_point(size=2) 

More useful scatterplots are annotated with additional statistical summaries to aid interpretation:

I use ggplot2 here. It provides most of these features, except that to label unusual points, I calculate the mahalanobis squared distance of all points from the grand means.

gdf <- Guerry[, c("Literacy", "Crime_pers", "Department")]
gdf$dsq <- mahalanobis(gdf[,1:2], colMeans(gdf[,1:2]), cov(gdf[,1:2]))

ggplot(aes(x=Literacy, y=Crime_pers/1000, label=Department), data=gdf) +
  geom_point(size=2) +
  stat_ellipse(level=0.68, color="blue", size=1.2) +  
  stat_ellipse(level=0.95, color="gray", size=1, linetype=2) + 
  geom_smooth(method="lm", formula=y~x, fill="lightblue") +
  geom_smooth(method="loess", formula=y~x, color="red", se=FALSE) +
  geom_label_repel(data = gdf[gdf$dsq > 4.6,]) +
  theme_bw()

Doing the same for crimes against property:

gdf <- Guerry[, c("Literacy", "Crime_prop", "Department")]
gdf$dsq <- mahalanobis(gdf[,1:2], colMeans(gdf[,1:2]), cov(gdf[,1:2]))

ggplot(aes(x=Literacy, y=Crime_prop/1000, label=Department), data=gdf) +
  geom_point(size=2) +
  stat_ellipse(level=0.68, color="blue", size=1.2) +  
  stat_ellipse(level=0.95, color="gray", size=1, linetype=2) + 
  geom_smooth(method="lm", formula=y~x, fill="lightblue") +
  geom_smooth(method="loess", formula=y~x, color="red", se=FALSE) +
  geom_label_repel(data = gdf[gdf$dsq > 4.6,]) +
  theme_bw()

Galton, Francis. 1886. “Regression Towards Mediocrity in Hereditary Stature.” Journal of the Anthropological Institute 15: 246–63.