André-Michel Guerry’s (1833) Essai sur la Statistique Morale de la France collected data on crimes, suicide, literacy and other “moral statistics” for various départements in France. He provided the first real social data analysis, using graphics and maps to summarize this multivariate dataset. One of his main goals in this ground-breaking study was to determine if the prevalence of crime in France could be explained by other social variables.
In 1833, the scatterplot had not yet been invented; the idea of a correlation or a regression was still 50 years in the future (Galton 1886). Guerry displayed his data in shaded choropleth maps and semi-graphic tables and argued how these could be seen as implying systematic, lawful relations among moral variables.
In this analysis, we ignore the spatial context of the départements and focus on multivariate analyses of the the data set.
Guerry
data setGuerry’s (1833) data consisted of six main moral variables shown in the table below. He wanted all of these to be recorded on aligned scales so that larger numbers consistently reflected “morally better”. Thus, four of the variables are recorded in the inverse form, “Population per …”.
Name | Description |
---|---|
Crime_pers |
Population per crime against persons |
Crime_prop |
Population per crime against property |
Literacy |
Percent of military conscripts who can read and write |
Donations |
Donations to the poor |
Infants |
Population per illegitimate birth |
Suicides |
Population per suicide |
The Guerry
data set also contains:
dept
and Department
, the French ID numbers and names for the 86 départements of metropolitan France in 1830, including Corsica.Region
: a factor with main levels “N”, “S”, “E”, “W”. Corsica is coded as NA
.?Guerry
for their precise definitions.The main questions that concerned Guerry were whether indicators of crime could be shown to be related to factors which might be considered to ameliorate crime. Among these, Guerry focused most on Literacy
defined as the number of military conscripts who could do more than mark an “X” on their enrollment form. Other potential explanatory variables are:
Donations
(a measure of donations to the poor),Clergy
(the rank of number of Catholic priests in active service, per population)
Let’s start with plots of crime (Crime_pers
and Crime_prop
) in relation to Literacy
. A simple scatterplot is not very informative.
More useful scatterplots are annotated with additional statistical summaries to aid interpretation:
I use ggplot2
here. It provides most of these features, except that to label unusual points, I calculate the mahalanobis squared distance of all points from the grand means.
gdf <- Guerry[, c("Literacy", "Crime_pers", "Department")]
gdf$dsq <- mahalanobis(gdf[,1:2], colMeans(gdf[,1:2]), cov(gdf[,1:2]))
ggplot(aes(x=Literacy, y=Crime_pers/1000, label=Department), data=gdf) +
geom_point(size=2) +
stat_ellipse(level=0.68, color="blue", size=1.2) +
stat_ellipse(level=0.95, color="gray", size=1, linetype=2) +
geom_smooth(method="lm", formula=y~x, fill="lightblue") +
geom_smooth(method="loess", formula=y~x, color="red", se=FALSE) +
geom_label_repel(data = gdf[gdf$dsq > 4.6,]) +
theme_bw()
Doing the same for crimes against property:
gdf <- Guerry[, c("Literacy", "Crime_prop", "Department")]
gdf$dsq <- mahalanobis(gdf[,1:2], colMeans(gdf[,1:2]), cov(gdf[,1:2]))
ggplot(aes(x=Literacy, y=Crime_prop/1000, label=Department), data=gdf) +
geom_point(size=2) +
stat_ellipse(level=0.68, color="blue", size=1.2) +
stat_ellipse(level=0.95, color="gray", size=1, linetype=2) +
geom_smooth(method="lm", formula=y~x, fill="lightblue") +
geom_smooth(method="loess", formula=y~x, color="red", se=FALSE) +
geom_label_repel(data = gdf[gdf$dsq > 4.6,]) +
theme_bw()
Galton, Francis. 1886. “Regression Towards Mediocrity in Hereditary Stature.” Journal of the Anthropological Institute 15: 246–63.