To be able to run the SOM algorithm, you have to load the package called
SOMbrero
. The function used to run it is called trainSOM()
and is
detailed below.
This documentation only considers the case of contingency tables.
The trainSOM
function has several arguments, but only the first one is
required. This argument is x.data
which is the dataset used to train the
SOM. In this documentation, it is passed to the function as a matrix or a data
frame. This set must be a contingency table, i.e., it must contain either 0 or
positive integers. Column and row names must be supplied.
The other arguments are the same as the arguments passed to the initSOM
function (they are parameters defining the algorithm, see help(initSOM)
for further details).
The trainSOM
function returns an object of class somRes
(see
help(trainSOM)
for further details on this class).
presidentielles2002
data setThe presidentielles2002
data set provides the number of votes at the first
round of the 2002 French presidential election for each of the 16 candidates in
all of the 106 French administrative districts called “departements”. Further
details about this data set and the 2002 French presidential election are given
with help(presidentielles2002)
.
data(presidentielles2002)
apply(presidentielles2002, 2, sum)
## MEGRET LEPAGE GLUCKSTEIN BAYROU CHIRAC LE_PEN
## 667043 535875 132696 1949219 5666021 4804772
## TAUBIRA SAINT_JOSSE MAMERE JOSPIN BOUTIN HUE
## 660515 1204801 1495774 4610267 339157 960548
## CHEVENEMENT MADELIN LAGUILLER BESANCENOT
## 1518568 1113551 1630118 1210562
(the two candidates that ran the second round of the election were Jacques Chirac and the far-right candidate Jean-Marie Le Pen)
set.seed(4031719)
korresp.som <- trainSOM(x.data=presidentielles2002, dimension=c(8,8),
type="korresp", scaling="chi2", nb.save=10)
korresp.som
## Self-Organizing Map object...
## online learning, type: korresp
## 8 x 8 grid with square topology
## neighbourhood type: gaussian
## distance type: euclidean
As the energy is registered during the intermediate backups, we can have a look at its evolution
plot(korresp.som, what="energy")
which is stabilized during the last 100 iterations.
The clustering component contains the final classification of the dataset. As both row and column variables are classified, the length of the resulting vector is equal to the sum of the number of rows and the number of columns.
NB: The clustering component shows first the column variables (here, the candidates) and then the row variables (here, the departements).
The following table indicates which graphics are available for a korresp SOM.
Type | Energy | Obs | Prototypes | Add | Super Cluster |
---|---|---|---|---|---|
no type | x | ||||
hitmap | x | x | |||
color | x2 | x2 | |||
lines | x2 | x2 | |||
barplot | x | ||||
radar | x | ||||
pie | |||||
boxplot | |||||
3d | x2 | ||||
poly.dist | x | x | |||
umatrix | x | ||||
smooth.dist | x | ||||
words | |||||
names | x | ||||
graph | |||||
mds | x | x | |||
grid.dist | x | ||||
grid | x | ||||
dendrogram | x | ||||
dendro3d | x |
In the column “Prototypes”, a plot marked “x2” means that this plot is available for both row and column variables. In the “Super Cluster” column, a “x2” cell means the plot is available for both data set variables and additional variables.
korresp.som$clustering
## MEGRET LEPAGE GLUCKSTEIN
## 8 8 8
## BAYROU CHIRAC LE_PEN
## 3 57 61
## TAUBIRA SAINT_JOSSE MAMERE
## 8 8 5
## JOSPIN BOUTIN HUE
## 33 8 8
## CHEVENEMENT MADELIN LAGUILLER
## 6 8 24
## BESANCENOT ain aisne
## 8 61 61
## allier alpes_de_haute_provence hautes_alpes
## 59 57 57
## alpes_maritimes ardeche ardennes
## 56 59 58
## ariege aube aude
## 57 58 59
## aveyron bouches_du_rhone calvados
## 58 32 62
## cantal charente charente_maritime
## 57 59 61
## cher correze corse_sud
## 58 57 57
## haute_corse cote_d'or cotes_d'armor
## 57 61 44
## creuse dordogne doubs
## 57 60 61
## drome eure eure_et_loir
## 61 61 59
## finistere gard haute_garonne
## 29 63 48
## gers gironde herault
## 57 40 64
## ille_et_vilaine indre indre_et_loire_
## 19 57 61
## isere jura landes
## 48 58 59
## loir_et_cher loire haute_loire
## 59 63 57
## loire_atlantique loiret lot
## 13 62 57
## lot_et_garonne_ lozere maine_et_loire_
## 59 57 27
## manche marne haute_marne
## 60 61 57
## mayenne meurthe_et_moselle meuse
## 57 63 57
## morbihan moselle nievre
## 36 56 57
## nord oise orne
## 24 63 58
## pas_de_calais puy_de_dome pyrenees_atlantiques
## 32 62 62
## hautes_pyrenees pyrenees_orientales bas_rhin
## 57 60 2
## haut_rhin rhone haute_saone
## 63 4 57
## saone_et_loire_ sarthe savoie
## 61 61 59
## haute_savoie paris seine_maritime_
## 62 4 40
## seine_et_marne_ yvelines deux_sevres
## 48 3 58
## somme tarn tarn_et_garonne
## 61 59 57
## var vaucluse vendee
## 64 61 44
## vienne haute_vienne vosges
## 59 58 59
## yonne territoire_de_belfort essonne
## 58 57 1
## hauts_de_seine_ seine_saint-denis val_de_marne
## 3 56 1
## val_d'oise guadeloupe martinique
## 64 57 57
## guyane la_reunion mayotte
## 57 33 57
## nouvelle_caledonie polynesie_francaise saint_pierre_et_miquelon
## 57 57 57
## wallis_et_futuna francais_de_l'etranger
## 57 49
The resulting distribution of the clustering on the map can also be visualized by a hitmap:
plot(korresp.som, what="obs", type="hitmap")
For a more precise view, "names"
plot is implemented: it prints,
in each neuron, the names of the variables assigned to it ; in the korresp SOM,
both row and column variable names are printed.
plot(korresp.som, what="obs", type="names", scale=c(0.9,0.5))
The map is organized as follows: the bottom left side of the map is associated to the candidate “Taubira” who obtained her better vote scoring in the overseas departements “Guadeloupe”, “Martinique” and “Guyane”.
These candidates are opposed to the top left hand side of the map (cluster 8) which is associated to the far-right candidates “Le Pen” and “Megret” who traditionally obtain higher voting scores in some South of France departements “Vaucluse” and some North Est departements as “Haut Rhin”. The top right hand side of the map is composed of clustered characterized by far-left candidates (“HUE”, “LAGUILLER”, “BESANCENOT”) and progressively goes to the traditionnal left candidates in the right part of the map (“JOSPIN”) and finally to the traditional right candidates in the bottom right corner of the map (“CHIRAC”, “BAYROU”). It is to be noted that the vote for far-right candidates is more similar to the vote for far-left candidates than for traditional right candidates. The cluster with the largest number of departement classified inside is cluster 8 at the top left corner of the map, which is also Le Pen's cluster: in this election, the far-right candidate actually succeeded for the first time to run the second round of the presidential election.
Some graphics from the numeric SOM algorithm are still available in the korresp
case. They are detailed below. As the resulting clustering provides the
classification for both rows and columns, a new argument view
is used to
specify which one should be considered. Its possible values are either
"r"
for row variables (the default value) or "c"
for column
variables.
Three representations are available:
view
argument is used)# plot the line prototypes (106 French departements)
plot(korresp.som, what="prototypes", type="lines", view="r", print.title=TRUE)
# plot the column prototypes (16 candidates)
plot(korresp.som, what="prototypes", type="lines", view="c", print.title=TRUE)
The peaks in neurons 1, 2 and 9 correspond, in the row view, to the overseas departements and, in the column view, to the candidate “Taubira”. In the column views, the two peaks clearly identified in the right side clusters correspond to the two “main” tranditional candidates “Jospin” and “Chirac” (respectively, left and right candidates).
A more precise individual view are given with the graphics “color” and “3d”, here drawn, as an example for the candidate “Le Pen” and for the departement “Martinique”.
variable
) is represented on the map;"color"
.par(mfrow=c(1,2))
plot(korresp.som, what="prototypes", type="color", variable="LE_PEN")
plot(korresp.som, what="prototypes", type="3d", variable="martinique")
The first graphic shows that “Le Pen” obtained its best scores in the departements located in the top left hand side of the map and its lowest scores in the departements located in the bottom left side of the map (overseas departement).
The second graphic shows that the candidates that obtained the higher scores in Martinique are located in the bottom right hand side of the map (mainly Taubira).
The graphics can also be drawn by giving the variable number and its type, either “r” or “c” (here, as an example, “Chirac” which is the 5th candidate):
par(mfrow=c(1,2))
plot(korresp.som, what="prototypes", type="color", variable=5, view="c")
plot(korresp.som, what="prototypes", type="3d", variable=5, view="c")
Hence “Chirac” is located at the bottom right corner the map and more generally in the bottom of the map (he traditionnally also have high votes in the overseas departements).
These graphics are exactly the same as in the numerical case:
"poly.dist"
represents the distances between neighboring prototypes with
polygons plotted for each cell of the grid. The smaller the distance between
a polygon's vertex and a cell border, the closer the pair of prototypes.
The colors indicates the number of observations in the neuron (white=empty);
"umatrix"
fills the neurons of the grid using colors that represent
the average distance between the current prototype and its neighbors;
"smooth.dist"
plots the mean distance between the current prototype and
its neighbors with a color gradation;
"mds"
plots the number of the neuron on a map according to a Multi
Dimensional Scaling (MDS) projection;
"grid.dist"
plots a point for each pair of prototypes, with x
coordinates representing the distance between the prototypes in the
input space, and y coordinates representing the distance between the
corresponding neurons on the grid.
plot(korresp.som, what="prototypes", type="poly.dist", print.title=TRUE)
plot(korresp.som, what="prototypes", type="umatrix", print.title=TRUE)
plot(korresp.som, what="prototypes", type="smooth.dist", print.title=TRUE)
plot(korresp.som, what="prototypes", type="mds")
plot(korresp.som, what="prototypes", type="grid.dist")
Three neurons (1, 9 and 2) have been already picked out in the section Clustering interpretation for having prototypes rather different than the rest of the map. The graphics just above confirm this hypothesis: there a noticeable peak in prototype distances around these three neurons. The MDS vizualisation also shows that these three prototypes are clearly different.
quality(korresp.som)
## $topographic
## [1] 0.009433962
##
## $quantization
## [1] 73196.03
By default, the quality function calculates both quantization and topographic
errors. It is also possible to specify which one you want to obtain, by using
the argument quality.type
.
The topographic error value varies between 0 (good projection quality) and 1 (poor projection quality). Here, the topographic quality of the mapping is quite good with a topographic error equal to 0.009.
The quantization error is an unbounded positive number. The closer from 0 it is, the better the projection quality is.
In the SOM algorithm, the number of clusters is necessarily close to the number of neurons on the grid (not necessarily equal as some neurons may have no observations assigned to them). This - quite large - number may not suit the original data for a clustering purpose.
A usual way to address clustering with SOM is to perform a hierarchical
clustering on the prototypes. This clustering is directly available in the
package SOMbrero
using the function superClass
. To do so, you can
first have a quick overview to decide on the number of super clusters which
suits your data.
plot(superClass(korresp.som))
## Warning in plot.somSC(superClass(korresp.som)): Impossible to plot the rectangles: no super clusters.
By default, the function plots both a dendrogram and the evolution of the
percentage of explained variance. Here, 3 super clusters seem to be a good
choice. The output of superClass
is a somSC
class object.
Basic functions have been defined for this class:
my.sc <- superClass(korresp.som, k=3)
summary(my.sc)
##
## SOM Super Classes
## Initial number of clusters : 64
## Number of super clusters : 3
##
##
## Frequency table
## 1 2 3
## 28 17 19
##
## Clustering
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1 1 2 2 2 2 2 2 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 2 1
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 1 1 1 1 1 2 2 3 3 3 1 1 1 1 1 3 3 3 3 3 1 1 1 3 3
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 3 3 3 1 1 1 3 3 3 3 3 3 1 1
plot(my.sc, plot.var=FALSE)
Like plot.somRes
, the function plot.somSC
has an
argument 'type'
which offers many different plots and can thus be
combined with most of the graphics produced by plot.somSC
:
Case "grid"
fills the grid with colors according to the super clustering
(and can provide a legend).
Case "dendro3d"
plots a 3d dendrogram.
plot(my.sc, type="grid", plot.legend=TRUE)
plot(my.sc, type="dendro3d")
The three super-clusters correspond to overseas votes (super-cluster 1), traditional votes (super-cluster 2) and far-left/right votes (super-cluster 2). The 3 different neurons mentionned earlier have been gathered together in the super cluster 1.
A couple of plots from plot.somRes
are also available for the super
clustering. Some identify the super clusters with colors:
plot(my.sc, type="hitmap", plot.legend=TRUE)
plot(my.sc, type="lines", print.title=TRUE)
plot(my.sc, type="lines", print.title=TRUE, view="c")
plot(my.sc, type="mds", plot.legend=TRUE)
And some others identify the super clusters with titles:
plot(my.sc, type="color", view="r", variable="correze")
plot(my.sc, type="color", view="c", variable="JOSPIN")
plot(my.sc, type="poly.dist")
Let us consider the first super cluster. It contains 3 departements and 1 candidate:
## [1] "alpes_maritimes" "finistere" "gard"
## [4] "haute_garonne" "gironde" "herault"
## [7] "ille_et_vilaine" "isere" "loire"
## [10] "maine_et_loire_" "meurthe_et_moselle" "morbihan"
## [13] "moselle" "oise" "bas_rhin"
## [16] "haut_rhin" "seine_maritime_" "seine_et_marne_"
## [19] "var" "essonne" "seine_saint-denis"
## [22] "val_de_marne" "val_d'oise"
The departments are the 3 biggest overseas departements. These departements, regarding history and culture, are different from metropolitan France thus they also have a different but common election behaviour. Particularly, during the 2002 French presidential election, they strongly promoted Christine Taubira, who is actually the candidate assigned to this super cluster, a woman originated from one of the overseas departements.