Contents

1 Introduction

Pattern recognition and clustering algorithms are the methodological cornerstones of the “big data” paradigm. In biology, high-throughput genomics and detailed imaging techniques are avidly applied to learn the details of how cells work and how diseases develop, and big datasets are expanding at an exponential rate, which also means that biomedical data analysis relies more and more on computational modeling and visualization in addition to the traditional descriptive statistics. The taxonomic tradition in biomedicine to categorize phenomena into distinct easily identifiable boxes remains strong, which explains the popularity of classical algorithms such as principal component analysis and hierarchical clustering as the first and often only choices for visualization and interpretation of the multi-dimensional structure of a complex dataset. However, both methods struggle when the dataset resembles a continuum instead of distinct clusters of data.

In the vignette, we focus on biomedical applications of the Numero framework, but there is nothing in the R package that is specifically aimed at biology. The choice of diabetic kidney disease as an example reflects our experience in the field, whereas Numero itself can be applied to any analysis problem that involves complex multi-dimensional data.

The document is organized into sections and paragraphs that describe our motivation for developing the library, introduce the concept of the self-organizing map, describe the dataset we use as an example of a biomedical study, go through a complete R-script of an analysis pipeline, define metabolic subgroups, interpret the results and discuss the role of the map analyses in publications.

1.1 Limitations of conventional categorization

The conventional notion of qualitative data patterns (e.g. health vs. disease) fits well with clustering algorithms that aim to find discriminatory borders automatically within the data. However, we argue that many biomedical datasets do not have a qualitative structure of regularity, but they instead reflect a multivariable spectrum of causes and consequences where the borderline between health and disease is blurred. For instance, chronic kidney disease is defined according to an internationally accepted threshold of glomerular filtration rate (GFR < 60 mL/min 1.73 m2, (Levey & Coresh, 2012)), but there is no mathematically identifiable threshold effect in the population-based GFR distribution or any other biomarker or physical characteristic, as demonstrated by the continuous discussion on diagnostic criteria (Delanaye et al., 2012). Therefore, in most cases it is impossible to say exactly when someone develops chronic kidney disease, only that the diagnostic threshold is reached after a gradual decline, after which treatments can be initiated according to consensus guidelines.

Typical clustering analyses rely on algorithms that are tweaked for different application domains to produce classifications that are mathematically optimal, to reproduce an existing gold standard, or to predict future outcomes. We maintain that excessive reliance on mathematical criteria is not useful for datasets without intrinsic clustering structure, since the choice of the criteria will determine the output rather than the data or practical usefulness of the classification. Furthermore, the process that leads to category assignments is often too complicated to understand on a practical level, so the human observer must rely on the “black box” to produce the classification results without access to the inner workings. We propose a half-way solution, where the aim is to simplify the data presentation with statistical verification so that a human observer can determine a suitable subgrouping for a specific purpose, yet with sufficient access to the data patterns to understand the characteristics of the dataset in detail.

A traditional strict classification model will work well, if measurable qualitative differences exist. For instance, type 1 diabetes is an autoimmune form of diabetes that develops in children and adolescents. The condition is severe with a short life expectancy if untreated, so type 1 diabetes can be considered a qualitative example of health versus disease. Consequently, highly accurate diagnostic biomarkers such as glucose, insulin and C-peptide already exist. Even when treated, type 1 diabetes has a profound long-term impact on energy metabolism and it represents a distinct data cluster that is separate from the non-diabetic population.

Unlike type 1 diabetes, common age-associated diseases such as chronic kidney disease, type 2 diabetes, and atherosclerosis are challenging from a clustering perspective: they take decades to develop, they are not immediately life-threatening if left untreated, and there is a wide variation in severity across individuals. Furthermore, the affected individuals often suffer from multiple interacting chronic conditions, making it difficult to isolate specific causes and symptoms. Therefore, the simplistic notion of a qualitative threshold between health and disease becomes problematic. We aim to address these challenges by creating subgroups that are of practical value beyond mathematical criteria, and guided by a human observer with access to understandable presentations of the multivariable data patterns.

Multiple co-occurring and inter-connected phenomena are hallmarks of complex systems and the observable data that can be obtained from them. This presents a challenge to the traditional paradigms of biomedicine. For instance, differential diagnostics cannot cope well with multiple overlapping diseases, or evolving degrees of severity. This motivated us to develop the Numero framework in such a way as to enable visual comparisons of multiple overlapping diagnoses and their diagnostic criteria. We expect the Numero to be highly valuable in situations where the most important outcome or a set of outcomes is not obvious (e.g. competing risk scenarios). For instance, patients with type 1 diabetes may develop serious injuries to their vasculature over decades, but the affected organs, severity and rate of progression vary. Therefore, predictive models that focus only on a single outcome at a time may miss the big picture. The example of diabetic kidney disease we use in this vignette demonstrates how to use the Numero framework to gain insight into the overlaps and longitudinal associations between multiple morbidities.

1.2 Self-organizing map

Expressing multivariable data in visual form is a critical part of any knowledge discovery process, and an extensive number of algorithms have been developed in recent decades. In many cases, the aim is to project a set of multivariable data points into a two-dimensional presentation for human viewing (Figure 1). We built the Numero package using the self-organizing map (SOM) algorithm (Kohonen, Schroeder, & Huang, 2001), which is based on only a few simple mathematical rules, does not break down from missing data and can handle a high number of variables. We also developed a method to estimate the statistical significance of the map patterns (V.-P. Mäkinen et al., 2008b). Of note, the modular structure of the library allows users to replace the SOM with any other suitable algorithm for customized analysis pipelines.

A conceptual example. The example shows how to organize objects with multiple features into a two-dimensional layout. The images were obtained from @cardoso2014taxonomic.

Figure 1: A conceptual example
The example shows how to organize objects with multiple features into a two-dimensional layout. The images were obtained from Cardoso, Queiroz, & Lima (2014).

Conceptually, the SOM algorithm mimics a human observer who wants to make sense of a set of objects. For instance, Figure 1A depicts schematic drawings of the flowering legume genus Luetzelburgia that grows in South America (Cardoso et al., 2014). Figure 1B shows how a human observer might organize the drawings based on their visual similarities (shape, size and other morphological details). By organization, we refer to the spatial layout of the drawings on the two-dimensional canvas: drawings that look similar are close to each other, whereas drawings that look different are far apart (in most cases). This is how all people, from children to elderly, sort and classify their objects with multiple observable features (= multivariable data points) with the help of a two-dimensional surface (= data map). The same observer then decides how to split the dataset into subgroups based on his or her domain knowledge.

If there are thousands of drawings, manual organization becomes impractical. For this reason, we let the SOM algorithm to do the first organization step, and to visualize the salient patterns within the dataset in a two-dimensional data map. The spatial principle still applies: multivariable data points that have similar values are close to each other, whereas data points that are different are on the opposite sides of the map. The second step of defining subgroups remains the responsibility of the observer. We argue that this type of data-assisted subgrouping is particularly useful in situations where there is no qualitative threshold between health and disease, but a line must be drawn to initiate preventative measures or treatments.

Although there are only 18 drawings in Figure 1, the nature of the dataset resembles many epidemiological studies. Specifically, some of the drawings are very similar, but it is not obvious how they should be classified into subgroups (i.e. our version of the figure can be disputed, a single “correct” visual subgrouping may not exist). If the classification was based on the height of a drawing, the results would look different compared to using the width – some drawings are narrow while being long, whereas others are wide despite being short. This is a naïve example on how the selection of the mathematical criteria for classification has a substantial impact on the results, and illustrates the motivation for our work. We developed the Numero library as an alternative tool that helps researchers to define meaningful groupings when pure mathematics cannot provide a conclusive answer.

Previous versions of the software (written for Matlab) were successfully used on a range of metabolomics and other biomedical studies (Bernardi et al., 2010; Kumpula et al., 2010; Kuusisto et al., 2012; V.-P. Mäkinen et al., 2008a, 2008b, 2013; Tukiainen et al., 2008; Würtz et al., 2011). However, the old version used a rectangular SOM, which tends to guide observers into picking four subgroups in the corners even when not supported by data. We created the Numero package with a circular implementation of the SOM to remove the limitations from cornered border shapes. Additional technical details and supportive material are available as an online supplement within a previous publication (V.-P. Mäkinen et al., 2012).

2 Terminology

Data point -– Here, we define the term data point as a single uniquely identifiable row in a spreadsheet of data (with variables as columns). For instance, in the diabetic kidney disease dataset described in the next section, a data point refers to a patient (and vice versa) as there is only one row per patient.

Map –- A map is a general term to describe the two-dimensional canvas onto which the multivariable data points are projected. The concept is analogous to a geographic map that indicates where people live, except that the location is not based on geography (i.e. physical distances), but comes from the data (i.e. distances = data-based similarities).

Layout –- We make a distinction between what is a map, and what is the layout of data points on it. The layout is a table of data point locations as coordinates, whereas the map is a more integrated concept that also includes information that is necessary to find the locations of new previously unseen data points, and to draw and paint the map in visual form.

District –- A district refers to a pre-defined division of the map into uniformly sized areas. The districts are created mainly for technical reasons: using districts speeds up calculations and enables the estimation of map-related statistics. This is analogous to a real city being divided into districts to estimate regional demographics, for instance.

Coloring –- The Numero framework always creates a single map. However, the map districts can be painted with different colors. This enables the user to create multiple colorings of the map to visualize regional differences. These colorings can be made for each variable, which helps to identify which parts of the map are particularly important for a specific phenomenon. Again, this is similar to a real city map where the districts are colored according to the income level of the local residents, or according to the mean age, smoking rates, obesity etc.

Subgroup –- We expect that most uses of Numero will result in the subgrouping of a complex dataset. Visually, we define a subgroup via a contiguous set of adjacent districts on the map. Consequently, all the data points that are located within the set of districts are the subgroup members.

District profile –- The SOM algorithm works through the districts during the optimization of the data point layout on the map. The computational process eventually converges to a stable configuration that is stored as a set of district profiles. From a practical point of view, a district profile represents the typical average profile that captures the characteristics of the data points within the district. In technical terms, the district profile (also known as the prototype) contains the mean weighted data values across all the data points, where the weights are determined by the neighborhood function used in the SOM algorithm.

Best-matching district –- The best-matching district (BMD, also known as the best-matching unit in the literature) is the district with a profile that is the most similar to a data point when considering all variables simultaneously. The BMD is closely related to the data point layout: the assigned location for a data point is the location of the BMD for that data point.

3 Example dataset of diabetic kidney disease

Diabetic kidney disease is the leading indication for dialysis and kidney transplantation in the developed countries, and carries a substantial risk of premature death due to cardiovascular disease. About one third of individuals with type 1 diabetes will develop diabetic kidney disease during their lifetime. As the age of onset of type 1 diabetes is in childhood or adolescence, these individuals will develop complications at a relatively early age. Therefore, people with type 1 diabetes represent a particularly vulnerable group facing lower quality of life and reduced life span due to kidney damage.

Albuminuria (elevated albumin concentration in urine) is the basis for the clinical classification of diabetic kidney disease. In this example, we applied the threshold of 300mg/24h, if 24h urine collections were done and 0.2 mg/min when overnight urine data were available from the local medical centers that examined the patients. If the threshold was exceeded in at least two out of three consecutive measurements, we assigned the individual in the diabetic kidney disease group. In addition, the FinnDiane Study Group measured urinary albumin excretion rate from a single 24h urine sample in their designated central laboratory. The logarithm of the albumin excretion rate was included in the example dataset.

Our example dataset contains a subset of data from a previous publication (V.-P. Mäkinen et al., 2008b). We created the simplified dataset for educational purposes, but it contains enough information to replicate some of the findings from the original study. The dataset includes 613 individuals of whom 225 individuals had diabetic kidney disease at baseline. In addition, we included information on whether an individual had died after an eight-year follow-up to demonstrate how the study design we chose can be applied to longitudinal data. The available data are summarized in Table 1.

Table 1: Summary of the diabetic kidney disease dataset from the FinnDiane Study
The mean and standard deviation are reported for continuous variables. Abbreviations: urinary albumin excretion rate (AER), triglycerides (TG), high density lipoprotein subclass 2 (HDL2). P-values were estimated by the t-test for continuous variables and by Fisher’s test for binary traits.
Trait No kidney disease Diabetic kidney disease P-value
Men / Women 192 / 196 119 / 106 0.45
Age (years) 38.8 ± 12.2 41.7 ± 9.7 0.0012
Type 1 diabetes duration (years) 25.3 ± 10.3 28.6 ± 7.8 <0.001
Log10 of AER (mg/24h) 1.20 ± 0.51 2.72 ± 0.59 <0.001
Log10 of TG (mmol/L) 0.034 ± 0.201 0.159 ± 0.212 <0.001
Total cholesterol (mmol/L) 4.89 ± 0.77 5.35 ± 0.96 <0.001
HDL2 cholesterol (mmol/L) 0.54 ± 0.16 0.51 ± 0.18 0.027
Log10 of serum creatinine (µmol/L) 1.94 ± 0.09 2.14 ± 0.24 <0.001
Metabolic syndrome 90 (23.2%) 114 (50.7%) <0.001
Macrovascular disease 16 (4.1%) 38 (16.9%) <0.001
Diabetic retinopathy 133 (34.4%) 178 (79.1%) <0.001
Died during follow-up 13 (3.4%) 43 (19.1%) <0.001

4 Aims and study design

In the original study, we hypothesized that the metabolic profile of an individual with type 1 diabetes at baseline predicts adverse events in the future (V.-P. Mäkinen et al., 2008b). Here, we set two aims to test the same hypothesis in the example dataset:

  1. Define metabolic subgroups of type 1 diabetes based on biochemical data.
  2. Identify subgroups with high all-cause mortality.

We chose these aims to accommodate a high number of variables and to ensure statistical robustness. Please note that we included only a few variables in the example dataset for pedagogical reasons, but the SOM in the original study was created based on thousands of variables.

The strict separation of Aim 1 and 2 is an example of an unsupervised classification design where the metabolic subgroups are created without using the mortality data. Only after the subgroup modeling has been completed, the deaths during follow-up will be counted within the subgroups. An alternative would be to employ regression or other supervised methods that use all the available data simultaneously to create a predictive model of mortality. While supervised models can achieve high accuracy, they rarely work well outside the dataset they were created for, and they may fail if the outcome to be predicted is poorly defined or biased. For these reasons, we adopted the more robust unsupervised classification design.

We denote the study design as “split-by-variables” since it starts from a spreadsheet with one patient per row and the variables organized into columns, and then assigns one set of variables into the training set, and the remaining variables (e.g. deceased or alive at follow-up) into the evaluation set (Figure 2). Since the evaluation set plays no part in the training of the SOM, we can estimate the statistical significance of the mortality pattern without over-estimating the model accuracy.

Application of the split-by-variable study design in the diabetic kidney disease example. Of note, the training set is adjusted for sex differences; hence the ‘MALE’ column is not formally included in the evaluation set.

Figure 2: Application of the split-by-variable study design in the diabetic kidney disease example
Of note, the training set is adjusted for sex differences; hence the ‘MALE’ column is not formally included in the evaluation set.

5 Statistical analysis

The architecture of the analysis pipeline for the diabetic kidney disease example is detailed in Figure 3. First, we describe how to preprocess the data for analysis (Figure 3A-D). Next, we create the SOM based on the training set (Figure 3B,E). The third segment focuses on map statistics and how to color the maps according to regional variation (Figure 3F,G). Lastly, we discuss the interactive subgroup selection and interpretation of the results (Figure 3H,I).

Analysis steps in the diabetic kidney disease example.

Figure 3: Analysis steps in the diabetic kidney disease example

5.1 Preprocessing

We have included the example dataset in the installation package. To access it, type

fname <- system.file("extdata", "finndiane_dataset.txt", 
                     package = "Numero")
db <- read.delim(file = fname, sep = "\t")
summary(db)
##      INDEX          AGE          T1D_DURAT          MALE       
##  Min.   :  1   Min.   :15.00   Min.   : 2.59   Min.   :0.0000  
##  1st Qu.:154   1st Qu.:31.00   1st Qu.:19.00   1st Qu.:0.0000  
##  Median :307   Median :39.00   Median :26.00   Median :1.0000  
##  Mean   :307   Mean   :39.86   Mean   :26.53   Mean   :0.5073  
##  3rd Qu.:460   3rd Qu.:48.00   3rd Qu.:34.00   3rd Qu.:1.0000  
##  Max.   :613   Max.   :74.00   Max.   :53.00   Max.   :1.0000  
##                                                                
##     DECEASED         MACROVASC        METAB_SYNDR      DIAB_KIDNEY   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.000  
##  Mean   :0.09135   Mean   :0.08809   Mean   :0.3328   Mean   :0.367  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.000  
##                                                                      
##   DIAB_RETINO        uALB_log          TG_log              CHOL       
##  Min.   :0.0000   Min.   :0.3617   Min.   :-0.37160   Min.   : 2.920  
##  1st Qu.:0.0000   1st Qu.:0.9590   1st Qu.:-0.06550   1st Qu.: 4.470  
##  Median :1.0000   Median :1.5682   Median : 0.05690   Median : 4.980  
##  Mean   :0.5082   Mean   :1.7526   Mean   : 0.08023   Mean   : 5.061  
##  3rd Qu.:1.0000   3rd Qu.:2.4900   3rd Qu.: 0.21750   3rd Qu.: 5.600  
##  Max.   :1.0000   Max.   :3.8788   Max.   : 0.90850   Max.   :10.000  
##  NA's   :1        NA's   :28                                          
##      HDL2C          CREAT_log    
##  Min.   :0.0910   Min.   :1.415  
##  1st Qu.:0.4120   1st Qu.:1.909  
##  Median :0.5200   Median :1.978  
##  Mean   :0.5273   Mean   :2.013  
##  3rd Qu.:0.6400   3rd Qu.:2.061  
##  Max.   :1.1900   Max.   :3.035  
## 

We hypothesize that the metabolic phenotype of an individual predicts future adverse outcomes. To investigate the hypothesis, we select all blood and urine biomarkers at baseline as the training set (Aim 1), and then use the remaining columns that contain data on clinical end-points and mortality as the evaluation set (Aim 2).

trvars <- c("CREAT_log", "CHOL", "HDL2C", "TG_log", "uALB_log")

If our hypothesis is correct, we should see a statistically significant regional pattern for mortality on the SOM that we constructed based on the metabolic variables at baseline. This is the split-by-variable study design that was previously described in Figure 2.

In the data file, the biomarkers are expressed in their physical concentration units, or as log-transformed versions. As a consequence, the standard deviations of the data columns vary, which can bias the SOM to fit better to those biomarkers that have a wide numerical variation. In most cases, it is desirable to standardize the training set before analyses, so that the information content rather than the measurement scale determines the modeling outcome.

Sex difference is another factor to consider when preparing the training set. Men and women display anatomical and metabolic differences, which usually complicate the interpretation of the SOM. For this reason, we recommend using a sex-specific standardization procedure that eliminates the differences. If necessary, separate visualizations can be made afterwards for men and women using the same map, please see V.-P. Mäkinen et al. (2012) for an example.

The Numero package contains a pre-processing function that checks the data for unusable rows and columns, and we can use it to center and scale the data for men and women separately. Our dataset contains an explicit identity column. Therefore, it needs to be declared in the pre-processing function. The function consequently assigns these identifiers as row names in the output. If no key column is provided, then the row names are copied as such.

db <- nroPreprocess(data = db, training = trvars, strata = "MALE",
                    key = "INDEX")

The function returns a list with three members:

  • original contains a subset of the original dataset were unusable rows were removed,
  • values contains those columns that could be converted to numbers,
  • features contains the standardized training columns.

You can verify that the training data is zero centered by typing the following commands that will produce the output below:

men <- which(db$values[,'MALE'] == 1)
women <- which(db$values[,'MALE'] == 0)
print(summary(db$features[men,]))
##    CREAT_log            CHOL              HDL2C             TG_log       
##  Min.   :-1.7142   Min.   :-2.24260   Min.   :-2.4881   Min.   :-2.0232  
##  1st Qu.:-0.5691   1st Qu.:-0.69266   1st Qu.:-0.6940   1st Qu.:-0.7095  
##  Median :-0.2694   Median :-0.06336   Median :-0.1069   Median :-0.1533  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.1772   3rd Qu.: 0.62420   3rd Qu.: 0.6541   3rd Qu.: 0.5876  
##  Max.   : 4.8941   Max.   : 5.16912   Max.   : 3.3436   Max.   : 3.7445  
##                                                                          
##     uALB_log      
##  Min.   :-1.5656  
##  1st Qu.:-0.9274  
##  Median :-0.0868  
##  Mean   : 0.0000  
##  3rd Qu.: 0.8118  
##  Max.   : 2.1943  
##  NA's   :13
print(summary(db$features[women,]))
##    CREAT_log            CHOL             HDL2C              TG_log       
##  Min.   :-3.3538   Min.   :-2.5613   Min.   :-2.86000   Min.   :-1.9467  
##  1st Qu.:-0.5241   1st Qu.:-0.6290   1st Qu.:-0.65987   1st Qu.:-0.6907  
##  Median :-0.1371   Median :-0.1259   Median :-0.07021   Median :-0.1360  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.2969   3rd Qu.: 0.6173   3rd Qu.: 0.57969   3rd Qu.: 0.5740  
##  Max.   : 4.9883   Max.   : 5.5339   Max.   : 3.82916   Max.   : 2.9483  
##                                                                          
##     uALB_log      
##  Min.   :-1.4387  
##  1st Qu.:-0.8690  
##  Median :-0.3313  
##  Mean   : 0.0000  
##  3rd Qu.: 0.8239  
##  Max.   : 2.3544  
##  NA's   :15

5.2 Initializing a self-organizing map

Training a SOM requires two steps: i) initialization of the map and ii) iterative optimization of the district profiles. The initialization of district profiles influences the usability of the final map, and several methods including principal component analysis have been proposed (Attik, Bougrain, & Alexandre, 2005). Our experience suggests that the most useful results are usually achieved by creating a limited number of seed profiles that are placed on the edges of the map. The districts in the middle are set automatically in such a way that the transition from one seed to another via the districts in the middle is smooth. The Numero package includes the nroKmeans() function which we use here to determine the seed profiles. It is based on the classical k-means algorithm, but our implementation is specifically designed for datasets with missing data and to produce output that is compatible with other Numero functions.

The minimum number of seeds is three as the triangle is the simplest polygon to cover the map. To create the seed profiles, you can use the command

km <- nroKmeans(data = db$features, k = 3)

The output is a list that contains three named elements:

  • centroids contains the seed profiles,
  • layout the best matching labels of each row, and
  • history the history of training errors.

To show the seeds, type the following command. The seed profile output has the same columns as the training data.

print(km$centroids)
##       CREAT_log       CHOL      HDL2C      TG_log    uALB_log
## [1,]  1.0612976  0.4464189 -0.7508336  1.07224248  1.08944887
## [2,] -0.3926040 -0.6453715 -0.2515304 -0.47089775 -0.57310817
## [3,] -0.1886087  0.6884679  0.9797789 -0.07189808  0.09436714

Most SOM software is based on rectangular or borderless maps with periodic boundaries. In our experience, the former suffer from an artificial tendency for observers to define four separate subgroups in the corners. The latter do not suffer from boundary artifacts, but are complicated to interpret. For these reasons, we developed a circular map topology for Numero since it does not have corners and yet the well-defined borders limit the visual complexity of the regional patterns.

The preferred size of the map depends on the number of data (small maps for small datasets), however, we advise against using large maps due to their complexity. For most biomedical and epidemiological applications, map radii between two and five will provide enough flexibility and expressive power based on our experience.

To initialize a circular map with a radius of three districts according to the seed profiles, use the command

sm <- nroKohonen(seeds = km$centroids, radius = 3)

This will create the initial matrix of district profiles, and additional topological information that will be required for visualization. To show the district profiles, type the following command that will show how the profiles have the same format as the seeds:

print(head(sm$centroids)) 
##     CREAT_log       CHOL        HDL2C      TG_log    uALB_log
## 1  0.16002834  0.1631718 -0.007528393  0.17648222  0.20356928
## 2  0.46405652  0.3572670 -0.149647657  0.49388353  0.53864811
## 3  0.10507216  0.3815612  0.297551387  0.15828606  0.23621959
## 4 -0.04348941  0.1989276  0.270231115 -0.01037918  0.04019910
## 5 -0.13134873 -0.1548775 -0.016852841 -0.14811837 -0.16612276
## 6 -0.03213067 -0.1265242 -0.101798836 -0.04994397 -0.06963915

5.3 Training the self-organizing map

Kohonen’s self-organizing map algorithm was originally developed to mimic the plasticity of neural networks (Kohonen et al., 2001). It scales up well for datasets with a high number of variables, and it can handle missing data values, which is why we chose it as the default method in Numero. To apply the SOM algorithm to the standardized training set, use the command

sm <- nroTrain(som = sm, data = db$features)

The nroTrain() function updates the district profiles in such a way that when the data points are assigned the best-matching districts (BMD), the resulting layout is distributed more evenly across the map. BMDs and the data point layout are discussed in the next section.

The nroTrain() function adds a record of the training process within the output list. To plot it on screen, type

plot(sm$history)
SOM training history in the diabetic kidney disease example.

Figure 4: SOM training history in the diabetic kidney disease example

The results are shown in Figure 4. The curve shows the mean Euclidean distance between a data point and its best matching district profile for each training cycle. In most cases, the first few cycles account for the largest reductions in the training error. The abrupt reduction that can be observed after the first few cycles is part of the training process. For those interested in the technical details, switching from a wide neighborhood function to a narrower one causes the threshold. This is beneficial, since by starting the training process with a wide neighborhood function forces the SOM to adapt to large (and presumably more important) patterns first before adapting to the minor details.

We have included additional tools in the Numero package to assess the internal properties of the SOM given a training dataset. However, the full use of these tools requires visualization functions that are not covered until later in the document. For this reason, we will return to this subject in a dedicated section after introducing map colorings.

5.4 Best-matching districts and data point layout

The application of the SOM algorithm mimics a human researcher who wants to investigate the cohort of type 1 diabetic patients. Suppose the researcher goes through the medical records of all individuals and then organizes the folders on a giant round table in such a way that patients with mutually similar clinical profiles are placed next to each other, whereas patients who are different are on opposite sides of the table. From this organized view, it is then possible to identify sections of the table (= patient subgroups) where there is a high risk of premature death.

To translate the story of the human researcher into a computer program, it is necessary to introduce new concepts and it is also useful to revisit to pre-defined terminology from the beginning of the document. First, the map (representing the giant round table) is divided into districts for technical reasons since the type of manual positioning algorithm a human would apply is unfeasible for large datasets. The districts are also anchors that enable the assignment of the patients onto specific map positions, again important for technical reasons.

The best-matching district (BMD) for a data point is the second important concept. Each data point is compared against a district profile by calculating their pair-wise Euclidean distance. This is repeated across the districts, and after the distances between the data point and all district profiles are calculated, the profile with the shortest distance is chosen as the best match. When the BMDs are located for all data points, the results are collected in a spreadsheet, which we denote as the data point layout. The layout is conceptually equivalent to the spatial configuration of folders on the human researcher’s table.

To create a layout, you can use the command

matches <- nroMatch(som = sm, data = db$features)

The output is an integer vector with elements corresponding to the rows in training data and each element contains the index of the best matching district. The vector also has the attribute quality. To print a section of the quality output on screen type the following command. Its output will be a data frame with three columns:

head(attr(matches,'quality'))
##   COVERAGE   RESIDUAL   QUALITY
## 1        1 0.29996141 0.8554628
## 2        1 0.16749380 1.1447729
## 3        1 0.21269967 1.0263238
## 4        1 0.09154408 1.4201384
## 5        1 0.23710154 0.9720335
## 6        1 0.12177353 1.2960540

The column RESIDUAL shows the Euclidean distance between a data point and the BMD (the shorter the better). QUALITY is calculated from the distance by dividing it with the average training error. This provides a scale-independent relative estimate on how well the data points were matched compared to a typical data point in the training set. Finally, COVERAGE shows if a multivariable data point contained missing elements (1 means all elements were usable and 0 means none of the elements contained a numerical value).

Ideally, the data points should be uniformly distributed across districts. While uniformity is rarely observed for real datasets, uneven numbers are usually not a problem unless there are large contiguous groups of districts with very high or very low occupancy. We will revisit the spatial data point distribution in a later section that addresses map quality, but a tabulation of the BMD assignments is a quick way to see if the output is reasonable

t <- table(matches)
counts <- data.frame(DISTRICT = names(t), N = as.integer(t))
print(counts, row.names = FALSE)
##  DISTRICT  N
##         1 14
##         2 12
##         3 22
##         4 14
##         5 24
##         6 13
##         7 11
##         8 19
##         9 14
##        10 11
##        11 13
##        12 12
##        13  5
##        14  9
##        15 16
##        16  9
##        17 12
##        18 15
##        19  8
##        20  9
##        21  4
##        22 18
##        23 28
##        24  8
##        25 21
##        26 17
##        27 19
##        28 23
##        29 21
##        30 24
##        31 20
##        32 19
##        33 14
##        34 30
##        35  7
##        36 19
##        37 18
##        38 17
##        39 15
##        40  9

If a large number of districts in the above output were devoid of data points, it would indicate that the map did not capture the diversity of the data set, and therefore would not be useful for subgrouping. However, the data points are scattered across all districts in this example, which suggests the layout will be useful.

5.5 Map statistics

Statistical evaluation of whether an observation is likely to occur purely by chance is the cornerstone of biomedical data analysis. In our example, we achieve well-defined statistical analyses via the split-by-variable design and the non-parametric permutation engine that is built into the package. The former ensures that our results are not over-optimistic (no over-fitting) and the latter enables us to avoid restrictive assumptions on the nature of the data generating processes that are often violated in real datasets.

In our example, we investigate if the metabolic profile at baseline indicates the risk of death during follow-up. To estimate statistical significance, it is necessary to find out how much the areas of the map can differ with respect to mortality just by the virtue of random fluctuations. This concept is formally encapsulated by the null hypothesis. Here, the null hypothesis states that the data point layout is not associated with the number of deaths, that is, the location of a patient on the map does not provide any information on how likely the patient is to die in the next eight years. If the null hypothesis is true, then the observed layout and regional patterns of mortality should be within the variation we would expect for random layouts. We use permutation analysis to simulate a high number of random layouts, and then compare the observation with the simulated findings to see if it could have occurred by chance alone (V.-P. Mäkinen et al., 2008b). Within the split-by-variable design, P-values for statistical significance are only meaningful for those variables that are in the evaluation set since, by definition, the variables in the training set will always be strongly associated with the layout. However, it makes sense to evaluate the expected range of regional variation also for the training set, as we will demonstrate later in the vignette. Knowing the randomly expected amplitude of regional patterns (i.e. the basal amplitude) helps us to assess which of the training variables had the strongest influence on the layout. For these reasons, we will apply the permutation analysis to all variables, but only report P-values for the evaluation set.

The function nroPermute() repeats the following procedure: i) re-assign best-matching district randomly in accordance with the null hypothesis, ii) recalculate the average district values across the map and iii) summarize the regional variation with a single descriptive statistic. When a sufficient number of cycles has been achieved, the null distribution of the descriptive statistic is analyzed to determine how far, in terms of standard deviations, the observed value is from the mean prediction by the null hypothesis. This distance is reported as the Z-score of regional variation. Furthermore, the function also estimates how frequently a permuted layout produced a regional variation that exceeded the observation (frequency-based P-value).

The nroPermute() function goes through all variables in a dataset. Please note that we calculate the statistics on the data frame value (it only contains numeric values without a key column or unusable rows). Internally, we also distinguish between a training variable (no P-value needed and we can use fewer cycles) and an evaluation variable that is tested with larger number of cycles. The function call is shown below:

stats <- nroPermute(som = sm, data = db$values, 
                    districts = matches)

To see the results, type the following command:

print(stats)
##                  SCORE          Z    P.freq N.data N.cycles TRAINING
## AGE         2.64097289  4.9259340 0.0000000    613    10000       no
## T1D_DURAT   3.16434620  5.0704949 0.0000000    613    10000       no
## MALE        0.03183488 -0.9824115 0.8333333    613       24       no
## DECEASED    0.08340699  6.2458777 0.0000000    613    10000       no
## MACROVASC   0.06011753  4.4950270 0.0000000    613    10000       no
## METAB_SYNDR 0.20813671  8.4682837 0.0000000    613    10000       no
## DIAB_KIDNEY 0.30830856 10.8681282 0.0000000    613    10000       no
## DIAB_RETINO 0.16995995  7.1931211 0.0000000    612    10000       no
## uALB_log    0.69966738 15.7798023        NA    585     1000      yes
## TG_log      0.12724597 11.2545995        NA    613     1000      yes
## CHOL        0.48767044 12.9654093        NA    613     1000      yes
## HDL2C       0.10087360 13.1108202        NA    613     1000      yes
## CREAT_log   0.08877195 11.2528273        NA    613     1000      yes
##                      P.z AMPLITUDE
## AGE         4.197917e-07 0.3808101
## T1D_DURAT   1.983913e-07 0.3919857
## MALE        8.370514e-01 0.0000000
## DECEASED    2.107135e-10 0.4828512
## MACROVASC   3.478056e-06 0.3474979
## METAB_SYNDR 1.245176e-17 0.6546592
## DIAB_KIDNEY 8.176055e-28 0.8401844
## DIAB_RETINO 3.166337e-13 0.5560799
## uALB_log              NA 1.2198921
## TG_log                NA 0.8700614
## CHOL                  NA 1.0023193
## HDL2C                 NA 1.0135606
## CREAT_log             NA 0.8699244

The Z column contains Z-scores that indicate how far the observed regional variation is from the mean expected value if the null hypothesis is true. P.z is a parametric estimate for statistical significance based on the Z-scores using the cumulative Gaussian distribution, whereas P.freq is the frequency-based estimate for statistical significance. It is calculated by using the frequency of observing regional variation for a simulated random layout that exceeds the observed variation. N.data indicates how many data values were used, N.cycles tells the number of completed permutations. The column TRAINING indicates whether a variable has been used during the training process. Please note how the P-values are missing for those variables. AMPLITUDE contains the dynamic range for colors that can be used in map visualizations. The amplitudes are required for the assignment of district colors and will be described in closer detail later.

6 Visualization

After estimating the map statistics, we now have all the results that are required to color the map according to the data patterns: the topological information that is carried within the variable sm allows us to draw the districts correctly, the data point layout specifies the locations of the patients on the SOM, and the color amplitudes.

6.1 Color Amplitudes

Assigning a color palette to a set of values is not much different from photography. When a photo is taken, the intensity of light is converted into numbers by the digital camera, and then converted back to light on the viewing screen. If there is too much light, the photo gets overexposed, which means that most pixels show up as “burned” since the intensity is beyond their dynamic range (i.e. light saturates the sensor). If the photo is underexposed, most pixels will show a zero signal (i.e. the light is below the detection limit). In principle, the SOM colors work the same way: we aim to set up an optimal color assignment so that the colorings with very high regional variation do not over-expose too much, while the colorings with less regional variation can still show differences between districts despite under-exposure.

In the Numero framework, a photo corresponds to a map coloring (please revisit Terminology if necessary), light intensity is analogous to statistical significance (captured by z-scores), and the dynamic range is the gap between the lowest and highest district averages. Importantly, the “camera settings” are kept constant to ensure all colorings remain visually comparable. Ideally, the camera would be set up so that the full dynamic range of every coloring could be expressed within the available color palette. However, this approach is usually impractical as interesting detail could be lost for variables that show statistically modest but biologically critical variation. The dynamic range of colors is stored in the AMPLITUDE column of the output of nroPermute.

In brief, Z-scores indicate the statistical support for the observed regional variation. But before the information can be visualized, the Z-scores have to be converted to color amplitudes so that the map coloring reflects the strength of the statistical evidence.

6.2 Color and label assignment

The color of a district depends on the estimated mean value across its local resident data points. To calculate the district values, use the command

comps <- nroAggregate(topology = sm$topology, data = db$values,
                      districts = matches)

The output of the function is a data frame containing the average district values.

In the SOM literature, the set of district mean values for a variable are typically referred to as the component plane, hence the name of the output. We now have all the materials to assign colors to each district based on i) the amplitudes for each variable, which tells how much “exposure” the camera provides, and ii) the component plane, which gives the dynamic range and district means:

colrs <- nroColorize(values = comps, amplitudes = stats$AMPLITUDE)

The output is a data frame of colors in a format that matches the values in the component plane and can be used in subsequent Numero functions.

Due to the standardization by z-scores, the colors are not directly relatable to the original measurement units, or to the original binary categories. For this reason, text labels that indicate the actual mean values for selected districts are a useful visual addition to the final map plot. To create a set of labels for the map coloring, use the command

labls <- nroLabel(topology = sm$topology, values = comps)

6.3 Graphics output

The Numero package contains functions to visualize map colorings on screen, to create interactive map colorings to define subgroups and to save those colorings in the Scalable Vector Graphics (SVG) format to a file.

To see the all map colorings on screen (Figure 5), use the following command

elem <- nroPlot(elements = sm$topology, colors = colrs,
                labels = labls, values = comps)
Statistically normalized colorings of all variables in the kidney disease dataset. The color intensity depends on how likely the observed regional variation would arise by chance; intense reds and intense blues indicate that these extremes would be very unlikely if the data point layout was random.

Figure 5: Statistically normalized colorings of all variables in the kidney disease dataset
The color intensity depends on how likely the observed regional variation would arise by chance; intense reds and intense blues indicate that these extremes would be very unlikely if the data point layout was random.

To save the plots into an SVG file, you can use the same command and providing a file name as parameter. The following command is not executed upon the creation of the vignette but serves as an example only.

nroPlot(elements = sm$topology, colors = colrs,
                labels = labls, values = comps, file = 'test.svg')

It is possible to direct the figure to any of the R graphics devices, including the SVG device, but the Numero SVG file will be cleaner and structured in a way that makes it easier to be manually edited in graphics programs such as Inkscape.

7 Results and interpretation

In the final section, we summarize and discuss the results of the SOM analysis. As previously mentioned, the big open problems in biomedicine and public health are typically characterized by multiple synergistic risk factors that produce a gradual decline in biological functions over time. For this reason, the observed data patterns are not likely to be self-explanatory, but will require additional analyses and contextual assessment with respect to how the original data was collected and what are the clinically impactful findings.

7.1 Map quality

Before delving into the characteristics of diabetic kidney disease, it is prudent to examine the SOM for potential problems with the data. The Numero package provides three different quality metrics: i) the data point histogram reveals problems of misrepresentation between the data points and the district profiles, ii) the coverage map shows systematic patterns of missing data that may influence the results, and iii) the matching quality indicates subgroups of data points that may have been modeled poorly.

To calculate all three, we can use the nroAggregate function and input the quality measures that have been calculated by nroMatch.

comps.qc <- nroAggregate(topology = sm$topology, 
                         data = attr(matches, "quality"),
                         districts = matches)

The output contains the district averages for the quality metrics. In addition, nroAggregate always estimates the spatial histogram that tells how many samples are within each district, and returns it as an attribute. To add the histogram information to the quality visualization, we copy the attribute into a new column in the data frame:

comps.qc$HISTOGRAM <- attr(comps.qc, "histogram")

The output format is equivalent to what was used for the map colorings, so the same code sequence is applied to create an SVG figure. To make a distinction between diagnostic and other colorings, we use a different color palette for the nroColorize function:

colrs.qc <- nroColorize(values = comps.qc, palette = "fire")
labls.qc <- nroLabel(topology = sm$topology, values = comps.qc)

Again, we can use the nroPlot function to visualize the results on screen.

elem.qc <- nroPlot(elements=sm$topology, colors=colrs.qc,
                labels=labls.qc)
Visualization of SOM quality metrics. Light (dark) colors indicate high (low) values. The color intensity was not normalized statistically. Coverage indicates the proportion of usable data values, residuals indicate model fit (smaller value is better), quality is a scale-independent measure based on the residuals (larger is better). Finally, the histogram shows smoothed estimates on how many samples were assigned to each district.

Figure 6: Visualization of SOM quality metrics
Light (dark) colors indicate high (low) values. The color intensity was not normalized statistically. Coverage indicates the proportion of usable data values, residuals indicate model fit (smaller value is better), quality is a scale-independent measure based on the residuals (larger is better). Finally, the histogram shows smoothed estimates on how many samples were assigned to each district.

We observed coverage close to 1 across the map, which reflects the low frequency of missing elements in the original data matrix (Figure 6).

There are two ways to show matching quality, either by coloring the map according to the mean matching errors for districts, or by examining the matching errors of individual data points (also referred to as quantization errors or model residuals). These are shown in the colorings for RESIDUAL and QUALITY in Figure 6. Again, some regional differences are expected, but there were no indications of serious problems. In particular, the relative quality even in the worst region was close to the average training quality (i.e. close to one).

Finally, the HISTOGRAM coloring in Figure 6 shows that there were noticeable differences between the districts. However, there was a sufficient data point count everywhere on the map and, based on our experience from previous studies, it is unlikely that the results were adversely affected due to sparse representation.

7.2 Summary of map colorings

Important note on reproducibility: The Numero framework uses optimized code that reduces memory footprint and computational burden. For this reason, different computers, particularly 32-bit vs. 64-bit architectures, may produce map patterns that have been flipped, mirrored, rotated or otherwise transformed when compared with the figures in the vignette. This is a technical limitation due to machine precision, not an unintentional mistake in the code.

As discussed earlier, we used the split-by-variable study design in this example. This meant that the SOM was trained using a subset of the available variables (biochemical data), which allowed us to investigate the associations with the clinical variables without a high risk of overfitting. To visualize the results, we follow the same logic. Below, we first investigate the training data to get insight into the metabolic profiles and diversity within the dataset. This will also allow us to define biochemical subgroups from a multivariable perspective. Later on, we will overlay the clinical variables onto the map to identify subgroups of clinical importance.

trvars <- colnames(db$features)
elem <- nroPlot(elements=sm$topology, colors=colrs[,trvars],
                    labels=labls[,trvars])
Statistically normalized colorings of the training variables in the kidney disease dataset. The color intensity depends on how likely the observed regional variation would arise by chance; intense reds and intense blues indicate that these extremes would be very unlikely if the data point layout was random.

Figure 7: Statistically normalized colorings of the training variables in the kidney disease dataset
The color intensity depends on how likely the observed regional variation would arise by chance; intense reds and intense blues indicate that these extremes would be very unlikely if the data point layout was random.

Figure 7 shows the map colorings for the training set. Serum creatinine (log-transformed) was substantially higher for a subset of individuals located on the top part of the map compared to the lower part, and a similar pattern was found for the log-transformed measurements for urinary albumin excretion. As elevated serum creatinine and urinary albumin are hallmarks of kidney disease, it is likely that the individuals who were assigned to the top part of the map had kidney disease as the underlying explanation.

The patterns for the lipids were more complicated. Cholesterol showed a pattern of high concentrations in the upper right area and low concentration in the bottom left, whereas HDL2 cholesterol was the highest in the bottom-right and the lowest in the upper-left. Triglycerides (log-transformed) showed a general pattern of high concentrations on the upper part of the map.

We recommend using the SOM together with conventional approaches such as linear correlations, for broader understanding of the nature of the dataset. For instance, cholesterol and triglycerides were correlated (r = 0.43, P < 0.001), however, the SOM colorings suggest that the correlation may not apply to all individuals; particularly those in the upper-left area with high triglycerides did not seem to follow the linear trend. Other dimension reduction methods such as principal component analysis may work better in datasets where there are clear clusters (a typical SOM analysis may miss the clustering structure) and, again, we recommend using multiple conceptually different methods to achieve robust conclusions. We did not observe any obvious clustering structure in the kidney disease dataset (results not shown).

7.3 Subgroup boundaries

The aims of the example study were i) to define and describe metabolic subgroups of type 1 diabetes, and ii) to investigate how the subgroups are associated with mortality. The ability to choose subgroups boundaries on the map while simultaneously observing multiple variables is the main strength of the Numero framework. Furthermore, the intensity of the colorings guides the process towards selecting criteria that have the strongest statistical support.

The Numero framework offers an interactive way to define subgroups. To start the interactive process to define subgroups based on the biochemical variables used in training, run the following command:

trvars <- colnames(db$features)
elem <- nroPlot(elements=sm$topology, colors=colrs[,trvars],
                    labels=labls[,trvars], interactive=TRUE)

Subgroups can be defined interactively by clicking onto districts in the map colorings in the plot window. We will step through one example defining the subgroup with high creatinine.

As a vignette does not allow for an interactive process, we will provide screenshots of the process in the following section.

After running the above command, the training colorings are shown in the plot window. We click on the district with the highest creatinine measure on the CREAT_log map coloring as shown in Figure 8.

Screenshot: Interactive definition of subgroups - step 1: Choosing the district with the highest creatinine value.

Figure 8: Screenshot: Interactive definition of subgroups - step 1: Choosing the district with the highest creatinine value

Now, we choose other districts that we want to add to this subgroup (Figure 9). Clicking on one district in a coloring, also updates the corresponding district in all other colorings.

Screenshot: Interactive definition of subgroups - step 2: Choosing other districts with a higher creatinine value.

Figure 9: Screenshot: Interactive definition of subgroups - step 2: Choosing other districts with a higher creatinine value

Once all districts belonging to this first subgroup have been chosen, a click into the plot window outside of a coloring will exit the selection process. The console window will then ask to provide a descriptive name for the subgroup and to confirm the selection. Here, we choose the name High creatinine (Figure 10) and confirm the subgroup. After that the district will be labeled with A indicating the first subgroup (Figure 11).

Screenshot: Interactive definition of subgroups - step 3: Subgroup name and confirmation.

Figure 10: Screenshot: Interactive definition of subgroups - step 3: Subgroup name and confirmation

Screenshot: Interactive definition of subgroups - step 4: Updated interactive plot after subgroup confirmation.

Figure 11: Screenshot: Interactive definition of subgroups - step 4: Updated interactive plot after subgroup confirmation

Subgroups are automatically labeled from A to Z in the graphical output. A district selection can be changed by clicking on top of it.

Now, we continue defining subgroups in the same manner until all districts have been chosen. Then, we press the finish button at the top right of the plot window and confirm that the session should be terminated. Map visualizations that contain the subgroup choices can be saved using the following command:

nroPlot(elements=elem, colors=colrs, labels=labls, file="subgroups.svg")

In this case, we have decided to define 5 subgroups. The overall result of the subgrouping efforts with above command are illustrated in Figure 12. Note, this figure was pre-created outside the vignette.

The five subgroups in the diabetic kidney disease example. The grouping is the result of the example interactive process. The SVG was pre-created by saving the results of the interactive process.

Figure 12: The five subgroups in the diabetic kidney disease example
The grouping is the result of the example interactive process. The SVG was pre-created by saving the results of the interactive process.

In this example, we have chosen the following descriptive names during the interactive procedure:

  • High creatinine for the subgroup labeled A,
  • High cholesterol for the subgroup labeled B,
  • High HDL2 cholesterol for the subgroup labeled C,
  • High triglycerides for the subgroup labeled D and
  • Low lipids for the subgroup labeled E.

While we admit that our choices for the subgroup boundaries were subjective, we also argue that any observer can dispute those choices and provide an alternative by examining the figures. Therefore, the transparency of the methodology allows collective objectivity that is superior to strict “black box” classifiers, especially when the data patterns overlap and involve multiple outcomes.

Please note that the boundaries may not fit exactly with any specific variable, since we also required that the subgroups have to be mutually exclusive. This is the part where there are no perfect mathematical solutions due to overlaps and multi-morbidity.

The second aim of the study was to compare the subgroups with respect to mortality and clinical diagnoses. Graphical comparisons of the metabolic subgroup boundaries and selected map colorings are shown in Figure 12. Mortality was the highest in the top section of the map (34% in eight years) as seen in the coloring DECEASED, and the same region was also characterized by greater than 90% prevalence of diabetic kidney disease as seen in the DIAB_KIDNEY coloring. As expected the High Creatinine Subgroup (A) captured this segment of the study population. In addition, a few districts with increased mortality and kidney disease prevalence were found within the High Cholesterol Subgroup (B), and similar spill-over was observable also in the High Triglyceride Subgroup (D). On the other hand, the Low Lipids Subgroup (E) showed the lowest rates of death or complications across all the plots in Figure 12.

The metabolic syndrome is a clinical entity to describe the co-occurrence of obesity, diabetes, high blood pressure and abnormal blood lipids that is often observed in people at risk of cardiovascular death. Triglycerides and HDL cholesterol comprise the lipid component of the metabolic syndrome, which explains the similar yet different patterns with respect to cardiovascular disease (coloring METAB_SYNDR in Figure 12). In particular, over half of the individuals in the High Triglyceride Subgroup (D) have the metabolic syndrome.

7.4 Subgroup statistics

Please note that these results have been generated with a 64-bit machine, and they may be different from the results from 32-bit architectures due to the lower machine precision. If you notice problems, please redefine the subgroups yourself and update the R-code accordingly.

The function nroSummary calculates the summary statistics for the interactively defined subgroups.

results <- nroSummary(data = db$values, districts = matches,
                      regions = elem$REGION)

It calculates the mean, standard deviation and median for each subgroup and variable and also calculates P-Values using ANOVA, the t statistics and the chi-square test depending on the type of data. Of note, P-Values for variables used in training are set to NA.

Here, we look at the differences in the prevalence of mortality and diabetic kidney disease in each subgroup.

results <- results[which(results$VARIABLE %in% c('DECEASED','DIAB_KIDNEY')),c('VARIABLE','SUBGROUP','MEAN','P.chisq')]
Table 2: Comparison of metabolic subgroups in individuals with type 1 diabetes for mortality and diabetic kidney disease
VARIABLE SUBGROUP MEAN P.chisq
DECEASED High Cholesterol 0.0476190 0.4058467
DECEASED High Creatinine 0.2426471 0.0000000
DECEASED High HDL2 cholesterol 0.0638298 0.1254771
DECEASED High triglycerides 0.0869565 0.0392702
DECEASED Low lipids 0.0245098 1.0000000
DIAB_KIDNEY High Cholesterol 0.3968254 0.0000002
DIAB_KIDNEY High Creatinine 0.9044118 0.0000000
DIAB_KIDNEY High HDL2 cholesterol 0.2198582 0.0047829
DIAB_KIDNEY High triglycerides 0.3623188 0.0000016
DIAB_KIDNEY Low lipids 0.1029412 1.0000000

Selected findings are listed in Table 2. As expected, the High Creatinine Subgroup had the highest prevalence for dying within the follow-up period and the highest for diabetic kidney disease. When considering P-values below 0.05 significant, then we observed an increased prevalence of mortality in the High Creatinine and High Triglycerides subgroups. Regarding the prevalence of diabetic kidney disease, only the Low Lipids Subgroup showed no significant association.

8 Concluding remarks

Now that the SOM analyses have been completed, how should these findings be reported in a journal article, and what is the take-home message? Our first recommendation is not to abandon conventional statistics when using the Numero framework – the two are complementary. In the kidney disease example, we recommend starting with the description of the study cohort and age- and sex-adjusted comparisons between established clinical categories (e.g. Table 1 is a basic first step). This will give most readers in the field an understanding of the basic nature of the dataset.

Next, we recommend drawing Kaplan-Meier mortality curves for diabetic kidney disease, retinopathy, and metabolic syndrome, and apply Cox regression to investigate associations with mortality in a multivariate context (or other well established statistical methods). Again, the biomedical readership will appreciate using methodology that is familiar to them. These analyses work best for datasets with only a few variables and a well-defined hypothesis, but they are not well suited to identifying non-linear subgroups, synergies across a high number of variables, or multi-morbidity from several correlated yet diverse clinical end-points. Hence the machine learning audience will probably want more.

The third section of the article should involve the SOM to identify features that cannot be detected by the standard tools. Even if nothing new was discovered, we still recommend adding the SOM as a supplement, since it gives a comprehensive window into the data, and it is particularly useful to detect non-random patterns of missing data, the effects of censoring in longitudinal studies and opportunities to detect outliers. For readers and reviewers, sophisticated visualizations that capture the nature of the cohort and give an accurate description of the structural strengths and weaknesses will be highly appreciated – it is better science.

As to the take-home message, we propose the following: if people with type 1 diabetes can achieve such metabolic control that their serum lipids are low, they are likely to be resilient against diabetic complications and mortality. We and others have made similar observations before, so this is not novel, however, it shows how the SOM lead to the expected conclusions and it gives us confidence that the approach is robust for high-dimensional big data that is out of reach of conventional tools.

9 Citation

When you use the package, please cite our publication

citation('Numero')
## 
##   Song Gao, Stefan Mutter, Aaron E. Casey, Ville-Petteri Mäkinen;
##   Numero: a statistical framework to define multivariable subgroups
##   in complex population-based datasets, International Journal of
##   Epidemiology, , dyy113, https://doi.org/10.1093/ije/dyy113
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Numero: a statistical framework to define multivariable subgroups in complex population-based datasets},
##     author = {Song Gao and Stefan Mutter and Aaron E. Casey and Ville-Petteri Mäkinen},
##     journal = {International Journal of Epidemiology},
##     year = {2018},
##     pages = {dyy113},
##     doi = {10.1093/ije/dyy113},
##   }

References

Attik, M., Bougrain, L., & Alexandre, F. (2005). Self-organizing map initialization. In W. Duch, J. Kacprzyk, E. Oja, & S. Zadrożny (Eds.), Artificial neural networks: Biological inspirations – icann 2005: 15th international conference, warsaw, poland, september 11-15, 2005. proceedings, part i (pp. 357–362). doi:10.1007/11550822_56

Bernardi, L., De Barbieri, G., Rosengård-Bärlund, M., Mäkinen, V.-P., Porta, C., & Groop, P.-H. (2010). New method to measure and improve consistency of baroreflex sensitivity values. Clinical Autonomic Research, 20(6), 353–361. doi:10.1007/s10286-010-0079-1

Cardoso, D. B. O. S., Queiroz, L. P. de, & Lima, H. C. de. (2014). A taxonomic revision of the south american papilionoid genus luetzelburgia (fabaceae). Botanical Journal of the Linnean Society, 175(3), 328–375. doi:10.1111/boj.12153

Delanaye, P., Schaeffner, E., Ebert, N., Cavalier, E., Mariat, C., Krzesinski, J.-M., & Moranne, O. (2012). Normal reference values for glomerular filtration rate: What do we really know? Nephrology Dialysis Transplantation, 27(7), 2664–2672. doi:10.1093/ndt/gfs265

Kohonen, T., Schroeder, M. R., & Huang, T. S. (Eds.). (2001). Self-organizing maps (3rd ed.). Secaucus, NJ, USA: Springer-Verlag New York, Inc.

Kumpula, L. S., Makela, S. M., Mäkinen, V.-P., Karjalainen, A., Liinamaa, J. M., Kaski, K., … Ala-Korpela, M. (2010). Characterization of metabolic interrelationships and in silico phenotyping of lipoprotein particles using self-organizing maps. The Journal of Lipid Research, 51(2), 431–439. doi:10.1194/jlr.D000760

Kuusisto, S. M., Peltola, T., Laitinen, M., Kumpula, L. S., Mäkinen, V.-P., Salonurmi, T., … Ala-Korpela, M. (2012). The interplay between lipoprotein phenotypes, adiponectin, and alcohol consumption. Annals of Medicine, 44(5), 513–522. doi:10.3109/07853890.2011.611529

Levey, A. S., & Coresh, J. (2012). Chronic kidney disease. The Lancet, 379(9811), 165–180. doi:10.1016/S0140-6736(11)60178-5

Mäkinen, V.-P., Forsblom, C., Thorn, L. M., Waden, J., Gordin, D., Heikkila, O., … on behalf of the FinnDiane Study Group. (2008a). Metabolic Phenotypes, Vascular Complications, and Premature Deaths in a Population of 4,197 Patients With Type 1 Diabetes. Diabetes, 57(9), 2480–2487. doi:10.2337/db08-0332

Mäkinen, V.-P., Soininen, P., Forsblom, C., Parkkonen, M., Ingman, P., Kaski, K., … Ala-Korpela, M. (2008b). 1H NMR metabonomics approach to the disease continuum of diabetic complications and premature death. Molecular Systems Biology, 4. doi:10.1038/msb4100205

Mäkinen, V.-P., Soininen, P., Kangas, A. J., Forsblom, C., Tolonen, N., Thorn, L. M., … Finnish Diabetic Nephropathy Study Group. (2013). Triglyceride-cholesterol imbalance across lipoprotein subclasses predicts diabetic kidney disease and mortality in type 1 diabetes: The FinnDiane Study. Journal of Internal Medicine, 273(4), 383–395. doi:10.1111/joim.12026

Mäkinen, V.-P., Tynkkynen, T., Soininen, P., Peltola, T., Kangas, A. J., Forsblom, C., … Groop, P.-H. (2012). Metabolic Diversity of Progressive Kidney Disease in 325 Patients with Type 1 Diabetes (the FinnDiane Study). Journal of Proteome Research, 11(3), 1782–1790. doi:10.1021/pr201036j

Tukiainen, T., Tynkkynen, T., Mäkinen, V.-P., Jylänki, P., Kangas, A., Hokkanen, J., … Ala-Korpela, M. (2008). A multi-metabolite analysis of serum by 1H NMR spectroscopy: Early systemic signs of Alzheimer’s disease. Biochemical and Biophysical Research Communications, 375(3), 356–361. doi:10.1016/j.bbrc.2008.08.007

Würtz, P., Soininen, P., Kangas, A. J., Mäkinen, V.-P., Groop, P.-H., Savolainen, M. J., … Ala-Korpela, M. (2011). Characterization of systemic metabolic phenotypes associated with subclinical atherosclerosis. Mol. BioSyst., 7(2), 385–393. doi:10.1039/C0MB00066C