In textbook examples, multivariable datasets are clustered into distinct subgroups that can be clearly identified by a set of optimal mathematical criteria. However, many real-world datasets arise from synergistic consequences of multiple effects, noisy and partly redundant measurements, and may represent a continuous spectrum of the different phases of a phenomenon. In medicine, complex diseases associated with ageing are typical examples. An individual’s data are the result from the mix of genetic and environmental factors that have had cumulative effects over decades, and incidental factors at the time of the measurements. Furthermore, each individual typically has a unique mix of multiple ailments and morbidities that depend on physiology and circumstances. We postulate that population-based biomedical datasets (and many other real-world examples) do not contain an intrinsic clustered structure that would give rise to mathematically well-defined subgroups. From a modeling point of view, the lack of intrinsic structure means that the data form a contiguous cloud in high-dimensional space without abrupt changes in density to indicate subgroup boundaries, hence a mathematical criteria cannot segment the cloud purely by its internal structure. Yet we need data-driven classification and subgrouping to aid decision-making and to facilitate the development of testable hypotheses. For this reason, we developed the Numero package, a more flexible and transparent process that allows human observers to create usable multivariable subgroups even when conventional clustering frameworks struggle.
Numero 1.1.1
Pattern recognition and clustering algorithms are the methodological cornerstones of the “big data” paradigm. In biology, high-throughput genomics and detailed imaging techniques are avidly applied to learn the details of how cells work and how diseases develop, and big datasets are expanding at an exponential rate, which also means that biomedical data analysis relies more and more on computational modeling and visualization in addition to the traditional descriptive statistics. The taxonomic tradition in biomedicine to categorize phenomena into distinct easily identifiable boxes remains strong, which explains the popularity of classical algorithms such as principal component analysis and hierarchical clustering as the first and often only choices for visualization and interpretation of the multi-dimensional structure of a complex dataset. However, both methods struggle when the dataset resembles a continuum instead of distinct clusters of data.
In the vignette, we focus on biomedical applications of the Numero framework, but there is nothing in the R package that is specifically aimed at biology. The choice of diabetic kidney disease as an example reflects our experience in the field, whereas Numero itself can be applied to any analysis problem that involves complex multi-dimensional data.
The document is organized into sections and paragraphs that describe our motivation for developing the library, introduce the concept of the self-organizing map, describe the dataset we use as an example of a biomedical study, go through a complete R-script of an analysis pipeline, define metabolic subgroups, interpret the results and discuss the role of the map analyses in publications.
The conventional notion of qualitative data patterns (e.g. health vs. disease) fits well with clustering algorithms that aim to find discriminatory borders automatically within the data. However, we argue that many biomedical datasets do not have a qualitative structure of regularity, but they instead reflect a multivariable spectrum of causes and consequences where the borderline between health and disease is blurred. For instance, chronic kidney disease is defined according to an internationally accepted threshold of glomerular filtration rate (GFR < 60 mL/min 1.73 m2, (Levey & Coresh, 2012)), but there is no mathematically identifiable threshold effect in the population-based GFR distribution or any other biomarker or physical characteristic, as demonstrated by the continuous discussion on diagnostic criteria (Delanaye et al., 2012). Therefore, in most cases it is impossible to say exactly when someone develops chronic kidney disease, only that the diagnostic threshold is reached after a gradual decline, after which treatments can be initiated according to consensus guidelines.
Typical clustering analyses rely on algorithms that are tweaked for different application domains to produce classifications that are mathematically optimal, to reproduce an existing gold standard, or to predict future outcomes. We maintain that excessive reliance on mathematical criteria is not useful for datasets without intrinsic clustering structure, since the choice of the criteria will determine the output rather than the data or practical usefulness of the classification. Furthermore, the process that leads to category assignments is often too complicated to understand on a practical level, so the human observer must rely on the “black box” to produce the classification results without access to the inner workings. We propose a half-way solution, where the aim is to simplify the data presentation with statistical verification so that a human observer can determine a suitable subgrouping for a specific purpose, yet with sufficient access to the data patterns to understand the characteristics of the dataset in detail.
A traditional strict classification model will work well, if measurable qualitative differences exist. For instance, type 1 diabetes is an autoimmune form of diabetes that develops in children and adolescents. The condition is severe with a short life expectancy if untreated, so type 1 diabetes can be considered a qualitative example of health versus disease. Consequently, highly accurate diagnostic biomarkers such as glucose, insulin and C-peptide already exist. Even when treated, type 1 diabetes has a profound long-term impact on energy metabolism and it represents a distinct data cluster that is separate from the non-diabetic population.
Unlike type 1 diabetes, common age-associated diseases such as chronic kidney disease, type 2 diabetes, and atherosclerosis are challenging from a clustering perspective: they take decades to develop, they are not immediately life-threatening if left untreated, and there is a wide variation in severity across individuals. Furthermore, the affected individuals often suffer from multiple interacting chronic conditions, making it difficult to isolate specific causes and symptoms. Therefore, the simplistic notion of a qualitative threshold between health and disease becomes problematic. We aim to address these challenges by creating subgroups that are of practical value beyond mathematical criteria, and guided by a human observer with access to understandable presentations of the multivariable data patterns.
Multiple co-occurring and inter-connected phenomena are hallmarks of complex systems and the observable data that can be obtained from them. This presents a challenge to the traditional paradigms of biomedicine. For instance, differential diagnostics cannot cope well with multiple overlapping diseases, or evolving degrees of severity. This motivated us to develop the Numero framework in such a way as to enable visual comparisons of multiple overlapping diagnoses and their diagnostic criteria. We expect the Numero to be highly valuable in situations where the most important outcome or a set of outcomes is not obvious (e.g. competing risk scenarios). For instance, patients with type 1 diabetes may develop serious injuries to their vasculature over decades, but the affected organs, severity and rate of progression vary. Therefore, predictive models that focus only on a single outcome at a time may miss the big picture. The example of diabetic kidney disease we use in this vignette demonstrates how to use the Numero framework to gain insight into the overlaps and longitudinal associations between multiple morbidities.
Expressing multivariable data in visual form is a critical part of any knowledge discovery process, and an extensive number of algorithms have been developed in recent decades. In many cases, the aim is to project a set of multivariable data points into a two-dimensional presentation for human viewing (Figure 1). We built the Numero package using the self-organizing map (SOM) algorithm (Kohonen, Schroeder, & Huang, 2001), which is based on only a few simple mathematical rules, does not break down from missing data and can handle a high number of variables. We also developed a method to estimate the statistical significance of the map patterns (V.-P. Mäkinen et al., 2008b). Of note, the modular structure of the library allows users to replace the SOM with any other suitable algorithm for customized analysis pipelines.
Figure 1: A conceptual example
The example shows how to organize objects with multiple features into a two-dimensional layout. The images were obtained from Cardoso, Queiroz, & Lima (2014).
Conceptually, the SOM algorithm mimics a human observer who wants to make sense of a set of objects. For instance, Figure 1A depicts schematic drawings of the flowering legume genus Luetzelburgia that grows in South America (Cardoso et al., 2014). Figure 1B shows how a human observer might organize the drawings based on their visual similarities (shape, size and other morphological details). By organization, we refer to the spatial layout of the drawings on the two-dimensional canvas: drawings that look similar are close to each other, whereas drawings that look different are far apart (in most cases). This is how all people, from children to elderly, sort and classify their objects with multiple observable features (= multivariable data points) with the help of a two-dimensional surface (= data map). The same observer then decides how to split the dataset into subgroups based on his or her domain knowledge.
If there are thousands of drawings, manual organization becomes impractical. For this reason, we let the SOM algorithm to do the first organization step, and to visualize the salient patterns within the dataset in a two-dimensional data map. The spatial principle still applies: multivariable data points that have similar values are close to each other, whereas data points that are different are on the opposite sides of the map. The second step of defining subgroups remains the responsibility of the observer. We argue that this type of data-assisted subgrouping is particularly useful in situations where there is no qualitative threshold between health and disease, but a line must be drawn to initiate preventative measures or treatments.
Although there are only 18 drawings in Figure 1, the nature of the dataset resembles many epidemiological studies. Specifically, some of the drawings are very similar, but it is not obvious how they should be classified into subgroups (i.e. our version of the figure can be disputed, a single “correct” visual subgrouping may not exist). If the classification was based on the height of a drawing, the results would look different compared to using the width – some drawings are narrow while being long, whereas others are wide despite being short. This is a naïve example on how the selection of the mathematical criteria for classification has a substantial impact on the results, and illustrates the motivation for our work. We developed the Numero library as an alternative tool that helps researchers to define meaningful groupings when pure mathematics cannot provide a conclusive answer.
Previous versions of the software (written for Matlab) were successfully used on a range of metabolomics and other biomedical studies (Bernardi et al., 2010; Kumpula et al., 2010; Kuusisto et al., 2012; V.-P. Mäkinen et al., 2008a, 2008b, 2013; Tukiainen et al., 2008; Würtz et al., 2011). However, the old version used a rectangular SOM, which tends to guide observers into picking four subgroups in the corners even when not supported by data. We created the Numero package with a circular implementation of the SOM to remove the limitations from cornered border shapes. Additional technical details and supportive material are available as an online supplement within a previous publication (V.-P. Mäkinen et al., 2012).
Data point -– Here, we define the term data point as a single uniquely identifiable row in a spreadsheet of data (with variables as columns). For instance, in the diabetic kidney disease dataset described in the next section, a data point refers to a patient (and vice versa) as there is only one row per patient.
Map –- A map is a general term to describe the two-dimensional canvas onto which the multivariable data points are projected. The concept is analogous to a geographic map that indicates where people live, except that the location is not based on geography (i.e. physical distances), but comes from the data (i.e. distances = data-based similarities).
Layout –- We make a distinction between what is a map, and what is the layout of data points on it. The layout is a table of data point locations as coordinates, whereas the map is a more integrated concept that also includes information that is necessary to find the locations of new previously unseen data points, and to draw and paint the map in visual form.
District –- A district refers to a pre-defined division of the map into uniformly sized areas. The districts are created mainly for technical reasons: using districts speeds up calculations and enables the estimation of map-related statistics. This is analogous to a real city being divided into districts to estimate regional demographics, for instance.
Coloring –- The Numero framework always creates a single map. However, the map districts can be painted with different colors. This enables the user to create multiple colorings of the map to visualize regional differences. These colorings can be made for each variable, which helps to identify which parts of the map are particularly important for a specific phenomenon. Again, this is similar to a real city map where the districts are colored according to the income level of the local residents, or according to the mean age, smoking rates, obesity etc.
Subgroup –- We expect that most uses of Numero will result in the subgrouping of a complex dataset. Visually, we define a subgroup via a contiguous set of adjacent districts on the map. Consequently, all the data points that are located within the set of districts are the subgroup members.
District profile –- The SOM algorithm works through the districts during the optimization of the data point layout on the map. The computational process eventually converges to a stable configuration that is stored as a set of district profiles. From a practical point of view, a district profile represents the typical average profile that captures the characteristics of the data points within the district. In technical terms, the district profile (also known as the prototype) contains the mean weighted data values across all the data points, where the weights are determined by the neighborhood function used in the SOM algorithm.
Best-matching district –- The best-matching district (BMD, also known as the best-matching unit in the literature) is the district with a profile that is the most similar to a data point when considering all variables simultaneously. The BMD is closely related to the data point layout: the assigned location for a data point is the location of the BMD for that data point.
Diabetic kidney disease is the leading indication for dialysis and kidney transplantation in the developed countries, and carries a substantial risk of premature death due to cardiovascular disease. About one third of individuals with type 1 diabetes will develop diabetic kidney disease during their lifetime. As the age of onset of type 1 diabetes is in childhood or adolescence, these individuals will develop complications at a relatively early age. Therefore, people with type 1 diabetes represent a particularly vulnerable group facing lower quality of life and reduced life span due to kidney damage.
Albuminuria (elevated albumin concentration in urine) is the basis for the clinical classification of diabetic kidney disease. In this example, we applied the threshold of 300mg/24h, if 24h urine collections were done and 0.2 mg/min when overnight urine data were available from the local medical centers that examined the patients. If the threshold was exceeded in at least two out of three consecutive measurements, we assigned the individual in the diabetic kidney disease group. In addition, the FinnDiane Study Group measured urinary albumin excretion rate from a single 24h urine sample in their designated central laboratory. The logarithm of the albumin excretion rate was included in the example dataset.
Our example dataset contains a subset of data from a previous publication (V.-P. Mäkinen et al., 2008b). We created the simplified dataset for educational purposes, but it contains enough information to replicate some of the findings from the original study. The dataset includes 613 individuals of whom 225 individuals had diabetic kidney disease at baseline. In addition, we included information on whether an individual had died after an eight-year follow-up to demonstrate how the study design we chose can be applied to longitudinal data. The available data are summarized in Table 1.
Trait | No kidney disease | Diabetic kidney disease | P-value |
---|---|---|---|
Men / Women | 192 / 196 | 119 / 106 | 0.45 |
Age (years) | 38.8 ± 12.2 | 41.7 ± 9.7 | 0.0012 |
Type 1 diabetes duration (years) | 25.3 ± 10.3 | 28.6 ± 7.8 | <0.001 |
Log10 of AER (mg/24h) | 1.20 ± 0.51 | 2.72 ± 0.59 | <0.001 |
Log10 of TG (mmol/L) | 0.034 ± 0.201 | 0.159 ± 0.212 | <0.001 |
Total cholesterol (mmol/L) | 4.89 ± 0.77 | 5.35 ± 0.96 | <0.001 |
HDL2 cholesterol (mmol/L) | 0.54 ± 0.16 | 0.51 ± 0.18 | 0.027 |
Log10 of serum creatinine (µmol/L) | 1.94 ± 0.09 | 2.14 ± 0.24 | <0.001 |
Metabolic syndrome | 90 (23.2%) | 114 (50.7%) | <0.001 |
Macrovascular disease | 16 (4.1%) | 38 (16.9%) | <0.001 |
Diabetic retinopathy | 133 (34.4%) | 178 (79.1%) | <0.001 |
Died during follow-up | 13 (3.4%) | 43 (19.1%) | <0.001 |
In the original study, we hypothesized that the metabolic profile of an individual with type 1 diabetes at baseline predicts adverse events in the future (V.-P. Mäkinen et al., 2008b). Here, we set two aims to test the same hypothesis in the example dataset:
We chose these aims to accommodate a high number of variables and to ensure statistical robustness. Please note that we included only a few variables in the example dataset for pedagogical reasons, but the SOM in the original study was created based on thousands of variables.
The strict separation of Aim 1 and 2 is an example of an unsupervised classification design where the metabolic subgroups are created without using the mortality data. Only after the subgroup modeling has been completed, the deaths during follow-up will be counted within the subgroups. An alternative would be to employ regression or other supervised methods that use all the available data simultaneously to create a predictive model of mortality. While supervised models can achieve high accuracy, they rarely work well outside the dataset they were created for, and they may fail if the outcome to be predicted is poorly defined or biased. For these reasons, we adopted the more robust unsupervised classification design.
We denote the study design as “split-by-variables” since it starts from a spreadsheet with one patient per row and the variables organized into columns, and then assigns one set of variables into the training set, and the remaining variables (e.g. deceased or alive at follow-up) into the evaluation set (Figure 2). Since the evaluation set plays no part in the training of the SOM, we can estimate the statistical significance of the mortality pattern without over-estimating the model accuracy.
Figure 2: Application of the split-by-variable study design in the diabetic kidney disease example
Of note, the training set is adjusted for sex differences; hence the ‘MALE’ column is not formally included in the evaluation set.
The architecture of the analysis pipeline for the diabetic kidney disease example is detailed in Figure 3. First, we describe how to preprocess the data for analysis (Figure 3A-D). Next, we create the SOM based on the training set (Figure 3B,E). The third segment focuses on map statistics and how to color the maps according to regional variation (Figure 3F,G). Lastly, we discuss the interactive subgroup selection and interpretation of the results (Figure 3H,I).
Figure 3: Analysis steps in the diabetic kidney disease example
We have included the example dataset in the installation package. To access it, type
fname <- system.file("extdata", "finndiane_dataset.txt",
package = "Numero")
db <- read.delim(file = fname, sep = "\t")
summary(db)
## INDEX AGE T1D_DURAT MALE
## Min. : 1 Min. :15.00 Min. : 2.59 Min. :0.0000
## 1st Qu.:154 1st Qu.:31.00 1st Qu.:19.00 1st Qu.:0.0000
## Median :307 Median :39.00 Median :26.00 Median :1.0000
## Mean :307 Mean :39.86 Mean :26.53 Mean :0.5073
## 3rd Qu.:460 3rd Qu.:48.00 3rd Qu.:34.00 3rd Qu.:1.0000
## Max. :613 Max. :74.00 Max. :53.00 Max. :1.0000
##
## DECEASED MACROVASC METAB_SYNDR DIAB_KIDNEY
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.000
## Mean :0.09135 Mean :0.08809 Mean :0.3328 Mean :0.367
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.000
##
## DIAB_RETINO uALB_log TG_log CHOL
## Min. :0.0000 Min. :0.3617 Min. :-0.37160 Min. : 2.920
## 1st Qu.:0.0000 1st Qu.:0.9590 1st Qu.:-0.06550 1st Qu.: 4.470
## Median :1.0000 Median :1.5682 Median : 0.05690 Median : 4.980
## Mean :0.5082 Mean :1.7526 Mean : 0.08023 Mean : 5.061
## 3rd Qu.:1.0000 3rd Qu.:2.4900 3rd Qu.: 0.21750 3rd Qu.: 5.600
## Max. :1.0000 Max. :3.8788 Max. : 0.90850 Max. :10.000
## NA's :1 NA's :28
## HDL2C CREAT_log
## Min. :0.0910 Min. :1.415
## 1st Qu.:0.4120 1st Qu.:1.909
## Median :0.5200 Median :1.978
## Mean :0.5273 Mean :2.013
## 3rd Qu.:0.6400 3rd Qu.:2.061
## Max. :1.1900 Max. :3.035
##
We hypothesize that the metabolic phenotype of an individual predicts future adverse outcomes. To investigate the hypothesis, we select all blood and urine biomarkers at baseline as the training set (Aim 1), and then use the remaining columns that contain data on clinical end-points and mortality as the evaluation set (Aim 2).
trvars <- c("CREAT_log", "CHOL", "HDL2C", "TG_log", "uALB_log")
If our hypothesis is correct, we should see a statistically significant regional pattern for mortality on the SOM that we constructed based on the metabolic variables at baseline. This is the split-by-variable study design that was previously described in Figure 2.
In the data file, the biomarkers are expressed in their physical concentration units, or as log-transformed versions. As a consequence, the standard deviations of the data columns vary, which can bias the SOM to fit better to those biomarkers that have a wide numerical variation. In most cases, it is desirable to standardize the training set before analyses, so that the information content rather than the measurement scale determines the modeling outcome.
Sex difference is another factor to consider when preparing the training set. Men and women display anatomical and metabolic differences, which usually complicate the interpretation of the SOM. For this reason, we recommend using a sex-specific standardization procedure that eliminates the differences. If necessary, separate visualizations can be made afterwards for men and women using the same map, please see V.-P. Mäkinen et al. (2012) for an example.
The Numero package contains a pre-processing function that checks the data for unusable rows and columns, and we can use it to center and scale the data for men and women separately. Our dataset contains an explicit identity column. Therefore, it needs to be declared in the pre-processing function. The function consequently assigns these identifiers as row names in the output. If no key column is provided, then the row names are copied as such.
db <- nroPreprocess(data = db, training = trvars, strata = "MALE",
key = "INDEX")
The function returns a list with three members:
You can verify that the training data is zero centered by typing the following commands that will produce the output below:
men <- which(db$values[,'MALE'] == 1)
women <- which(db$values[,'MALE'] == 0)
print(summary(db$features[men,]))
## CREAT_log CHOL HDL2C TG_log
## Min. :-1.7142 Min. :-2.24260 Min. :-2.4881 Min. :-2.0232
## 1st Qu.:-0.5691 1st Qu.:-0.69266 1st Qu.:-0.6940 1st Qu.:-0.7095
## Median :-0.2694 Median :-0.06336 Median :-0.1069 Median :-0.1533
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.1772 3rd Qu.: 0.62420 3rd Qu.: 0.6541 3rd Qu.: 0.5876
## Max. : 4.8941 Max. : 5.16912 Max. : 3.3436 Max. : 3.7445
##
## uALB_log
## Min. :-1.5656
## 1st Qu.:-0.9274
## Median :-0.0868
## Mean : 0.0000
## 3rd Qu.: 0.8118
## Max. : 2.1943
## NA's :13
print(summary(db$features[women,]))
## CREAT_log CHOL HDL2C TG_log
## Min. :-3.3538 Min. :-2.5613 Min. :-2.86000 Min. :-1.9467
## 1st Qu.:-0.5241 1st Qu.:-0.6290 1st Qu.:-0.65987 1st Qu.:-0.6907
## Median :-0.1371 Median :-0.1259 Median :-0.07021 Median :-0.1360
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.2969 3rd Qu.: 0.6173 3rd Qu.: 0.57969 3rd Qu.: 0.5740
## Max. : 4.9883 Max. : 5.5339 Max. : 3.82916 Max. : 2.9483
##
## uALB_log
## Min. :-1.4387
## 1st Qu.:-0.8690
## Median :-0.3313
## Mean : 0.0000
## 3rd Qu.: 0.8239
## Max. : 2.3544
## NA's :15
Training a SOM requires two steps: i) initialization of the map and ii) iterative optimization of the district profiles. The initialization of district profiles influences the usability of the final map, and several methods including principal component analysis have been proposed (Attik, Bougrain, & Alexandre, 2005). Our experience suggests that the most useful results are usually achieved by creating a limited number of seed profiles that are placed on the edges of the map. The districts in the middle are set automatically in such a way that the transition from one seed to another via the districts in the middle is smooth. The Numero package includes the nroKmeans() function which we use here to determine the seed profiles. It is based on the classical k-means algorithm, but our implementation is specifically designed for datasets with missing data and to produce output that is compatible with other Numero functions.
The minimum number of seeds is three as the triangle is the simplest polygon to cover the map. To create the seed profiles, you can use the command
km <- nroKmeans(data = db$features, k = 3)
The output is a list that contains three named elements:
To show the seeds, type the following command. The seed profile output has the same columns as the training data.
print(km$centroids)
## CREAT_log CHOL HDL2C TG_log uALB_log
## [1,] 1.0612976 0.4464189 -0.7508336 1.07224248 1.08944887
## [2,] -0.3926040 -0.6453715 -0.2515304 -0.47089775 -0.57310817
## [3,] -0.1886087 0.6884679 0.9797789 -0.07189808 0.09436714
Most SOM software is based on rectangular or borderless maps with periodic boundaries. In our experience, the former suffer from an artificial tendency for observers to define four separate subgroups in the corners. The latter do not suffer from boundary artifacts, but are complicated to interpret. For these reasons, we developed a circular map topology for Numero since it does not have corners and yet the well-defined borders limit the visual complexity of the regional patterns.
The preferred size of the map depends on the number of data (small maps for small datasets), however, we advise against using large maps due to their complexity. For most biomedical and epidemiological applications, map radii between two and five will provide enough flexibility and expressive power based on our experience.
To initialize a circular map with a radius of three districts according to the seed profiles, use the command
sm <- nroKohonen(seeds = km$centroids, radius = 3)
This will create the initial matrix of district profiles, and additional topological information that will be required for visualization. To show the district profiles, type the following command that will show how the profiles have the same format as the seeds:
print(head(sm$centroids))
## CREAT_log CHOL HDL2C TG_log uALB_log
## 1 0.16002834 0.1631718 -0.007528393 0.17648222 0.20356928
## 2 0.46405652 0.3572670 -0.149647657 0.49388353 0.53864811
## 3 0.10507216 0.3815612 0.297551387 0.15828606 0.23621959
## 4 -0.04348941 0.1989276 0.270231115 -0.01037918 0.04019910
## 5 -0.13134873 -0.1548775 -0.016852841 -0.14811837 -0.16612276
## 6 -0.03213067 -0.1265242 -0.101798836 -0.04994397 -0.06963915
Kohonen’s self-organizing map algorithm was originally developed to mimic the plasticity of neural networks (Kohonen et al., 2001). It scales up well for datasets with a high number of variables, and it can handle missing data values, which is why we chose it as the default method in Numero. To apply the SOM algorithm to the standardized training set, use the command
sm <- nroTrain(som = sm, data = db$features)
The nroTrain() function updates the district profiles in such a way that when the data points are assigned the best-matching districts (BMD), the resulting layout is distributed more evenly across the map. BMDs and the data point layout are discussed in the next section.
The nroTrain() function adds a record of the training process within the output list. To plot it on screen, type
plot(sm$history)
Figure 4: SOM training history in the diabetic kidney disease example
The results are shown in Figure 4. The curve shows the mean Euclidean distance between a data point and its best matching district profile for each training cycle. In most cases, the first few cycles account for the largest reductions in the training error. The abrupt reduction that can be observed after the first few cycles is part of the training process. For those interested in the technical details, switching from a wide neighborhood function to a narrower one causes the threshold. This is beneficial, since by starting the training process with a wide neighborhood function forces the SOM to adapt to large (and presumably more important) patterns first before adapting to the minor details.
We have included additional tools in the Numero package to assess the internal properties of the SOM given a training dataset. However, the full use of these tools requires visualization functions that are not covered until later in the document. For this reason, we will return to this subject in a dedicated section after introducing map colorings.
The application of the SOM algorithm mimics a human researcher who wants to investigate the cohort of type 1 diabetic patients. Suppose the researcher goes through the medical records of all individuals and then organizes the folders on a giant round table in such a way that patients with mutually similar clinical profiles are placed next to each other, whereas patients who are different are on opposite sides of the table. From this organized view, it is then possible to identify sections of the table (= patient subgroups) where there is a high risk of premature death.
To translate the story of the human researcher into a computer program, it is necessary to introduce new concepts and it is also useful to revisit to pre-defined terminology from the beginning of the document. First, the map (representing the giant round table) is divided into districts for technical reasons since the type of manual positioning algorithm a human would apply is unfeasible for large datasets. The districts are also anchors that enable the assignment of the patients onto specific map positions, again important for technical reasons.
The best-matching district (BMD) for a data point is the second important concept. Each data point is compared against a district profile by calculating their pair-wise Euclidean distance. This is repeated across the districts, and after the distances between the data point and all district profiles are calculated, the profile with the shortest distance is chosen as the best match. When the BMDs are located for all data points, the results are collected in a spreadsheet, which we denote as the data point layout. The layout is conceptually equivalent to the spatial configuration of folders on the human researcher’s table.
To create a layout, you can use the command
matches <- nroMatch(som = sm, data = db$features)
The output is an integer vector with elements corresponding to the rows in training data and each element contains the index of the best matching district. The vector also has the attribute quality. To print a section of the quality output on screen type the following command. Its output will be a data frame with three columns:
head(attr(matches,'quality'))
## COVERAGE RESIDUAL QUALITY
## 1 1 0.29996141 0.8554628
## 2 1 0.16749380 1.1447729
## 3 1 0.21269967 1.0263238
## 4 1 0.09154408 1.4201384
## 5 1 0.23710154 0.9720335
## 6 1 0.12177353 1.2960540
The column RESIDUAL shows the Euclidean distance between a data point and the BMD (the shorter the better). QUALITY is calculated from the distance by dividing it with the average training error. This provides a scale-independent relative estimate on how well the data points were matched compared to a typical data point in the training set. Finally, COVERAGE shows if a multivariable data point contained missing elements (1 means all elements were usable and 0 means none of the elements contained a numerical value).
Ideally, the data points should be uniformly distributed across districts. While uniformity is rarely observed for real datasets, uneven numbers are usually not a problem unless there are large contiguous groups of districts with very high or very low occupancy. We will revisit the spatial data point distribution in a later section that addresses map quality, but a tabulation of the BMD assignments is a quick way to see if the output is reasonable
t <- table(matches)
counts <- data.frame(DISTRICT = names(t), N = as.integer(t))
print(counts, row.names = FALSE)
## DISTRICT N
## 1 14
## 2 12
## 3 22
## 4 14
## 5 24
## 6 13
## 7 11
## 8 19
## 9 14
## 10 11
## 11 13
## 12 12
## 13 5
## 14 9
## 15 16
## 16 9
## 17 12
## 18 15
## 19 8
## 20 9
## 21 4
## 22 18
## 23 28
## 24 8
## 25 21
## 26 17
## 27 19
## 28 23
## 29 21
## 30 24
## 31 20
## 32 19
## 33 14
## 34 30
## 35 7
## 36 19
## 37 18
## 38 17
## 39 15
## 40 9
If a large number of districts in the above output were devoid of data points, it would indicate that the map did not capture the diversity of the data set, and therefore would not be useful for subgrouping. However, the data points are scattered across all districts in this example, which suggests the layout will be useful.
Statistical evaluation of whether an observation is likely to occur purely by chance is the cornerstone of biomedical data analysis. In our example, we achieve well-defined statistical analyses via the split-by-variable design and the non-parametric permutation engine that is built into the package. The former ensures that our results are not over-optimistic (no over-fitting) and the latter enables us to avoid restrictive assumptions on the nature of the data generating processes that are often violated in real datasets.
In our example, we investigate if the metabolic profile at baseline indicates the risk of death during follow-up. To estimate statistical significance, it is necessary to find out how much the areas of the map can differ with respect to mortality just by the virtue of random fluctuations. This concept is formally encapsulated by the null hypothesis. Here, the null hypothesis states that the data point layout is not associated with the number of deaths, that is, the location of a patient on the map does not provide any information on how likely the patient is to die in the next eight years. If the null hypothesis is true, then the observed layout and regional patterns of mortality should be within the variation we would expect for random layouts. We use permutation analysis to simulate a high number of random layouts, and then compare the observation with the simulated findings to see if it could have occurred by chance alone (V.-P. Mäkinen et al., 2008b). Within the split-by-variable design, P-values for statistical significance are only meaningful for those variables that are in the evaluation set since, by definition, the variables in the training set will always be strongly associated with the layout. However, it makes sense to evaluate the expected range of regional variation also for the training set, as we will demonstrate later in the vignette. Knowing the randomly expected amplitude of regional patterns (i.e. the basal amplitude) helps us to assess which of the training variables had the strongest influence on the layout. For these reasons, we will apply the permutation analysis to all variables, but only report P-values for the evaluation set.
The function nroPermute() repeats the following procedure: i) re-assign best-matching district randomly in accordance with the null hypothesis, ii) recalculate the average district values across the map and iii) summarize the regional variation with a single descriptive statistic. When a sufficient number of cycles has been achieved, the null distribution of the descriptive statistic is analyzed to determine how far, in terms of standard deviations, the observed value is from the mean prediction by the null hypothesis. This distance is reported as the Z-score of regional variation. Furthermore, the function also estimates how frequently a permuted layout produced a regional variation that exceeded the observation (frequency-based P-value).
The nroPermute() function goes through all variables in a dataset. Please note that we calculate the statistics on the data frame value (it only contains numeric values without a key column or unusable rows). Internally, we also distinguish between a training variable (no P-value needed and we can use fewer cycles) and an evaluation variable that is tested with larger number of cycles. The function call is shown below:
stats <- nroPermute(som = sm, data = db$values,
districts = matches)
To see the results, type the following command:
print(stats)
## SCORE Z P.freq N.data N.cycles TRAINING
## AGE 2.64097289 4.9259340 0.0000000 613 10000 no
## T1D_DURAT 3.16434620 5.0704949 0.0000000 613 10000 no
## MALE 0.03183488 -0.9824115 0.8333333 613 24 no
## DECEASED 0.08340699 6.2458777 0.0000000 613 10000 no
## MACROVASC 0.06011753 4.4950270 0.0000000 613 10000 no
## METAB_SYNDR 0.20813671 8.4682837 0.0000000 613 10000 no
## DIAB_KIDNEY 0.30830856 10.8681282 0.0000000 613 10000 no
## DIAB_RETINO 0.16995995 7.1931211 0.0000000 612 10000 no
## uALB_log 0.69966738 15.7798023 NA 585 1000 yes
## TG_log 0.12724597 11.2545995 NA 613 1000 yes
## CHOL 0.48767044 12.9654093 NA 613 1000 yes
## HDL2C 0.10087360 13.1108202 NA 613 1000 yes
## CREAT_log 0.08877195 11.2528273 NA 613 1000 yes
## P.z AMPLITUDE
## AGE 4.197917e-07 0.3808101
## T1D_DURAT 1.983913e-07 0.3919857
## MALE 8.370514e-01 0.0000000
## DECEASED 2.107135e-10 0.4828512
## MACROVASC 3.478056e-06 0.3474979
## METAB_SYNDR 1.245176e-17 0.6546592
## DIAB_KIDNEY 8.176055e-28 0.8401844
## DIAB_RETINO 3.166337e-13 0.5560799
## uALB_log NA 1.2198921
## TG_log NA 0.8700614
## CHOL NA 1.0023193
## HDL2C NA 1.0135606
## CREAT_log NA 0.8699244
The Z column contains Z-scores that indicate how far the observed regional variation is from the mean expected value if the null hypothesis is true. P.z is a parametric estimate for statistical significance based on the Z-scores using the cumulative Gaussian distribution, whereas P.freq is the frequency-based estimate for statistical significance. It is calculated by using the frequency of observing regional variation for a simulated random layout that exceeds the observed variation. N.data indicates how many data values were used, N.cycles tells the number of completed permutations. The column TRAINING indicates whether a variable has been used during the training process. Please note how the P-values are missing for those variables. AMPLITUDE contains the dynamic range for colors that can be used in map visualizations. The amplitudes are required for the assignment of district colors and will be described in closer detail later.
After estimating the map statistics, we now have all the results that are required to color the map according to the data patterns: the topological information that is carried within the variable sm allows us to draw the districts correctly, the data point layout specifies the locations of the patients on the SOM, and the color amplitudes.
Assigning a color palette to a set of values is not much different from photography. When a photo is taken, the intensity of light is converted into numbers by the digital camera, and then converted back to light on the viewing screen. If there is too much light, the photo gets overexposed, which means that most pixels show up as “burned” since the intensity is beyond their dynamic range (i.e. light saturates the sensor). If the photo is underexposed, most pixels will show a zero signal (i.e. the light is below the detection limit). In principle, the SOM colors work the same way: we aim to set up an optimal color assignment so that the colorings with very high regional variation do not over-expose too much, while the colorings with less regional variation can still show differences between districts despite under-exposure.
In the Numero framework, a photo corresponds to a map coloring (please revisit Terminology if necessary), light intensity is analogous to statistical significance (captured by z-scores), and the dynamic range is the gap between the lowest and highest district averages. Importantly, the “camera settings” are kept constant to ensure all colorings remain visually comparable. Ideally, the camera would be set up so that the full dynamic range of every coloring could be expressed within the available color palette. However, this approach is usually impractical as interesting detail could be lost for variables that show statistically modest but biologically critical variation. The dynamic range of colors is stored in the AMPLITUDE column of the output of nroPermute.
In brief, Z-scores indicate the statistical support for the observed regional variation. But before the information can be visualized, the Z-scores have to be converted to color amplitudes so that the map coloring reflects the strength of the statistical evidence.
The color of a district depends on the estimated mean value across its local resident data points. To calculate the district values, use the command
comps <- nroAggregate(topology = sm$topology, data = db$values,
districts = matches)
The output of the function is a data frame containing the average district values.
In the SOM literature, the set of district mean values for a variable are typically referred to as the component plane, hence the name of the output. We now have all the materials to assign colors to each district based on i) the amplitudes for each variable, which tells how much “exposure” the camera provides, and ii) the component plane, which gives the dynamic range and district means:
colrs <- nroColorize(values = comps, amplitudes = stats$AMPLITUDE)
The output is a data frame of colors in a format that matches the values in the component plane and can be used in subsequent Numero functions.
Due to the standardization by z-scores, the colors are not directly relatable to the original measurement units, or to the original binary categories. For this reason, text labels that indicate the actual mean values for selected districts are a useful visual addition to the final map plot. To create a set of labels for the map coloring, use the command
labls <- nroLabel(topology = sm$topology, values = comps)
The Numero package contains functions to visualize map colorings on screen, to create interactive map colorings to define subgroups and to save those colorings in the Scalable Vector Graphics (SVG) format to a file.
To see the all map colorings on screen (Figure 5), use the following command
elem <- nroPlot(elements = sm$topology, colors = colrs,
labels = labls, values = comps)
Figure 5: Statistically normalized colorings of all variables in the kidney disease dataset
The color intensity depends on how likely the observed regional variation would arise by chance; intense reds and intense blues indicate that these extremes would be very unlikely if the data point layout was random.
To save the plots into an SVG file, you can use the same command and providing a file name as parameter. The following command is not executed upon the creation of the vignette but serves as an example only.
nroPlot(elements = sm$topology, colors = colrs,
labels = labls, values = comps, file = 'test.svg')
It is possible to direct the figure to any of the R graphics devices, including the SVG device, but the Numero SVG file will be cleaner and structured in a way that makes it easier to be manually edited in graphics programs such as Inkscape.
In the final section, we summarize and discuss the results of the SOM analysis. As previously mentioned, the big open problems in biomedicine and public health are typically characterized by multiple synergistic risk factors that produce a gradual decline in biological functions over time. For this reason, the observed data patterns are not likely to be self-explanatory, but will require additional analyses and contextual assessment with respect to how the original data was collected and what are the clinically impactful findings.
Before delving into the characteristics of diabetic kidney disease, it is prudent to examine the SOM for potential problems with the data. The Numero package provides three different quality metrics: i) the data point histogram reveals problems of misrepresentation between the data points and the district profiles, ii) the coverage map shows systematic patterns of missing data that may influence the results, and iii) the matching quality indicates subgroups of data points that may have been modeled poorly.
To calculate all three, we can use the nroAggregate function and input the quality measures that have been calculated by nroMatch.
comps.qc <- nroAggregate(topology = sm$topology,
data = attr(matches, "quality"),
districts = matches)
The output contains the district averages for the quality metrics. In addition, nroAggregate always estimates the spatial histogram that tells how many samples are within each district, and returns it as an attribute. To add the histogram information to the quality visualization, we copy the attribute into a new column in the data frame:
comps.qc$HISTOGRAM <- attr(comps.qc, "histogram")
The output format is equivalent to what was used for the map colorings, so the same code sequence is applied to create an SVG figure. To make a distinction between diagnostic and other colorings, we use a different color palette for the nroColorize function:
colrs.qc <- nroColorize(values = comps.qc, palette = "fire")
labls.qc <- nroLabel(topology = sm$topology, values = comps.qc)
Again, we can use the nroPlot function to visualize the results on screen.
elem.qc <- nroPlot(elements=sm$topology, colors=colrs.qc,
labels=labls.qc)
Figure 6: Visualization of SOM quality metrics
Light (dark) colors indicate high (low) values. The color intensity was not normalized statistically. Coverage indicates the proportion of usable data values, residuals indicate model fit (smaller value is better), quality is a scale-independent measure based on the residuals (larger is better). Finally, the histogram shows smoothed estimates on how many samples were assigned to each district.
We observed coverage close to 1 across the map, which reflects the low frequency of missing elements in the original data matrix (Figure 6).
There are two ways to show matching quality, either by coloring the map according to the mean matching errors for districts, or by examining the matching errors of individual data points (also referred to as quantization errors or model residuals). These are shown in the colorings for RESIDUAL and QUALITY in Figure 6. Again, some regional differences are expected, but there were no indications of serious problems. In particular, the relative quality even in the worst region was close to the average training quality (i.e. close to one).
Finally, the HISTOGRAM coloring in Figure 6 shows that there were noticeable differences between the districts. However, there was a sufficient data point count everywhere on the map and, based on our experience from previous studies, it is unlikely that the results were adversely affected due to sparse representation.
Important note on reproducibility: The Numero framework uses optimized code that reduces memory footprint and computational burden. For this reason, different computers, particularly 32-bit vs. 64-bit architectures, may produce map patterns that have been flipped, mirrored, rotated or otherwise transformed when compared with the figures in the vignette. This is a technical limitation due to machine precision, not an unintentional mistake in the code.
As discussed earlier, we used the split-by-variable study design in this example. This meant that the SOM was trained using a subset of the available variables (biochemical data), which allowed us to investigate the associations with the clinical variables without a high risk of overfitting. To visualize the results, we follow the same logic. Below, we first investigate the training data to get insight into the metabolic profiles and diversity within the dataset. This will also allow us to define biochemical subgroups from a multivariable perspective. Later on, we will overlay the clinical variables onto the map to identify subgroups of clinical importance.
trvars <- colnames(db$features)
elem <- nroPlot(elements=sm$topology, colors=colrs[,trvars],
labels=labls[,trvars])
Figure 7: Statistically normalized colorings of the training variables in the kidney disease dataset
The color intensity depends on how likely the observed regional variation would arise by chance; intense reds and intense blues indicate that these extremes would be very unlikely if the data point layout was random.
Figure 7 shows the map colorings for the training set. Serum creatinine (log-transformed) was substantially higher for a subset of individuals located on the top part of the map compared to the lower part, and a similar pattern was found for the log-transformed measurements for urinary albumin excretion. As elevated serum creatinine and urinary albumin are hallmarks of kidney disease, it is likely that the individuals who were assigned to the top part of the map had kidney disease as the underlying explanation.
The patterns for the lipids were more complicated. Cholesterol showed a pattern of high concentrations in the upper right area and low concentration in the bottom left, whereas HDL2 cholesterol was the highest in the bottom-right and the lowest in the upper-left. Triglycerides (log-transformed) showed a general pattern of high concentrations on the upper part of the map.
We recommend using the SOM together with conventional approaches such as linear correlations, for broader understanding of the nature of the dataset. For instance, cholesterol and triglycerides were correlated (r = 0.43, P < 0.001), however, the SOM colorings suggest that the correlation may not apply to all individuals; particularly those in the upper-left area with high triglycerides did not seem to follow the linear trend. Other dimension reduction methods such as principal component analysis may work better in datasets where there are clear clusters (a typical SOM analysis may miss the clustering structure) and, again, we recommend using multiple conceptually different methods to achieve robust conclusions. We did not observe any obvious clustering structure in the kidney disease dataset (results not shown).
The aims of the example study were i) to define and describe metabolic subgroups of type 1 diabetes, and ii) to investigate how the subgroups are associated with mortality. The ability to choose subgroups boundaries on the map while simultaneously observing multiple variables is the main strength of the Numero framework. Furthermore, the intensity of the colorings guides the process towards selecting criteria that have the strongest statistical support.
The Numero framework offers an interactive way to define subgroups. To start the interactive process to define subgroups based on the biochemical variables used in training, run the following command:
trvars <- colnames(db$features)
elem <- nroPlot(elements=sm$topology, colors=colrs[,trvars],
labels=labls[,trvars], interactive=TRUE)
Subgroups can be defined interactively by clicking onto districts in the map colorings in the plot window. We will step through one example defining the subgroup with high creatinine.
As a vignette does not allow for an interactive process, we will provide screenshots of the process in the following section.
After running the above command, the training colorings are shown in the plot window. We click on the district with the highest creatinine measure on the CREAT_log map coloring as shown in Figure 8.
Figure 8: Screenshot: Interactive definition of subgroups - step 1: Choosing the district with the highest creatinine value
Now, we choose other districts that we want to add to this subgroup (Figure 9). Clicking on one district in a coloring, also updates the corresponding district in all other colorings.
Figure 9: Screenshot: Interactive definition of subgroups - step 2: Choosing other districts with a higher creatinine value
Once all districts belonging to this first subgroup have been chosen, a click into the plot window outside of a coloring will exit the selection process. The console window will then ask to provide a descriptive name for the subgroup and to confirm the selection. Here, we choose the name High creatinine (Figure 10) and confirm the subgroup. After that the district will be labeled with A indicating the first subgroup (Figure 11).
Figure 10: Screenshot: Interactive definition of subgroups - step 3: Subgroup name and confirmation
Figure 11: Screenshot: Interactive definition of subgroups - step 4: Updated interactive plot after subgroup confirmation
Subgroups are automatically labeled from A to Z in the graphical output. A district selection can be changed by clicking on top of it.
Now, we continue defining subgroups in the same manner until all districts have been chosen. Then, we press the finish button at the top right of the plot window and confirm that the session should be terminated. Map visualizations that contain the subgroup choices can be saved using the following command:
nroPlot(elements=elem, colors=colrs, labels=labls, file="subgroups.svg")
In this case, we have decided to define 5 subgroups. The overall result of the subgrouping efforts with above command are illustrated in Figure 12. Note, this figure was pre-created outside the vignette.
Figure 12: The five subgroups in the diabetic kidney disease example
The grouping is the result of the example interactive process. The SVG was pre-created by saving the results of the interactive process.
In this example, we have chosen the following descriptive names during the interactive procedure:
While we admit that our choices for the subgroup boundaries were subjective, we also argue that any observer can dispute those choices and provide an alternative by examining the figures. Therefore, the transparency of the methodology allows collective objectivity that is superior to strict “black box” classifiers, especially when the data patterns overlap and involve multiple outcomes.
Please note that the boundaries may not fit exactly with any specific variable, since we also required that the subgroups have to be mutually exclusive. This is the part where there are no perfect mathematical solutions due to overlaps and multi-morbidity.
The second aim of the study was to compare the subgroups with respect to mortality and clinical diagnoses. Graphical comparisons of the metabolic subgroup boundaries and selected map colorings are shown in Figure 12. Mortality was the highest in the top section of the map (34% in eight years) as seen in the coloring DECEASED, and the same region was also characterized by greater than 90% prevalence of diabetic kidney disease as seen in the DIAB_KIDNEY coloring. As expected the High Creatinine Subgroup (A) captured this segment of the study population. In addition, a few districts with increased mortality and kidney disease prevalence were found within the High Cholesterol Subgroup (B), and similar spill-over was observable also in the High Triglyceride Subgroup (D). On the other hand, the Low Lipids Subgroup (E) showed the lowest rates of death or complications across all the plots in Figure 12.
The metabolic syndrome is a clinical entity to describe the co-occurrence of obesity, diabetes, high blood pressure and abnormal blood lipids that is often observed in people at risk of cardiovascular death. Triglycerides and HDL cholesterol comprise the lipid component of the metabolic syndrome, which explains the similar yet different patterns with respect to cardiovascular disease (coloring METAB_SYNDR in Figure 12). In particular, over half of the individuals in the High Triglyceride Subgroup (D) have the metabolic syndrome.
Please note that these results have been generated with a 64-bit machine, and they may be different from the results from 32-bit architectures due to the lower machine precision. If you notice problems, please redefine the subgroups yourself and update the R-code accordingly.
The function nroSummary calculates the summary statistics for the interactively defined subgroups.
results <- nroSummary(data = db$values, districts = matches,
regions = elem$REGION)
It calculates the mean, standard deviation and median for each subgroup and variable and also calculates P-Values using ANOVA, the t statistics and the chi-square test depending on the type of data. Of note, P-Values for variables used in training are set to NA.
Here, we look at the differences in the prevalence of mortality and diabetic kidney disease in each subgroup.
results <- results[which(results$VARIABLE %in% c('DECEASED','DIAB_KIDNEY')),c('VARIABLE','SUBGROUP','MEAN','P.chisq')]
VARIABLE | SUBGROUP | MEAN | P.chisq |
---|---|---|---|
DECEASED | High Cholesterol | 0.0476190 | 0.4058467 |
DECEASED | High Creatinine | 0.2426471 | 0.0000000 |
DECEASED | High HDL2 cholesterol | 0.0638298 | 0.1254771 |
DECEASED | High triglycerides | 0.0869565 | 0.0392702 |
DECEASED | Low lipids | 0.0245098 | 1.0000000 |
DIAB_KIDNEY | High Cholesterol | 0.3968254 | 0.0000002 |
DIAB_KIDNEY | High Creatinine | 0.9044118 | 0.0000000 |
DIAB_KIDNEY | High HDL2 cholesterol | 0.2198582 | 0.0047829 |
DIAB_KIDNEY | High triglycerides | 0.3623188 | 0.0000016 |
DIAB_KIDNEY | Low lipids | 0.1029412 | 1.0000000 |
Selected findings are listed in Table 2. As expected, the High Creatinine Subgroup had the highest prevalence for dying within the follow-up period and the highest for diabetic kidney disease. When considering P-values below 0.05 significant, then we observed an increased prevalence of mortality in the High Creatinine and High Triglycerides subgroups. Regarding the prevalence of diabetic kidney disease, only the Low Lipids Subgroup showed no significant association.
Now that the SOM analyses have been completed, how should these findings be reported in a journal article, and what is the take-home message? Our first recommendation is not to abandon conventional statistics when using the Numero framework – the two are complementary. In the kidney disease example, we recommend starting with the description of the study cohort and age- and sex-adjusted comparisons between established clinical categories (e.g. Table 1 is a basic first step). This will give most readers in the field an understanding of the basic nature of the dataset.
Next, we recommend drawing Kaplan-Meier mortality curves for diabetic kidney disease, retinopathy, and metabolic syndrome, and apply Cox regression to investigate associations with mortality in a multivariate context (or other well established statistical methods). Again, the biomedical readership will appreciate using methodology that is familiar to them. These analyses work best for datasets with only a few variables and a well-defined hypothesis, but they are not well suited to identifying non-linear subgroups, synergies across a high number of variables, or multi-morbidity from several correlated yet diverse clinical end-points. Hence the machine learning audience will probably want more.
The third section of the article should involve the SOM to identify features that cannot be detected by the standard tools. Even if nothing new was discovered, we still recommend adding the SOM as a supplement, since it gives a comprehensive window into the data, and it is particularly useful to detect non-random patterns of missing data, the effects of censoring in longitudinal studies and opportunities to detect outliers. For readers and reviewers, sophisticated visualizations that capture the nature of the cohort and give an accurate description of the structural strengths and weaknesses will be highly appreciated – it is better science.
As to the take-home message, we propose the following: if people with type 1 diabetes can achieve such metabolic control that their serum lipids are low, they are likely to be resilient against diabetic complications and mortality. We and others have made similar observations before, so this is not novel, however, it shows how the SOM lead to the expected conclusions and it gives us confidence that the approach is robust for high-dimensional big data that is out of reach of conventional tools.
When you use the package, please cite our publication
citation('Numero')
##
## Song Gao, Stefan Mutter, Aaron E. Casey, Ville-Petteri Mäkinen;
## Numero: a statistical framework to define multivariable subgroups
## in complex population-based datasets, International Journal of
## Epidemiology, , dyy113, https://doi.org/10.1093/ije/dyy113
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Numero: a statistical framework to define multivariable subgroups in complex population-based datasets},
## author = {Song Gao and Stefan Mutter and Aaron E. Casey and Ville-Petteri Mäkinen},
## journal = {International Journal of Epidemiology},
## year = {2018},
## pages = {dyy113},
## doi = {10.1093/ije/dyy113},
## }
Attik, M., Bougrain, L., & Alexandre, F. (2005). Self-organizing map initialization. In W. Duch, J. Kacprzyk, E. Oja, & S. Zadrożny (Eds.), Artificial neural networks: Biological inspirations – icann 2005: 15th international conference, warsaw, poland, september 11-15, 2005. proceedings, part i (pp. 357–362). doi:10.1007/11550822_56
Bernardi, L., De Barbieri, G., Rosengård-Bärlund, M., Mäkinen, V.-P., Porta, C., & Groop, P.-H. (2010). New method to measure and improve consistency of baroreflex sensitivity values. Clinical Autonomic Research, 20(6), 353–361. doi:10.1007/s10286-010-0079-1
Cardoso, D. B. O. S., Queiroz, L. P. de, & Lima, H. C. de. (2014). A taxonomic revision of the south american papilionoid genus luetzelburgia (fabaceae). Botanical Journal of the Linnean Society, 175(3), 328–375. doi:10.1111/boj.12153
Delanaye, P., Schaeffner, E., Ebert, N., Cavalier, E., Mariat, C., Krzesinski, J.-M., & Moranne, O. (2012). Normal reference values for glomerular filtration rate: What do we really know? Nephrology Dialysis Transplantation, 27(7), 2664–2672. doi:10.1093/ndt/gfs265
Kohonen, T., Schroeder, M. R., & Huang, T. S. (Eds.). (2001). Self-organizing maps (3rd ed.). Secaucus, NJ, USA: Springer-Verlag New York, Inc.
Kumpula, L. S., Makela, S. M., Mäkinen, V.-P., Karjalainen, A., Liinamaa, J. M., Kaski, K., … Ala-Korpela, M. (2010). Characterization of metabolic interrelationships and in silico phenotyping of lipoprotein particles using self-organizing maps. The Journal of Lipid Research, 51(2), 431–439. doi:10.1194/jlr.D000760
Kuusisto, S. M., Peltola, T., Laitinen, M., Kumpula, L. S., Mäkinen, V.-P., Salonurmi, T., … Ala-Korpela, M. (2012). The interplay between lipoprotein phenotypes, adiponectin, and alcohol consumption. Annals of Medicine, 44(5), 513–522. doi:10.3109/07853890.2011.611529
Levey, A. S., & Coresh, J. (2012). Chronic kidney disease. The Lancet, 379(9811), 165–180. doi:10.1016/S0140-6736(11)60178-5
Mäkinen, V.-P., Forsblom, C., Thorn, L. M., Waden, J., Gordin, D., Heikkila, O., … on behalf of the FinnDiane Study Group. (2008a). Metabolic Phenotypes, Vascular Complications, and Premature Deaths in a Population of 4,197 Patients With Type 1 Diabetes. Diabetes, 57(9), 2480–2487. doi:10.2337/db08-0332
Mäkinen, V.-P., Soininen, P., Forsblom, C., Parkkonen, M., Ingman, P., Kaski, K., … Ala-Korpela, M. (2008b). 1H NMR metabonomics approach to the disease continuum of diabetic complications and premature death. Molecular Systems Biology, 4. doi:10.1038/msb4100205
Mäkinen, V.-P., Soininen, P., Kangas, A. J., Forsblom, C., Tolonen, N., Thorn, L. M., … Finnish Diabetic Nephropathy Study Group. (2013). Triglyceride-cholesterol imbalance across lipoprotein subclasses predicts diabetic kidney disease and mortality in type 1 diabetes: The FinnDiane Study. Journal of Internal Medicine, 273(4), 383–395. doi:10.1111/joim.12026
Mäkinen, V.-P., Tynkkynen, T., Soininen, P., Peltola, T., Kangas, A. J., Forsblom, C., … Groop, P.-H. (2012). Metabolic Diversity of Progressive Kidney Disease in 325 Patients with Type 1 Diabetes (the FinnDiane Study). Journal of Proteome Research, 11(3), 1782–1790. doi:10.1021/pr201036j
Tukiainen, T., Tynkkynen, T., Mäkinen, V.-P., Jylänki, P., Kangas, A., Hokkanen, J., … Ala-Korpela, M. (2008). A multi-metabolite analysis of serum by 1H NMR spectroscopy: Early systemic signs of Alzheimer’s disease. Biochemical and Biophysical Research Communications, 375(3), 356–361. doi:10.1016/j.bbrc.2008.08.007
Würtz, P., Soininen, P., Kangas, A. J., Mäkinen, V.-P., Groop, P.-H., Savolainen, M. J., … Ala-Korpela, M. (2011). Characterization of systemic metabolic phenotypes associated with subclinical atherosclerosis. Mol. BioSyst., 7(2), 385–393. doi:10.1039/C0MB00066C