A combination of a SOM with SM was used to analyze the inoculation experiment’s qPCR results. SM had been introduced in 1969 by J.W. Sammon [62]. It aims to project n-dimensional data, with values for n > 2 observable on a 2D plane. The distances between any two data points in the 2D plane are as much as possible proportional to the dissimilarities in the n-dimensional data space. That would allow for visualization in 2D graphs with minimum loss of information. To that end, a projection on a 2D plane is iteratively optimized, starting from a random constellation by minimizing Sammon’s stress E (1):
where is the distance between two instances in the original data space, and the distance between their projections. Different methods have been suggested for that optimization. However, they all share that there is no guarantee for convergence to the best possible solution. In particular, the algorithms reach their limits for large and high-dimensional data. Substantial improvements can be achieved by using upstream, powerful nonlinear projections methods. Especially SOM turned out to be exceptionally well suited in that regard.
SOM, also called Kohonen Feature Maps, have been introduced in the early 1980s [63,64]. Findings in neural science have stimulated the development of this type of an Artificial Neural Network. Correspondingly, SOM aim at mimicking certain aspects of visual information processing in human brains. Like SM, the purpose is an efficient low-dimensional projection of high-dimensional data sets, but following a different unsupervised learning approach.
SOM consists of units called neurons or codebook vectors arranged in a regular low-dimensional (often 2D) lattice. Each codebook vector consists of an n-dimensional vector where n is the dimension of the data set to be projected. Learning starts with the random initialization of the codebook vectors. Step by step, each instance of the original data set is compared with all codebook vectors. The most similar codebook vector (called winner neuron) is then slightly modified to adjust it a little more to the respective instance’s values. The so-called learning rate defines the degree of adjustment. The adjustment applies not only to the winner neuron but to the codebook vectors nearby, but the less, the more distant they are from the winner neuron. A neighborhood function defines the adjustment rate as a function of distance.
The same procedure is carried out for all instances of the data set and is repeated until a certain threshold is reached. Usually, learning is subdivided into two phases. During the first phase, the learning rate is relatively high, and the radius of the neighborhood function enables a rapid setup of a coarse structure. In the second phase of fine-tuning of the codebook vectors, both the learning rate and the neighborhood function’s radius are reduced to prevent overshooting. In the end, each instance of the data set can be assigned to a codebook vector with almost identical values. Besides, the SOM exhibits a smooth shape because adjacent codebook vectors are remarkably similar.
SOM are now increasingly used to classify large, high-dimensional data sets and often proved superior to conventional clustering approaches, e.g., for analysis of sequencing data [65,66,67]. Beyond that, Kohonen [64] suggested combining SOM and SM as a powerful approach of low-dimensional projection of large high-dimensional data sets where SM alone fails. Here the SOM output is used to initialize the SM algorithm that then performs an additional fine-tuning of the structure pre-defined by the SOM.
Compared to conventional 2D graphs, the data points’ location concerning the axes does not bear any information. Instead, distances between any two symbols in the graph are roughly proportional to dissimilarity in the original n-dimensional data space. Thus, SOM-SM can be compared to a sketch map that provides information about neighborhood relations rather than on absolute location. The same graph with the same spatial organization can illustrate different information provided by respective coloring or symbol types. Thus, a synopsis of various graphs with the same spatial structure serves as a compelling interface between large, high-dimensional data sets and the human brain, taking advantage of humans’ great capacity for visual pattern recognition.
It is the first time the method, known from neural network analysis, has been used to visualize large microbiological data. The SOM-SM analysis has been done using the R environment (R Core Team 2019), including the SOM [68] and MASS packages [69]. For the SOM, a rectangular grid with 11x8 neurons and a Gaussian neighborhood function was used. Learning comprised ten iteration steps in the first learning phase and 100 iteration steps in the second phase. For SM, 90 iteration steps were carried out.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.