Spectral Interpolation

Ethan King; Richard Overstreet; Julia Nguyen; Danielle Ciesielski; *

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Spectral Interpolation

EK Ethan King

RO Richard Overstreet

JN Julia Nguyen

DC Danielle Ciesielski

* *

This method is extracted from research article: J Chem Inf Model, Jul 2022

Augmentation of MS/MS Libraries with Spectral Interpolation for Improved Identification

DOI: 10.1021/acs.jcim.2c00620

Ask a question

Favorite

Because we start with fine-grain HRMS data, we bin the peaks of the spectra to increase our ability to identify key spectral features. This peak-binning process naturally creates a vector representation for spectra, so we can leverage linear algebra tools and construct a method to interpolate spectra across collision energies. We first lay out the notation for this process and then discuss details of the approach. A high-level overview is illustrated in Figure Figure11.

(a) Three experimental spectra are collected from a consistent mass spectrometry workflow at collision energies spanning the energies of interest. (b) The experimental spectra are binned to the desired level of detail, forming vector-formatted spectra that are suited for mathematical analysis. (c) The vector-formatted spectra are joined into a matrix, and the SVD of the matrix is computed. The resulting SVD provides a set of basis vectors b_k and the weight coefficients c_k at the known collision energies. The weight coefficients for the desired spectra are interpolated from the known weight coefficients. (d) Finally, for a desired collision energy e, the interpolated weights are applied as a linear combination with the basis vectors to determine the anticipated spectrum at the unknown collision energy.

Let a spectrum s with a set of peaks P be given by the set

where i_p is the measured intensity of a peak at m/z value m_p. To apply principal component analysis (PCA) to a set of spectra, they must be represented in the same vector space. To conform a set of spectra, we first choose a Q_max value such that all relevant m/z values are in the interval [0,Q_max] and partition the interval into N uniform bins {[q_minⁿ,q_minⁿ)}_{n ∈ {0,···,N}}. For our purposes, Q_max is determined for each trial independently by the highest nonzero m/z value in the set of spectra being analyzed, and the value of N is set to bin the peaks to the nearest integer. This coarse binning is chosen to allow many trials to be run quickly to validate the process, though the mathematical details still apply for the much finer detail required for a real-world analysis. We then represent a spectrum s as a vector An external file that holds a picture, illustration, etc.
Object name is ci2c00620_m002.gif where, at each index n, the vector value vⁿ becomes either the intensity of the highest peak in that section of the partition or zero if there are no peaks in that bin. This modified binning method is designed to ignore noise around prominent peaks. If a moderate height peak is surrounded by many very small peaks, the common method of binning by summing the peaks may allow the peak to appear more prominent. This binning method preserves peak prominence and reduces noise. Mathematically, this is expressed as follows.

To capture how the spectra for a given molecule progress as collision energy changes, we seek an optimal representation of the set of spectra using singular value decomposition (SVD). Commonly in PCA, SVD is used to identify a set of component vectors that is smaller than the full data set but still represents all of the data well enough to make strong predictions. However, because spectral prediction is incredibly nuanced, we generate a full set of basis vectors to retain as much spectral detail as possible. In this analysis, a basis is a minimal set of vectors required to be able to recreate any spectrum in the data set through a linear combination of the basis vectors (in linear algebra terms, the basis spans the original data set). The basis vectors also have the properties of being linearly independent and orthogonal. These properties make PCA a powerful tool but also make the basis vectors purely statistical artifacts, no longer representative of actual spectra. To represent a given set of J known, vectorized spectra V = {v_j}_{j ∈{1,···,J}} taken at collision energies {e_j}_{j∈{1,···,J}}, we use SVD to construct an orthonormal basis {b_k}_{k∈{1,···,K}} of the span of V, where K is the dimension of the span. Because of the complexity of the HRMS data, K will be equal to J in most cases for this method.

Now because {b_k}_{k ∈{1,···,K}} is a basis, there exist a set of coefficients {c_k,j}_{k ∈{1,···,K},j ∈{1,···,J}} such that each vectorized spectrum v_j ∈ V can be written as a linear combination of basis vectors.

That is, each spectrum can be represented as a weighted sum of the basis vectors b_k, where the coefficient c_k,j can be understood as the contribution of the vector b_k to the spectrum v_j. In this view, the changes in spectra across collision energies can be described by the changes to the contributions (coefficients) of each basis vector. For example, a given basis vector may have a small contribution at low collision energy but a large contribution at higher collision energies. It is important here to note that the coefficients c_k,j can be positive or negative because the basis vectors do not necessarily correspond to any physical phenomenon (e.g., fragment structure/stability); they are statistical in nature.

Finally, to generate the interpolations for all missing collision energies, we need to build functions that map how the contributions for each basis vector change as a function of the collision energy. These functions are represented by the dotted lines in Figure Figure22. Ideally we would use a function that takes in a scalar collision energy e and outputs the corresponding continuous, HRMS spectrum g(e) for a given molecule. While we cannot determine the true function g, we can construct an approximation ĝ from An external file that holds a picture, illustration, etc.
Object name is ci2c00620_m005.gif to that outputs an N-dimensional vectorized spectrum in the span of the basis {b_k}_{k∈{1,...,K}} of the form

where we initially define

for all j ∈ {1,...,J} and k ∈ {1,···,K} such that the approximation exactly satisfies (5) with the values satisfying (3) for our J known vectorized spectral representations. We then estimate the values of f_k at all other e ∈ [e_min,e_max] by linear interpolation, where we have the following.

By this definition, the vectorized spectrum approximations of the form 4 may include a negative intensity value. To make sensible spectrum estimates, all negative values in ĝ(e) are set to zero.

(top) The left-hand plots titled b_k show the basis vectors generated from the capsaicin spectra shown in the bottom row of the figure. The right-hand plots titled f_k(e) give the coefficients that reconstruct the spectrum at a given electronvolt value, e, with a linear interpolation (dashed line) plotted across collision energies. For each basis vector, these interpolations are the functions f_k. Across the range of collision energies, the contribution of b₁ remains relatively constant, as it contains the prominent base peak across all electronvolts. At 10 eV, there are small influences from basis vector b₃, and the negative value of f₂ decreases peak intensity at the positive value in basis vector b₂ while increasing the intensity of the peaks shown as negative. At 20 eV, the contributions of basis vector b₃ switch sign, and the coefficient for basis vector b₂ begins to increase. Finally at 40 eV, the influence of basis vector b₂ increases, corresponding with the appearance of strong peaks with lower m/z values than the base peak observed at 10 and 20 eV. This represents the fragmentation that occurs between 20 and 40 eV. (bottom) Normalized MS/MS Agilent Q-TOF spectra for capsaicin from the NIST20 database at collision energies of 10, 20, and 40 eV are shown in the bottom row.

Figure Figure22 (top) shows the three basis vectors b_k for a set of three capsaicin spectra (Figure Figure22 (bottom)) along with the known contribution values c_k,j (the dots on the right-hand plots) and how they change as a function of collision energy, f_k. The basis vector b₁ represents a peak prominent across all collision energies, and the associated coefficients, c_1,j and f₁, remain close to constant. In contrast, b₂ represents peaks that are more prominent in the highest collision energy spectra, estimated as linear increases to the contribution of b₂ in the interpolation f₂. Head-to-tail comparisons of interpolated spectra against experimental spectra from NIST20 are shown in Figure Figure33. While this method shows strong results within the range [e_min, e_max], it is important to note that, as an interpolation, extrapolating to spectrum estimates at collision energies outside the range [e_min, e_max] is not possible, as the approximations are not meaningfully defined for such values.

Sample interpolation predicted (ITP) Q-TOF capsaicin spectra compared to the known spectra available in NIST20 at collision energies of 15 and 30 eV. Note that the methods used to generate predictions preclude accurate predictions outside the range of provided collision energies. These spectra were generated with samples at 10, 20, and 40 eV, so ITP spectra can only be generated for collision energies between 10 and 40 eV.

Permits non-commercial access and re-use, provided that author attribution and integrity are maintained; but does not permit creation of adaptations or other derivative works (https://creativecommons.org/licenses/by-nc-nd/4.0/).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol