Because we start with fine-grain HRMS data, we bin the peaks of the spectra to increase our ability to identify key spectral features. This peak-binning process naturally creates a vector representation for spectra, so we can leverage linear algebra tools and construct a method to interpolate spectra across collision energies. We first lay out the notation for this process and then discuss details of the approach. A high-level overview is illustrated in Figure Figure11.
(a) Three experimental spectra are collected from a consistent mass spectrometry workflow at collision energies spanning the energies of interest. (b) The experimental spectra are binned to the desired level of detail, forming vector-formatted spectra that are suited for mathematical analysis. (c) The vector-formatted spectra are joined into a matrix, and the SVD of the matrix is computed. The resulting SVD provides a set of basis vectors bk and the weight coefficients ck at the known collision energies. The weight coefficients for the desired spectra are interpolated from the known weight coefficients. (d) Finally, for a desired collision energy e, the interpolated weights are applied as a linear combination with the basis vectors to determine the anticipated spectrum at the unknown collision energy.
Let a spectrum s with a set of peaks P be given by the set
where ip is the measured intensity of a peak at m/z value mp.
To apply principal component analysis (PCA) to a set of spectra, they
must be represented in the same vector space. To conform a set of
spectra, we first choose a Qmax value
such that all relevant m/z values
are in the interval [0,Qmax] and partition
the interval into N uniform bins {[qminn,qminn)}n ∈ {0,···,N}. For our purposes, Qmax is determined for each trial independently by the highest nonzero m/z value in the set of spectra being analyzed,
and the value of N is set to bin the peaks to the
nearest integer. This coarse binning is chosen to allow many trials
to be run quickly to validate the process, though the mathematical
details still apply for the much finer detail required for a real-world
analysis. We then represent a spectrum s as a vector
where, at each index n, the vector value vn becomes either the intensity of the highest peak in that section
of the partition or zero if there are no peaks in that bin. This modified
binning method is designed to ignore noise around prominent peaks.
If a moderate height peak is surrounded by many very small peaks,
the common method of binning by summing the peaks may allow the peak
to appear more prominent. This binning method preserves peak prominence
and reduces noise. Mathematically, this is expressed as follows.
To capture how the spectra for a given molecule progress as collision energy changes, we seek an optimal representation of the set of spectra using singular value decomposition (SVD). Commonly in PCA, SVD is used to identify a set of component vectors that is smaller than the full data set but still represents all of the data well enough to make strong predictions. However, because spectral prediction is incredibly nuanced, we generate a full set of basis vectors to retain as much spectral detail as possible. In this analysis, a basis is a minimal set of vectors required to be able to recreate any spectrum in the data set through a linear combination of the basis vectors (in linear algebra terms, the basis spans the original data set). The basis vectors also have the properties of being linearly independent and orthogonal. These properties make PCA a powerful tool but also make the basis vectors purely statistical artifacts, no longer representative of actual spectra. To represent a given set of J known, vectorized spectra V = {vj}j ∈{1,···,J} taken at collision energies {ej}j∈{1,···,J}, we use SVD to construct an orthonormal basis {bk}k∈{1,···,K} of the span of V, where K is the dimension of the span. Because of the complexity of the HRMS data, K will be equal to J in most cases for this method.
Now because {bk}k ∈{1,···,K} is a basis, there exist a set of coefficients {ck,j}k ∈{1,···,K},j ∈{1,···,J} such that each vectorized spectrum vj ∈ V can be written as a linear combination of basis vectors.
That is, each spectrum can be represented as a weighted sum of the basis vectors bk, where the coefficient ck,j can be understood as the contribution of the vector bk to the spectrum vj. In this view, the changes in spectra across collision energies can be described by the changes to the contributions (coefficients) of each basis vector. For example, a given basis vector may have a small contribution at low collision energy but a large contribution at higher collision energies. It is important here to note that the coefficients ck,j can be positive or negative because the basis vectors do not necessarily correspond to any physical phenomenon (e.g., fragment structure/stability); they are statistical in nature.
Finally, to generate the interpolations
for all missing collision
energies, we need to build functions that map how the contributions
for each basis vector change as a function of the collision energy.
These functions are represented by the dotted lines in Figure Figure22. Ideally we would use a function
that takes in a scalar collision energy e and outputs
the corresponding continuous, HRMS spectrum g(e) for a given molecule. While we cannot determine the true
function g, we can construct an approximation ĝ
from
to
that outputs an N-dimensional
vectorized spectrum in the span of the basis {bk}k∈{1,...,K} of the form
where we initially define
for all j ∈ {1,...,J} and k ∈ {1,···,K} such that the approximation exactly satisfies (5) with the values satisfying (3) for our J known vectorized spectral representations. We then estimate the values of fk at all other e ∈ [emin,emax] by linear interpolation, where we have the following.
By this definition, the vectorized spectrum approximations of the form 4 may include a negative intensity value. To make sensible spectrum estimates, all negative values in ĝ(e) are set to zero.
(top) The left-hand plots titled bk show the basis vectors generated from the capsaicin spectra shown in the bottom row of the figure. The right-hand plots titled fk(e) give the coefficients that reconstruct the spectrum at a given electronvolt value, e, with a linear interpolation (dashed line) plotted across collision energies. For each basis vector, these interpolations are the functions fk. Across the range of collision energies, the contribution of b1 remains relatively constant, as it contains the prominent base peak across all electronvolts. At 10 eV, there are small influences from basis vector b3, and the negative value of f2 decreases peak intensity at the positive value in basis vector b2 while increasing the intensity of the peaks shown as negative. At 20 eV, the contributions of basis vector b3 switch sign, and the coefficient for basis vector b2 begins to increase. Finally at 40 eV, the influence of basis vector b2 increases, corresponding with the appearance of strong peaks with lower m/z values than the base peak observed at 10 and 20 eV. This represents the fragmentation that occurs between 20 and 40 eV. (bottom) Normalized MS/MS Agilent Q-TOF spectra for capsaicin from the NIST20 database at collision energies of 10, 20, and 40 eV are shown in the bottom row.
Figure Figure22 (top) shows the three basis vectors bk for a set of three capsaicin spectra (Figure Figure22 (bottom)) along with the known contribution values ck,j (the dots on the right-hand plots) and how they change as a function of collision energy, fk. The basis vector b1 represents a peak prominent across all collision energies, and the associated coefficients, c1,j and f1, remain close to constant. In contrast, b2 represents peaks that are more prominent in the highest collision energy spectra, estimated as linear increases to the contribution of b2 in the interpolation f2. Head-to-tail comparisons of interpolated spectra against experimental spectra from NIST20 are shown in Figure Figure33. While this method shows strong results within the range [emin, emax], it is important to note that, as an interpolation, extrapolating to spectrum estimates at collision energies outside the range [emin, emax] is not possible, as the approximations are not meaningfully defined for such values.
Sample interpolation predicted (ITP) Q-TOF capsaicin spectra compared to the known spectra available in NIST20 at collision energies of 15 and 30 eV. Note that the methods used to generate predictions preclude accurate predictions outside the range of provided collision energies. These spectra were generated with samples at 10, 20, and 40 eV, so ITP spectra can only be generated for collision energies between 10 and 40 eV.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.