[0001] The present invention relates to an automated, computer-assisted method for identifying
compounds according to mass spectral and chromatographic data obtained from a sample.
In particular, the invention relates to methods for identifying compounds using two
dimensional gas chromatography-mass spectrometry (GCxGC-MS), and processes for automating
the interpretation of the mass spectral and chromatographic data obtained from such
a method.
[0002] Mass spectrometry is an analytical tool that can be used to determine the molecular
weights of chemical compounds and of their fragments by detecting the ionized compounds
and fragments according to their mass-to-charge ratio (m/z). The molecular ions are
generated by inducing either a loss or a gain of a charge by the chemical compounds,
such as via electron ejection, protonation, or deprotonation. The fragment ions are
generated by collision-induced or energy-induced dissociation. The resulting data
are usually presented as a spectrum, a plot with m/z ratio on the x-axis and abundance
of ions on the γ-axis. Thus, this spectrum shows the distribution of m/z values in
the population of ions being analyzed. This distribution is characteristic for a given
compound. Therefore, if the sample is a pure compound or contains only a few compounds,
mass spectrometry can reveal the identity of the compound(s) in the sample.
[0003] A complex sample usually contains too many chemical compounds to be analyzed meaningfully
by mass spectrometry alone, because ionization of different chemical compounds may
result in ions with the same m/z value. The more chemical compounds a sample contains,
the more likely ions of the same m/z values will be generated from different compounds.
Therefore, a complex sample is typically resolved to some extent prior to mass spectrometry,
such as by liquid chromatography (LC), gas chromatography (GC), or capillary electrophoresis.
For analysis of volatile compounds, gas chromatography is advantageously coupled with
mass spectroscopy (GC-MS). Several ionization methods are available in GC, one of
the most common being electron impact (EI), in which molecules are ionized by bombardment
with electrons emitted by a filament.
[0004] During the sample separation step (chromatography), the chemical compounds in the
sample are separated based on how long they stay in the sample separation system (column).
Once a chemical compound exits the sample separation system, it enters a mass spectrometer
system, and the ionization/ion separation/detection process begins as described above.
For each compound, the time it remains in the sample separation system before it produces
signal(s) in the mass spectrum is a function of its structure and is referred to as
the retention time (RT). However, retention time is also specific to the instrument
being used, and especially the column specifications in a gas chromatograph.
[0005] Without exact replication of the instrumentation on which RT is first measured, RTs
of the same sample measured later may not match the RTs specified in the original
chromatographic method or the computerized method files (including calibration and
event tables) and can lead to misidentified peaks. One solution is the "relative retention"
approach which utilizes retention indices (RI) or Kovats indices (KI) that circumvent
problems associated with discrepancies in RT due to instrument-to-instrument or column-to-column
variation. Methods to predict Kovats indices (KI) based on molecular structure and
associated features are known in the art. Models which predict KI based on such factors
are known as Quantitative Structure-Property Relationship (QSPR) models. See, for
example,
Mihaleva et al., (2009) Bioinformatics 6:787-794;
Garjani-Nejad et al., (2004) Journal of Chromatography A, 1028:287-295;
Seeley and Seeley, (2007) Journal of Chromatography A, 1172:72-83. This type of procedure converts the actual retention times of detected peaks into
a number that is normalized to multiple reference compounds. This is especially useful
for comparing retention times to databases and libraries for identification of individual
components. Such libraries provide large numbers of known compounds, and a match between
the data obtained experimentally by GC-MS and a compound in a library can assist in
identification of the compound.
[0006] In order to increase the resolution of the GC-MS, a "second dimension" of GC can
be added, for instance by coupling the GC column to a second GC column (often referred
to as 2DGC-MS or GCxGC-MS, and used interchangeably here with the terms GCxGC-TOF
or GCxGC-TOF-MS). See
Venkatramani and Phillips, J. Microcolumn Sep. (1993) 5:511-516. Peaks of interest are diverted from the first column into the second column for
further separation, which then feeds into the mass spectrometry system. However, even,
GCxGC-MS relies on structural correlation with compound libraries to make identifications
of unknown compounds. The libraries of compounds most widely used for structural identification,
such as the NIST library, contain retention index information for only 9% of the compounds
having mass spectral data.
[0007] The use of RI or KI data allows structural assignments derived from comparison with
library data to be refined. However, in order to achieve an acceptable level of confidence
in the identification of an unknown compound, the assignment must be interpreted by
the user, and compared to a reference standard by mass spectrometry to confirm the
proposed structure. This approach has a number of disadvantages, including the need
to repeat the process manually, which is inefficient; the limited size of Kovats Indices
libraries; the lack of standardization, due to the need for manual intervention; all
of which leads to reduced levels of confidence in the identification process.
[0008] In the traditional approach to identify the structure of a compound, mass spectral
data generated by gas chromatography-electron impact ionization-mass spectrometry
(GC-EI-MS) are compared with commercially available mass spectral data libraries (Figure
1). Using this procedure, the identification has only a low confidence level. In order
to increase the level of confidence, a manual verification and interpretation of the
mass spectral library search is carried out and the experimental retention time, or
the Kovats index, is compared to database entries (e.g., NIST Retention Index library).
Finally, for compound identification, a confirmation with reference standards is required.
However, owing to the fact that this is very costly and time demanding, it is currently
carried out only for a limited number of compounds.
[0009] There is a great need, therefore, for an improved procedure for interpreting GC-MS
data which will allow greater levels of automation in structure identification and
greater levels of confidence in the result.
Summary of the Invention
[0010] In a first aspect, there is provided a method for analysing mass spectral data obtained
from a sample in two dimensional gas chromatography-mass spectrometry (GCxGC-MS),
comprising:
- (a) comparing mass spectral data obtained from a sample comprising an analyte with
mass spectral data of candidate compounds of known structure in a library;
- (b) identifying a plurality of candidate compounds from the library based on similarities
of mass spectral data;
- (c) predicting, for each candidate compound, a value of at least one analytical property
using a quantitative model based on a plurality of molecular descriptors; and
- (d) calculating a match score for each candidate compound based on the value predicted
in step (c) and a measured value of the analytical property for the analyte.
[0011] In various embodiments of the method, within step (c), an analytical property score
is derived from the predicted value of the analytical property of a candidate compound
and a measured value of the analyte. In step (d), the measured value of the analytical
property for the analyte can be the spectral similarity value as determined by algorithms
in the software provided by NIST. The predicted value of an analytical property of
a candidate compound is calculated according to a quantitative model based on a plurality
of molecular descriptors. Accordingly, in one embodiment, the quantitative model of
step (c) can be established by:
- (i) providing a set of training compounds of known structure and a set of test compounds
of known structure, and optionally a set of validation compounds of known structure;
- (ii) generating a measured value of an analytical property for each training compound,
each test compound, and each validation compound;
- (iii) for each training compound, computing a set of molecular descriptors based on
chemical structure and properties;
- (iv) selecting a set of molecular descriptors from the set of molecular descriptors
for use in a quantitative model of the analytical property, by using a genetic algorithm;
- (v) generating a plurality of proposed quantitative models using the selected set
of molecular descriptors;
- (vi) evaluating each proposed quantitative model by computing a predicted value of
the analytical property for each test compound;
- (vii) selecting the quantitative model according to the root mean square error (RMSE)
and/or the squared correlation (r2) on the measured value and the predicted value of the analytical property for each
test compound; and optionally
- (viii) selecting the quantitative model according to the squared correlation (r2) on the measured value and the predicted value of the analytical property for each
validation compound.
[0012] In various embodiments, the genetic algorithm used in step (iv) preferably comprises
(p) generating a plurality of candidate solutions using a combination of two or more
molecular descriptors in a machine learning algorithm such as but not limited to multiple
linear regression, k-nearest neighbour method, or support vector regression;
(q) scoring each candidate solution according to a fitness function based on the cross
validation squared correlation (q2) of the training compounds;
(r) generating new candidate solutions by recombining and/or mutating the candidate
solutions that produces an improving cross validation squared correlation; and
(s) repeating step (q) and (r) for a finite number of times, for example, from 10
to 50 generations.
[0013] Candidate solutions generated by different machine learning algorithms can be compared
to identify the best performing solutions.
[0014] The establishment of a quantitative model for one or more analytical properties is
performed at least once when a particular set up of a GCxGC-MS separation system (e.g.,
a change of column specification, temperature profile, mobile phase) or mass spectrometry
system..After the quantitative models have been established for an experimental setup,
it is not necessary to perform the same each time the data of an analyte generated
by this particular set up is being analyzed.
[0015] The function of each analytical property, an analytical property score, is preferably
calculated as a quadratic function, where for analytical property P,

[0016] Exp_p = measured value of the property obtained by experiments, pre_p = predicted
value of the property, and SEP = standard error or prediction. If the predicted and
experimentally obtained measured values are identical, the equation = 1. The SEP is
calculated according to the formula, using the STEXY function of Microsoft Excel 2003:

where x is a value of a sample, y is the predicted value of x for the sample and n
is the number of samples.
[0017] In step (d) of the method, a spectral similarity value obtained from mass spectral
database comparison can be used to generate a numerical value, wherein the spectral
similarity value and the analytical property score(s) are combined. This numerical
value is referred to herein as a match score, also referred to as the computer-assisted
structure identification (CASI) score in the figures. In a preferred embodiment, the
match score is calculated using a hyperbolic equation. The concept of the present
invention differs from those used in currently available methods, in which analytical
property values are used as a filter to select or deselect candidate compounds.
[0018] Optionally, for each query relating to a sample, the highest and second-highest match
scores can be compared by dividing the highest score by the second-highest to generate
a discrimination function, where a greater difference between the two scores generates
a higher discrimination function. The higher the discrimination function, the higher
the confidence score that can be assigned to each query. A confidence score can be
calculated by multiplying the highest match score by the discrimination function value.
[0019] In preferred embodiments of the method, step (c) comprises predicting values of multiple
analytical properties for each candidate compound. In one embodiment, a match score
is derived from the spectral similarity obtained from the mass spectral database comparison,
and a function of at least two analytical properties derived using a plurality of
molecular descriptors. In another embodiment, a match score is derived from the spectral
similarity value obtained from the mass spectral database comparison, and an analytical
property score wherein the analytical property is the relative second dimension retention
time derived by using a plurality of molecular descriptors.
[0020] Preferred analytical properties useful in the present invention include a Kovats
index, a boiling point and a relative second dimension retention time (2D rel RT)
index. If the predicted analytical properties used in the method of the invention
comprise a Kovats index and a rel 2D RT, the Kovats Index and relative 2D retention
times are preferably calculated using different molecular descriptors. Preferably,
all three preferred analytical properties are used.
[0021] The Kovats indices of compounds are predicted using a linear equation comprising
a plurality of coefficients, each multiplied by the value of a molecular descriptor.
The equation is preferably obtained by using a test data set and a genetic algorithm
to select the molecular descriptors from a plurality of possible molecular descriptors,
and a linear regression or k nearest neighbors learning algorithm to correlate the
selected molecular descriptors with the value to predict.
[0022] The boiling points of compounds can be predicted based on experimentally determined
Kovats Indices. The boiling points of candidate compounds are calculated on the basis
of their individual chemical structures using software packages known in the art,
such as but not limited to ACD/PhysChem from ACD/Labs (Toronto, Canada).
[0023] In methods known in the art, the second dimension retention times are absolute second
dimension retention times and there is no known available method for calculating relative
2D retention times. The challenge for developing a relative model is to define a reference
system that is accessible for all second dimension peaks. This problem is solved by
referring to a reference system based on a function of hypothetical deuterated n-alkanes.
Deuterated or isotopically labelled compounds are used in a reference system for controlling
retention times or internal standard-based quantification. Although other substances
can be used as reference compounds, the n-alkanes are preferably used as a class of
substances for generating a hypothetic 2D-RT reference system because this class of
compounds does not have any known complex interaction with the stationary phase in
the column of the second dimension separation system. Therefore this reference system
adjusts for systemic shifts (such as different column length and gas flow), but not
for analyte-stationary phase shifts, as these shifts are individual to compounds.
Therefore adjusting for systemic shifts is the preferred method with regard to robustness
on adjusting the complete compound space. In one embodiment of the invention, the
first dimension of the GCxGC-MS is separated in a non-polar environment and the second
dimension is separated in a polar environment.
[0024] In accordance with the present invention, a relative second dimension retention time
of a compound is advantageously calculated as a retention time relative to a hypothetical
n-alkane, whose first dimension retention time is derived from the regression function
based on a series of deuterated n-alkane reference standards. The relative second
dimension retention time of a compound is calculated as follows:

where 2D-rel RT
comp is the relative second dimension retention time of the compound; abs 2D RT
comp is the measured absolute second dimension retention time of the compound; and 2D
RT
hypothetical n-alkane, is calculated for each compound that elutes between deuterated n-alkane standard
compound 1 and compound 2:

where dA1 and dA2 are deuterated n-alkane 1, and deuterated n-alkane 2; and 1 DRT
is the first dimension retention time of the respective molecules.
[0025] In the above-described method, neither the absolute nor the relative second dimension
rentention times of candidate compounds are available. To use the relative second
dimension retention time as an analytic property, a quantitative model is established
using a set of training compounds, test compounds and optionally validation compounds.
[0026] The above-described methods are automated in Java and is available as a web service.
The descriptors for prediction models were calculated using software Dragon. RapidMiner
was used to apply predictive retention models. Analytical scientists provide to the
software mass spectra files, Kls and 2D relative retention times. First, each mass
spectra of the compound to identify is searched in various mass spectra databases
using NIST MS Search and the first 100 hits are returned. Structures are standardized
and structural duplicates are removed using Pipeline Pilot 8. For each hit, KI, relative
retention time for the second dimension and boiling point (BP) are calculated using
predictive models. Final match score is calculated using a function taking into account
the match factor of NIST MS Search and the difference between each predicted and experimental
values of the compound to identify.
[0027] In a second aspect of the invention, there is provided a method for calculating a
relative second dimension retention time in GCxGC-MS (2-dimensional gas chromatography
coupled to mass spectrometry) for a compound comprising the steps of:
- (a) defining a reference system based on a function of deuterated n-alkanes that gives
the hypothetical retention time of the reference for a range of retention times;
- (b) transforming measured values of absolute second dimension retention times for
a plurality of training compounds of known molecular structure into the reference
system to calculate relative second dimension retention times for the training compounds;
- (c) using the relative second dimension retention times for the training compounds
to generate a quantitative structure-property relationship model of relative second
dimension retention time based on a plurality of molecular descriptors;
- (d) using the quantitative model to predict a relative second dimension retention
time of the compound.
[0028] The quantitative model of relative second dimension retention time is established
by:
(i) providing a set of training compounds of known structure and a set of test compounds
of known structure, and optionally a set of validation compounds of known structure;
(ii) generating the measured value of the absolute second dimension retention time
for each training compound, each test compound, and each validation compound in a
specific experimental set up, and transforming these into the reference system to
calculate relative second dimension retention times;
(ii) for each training compound, computing a set of molecular descriptors based on
chemical structure and properties;
(iii) selecting a set of molecular descriptors from the set of molecular descripto
rs for use in a quantitative model of relative second dimension retention time, by
using a genetic algorithm;
(iv) generating a plurality of proposed quantitative models using the selected set
of molecular descriptors;
(v) evaluating each proposed quantitative model by computing a predicted value of
relative second dimension retention time for each test compound
(vi) selecting the quantitative model according to the root mean square error (RMSE)
and/or the squared correlation (r2) on the calculated value from step (ia) and the predicted value of the relative second
dimension retention time for each test compound; and optionally
(vi) selecting the quantitative model according to the squared correlation (r2) on the calculated value and the predicted value of the second dimension retention
time for each validation compound.
[0029] Preferably, the genetic algorithm used in this aspect of the invention comprises:
(p) generating a plurality of candidate solutions using a combination of two or more
molecular descriptors in a machine learning algorithm such as but not limited to multiple
linear regression, k-nearest neighbour method, or support vector regression;
(q) scoring each candidate solution according to a fitness function based on the cross
validation squared correlation (q2) of the training compounds;
(r) generating new candidate solutions by recombining and/or mutating the candidate
solutions that produces an improving cross validation squared correlation; and
(s) repeating step (q) and (r) for a finite number of times, for example, 10 to 50
generations.
[0030] Advantageously, the relative second dimension retention times used in the first aspect
of the invention are predicted by the method of the second aspect of the invention.
[0031] Optionally, the results obtained from the computer-assisted methods of the invention
based on chromatographic and mass spectral data generated by GCxGC-MS can be further
enhanced by using the accurate mass data obtained from gas chromatograph-atmospheric
pressure chemical ionization-mass spectrometry (GC-APCI-MS). Data generated by the
two techniques can be matched by using a duplicate retention index system based on
an additional reference system of deuterated fatty acid methyl esters.
[0032] In a third aspect, the invention provides methods for confirming the match of a test
compound to a candidate compound identified in a database of two-dimension gas chromatography
mass spectrometry. The methods comprise analysis of the same sample by gas chromatography
by atmospheric pressure chemical ionization and time-of-flight mass spectrometry (GC-APCI-TOF-MS,
GC-APCI-TOF,or GC-APCI-MS) and comparing the theoretical monoisotopic mass with the
accurate mass measured by GC-APCI-TOF-MS. The prerequisite for the confirmatory method
is to match the retention indices of the two different chromatographic systems. The
Kovats index system from GCxGC-TOF-MS analysis based on deuterated n-alkanes to another
retention index system based on deuterated fatty acid methyl esters (FAMEs). The system
based on deuterated FAMEs is used because deuterated n-alkanes are not ionizable by
the ion source of the GC-APCI-TOF-MS.
[0033] The Kovats index systems are established by generation of a Kovats index system for
GCxGC-TOF-MS system based on deuterated n-alkanes; analysis of deuterated FAMEs using
the GC-GC-TOF-MS system and determination of the Kovats indices of the FAMEs; analysis
of deuterated FAMEs using the GC-APCI-TOF-MS system and generation of a retention
index system for GC-APCI-TOF-MS system based on deuterated FAMEs; and bridging of
retention index system for GC-APCI-TOF-MS system based on deuterated FAMEs with the
Kovats index system based on n-alkanes by using Kovats indeces of deuterated FAMEs
for GCxGC-TOF-MS system.
[0034] Accordingly, the invention provides methods comprising the steps of:
- (a) measuring Kovats indices of analytes relative to a first set of reference compounds
in GCxGC-TOF-MS;
- (b) measuring Kovats indices of a second set of reference compounds relative to the
first set of reference compounds in GCxGC-TOF-MS;
- (c) measuring absolute retention times of the second set of reference compounds in
a GC-APCI-TOF-MS; and
- (d) using the Kovats indices of the second set of reference compounds measured in
step (b) to derive by linear regression a function for converting the Kovats indices
of the analytes measured in step (a) into estimated absolute retention times of the
analytes in the GC-APCI-TOF-MS.
[0035] The function of step (d) is derived by linear regression for each retention time
range where an analyte is detected between two adjacent reference compounds of the
second set of reference compounds. The function is:

where a is a coefficient and b is constant for a specific time range.
[0036] The method further comprises comparing the molecular masses of the analytes with
the molecular masses of the respective candidate compounds for each of the analytes.
[0037] In one embodiment, the method further comprises:
(e) measuring the absolute retention times of the analytes in the GC-APCI-TOF-MS;
(f) using the function calculated in step (d) to convert the absolute retention times
measured in step (e) into calculated Kovats indices in the GC-APCI-TOF-MS for the
analytes; and
(g) comparing the Kovats indices calculated in step (f) with the measured Kovats indices
from step (a).
[0038] Preferably, the first set of reference compounds deuterated n-alkanes. Preferably,
the second set of reference compounds deuterated fatty acids methyl esters.
Brief Description of the Drawings
[0039] Preferred embodiments of the present invention will now be described with reference
to the accompanying drawings, in which:
Figure 1 illustrates a traditional approach for compound structure identification
using GC-MS (NO: no compound identified with medium confidence; YES: compound identified
with medium confidence);
Figure 2 illustrates the CASI approach for compound structure identification using
GCxGC-MS system including use of GC-APCI-MS to confirm the results;
Figure 3 illustrates a process used to build the Kovats index and relative second
dimension retention time models;
Figure 4 shows a correlation of predicted and experimental correlation values of Kovats
Indices for a set of validation compounds;
Figure 5 shows a correlation between boiling point (BP) predicted from Kovats Indices
and BP predicted from chemical structures by software by ACD/Labs PhysChem for the
set of validation compounds (r2 = 0.934);
Figure 6 shows a correlation between predicted retention times and experimental retention
times for the external test set of the GCxGC-MS system second column retention time
model;
Figure 7 shows a contribution equation of a theoretical scoring module (e.g. KIFIT...);
Figure 8 shows the result of CASI for Geranylgeraniol as presented by the computer
system of the present invention;
Figure 9 shows the position of the correct hit (i.e. structure candidate) for the
71 mass spectra to identify;
Figure 10 shows an embodiment of a computer system according to the present invention;
Figure 11 is a contingency table showing the true/false positives and true/false negatives
rate for CASI and NIST search;
Figure 12 shows a preferred embodiment of the CASI software architecture;
Figure 13 shows web interface output showing for each structure to identify the structure
candidate with the highest score is selected by default; and
Figure 14 shows web interface output wherein user can change selection.
Detailed Description of the Invention
[0040] Unless defined otherwise, all technical and scientific terms used herein have the
same meaning as commonly understood to one of ordinary skill in the art to which this
invention belongs. Although any methods, devices and material similar or equivalent
to those described herein can be used in the practice or testing of the invention,
the preferred methods, devices and materials are now described.
[0041] All publications cited in this specification, including patent publications, are
indicative of the level of ordinary skill in the art to which this invention pertains
and are incorporated herein by reference in their entireties.
[0042] A high-throughput computer-assisted system for analyzing GCxGC-MS data, referred
to as Computer-Assisted Structure Identification (CASI) is provided in this invention.
The CASI system accelerates and standardizes the identification of compound structures,
whilst assuring the reproducibility, and enables higher confidence for correct assignment
of mass spectra to the right compounds. The concept of CASI is based on several steps
of spectral searches and their matches to the parameters that are predicted on-the-fly.
[0043] Firstly, mass spectra are searched for candidate compounds and their associated match
factors using an algorithm of National Institutes of Standards and Technology (NIST,
Gaithersburg, MD, USA) MS Search in the NIST 08 and
WILEY 9th ed. Mass Spectra databases. Secondly, we have developed Quantitative Structure-Property Relationship (QSPR)
models that predict analytic properties to enhance the confidence in compound identification.
Two analytic properties, Kovats indices for first dimension (1 D) separation and relative
retention times for second dimension (2D) separation are predicted by using these
models. Preferably, the Kovats indices and relative 2D RT are calculated using different
molecular descriptors. In addition, a third analytic property, the boiling points
of compounds, are derived from the measured 1 D RT of an analyte and are matched to
computationally predicted boiling points of the candidate compounds. The boiling points
are calculated by software known in the art, such as ACD/PhysChem software. Finally,
the CASI system combines the matching results of NIST MS search and all parameters
predicted in QSPR models to produce a match score, also referred to as a CASI score
(Figure 2). Optionally, the discriminatory power is calculated for each identified
compound to measure confidence of the assignment. Optionally, the proposed chemical
structure is confirmed by GC-APCI-TOF.
Models for prediction of analytical properties
[0044] All QSPR models for the development of CASI are built under the same principles.
Compounds of known structure are split randomly into a training set (in this example,
90 compounds) and a test set (in this example, 35 compounds). In addition, in this
example, 35 different compounds are used as a validation set. Without limitation,
50 to 500 compounds can be used for training. Different distribution of compounds
between the sets could be chosen for model establishment. Chemical structures represented
in computer-readable format are prepared using software known in the art, in this
case, Pipeline Pilot 8.0.1 (Accelrys, Inc. San Diego, California, USA). During the
preparation. salts are stripped from the compounds' structures using a predefined
list, largest fragments are kept, bases are deprotonated and acids are protonated,
charges of functional groups are standardized, hydrogens are added, canonical tautomers
are generated, and 2D coordinates are generated. Then the duplicate structures are
removed.
[0046] To construct a predictive model, a set of predictive descriptors is selected in RapidMiner
5 (Rapid-I GmBH, Dortmund, Germany). Other similar data mining software platform known
in the art can also be used. Several molecular descriptor selection experiments using
forward selection and a genetic algorithm were tried. The performance of forward selection
is acceptable, but this method has the inconvenience of a fall in local minima. Stochastic
methods like genetic algorithms generally perform better. For this reason, genetic
algorithms are used to select molecular descriptors.
[0047] The implementation of genetic algorithms in the systems of the invention uses roulette-wheel
selection and two point crossover. Each string of molecular descriptors referred to
as "chromosome" contains a predefined number of "genes", and each gene codes for a
descriptor. Generally, we select between 2 and 15 descriptors. The genes are not binary,
but contain the position of the corresponding descriptor in a list. This allows using
a minimum number of descriptors. The fitness function set the subset of descriptors
in the "Select Attributes" nodes of the RapidMiner process, executes it, and gets
the root mean squared error of the training set as the fitness score. Mutation rate
was set to 0.1, the number of chromosomes per generation was set to 20 to 40, preferably
30 and the number of generation was set to 100 to 300, preferably 200. The two best
chromosomes survive at each generation.
[0048] In an exemplary workflow using Rapidminer, data preparation is constituted of a node
which selects a subset of attributes, normalization with Z-transformation, separation
of data set into training test (75%) and test set (25%). Then a linear regre ssion
is applied on the training set, the learned model is applied on both training set
and test set. In addition leave-one-out cross validation on training set was carried
out. Various different learning algorithms are used to build the models for prediction
of KI and relative second dimension retention time. Various learning algorithms were
used, such as but not limited to k-Nearest Neighbors (k-NN), Multi Linear Regression
(MLR) and Support Vector Regression (SVR). For each learning algorithm, from 2 to
15 descriptors were used to generate the models. At the end of the modeling run, the
best model is kept for each value to predict. This process is described in Figure
3.
Kovats indices model
[0049] In this example of prediction of KI, the genetic algorithm (GA) were combined with
three different learning algorithms. The results are presented in Table 1:
Table 1. Result of the best models for KI with multi linear regression, k-nearest neighbors
and support vector machine regression. Q2 values were obtained with leave-one-out
cross validation for MLR and 10 folds cross validation for kNN and RMSE value was
obtained by 5 folds cross validation for SVR. Results shown in bold is selected as
the best solution.
|
|
GA - MLR |
GA - kNN |
GA-epsilon SVR (linear kernel) |
KI |
Q2 |
0.988 |
0.972 |
0.979 |
C = 1.9 x 10-3 |
|
R2 (test set) |
0.982 |
0.956 |
0.957 |
|
[0050] The best results were obtained with a genetic algorithm - linear model using 15 descriptors.
Exemplary descriptors are presented in Table 2 ; these or any other suitable descriptors
may be used. Results obtained with this linear model are very good with r
2 on training set = 0.991, q
2 for leave one out on training set = 0.988 and r
2 test set = 0.982. r
2 on the external test set is also very good (r
2 = 0.985, see Figure 4).
Table 2. Descriptors used in the selected KI model.
Coefficient |
Descriptor |
Description |
236.746 |
nSK |
Number of non-H atoms. |
- 140.487 |
TI1 |
First Mohar index TI1. |
+ 60.674 |
Wap |
All-path Wiener index. |
- 57.063 |
Jhetm |
Balaban-type index from mass weighted distance matrix. |
+ 54.075 |
PW4 |
Path/walk 4-Randic shape index. |
117.349 |
AAC |
Mean information index on atomic composition. |
+ 67.819 |
ATS6v |
Broto-Moreau autocorrelation of a topological structure - lag 6 / weighted by atomic
van der Waals volumes. |
+ 149.892 |
EEig10x |
Eigenvalue 10 from edge adj. matrix weighted by edge degrees. |
- 101.933 |
EEig10d |
Eigenvalue 10 from edge adj. matrix weighted by dipole moments. |
+ 69.663 |
BEHe3 |
Highest eigenvalue n. 3 of Burden matrix / weighted by atomic Sanderson electronegativities. |
- 58.337 |
nCrq |
Number of ring quaternary C(sp3). |
- 7.834 |
C-034 |
Fragment R-CR..X |
+ 49.347 |
Hy |
Hydrophilic factor. |
- 44.028 |
Inflammat-80 |
Ghose-Viswanadhan-Wendoloski anti-inflammatory-like index at 80 %. |
+ 283.204 |
F02[C-C] |
Frequency of C-C at topological distance 2. |
1609.956 |
|
|
[0051] In another example of prediction of KI, a genetic algorithm -linear model using 12
descriptors is used. Exemplary descriptors are presented in Table 3 below. Results
obtained with this linear model yielded with r2 training set = 0.992, q2 leave one
out = 0.999 and r2 test set = 0.983. r2 on external test set.
Table 3
Coefficient |
Descriptor |
Description |
2490.980 |
nSK |
Number of non-H atoms. |
- 3470.745 |
nC |
Number of C atoms |
-48.955 |
nR06 |
Number of 6 membered rings |
-48.134 |
Qindex |
Quadratic index |
-211.303 |
DELS |
Molecular electropological variation |
-45.839 |
SRW09 |
Self-returning walk count of order 9 |
-63.030 |
CIC3 |
Complementary information content (neighbourhood symmetry of 3-order) |
+328.644 |
ATS1p |
Bronto-Moreau autocorrelation of topological structure - lag 1/weighted by atomic
polarizabilities |
+25.916 |
EEig15x |
Eigenvalue 15 from edge adj. matrix weighted by egde degrees. |
-31.625 |
JGI6 |
Mean topological charge index of order 6 |
-59.809 |
B01[C-Si] |
Presence/absence of C-Si at topological distance 1 |
+1539.797 |
F01[C-C] |
Frequency of C-C at topological distance 1 |
+1561.023 |
|
|
Boiling Point Model
[0052] In this example, the correlation between the boiling point (calculated with ACD/Labs
ACD/PhysChem) and the boiling point calculated from Kovats Indices values are: r
2 training set = 0.955, r
2 test set = 0.910 and r
2 validation set = 0.934 (Figure 5). The equation obtained is:

[0053] In another example, the correlation between the boiling point (calculated with ACD/Labs
ACD/PhysChem) and the boiling point calculated from Kovats Indices values are: r2
training set = 0.902, q2 leave one out = 0.899, r2 test set = 0.891 and r2 validation
set = 0.934 (Figure 3). The equation obtained is:

Relative second dimension retention time model
[0054] For the relative second dimension time of the GCxGC-MS, we used genetic algorithms
with three different learning algorithms. The results are presented in Table 4:
Table 4. Result of the best models for 2DRT with multi linear regression, k-nearest neighbors
and support vector machine regression. Q2 values were obtained with leave-one-out
cross validation for MLR and 10 folds cross validation for kNN and RMSE value was
obtained by 5 folds cross validation for SVR. Results shown in bold is selected as
the best solution.
|
|
GA - MLR |
GA - kNN |
GA-epsilon SVR (linear kernel) |
2DRT |
Q2 |
0.861 |
0.841 |
0.840 |
C = 3.8 |
|
R2 (test set) |
0.750 |
0.673 |
0.827 |
|
[0055] One of the best model was obtained by using genetic algorithms and support vector
regression analysis. The results obtained are q
2 leave one out = 0.840, r
2 test set = 0.827 and r
2 validation set = 0.849. The model is less accurate than the KI model. It can be explained
by the fact that the variances of experimental measured second dimension retention
times (respectively 2D relative RT) is higher than for the KI and in addition the
relation between the structures and the retention times is not linear. However with
a r
2 = 0.849 for the external test set, the model has a good accuracy. In this example,
the model uses 8 descriptors as presented in Table 5.
Table 5. Descriptors used for the 2DRT model.
Descriptor |
Description |
Wap |
All-path Wiener index. |
AMW |
Average molecular weight. |
X0Av |
Average valence connectivity index chi-0. |
nRCO |
Number of ketones (aliphatic). |
ZM2V |
Second Zagreb index by valence vertex degrees. |
JGI3 |
Mean topological charge index of order 3. |
X0A |
Average connectivity index chi-0. |
piPC10 |
Molecular multiple path count of order 10. |
[0056] In another example, wherein the second dimension of the GCxGC-MS set up is polar,
one of the best model was obtained by using genetic algorithms and 2 nearest neighbors
analysis. The results yielded q2 leave one out = 0.899, r2 test set = 0.816 and r2
validation set = 0.811. The model is less accurate than the KI model. It can be explained
by the fact that the reproducibility of experimental measures is lower, and that relation
between the structures and the retention times is not linear. However with a value
of r2 = 0.811 for the external test set, the model has a good accuracy. In this particular
example, the model uses 14 descriptors as presented in Table 6.
Table 6- Descriptors used in the GCxGC-TOF second column retention time model
Descriptors |
Description |
AMW |
Average molecular weight. |
MSD |
Mean square distance index (Balaban). |
BLI |
Kier benzene-likeness index. |
PW5 |
Path/walk 5 - Randic shape index. |
ICR |
Radial centric information index. |
piPC04 |
Molecular multiple path count of order 4. |
X0Av |
Averaqe valence connectivity index chi-0. |
AAC |
Mean information index on atomic composition. |
ATS5m |
Broto-Moreau autocorrelation of a topological structure - lag 5 / weighted by atomic
masses. |
GATS2v |
Geary autocorrelation - lag 2 / weighted by atomic van der Waals volumes. |
BEHe1 |
Highest eigenvalue n. 1 of Burden matrix / weighted by atomic Sanderson electronegativities |
F06[Si-Si] |
Frequency of Si-Si at topological distance 6. |
F09[C-O] |
Frequency of C-O at topological distance 9. |
F10[C-Si] |
Frequency of C-Si at topological distance 10. |
Calculation of a match score
[0057] Scores are calculated from spectral similarity value, (in this example, the NIST
MS Search match factor), predicted KI, predicted second dimension relative retention
time of the GCxGC-TOF and the predicted boiling point, using a hyperbolic equation.
The general principle is based on similarity of experimental MS to library MS multiplied
by analytical property scores derived from each analytical property (KI, BP ...).
The analytical property scores (KIFIT, BPFIT...) are normalized from 0 (no similarity)
to 1 (perfect match). The scores are based on quadratic equation via polynomials factorization
of the type:

[0059] The complete equation is:

[0060] With:
HYPKI: hyperbolic equation which is used to correct the value of NIST Match Factor in the
CASI score.
KIPre: predicted Kovats Index
KIExp: measured Kovats Index
nKI: factor (for curve fitting) = e.g. nKI for Kovats Index
SEPKI: standard error of prediction
[0061] Curve Analysis:
- Maximum: if KIPre = KIExp, y = 1
- Zero-crossing1: KIPre = KIExp - nKI x SEPKI
- Zero-crossing2: KIPre = KIExp + nKI x SEPKI
[0062] A graphical interpretation of the derived hyperbolic equation is shown in Figure
7.
[0063] An exemplary formula for combining the three analytical property scores and the spectral
similarity value to calculate a match score, is as follows:

[0064] For each query of an analyte, the candidate compounds are ranked according to decreasing
CASI scores. CASI score is calculated according to the above-described equation. The
hit with the highest value is selected by default.
Score optimization
[0065] In calculating the CASI score, each of the three analytical property scores has four
parameters. However, only
nx has to be established which defines at which value the hyperbolic curve crosses the
X axis.
nx is contributing to the shape of the hyperbolic curve, and then to the weight of each
analytical property score in the final CASI score.
[0066] A grid search procedure is provided to establish optimal values for
nKI, n2DrelRT and
nBP· A solution's score is generated by using every possible combination of integer values
between 1 and 50 for each of
nKI, n2DreIRT and
nBP. The solution's score is the number of correct hits sorted first for training set
and test set. The solution with the highest number of correct hits is selected. The
algorithm can be described as follow:
- for nKI in 1 .. 50
- for n2DRT in 1 .. 50
- for nBP in 1 .. 50 compute CASI score for the compounds in the training sets and in the test sets using combinations of
values of nKI, n2DRT and nBP for each iteration. count the number of correct hits for this iteration.
- select the values of the solution with the greatest number of correct hits.
[0067] The selected
nKI, n2DrelRT and
nBP parameters will be used in the final validation step of the configuration in CASI.
Validation
[0068] To validate the performance of the methods of the invention, a set of 71 molecules
whose identities are known are used. Results are shown in Figure 9. Some of these
molecules are present in the validation set used to validate the models, but none
of them are present in the training set and test set. The results obtained by using
the CASI system are clearly better than using the NIST match factor alone: 51 correct
hits ranked first and 14 correct hits ranked in second position. Using NIST Match
Factor, 50 correct hits ranked first but only 9 correct hits sorted in second position.
The ranking of correct structures with CASI Score is compared to the ranking using
NIST Match Factor in the Table 7:
Table 7. Comparison of the position of correct hits by ranking based on CASI score
and ranking based on NIST Match Factor. CASI score performs better than NIST Match
Factor in term of ranking of correct hits.
Position of correct hits |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
10 |
20 |
Frequency with CASI score |
51 |
14 |
3 |
2 |
|
1 |
|
|
|
Frequency NIST Match Factor |
50 |
9 |
4 |
2 |
2 |
1 |
1 |
1 |
1 |
[0069] By analyzing the true/false positives and true/false negatives rate shown in contingency
table (Figure 11), the rate of false positive structural assignments is reduced significantly
for the CASI score compared to the NIST MS search. Accordingly, CASI score each 9
th structural assignment is a wrong assignment, whereas for the NIST MS search each
3
rd structural assignment is a false one.
[0070] An illustrative example of the advantage of the CASI score is the hentriacontane,
which is sorted in 20th position with NIST MF but sorted in 2nd position with CASI
score, because of the accurate prediction of the KI. Another example presented in
Figure 8 is Geranylgeraniol which shows clearly that CASI score gives a better discriminatory
power than NIST Match Factor. CASI score as well as NIST Match Factor rank the correct
hit in first position, but CASI Score gives a much higher discriminatory power.
[0071] These results clearly show that the CASI system improves the confidence and increase
the throughput in structure identification .
[0072] The results obtained from the CASI system can be confirmed by the use of GC-APCI-TOF-MS.
A sample comprising analytes are combined with deuterated n-alkanes and deuterated
fatty acids methyl esters, divided into two aliquots. One was analyzed by GCxGC-TOF-MS
wherein the Kovats index of the FAMEs and analytes are determined using deuterated
n-alkanes as the reference system. The other aliquot is analyzed in a GC-APCI-MS wherein
the absolute retention time of the FAMEs are determined. By applying the above-described
methods for bridging the retention index systems, the deviation of Kovats Index was
found to be less than 1% between both systems and the mass deviation was found to
be less than 1 mDa for the GC-APCI-TOF-MS.
[0073] The ability to confirm proposed structures using accurate masses measured by GC-APCI-TOF-MS
was tested. The method is used to confirm the proposed structures of 155 compounds
present in cigarette smoke. 120 of the 155 compounds are ionizable in the GC-APCI-TOF-MS.
106 compounds are detected within the retention time index window and 85 compounds
are confirmed automatically.
The CASI system
[0074] Figure 10 is a block diagram of a computer system for analysing mass spectral data
in GCXGC mass spectrometry. The system includes a web interface 1000, a match score
generator engine 2100, a structural candidate search engine 2200 which accesses a
structural candidate database 2210, a descriptor selection and model generation engine
2300 and a descriptor computation engine 2400. The system further includes a chemical
structure generator 3100 which accesses a name-to-structure database 3200. The components
of the system may be software applications operating on a single server or may be
distributed over multiple computing systems communicating via network interfaces including
wireless communication systems. However, in the embodiment shown in Figure 10, the
match score generator engine 2100, structural candidate search engine 2200, descriptor
selection and model generation engine 2300 and descriptor computation engine 2400
are interconnected software applications operating on a match score server 2000, on
which structural candidate database 2210 is also stored. The chemical structure generator
3100 and name-to-structure database 3200 operate on a second server 3000, although
they may also operate on match score server 2000.
[0075] Input data 100 is input via web interface 1000. Input data may in the form of a JDX
file, and comprises mass spectra data from a sample, and further include experimental
values for analytical properties such as Kovats index data, boiling point data and
2D retention time data. The web interface 1000 may communicate with the match score
generator engine 2100 via a SOAP (Simple Object Access Protocol).
[0076] The computer system operates in two modes, a training mode and an analysis mode.
The training mode may be run at any time, but it is necessary to run the computer
system in training mode every time the mass spectrometer experimental set up is changed.
In the training mode, the input data are mass spectrometer data and measured values
of an analytical property such as Kovats index, for a set of known compounds.
[0077] For each of the known compounds, the chemical structure in computer readable form
is generated by the chemical structure generator 3100 which accesses the name-to-structure
database 3200. The chemical structure generator 3100 may be Pipeline Pilot 7.5.1 software,
and the database 3200 may be an ACD database.
[0078] For all of the known compounds, molecular descriptors are calculated by descriptor
computation engine 2400, which may be the Dragon software package. The known compounds
are divided into a training set and a test set. For the training set, descriptor selection
and model generation engine 2300, which may be RapidMiner software, selects a set
of predictive descriptors using forward selection and a genetic algorithm as described
in detail above to construct a predictive model for predicting values of an analytical
property, such as Kovats indices or 2D retention time, for the training compound structures.
The predicted model is verified using the test set, as described in more detail above,
and a model is selected.
[0079] In the analysis mode, the input data 100 is mass spectrometry data from a sample.
The structural candidate search engine 2200 carries out a search in structural candidate
database 2210 by comparing the mass spectra data from the sample with mass spectra
data in the database 2210, to generate a number of structural candidate compounds
based on similarity of the mass spectra data with the data in the database 2210. The
selected candidate compounds may be, for example, the top 100 matches. The search
engine may be an NIST MS search algorithm, and the database 2210 may be the NIST 08
and WILEY 9th ed Mass Spectra databases. The list of structural candidates is made
available for the user to view via web interface 1000. Each candidate has a match
factor indicative of the similarity of the mass spectra data for the sample with the
data in the database 2210 for the candidate. The match factor is generated by the
structural candidate search engine 2200, and may also be displayed to the user via
the web interface 1000 for each structural candidate.
[0080] For each of the structural candidates, the chemical structure in computer readable
form is generated by the chemical structure generator 3100 which accesses the name-to-structure
database 3200. The chemical structure generator 3100 may be Pipeline Pilot 7.5.1 software,
and the database 3200 may be an ACD database.
[0081] For all of the structural candidates, molecular descriptors are calculated by descriptor
computation engine 2400, which may be the Dragon software package.
[0082] The model generated by the descriptor selection and model generation engine 2300
in the training mode is then used to predict the analytical property, such as Kovats
index or 2D retention time, for the candidate structures. The descriptor selection
and model generation engine 2300 supplies the model to the match score generator engine
2100 which calculates predicted values of one or more analytical properties based
on the model. The predicted values may be communicated to the user via web interface
1000.
[0083] The match score generator engine 2100 calculates a match score for each candidate
compound based on the match factors generated by the structural candidate search engine
2200, the predicted values of the analytical properties predicted by the model provided
by the descriptor selection and model generation engine 2300, and measured values
of the analytical properties of the sample which were included in input data 100.
The match score generator engine 2100 may calculate a CASI score in accordance with
the method described above. The match scores may also be communicated to a user via
web interface 1000.
[0084] The web interface 1000 may display the results to the user in the form of a table,
listing the structural candidates, the match factors generated by the structural candidate
search engine 2200, the predicted values of the analytical properties generated by
the model generation engine 2300, and the match score. The table may be sorted to
rank the structural candidates by their match scores.
[0085] Once a model for predicting an analytical property has been generated by descriptor
selection and model generation engine 2300 in the training mode, there is no need
to generate a model again for a new set of input data ie a new sample for identification,
and a new set of structural candidates, provided the experimental set up has not changed.
If the experimental set up is changed, it is necessary to generate a new model by
running the system in the training mode. Therefore, the descriptor selection and model
generation engine 2300 supplies the selected model to the match score generator 2100,
which, in the analysis mode, applies the model to the structural candidates to generate
predicted values for the analytical property. In this way, in the analysis mode, access
to the descriptor selection and model generation engine 2300 is not required. Access
to the descriptor selection and model generation engine 2300 is only required in the
training mode for generation of a new model. The descriptor selection and model generation
engine 2300 may thus be provided on a separate computing device eg server which is
only accessed in the training mode.
[0086] A preferred embodiment of the software architecture is illustrated in Figure 12.
[0087] Oracle Application Express is used for the development of the web interface 1000. A SOAP interface allows Oracle
Application Express to communicate with the match score generator engine 2100, which
is developed in Java and runs in Tomcat.
RapidMiner is used as the descriptor selection and model generation engine 2300 and is integrated
by Java API. Java is used to implement the match score generator engine 2100 mainly
because RapidMiner can be easily integrated in Java.
[0088] The structural candidate search engine 2200 comprises
NIST MS Search and is integrated by command line. The chemical structure generator 3100 is
Pipeline Pilot and is integrated with Java API. It is used to convert names of the hits to structures
(using ACD/Labs name-to-structure and an internet connection to ChemBL), to standardize
the structures, to compute boiling point (ACD/Labs PhysChem Batch) and to move data
from CASI to a chemical registry database. The descriptor computation engine 2400
comprises Dragon and is integrated by command line. In addition to these software
modules, the standard Java APIs
Log4J is used for logging error messages,
Hibernate is used for the mapping of the objects to the Oracle database and
JUnit is used for the unit tests.
[0089] Figures 13 and 14 illustrate outputs of the web interface 1000. For a given analysis,
all compounds to identify are presented with the structure candidate having the best
score (Figure 13). Structure candidates can be browsed and selection can be changed
(Figure 14). Each structure candidates (Hits) for compound to identify (Query, in
this case 1-Pentene, 2,3-dimethyl) are listed with predicted properties. The one with
the best score is selected by default. User can change the selection and add comments
which will be inserted with the selected structure into a chemical registration system.
1. A method for analysing mass spectral data obtained from a sample in GCxGC (2-dimensional)
mass spectrometry, comprising:
(a) comparing mass spectral data of an analyte with mass spectral data of candidate
compounds of known structure in a library;
(b) identifying a plurality of candidate compounds from the library based on similarities
of mass spectral data;
(c) predicting, for each candidate compound, a value of at least one analytical property
using a quantitative model based on a plurality of molecular descriptors; and
(d) calculating a match score for each candidate compound based on the value predicted
in step (c) and a measured value of the analytical property for the analyte.
2. The method of claim 1, wherein step (c) comprises predicting, for each candidate compound,
values of a plurality of analytical properties, wherein the predicted analytical properties
include at least one of a Kovats index, a boiling point and a relative second dimension
retention time.
3. The method of claim 1 or 2, wherein the relative second dimension retention time of
the analyte is a function of the absolute second dimension retention time of the compound
and the second dimension retention time of a hypothetical deuterated n-alkane, wherein
the second dimension retention time of a hypothetical deuterated n-alkane is calculated
according to a linear regression on the absolute first dimension retention times and
absolute second dimension retention times of a series of deuterated n-alkanes..
4. The method of any one of the preceding claims, wherein the match score is additionally
based on the similarity of mass spectral data in step (b).
5. The method of claim 1, wherein the quantitative model of step (c) is obtained by using
a test data set and a genetic algorithm to select the molecular descriptors from a
plurality of possible molecular descriptors, and using a machine learning algorithm
selected from linear regression, support vector regression, or k nearest neighbours
method to correlate the selected molecular descriptors with the value to predict.
6. The method of claim 1, wherein said quantitative model of step (c) is the product
of a method for establishing quantitative model,which comprises the following steps:
(i) providing a set of training compounds of known structure and a set of test compounds
of known structure, and optionally a set of validation compounds of known structure;
(ii) generating a measured value of an analytic property for each training compound,
each test compound, and each validation compound;
(iii) for each training compound, computing a set of molecular descriptors based on
chemical structure and properties;
(iv) selecting a set of molecular descriptors from the set of molecular descriptors
for use in a quantitative model of the analytical property, by using a genetic algorithm;
(v) generating a plurality of proposed quantitative models using the selected set
of molecular descriptors;
(vi) evaluating each proposed quantitative model by computing a predicted value of
the analytical property for each test compound
(vii) selecting the quantitative model according to the root mean square error (RMSE)
and/or the squared correlation (r2) on the measured value and the predicted value of the analytical property for each
test compound; and optionally
(viii) selecting the quantitative model according to the root mean square error (RMSE)
and/or the squared correlation (r2) on the measured value and the predicted value of the analytical property for each
validation compound.
7. The method of claim 6, wherein using the genetic algorithm of (iii) comprises (p)
generating a plurality of candidate solutions using a combination of two or more molecular
descriptors in a machine learning algorithm selected from multiple linear regression,
k-nearest neighbour method, or support vector regression;
(r) scoring each candidate solution according to a fitness function based on the cross
validation squared correlation (q2) of the training compounds
(s) generating new candidate solutions by recombining and/or mutating the candidate
solutions that produces an increased cross validation squared correlation; and
(t) repeating step (r) and (s) for a finite number of times.
8. The method of any one of the preceding claims, further comprising verifying a candidate
structure by a method comprising the steps of:
(A) measuring Kovats indices of analytes relative to a first set of reference compounds
in GCxGC-TOF-MS;
(B) measuring Kovats indices of a second set of reference compounds relative to the
first set of reference compounds in GCxGC-TOF-MS;
(C) measuring absolute retention times of the second set of reference compounds in
a GC-APCI-TOF-MS; and
(D) using the Kovats indices of the second set of reference compounds measured in
step (b) to derive by linear regression a function for converting the Kovats indices
of the analytes measured in step (A) into estimated absolute retention times of the
analytes in the GC-APCI-TOF-MS.
9. The method of claim 8, further comprising:
(E) measuring the absolute retention times of the analytes in the GC-APCI-TOF-MS;
(F) using the function calculated in step (D) to convert the absolute retention times
measured in step (E) into calculated Kovats indices in the GC-APCI-TOF-MS for the
analytes; and
(G) comparing the Kovats indices calculated in step (F) with the measured Kovats indices
from step (A).
10. The method of claim 8 or 9, wherein the function of step (D) is derived by linear
regression for each retention time range where an analyte is detected between two
adjacent reference compounds of the second set of reference compounds, wherein the
function is:

where a is a coefficient and b is constant for a specific time range.
11. The method of any one of claims 8 to 10, further comprising comparing the molecular
masses of the analytes with the molecular masses of the respective candidate compounds
for each of the analytes.
12. The method of any one of claims 8 to 11, wherein the first set of reference compounds
deuterated n-alkanes and the second set of reference compounds deuterated fatty acids
methyl esters.
13. A method of calculating a predicted relative second dimension retention time in a
GCxGC-MS (2-dimensional gas chromatography coupled to mass spectrometry) for a molecular
structure comprising the steps of:
(a) defining a reference system based on a function of hypothetical deuterated n-alkanes;
(b) transforming measured values of absolute second dimension retention times for
a plurality of training compounds of known molecular structure into the reference
system to calculate relative second dimension retention times for the training compounds;
(c) using the relative second dimension retention times for the training compounds
to generate a quantitative model of relative second dimension retention time based
on a plurality of molecular descriptors;
(d) using the quantitative model to predict a relative second dimension retention
time of the molecular structure.
14. A computer system programmed to carry out the method of any one of claims 1 to 19,
operatively connected to a GCxGC (2-dimensional) mass spectrometer.