Annotating LC-MS data with PredRet in R

R
metabolomics
PredRet
Here I show how you can use the PredRet database to annotate your data in R
Author

Jan Stanstrup

Published

November 22, 2015

In this tutorial I will show you how you can use the PredRet database to annotate your LC-MS metabolomics data directly in R. The annotation function I will be using is general-purpose, though, so you can use it to annotate any data where you have a list of compounds with known m/z’s and retention times (RTs).

What is PredRet?

In short PredRet is a user-driven database of compound retention times. The purpose of PredRet is to be able to predict the RT of a compound in one (your!) chromatographic system if it has been experimentally determined in another chromatographic system by someone, somewhere in the world. You can download the paper here and visit the project’s home page at PredRet.org.

Pulling data from PredRet

So lets get going.

First we will download the PredRet database with the PredRetR package. We will get both the experimental and the predicted values for the chromatographic system “LIFE_old”.

library(PredRetR)

predret_db <- PredRet_get_db(exp_pred = c("exp","pred"),
                             systems = "LIFE_old"
                            )

Then lets take a look at the structure of the data.frame we have retrieved. If the recorded_rt column has data we have an experimentally determined RT, if the predicted_rt column has data we have a RT predicted with the PredRet systems. The ci_lower and ci_upper columns show the prediction interval for the predictions.

library(knitr)
library(dplyr)

dim(predret_db)
[1] 3367   12
head(predret_db) %>% kable(format="html", table.attr = 'class="nicetable"')
system name pubchem inchi date added username predicted suspect recorded_rt predicted_rt ci_lower ci_upper
LIFE_old (DL)-p-hydroxyphenyllactic acid NA InChI=1S/C9H10O4/c10-7-3-1-6(2-4-7)5-8(11)9(12)13/h1-4,8,10-11H,5H2,(H,12,13) 2019-09-06 17:29:46 jan FALSE FALSE 1.544350 NA NA NA
LIFE_old (R)-2-hydroxybutyric acid NA InChI=1S/C4H8O3/c1-2-3(5)4(6)7/h3,5H,2H2,1H3,(H,6,7) 2019-09-06 17:29:46 jan FALSE FALSE 1.399400 NA NA NA
LIFE_old 1-O-1'-(Z)-octadecenyl-2-hydroxy-sn-glycero-3-phosphocholine (LysoPC(P-18:0)) NA InChI=1S/C26H54NO6P/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-22-31-24-26(28)25-33-34(29,30)32-23-21-27(2,3)4/h20,22,26,28H,5-19,21,23-25H2,1-4H3 2019-09-06 17:29:46 jan FALSE FALSE 4.947717 NA NA NA
LIFE_old 2-Hydroxy-2-methylbutyric acid NA InChI=1S/C5H10O3/c1-3-5(2,8)4(6)7/h8H,3H2,1-2H3,(H,6,7) 2019-09-20 02:26:09 jan FALSE TRUE 1.625783 NA NA NA
LIFE_old 2-Hydroxy-3-methylbutyric acid NA InChI=1S/C5H10O3/c1-3(2)4(6)5(7)8/h3-4,6H,1-2H3,(H,7,8) 2019-09-20 02:26:09 jan FALSE TRUE 1.685467 NA NA NA
LIFE_old 2-methylacetoacetic acid NA InChI=1S/C5H8O3/c1-3(4(2)6)5(7)8/h3H,1-2H3,(H,7,8) 2019-09-06 17:29:46 jan FALSE FALSE 1.612850 NA NA NA

We have the RT directly in the database above but we do not have the m/z. Since we have the InChI the easiest way to get the m/z is to extract the molecular formula and then use the Rdisop package to get the mass.

library(Rdisop)
library(stringr)

formulas   <- str_split_fixed(predret_db[,"inchi"],"/",3)[,2]
masses     <- sapply(formulas, function(x) getMass(getMolecule(x)))

predret_db <- cbind.data.frame(predret_db, mz_pos = masses + 1.0073)

We can then split the database in two. One for the experimental RTs and one for the predicted. Here I have used some pipe/dplyr style code and even a forward assign. If you are not familiar with this type of R code I urge you to look into it. It really makes code much more readable.

predret_db %>% filter(!is.na(recorded_rt)) -> predret_db_exp
predret_db %>% filter(is.na(recorded_rt))  -> predret_db_pred

Annotating a dataset

We now have the database ready for annotation.

So we can load a dataset/peaklist. This peaklist was previously created with XCMS and fragments/adducts annotated with CAMERA. But again any peaklist will do.

library(readxl)

data <- read_excel("../../static/material/data/peaklist_pos.xlsx") %>%
        replace(is.na(.), "") %>%
        as.data.frame

Lets take a look at the interesting columns we have in the dataset.

info_cols <- c("mz","rt","isotopes","adduct","pcgroup")
dim(data)
[1] 4232  470
head(data[,info_cols])
         mz       rt   isotopes              adduct pcgroup
1  62.06013 273.8827            [M+H-C5H8O]+ 145.11       1
2  95.08600 271.5567                                      1
3  98.09642 268.1695                                      1
4 104.10687 271.2400   [12][M]+ [M+H-COCH2]+ 145.11       1
5 105.11030 271.2415 [12][M+1]+                           1
6 114.09119 268.9771            [M+H-CH3OH]+ 145.11       1

Now we can use db.comp.assign from my chemhelper package to annotate the dataset.

We would probably want one column in our peaklist for the RTs we have determined experimentally and one for the predicted RTs. So we do the annotation twice.

The first two arguments to the function are the m/z and RT of the dataset. The next three are the name, m/z and RT of the database of known compounds. Lastly we give the tolerance for the m/z and RT for a database match.

library(chemhelper)

anno_exp <- db.comp.assign(mz           = data[,"mz"],
                           rt           = data[,"rt"],
                           comp_name_db = predret_db_exp[,"name"],
                           mz_db        = predret_db_exp[,"mz_pos"],
                           rt_db        = predret_db_exp[,"recorded_rt"] *60, # *60 since XCMS works in seconds but PredRet is in minutes.
                           mzabs        = 0.01,
                           ppm          = 15,
                           ret_tol      = 10
                           )
                                        [,1]
Unique hits                               89
Non-unique hits                           10
Compounds not found                      201
Markers had multiple compounds assigned   53
anno_pred <- db.comp.assign(mz          = data[,"mz"],
                           rt           = data[,"rt"],
                           comp_name_db = predret_db_pred[,"name"],
                           mz_db        = predret_db_pred[,"mz_pos"],
                           rt_db        = predret_db_pred[,"predicted_rt"] *60,
                           mzabs        = 0.01,
                           ppm          = 15,
                           ret_tol      = 10
                           )
                                        [,1]
Unique hits                              209
Non-unique hits                           15
Compounds not found                     2843
Markers had multiple compounds assigned   72

Now lets put the annotations together with our dataset.

data_annotated <- cbind.data.frame(data,anno_exp,anno_pred)

The result

Then lets take a look at one of the feature groups (from CAMERA) where we got an annotation. In this example we have a feature that was annotated as tryptophan. The “OR” is because tryptophan was in the database several times. If multiple compounds would fit the m/z and RT they would be written with “OR” between them.

There is also a fragment annotated as adipic acid and 2-methylglutaric acid using a hit from the predicted RTs. In this case we have almost perfect CAMERA annotation suggesting the pseudo-molecular ion is the m/z = 205.0979 feature and the compound is very likely tryptophan.

data_annotated %>%
  filter(pcgroup==16) %>%
  select(one_of(info_cols,"anno_exp","anno_pred")) %>%
  kable(format="html", table.attr = 'class="nicetable"')
mz rt isotopes adduct pcgroup anno_exp anno_pred
118.0657 91.31296 [M+H-NH3-CO-COCH2]+ 204.09 16
130.0655 91.40512 16 QUINOLINE
132.0808 91.27887 [44][M]+ [M+H-NH3-CO-CO]+ 204.09 16
133.0845 91.19346 [44][M+1]+ 16
142.0647 91.20588 [M+H-NH3-HCOOH]+ 204.09 16
144.0416 91.25665 16
144.0814 91.34230 [55][M]+ [M+H-NH3-CO2]+ 204.09 16 OSI1265 OR OSI1138 (non-unique hit)
145.0846 91.28341 [55][M+1]+ 16
146.0599 91.33676 [58][M]+ [M+H-NH3-COCH2]+ 204.09 16 4-Hydroxyquinoline
147.0647 91.36894 [58][M+1]+ 16 Adipic acid ADIPATE OR 2-METHYLGLUTARATE OR METHYGLUTARATE
159.0922 91.28882 [71][M]+ [M+H-HCOOH]+ 204.09 16 OSI6
160.0851 91.35138 [71][M+1]+ 16
170.0608 91.34747 [M+H-NH3-H2O]+ 204.09 16
188.0711 91.34805 [93][M]+ [M+H-NH3]+ 204.09 16
189.0743 91.28935 [93][M+1]+ 16
205.0979 91.32364 [102][M]+ [M+H]+ 204.09 16 tryptophan OR Tryptophan OR TRYPTOPHAN L-tryptopan (15N2, 98%)
206.1034 91.31068 [102][M+1]+ 16
245.1301 91.91434 [M+H+(CH3)2CO-H2O]+ (acetone cond.) 204.09 16
276.1823 91.41425 16 [3-Carboxy-2-(3-hydroxyhexanoyloxy)propyl]-trimethylazanium
409.1873 91.28252 [265][M]+ [2M+H]+ 204.09 16
410.1906 91.26287 [265][M+1]+ 16
447.1336 91.29100 [2M+K]+ 204.09 [4M+2K]2+ 204.09 16

Lets take a look at another feature. This time we have experimental RTs that say the feature is either 1,7-dimethylxanthine OR theobromine but a prediction from the PredRet database also suggest that the feature could be theophylline as well.

data_annotated %>%
  filter(pcgroup==400) %>%
  select(one_of(info_cols,"anno_exp","anno_pred")) %>%
  kable(format="html", table.attr = 'class="nicetable"')
mz rt isotopes adduct pcgroup anno_exp anno_pred
124.0508 89.67717 400
181.0725 89.56150 [86][M]+ [M+H-NH3-CO2-NH3-H2O]+ 276.119 [M+H-C4H6-COCH2]+ 276.119 400 1,7-dimethylxanthine OR theobromine Theobromine OR Theophylline OR Paraxanthine OR OSI101
182.0761 89.59507 [86][M+1]+ 400
315.0806 89.70856 [M+K]+ 276.119 400 HMDB0005033

That’s it!