Annotating LC-MS data with PredRet in R

metabolomics

PredRet

Here I show how you can use the PredRet database to annotate your data in R

Author

Jan Stanstrup

Published

November 22, 2015

In this tutorial I will show you how you can use the PredRet database to annotate your LC-MS metabolomics data directly in R. The annotation function I will be using is general-purpose, though, so you can use it to annotate any data where you have a list of compounds with known m/z’s and retention times (RTs).

What is PredRet?

In short PredRet is a user-driven database of compound retention times. The purpose of PredRet is to be able to predict the RT of a compound in one (your!) chromatographic system if it has been experimentally determined in another chromatographic system by someone, somewhere in the world. You can download the paper here and visit the project’s home page at PredRet.org.

Pulling data from PredRet

So lets get going.

First we will download the PredRet database with the PredRetR package. We will get both the experimental and the predicted values for the chromatographic system “LIFE_old”.

library(PredRetR)

predret_db <- PredRet_get_db(exp_pred = c("exp","pred"),
                             systems = "LIFE_old"
                            )

Then lets take a look at the structure of the data.frame we have retrieved. If the recorded_rt column has data we have an experimentally determined RT, if the predicted_rt column has data we have a RT predicted with the PredRet systems. The ci_lower and ci_upper columns show the prediction interval for the predictions.

library(knitr)
library(dplyr)

dim(predret_db)

[1] 3367   12

head(predret_db) %>% kable(format="html", table.attr = 'class="nicetable"')

system	name	pubchem	inchi	date added	username	predicted	suspect	recorded_rt	predicted_rt	ci_lower	ci_upper
LIFE_old	(DL)-p-hydroxyphenyllactic acid	NA	InChI=1S/C9H10O4/c10-7-3-1-6(2-4-7)5-8(11)9(12)13/h1-4,8,10-11H,5H2,(H,12,13)	2019-09-06 17:29:46	jan	FALSE	FALSE	1.544350	NA	NA	NA
LIFE_old	(R)-2-hydroxybutyric acid	NA	InChI=1S/C4H8O3/c1-2-3(5)4(6)7/h3,5H,2H2,1H3,(H,6,7)	2019-09-06 17:29:46	jan	FALSE	FALSE	1.399400	NA	NA	NA
LIFE_old	1-O-1'-(Z)-octadecenyl-2-hydroxy-sn-glycero-3-phosphocholine (LysoPC(P-18:0))	NA	InChI=1S/C26H54NO6P/c1-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-22-31-24-26(28)25-33-34(29,30)32-23-21-27(2,3)4/h20,22,26,28H,5-19,21,23-25H2,1-4H3	2019-09-06 17:29:46	jan	FALSE	FALSE	4.947717	NA	NA	NA
LIFE_old	2-Hydroxy-2-methylbutyric acid	NA	InChI=1S/C5H10O3/c1-3-5(2,8)4(6)7/h8H,3H2,1-2H3,(H,6,7)	2019-09-20 02:26:09	jan	FALSE	TRUE	1.625783	NA	NA	NA
LIFE_old	2-Hydroxy-3-methylbutyric acid	NA	InChI=1S/C5H10O3/c1-3(2)4(6)5(7)8/h3-4,6H,1-2H3,(H,7,8)	2019-09-20 02:26:09	jan	FALSE	TRUE	1.685467	NA	NA	NA
LIFE_old	2-methylacetoacetic acid	NA	InChI=1S/C5H8O3/c1-3(4(2)6)5(7)8/h3H,1-2H3,(H,7,8)	2019-09-06 17:29:46	jan	FALSE	FALSE	1.612850	NA	NA	NA

We have the RT directly in the database above but we do not have the m/z. Since we have the InChI the easiest way to get the m/z is to extract the molecular formula and then use the Rdisop package to get the mass.

library(Rdisop)
library(stringr)

formulas   <- str_split_fixed(predret_db[,"inchi"],"/",3)[,2]
masses     <- sapply(formulas, function(x) getMass(getMolecule(x)))

predret_db <- cbind.data.frame(predret_db, mz_pos = masses + 1.0073)

We can then split the database in two. One for the experimental RTs and one for the predicted. Here I have used some pipe/dplyr style code and even a forward assign. If you are not familiar with this type of R code I urge you to look into it. It really makes code much more readable.

predret_db %>% filter(!is.na(recorded_rt)) -> predret_db_exp
predret_db %>% filter(is.na(recorded_rt))  -> predret_db_pred

Annotating a dataset

We now have the database ready for annotation.

So we can load a dataset/peaklist. This peaklist was previously created with XCMS and fragments/adducts annotated with CAMERA. But again any peaklist will do.

library(readxl)

data <- read_excel("../../static/material/data/peaklist_pos.xlsx") %>%
        replace(is.na(.), "") %>%
        as.data.frame

Lets take a look at the interesting columns we have in the dataset.

info_cols <- c("mz","rt","isotopes","adduct","pcgroup")
dim(data)

[1] 4232  470

head(data[,info_cols])

         mz       rt   isotopes              adduct pcgroup
1  62.06013 273.8827            [M+H-C5H8O]+ 145.11       1
2  95.08600 271.5567                                      1
3  98.09642 268.1695                                      1
4 104.10687 271.2400   [12][M]+ [M+H-COCH2]+ 145.11       1
5 105.11030 271.2415 [12][M+1]+                           1
6 114.09119 268.9771            [M+H-CH3OH]+ 145.11       1

Now we can use db.comp.assign from my chemhelper package to annotate the dataset.

We would probably want one column in our peaklist for the RTs we have determined experimentally and one for the predicted RTs. So we do the annotation twice.

The first two arguments to the function are the m/z and RT of the dataset. The next three are the name, m/z and RT of the database of known compounds. Lastly we give the tolerance for the m/z and RT for a database match.

library(chemhelper)

anno_exp <- db.comp.assign(mz           = data[,"mz"],
                           rt           = data[,"rt"],
                           comp_name_db = predret_db_exp[,"name"],
                           mz_db        = predret_db_exp[,"mz_pos"],
                           rt_db        = predret_db_exp[,"recorded_rt"] *60, # *60 since XCMS works in seconds but PredRet is in minutes.
                           mzabs        = 0.01,
                           ppm          = 15,
                           ret_tol      = 10
                           )

                                        [,1]
Unique hits                               89
Non-unique hits                           10
Compounds not found                      201
Markers had multiple compounds assigned   53

anno_pred <- db.comp.assign(mz          = data[,"mz"],
                           rt           = data[,"rt"],
                           comp_name_db = predret_db_pred[,"name"],
                           mz_db        = predret_db_pred[,"mz_pos"],
                           rt_db        = predret_db_pred[,"predicted_rt"] *60,
                           mzabs        = 0.01,
                           ppm          = 15,
                           ret_tol      = 10
                           )

                                        [,1]
Unique hits                              209
Non-unique hits                           15
Compounds not found                     2843
Markers had multiple compounds assigned   72

Now lets put the annotations together with our dataset.

data_annotated <- cbind.data.frame(data,anno_exp,anno_pred)

The result

Then lets take a look at one of the feature groups (from CAMERA) where we got an annotation. In this example we have a feature that was annotated as tryptophan. The “OR” is because tryptophan was in the database several times. If multiple compounds would fit the m/z and RT they would be written with “OR” between them.

There is also a fragment annotated as adipic acid and 2-methylglutaric acid using a hit from the predicted RTs. In this case we have almost perfect CAMERA annotation suggesting the pseudo-molecular ion is the m/z = 205.0979 feature and the compound is very likely tryptophan.

data_annotated %>%
  filter(pcgroup==16) %>%
  select(one_of(info_cols,"anno_exp","anno_pred")) %>%
  kable(format="html", table.attr = 'class="nicetable"')

mz	rt	isotopes	adduct	pcgroup	anno_exp	anno_pred
118.0657	91.31296		[M+H-NH3-CO-COCH2]+ 204.09	16
130.0655	91.40512			16		QUINOLINE
132.0808	91.27887	[44][M]+	[M+H-NH3-CO-CO]+ 204.09	16
133.0845	91.19346	[44][M+1]+		16
142.0647	91.20588		[M+H-NH3-HCOOH]+ 204.09	16
144.0416	91.25665			16
144.0814	91.34230	[55][M]+	[M+H-NH3-CO2]+ 204.09	16		OSI1265 OR OSI1138 (non-unique hit)
145.0846	91.28341	[55][M+1]+		16
146.0599	91.33676	[58][M]+	[M+H-NH3-COCH2]+ 204.09	16		4-Hydroxyquinoline
147.0647	91.36894	[58][M+1]+		16	Adipic acid	ADIPATE OR 2-METHYLGLUTARATE OR METHYGLUTARATE
159.0922	91.28882	[71][M]+	[M+H-HCOOH]+ 204.09	16		OSI6
160.0851	91.35138	[71][M+1]+		16
170.0608	91.34747		[M+H-NH3-H2O]+ 204.09	16
188.0711	91.34805	[93][M]+	[M+H-NH3]+ 204.09	16
189.0743	91.28935	[93][M+1]+		16
205.0979	91.32364	[102][M]+	[M+H]+ 204.09	16	tryptophan OR Tryptophan OR TRYPTOPHAN	L-tryptopan (15N2, 98%)
206.1034	91.31068	[102][M+1]+		16
245.1301	91.91434		[M+H+(CH3)2CO-H2O]+ (acetone cond.) 204.09	16
276.1823	91.41425			16		[3-Carboxy-2-(3-hydroxyhexanoyloxy)propyl]-trimethylazanium
409.1873	91.28252	[265][M]+	[2M+H]+ 204.09	16
410.1906	91.26287	[265][M+1]+		16
447.1336	91.29100		[2M+K]+ 204.09 [4M+2K]2+ 204.09	16

Lets take a look at another feature. This time we have experimental RTs that say the feature is either 1,7-dimethylxanthine OR theobromine but a prediction from the PredRet database also suggest that the feature could be theophylline as well.

data_annotated %>%
  filter(pcgroup==400) %>%
  select(one_of(info_cols,"anno_exp","anno_pred")) %>%
  kable(format="html", table.attr = 'class="nicetable"')

mz	rt	isotopes	adduct	pcgroup	anno_exp	anno_pred
124.0508	89.67717			400
181.0725	89.56150	[86][M]+	[M+H-NH3-CO2-NH3-H2O]+ 276.119 [M+H-C4H6-COCH2]+ 276.119	400	1,7-dimethylxanthine OR theobromine	Theobromine OR Theophylline OR Paraxanthine OR OSI101
182.0761	89.59507	[86][M+1]+		400
315.0806	89.70856		[M+K]+ 276.119	400		HMDB0005033

That’s it!