library(RmzTabM)The RmzTabM package provides the API and core functionality to read and write files in mzTab-M format. The functions can be re-used and integrated by other R packages to support import and export of their respective metabolomics/lipidomics result objects in this format.
For a general overview of the mzTab-M format see this figure.
The RmzTabM package supports mzTab-M version 2.1.
The mzTab-M format consists of four cross-referenced data tables: metadata (MTD), Small Molecule (SML), Small Molecule Feature (SMF) and the Small Molecule Evidence (SME). The MTD section is supposed to contain all experiment and measurement relevant information. The SML section contains the final results of an analysis that should be reported, i.e., the (annotated) molecules and their respective abundances. The SMF section contains information on the measured (LC-MS) features and their abundance values. The SME section contains information on the annotation process (and reliability) of the molecules reported in the SML section. The SML is supposed to be a subset of the SMF table. The structure and relationship between rows in these different tables is defined by the mzTab-M standard and follows strict rules. The functions from the RmzTabM package assist in creating and formatting these tables.
The RmzTabM package provides low level, core functions and higher-level functions to work with files in mzTab-M format. The high-level functions are more user-oriented, simplifying the import and export of data and information from and to files in mzTab-M format. The low-level, core functions are developer-oriented, providing helper functions that can be re-used in other R packages to read and write from and to mzTab-M files.
For a description of the mzTab-M format and the set of mandatory and optional fields refer to the official format definition.
library(RmzTabM)In this section we export the data set used in the Metabonaut end-to-end metabolomics data workflow (Louail et al. 2026) in mzTab-M format. The raw MS data is available in MetaboLights (accession number MTBLS8735 with 2 separate MS runs, one with LC-MS and a second with LC-MS/MS data for selected samples). The original xcms preprocessing result object is available in Metabonaut and is included also within the RmzTabM package. After collecting all necessary experimental metadata, we export this result object as a mzTab-M file with only the metadata and the small feature abundances (MTD+SMF).
The RmzTabM package defines a MzTabM() convenience function to create a mzTab-M file from a SummarizedExperiment object. Information provided in such objects is automatically converted and formatted into content for the right mzTab-M section. The user simply needs to define the columns containing information for the various mzTab-M fields and a mzTab-M is compiled. This mzTab-M object should then be completed adding eventually missing data.
Note
ℹ️ individual mzTab-M sections could also be compiled individually with helper functions such as
mtdFromSampleData()to generate a metadata section from a sampledata.frame.
#' required packages
library(SummarizedExperiment)Data preprocessing, normalization, statistical data analysis and annotation is described in Metabonaut (version 1.5.0) (Louail et al. 2026).
The result from the xcms-based preprocessing, a SummarizedExperiment object, is included within the RmzTabM package as the se data set, which we load below.
#' Load the Metabonaut preprocessing result
data(se)
seclass: SummarizedExperiment
dim: 9068 10
metadata(0):
assays(2): raw raw_filled
rownames(9068): FT0001 FT0002 ... FT9067 FT9068
rowData names(11): mzmed mzmin ... QC ms_level
colnames(10): MS_QC_POOL_1_POS.mzML MS_A_POS.mzML ... MS_F_POS.mzML
MS_QC_POOL_4_POS.mzML
colData names(15): sample_name derived_spectra_data_file ... polarity
instrument
The object contains sample information in colData():
colData(se)DataFrame with 10 rows and 15 columns
sample_name derived_spectra_data_file
<character> <character>
MS_QC_POOL_1_POS.mzML POOL FILES/MS_QC_POOL_1_P..
MS_A_POS.mzML A FILES/MS_A_POS.mzML
MS_B_POS.mzML B FILES/MS_B_POS.mzML
MS_QC_POOL_2_POS.mzML POOL FILES/MS_QC_POOL_2_P..
MS_C_POS.mzML C FILES/MS_C_POS.mzML
MS_D_POS.mzML D FILES/MS_D_POS.mzML
MS_QC_POOL_3_POS.mzML POOL FILES/MS_QC_POOL_3_P..
MS_E_POS.mzML E FILES/MS_E_POS.mzML
MS_F_POS.mzML F FILES/MS_F_POS.mzML
MS_QC_POOL_4_POS.mzML POOL FILES/MS_QC_POOL_4_P..
metabolite_asssignment_file source_name organism
<character> <character> <character>
MS_QC_POOL_1_POS.mzML m_MTBLS8735_LC-MS_po.. MS_QC_POOL_1_POS Homo sapiens
MS_A_POS.mzML m_MTBLS8735_LC-MS_po.. MS_A_POS Homo sapiens
MS_B_POS.mzML m_MTBLS8735_LC-MS_po.. MS_B_POS Homo sapiens
MS_QC_POOL_2_POS.mzML m_MTBLS8735_LC-MS_po.. MS_QC_POOL_1_POS Homo sapiens
MS_C_POS.mzML m_MTBLS8735_LC-MS_po.. MS_C_POS Homo sapiens
MS_D_POS.mzML m_MTBLS8735_LC-MS_po.. MS_D_POS Homo sapiens
MS_QC_POOL_3_POS.mzML m_MTBLS8735_LC-MS_po.. MS_QC_POOL_1_POS Homo sapiens
MS_E_POS.mzML m_MTBLS8735_LC-MS_po.. MS_E_POS Homo sapiens
MS_F_POS.mzML m_MTBLS8735_LC-MS_po.. MS_F_POS Homo sapiens
MS_QC_POOL_4_POS.mzML m_MTBLS8735_LC-MS_po.. MS_QC_POOL_1_POS Homo sapiens
blood_sample_type sample_type age
<character> <character> <integer>
MS_QC_POOL_1_POS.mzML blood serum pool NA
MS_A_POS.mzML blood plasma experimental sample 53
MS_B_POS.mzML blood plasma experimental sample 30
MS_QC_POOL_2_POS.mzML blood serum pool NA
MS_C_POS.mzML blood plasma experimental sample 66
MS_D_POS.mzML blood plasma experimental sample 36
MS_QC_POOL_3_POS.mzML blood serum pool NA
MS_E_POS.mzML blood plasma experimental sample 66
MS_F_POS.mzML blood plasma experimental sample 44
MS_QC_POOL_4_POS.mzML blood serum pool NA
unit phenotype injection_index
<character> <character> <integer>
MS_QC_POOL_1_POS.mzML year QC 1
MS_A_POS.mzML year CVD 2
MS_B_POS.mzML year CTR 3
MS_QC_POOL_2_POS.mzML year QC 4
MS_C_POS.mzML year CTR 5
MS_D_POS.mzML year CVD 6
MS_QC_POOL_3_POS.mzML year QC 7
MS_E_POS.mzML year CTR 8
MS_F_POS.mzML year CVD 9
MS_QC_POOL_4_POS.mzML year QC 10
species tissue polarity
<character> <character> <character>
MS_QC_POOL_1_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000133, b.. positive
MS_A_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000131, b.. positive
MS_B_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000131, b.. positive
MS_QC_POOL_2_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000133, b.. positive
MS_C_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000131, b.. positive
MS_D_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000131, b.. positive
MS_QC_POOL_3_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000133, b.. positive
MS_E_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000131, b.. positive
MS_F_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000131, b.. positive
MS_QC_POOL_4_POS.mzML [NCBITaxon, NCBITaxo.. [BTO, BTO:0000133, b.. positive
instrument
<character>
MS_QC_POOL_1_POS.mzML 1
MS_A_POS.mzML 1
MS_B_POS.mzML 1
MS_QC_POOL_2_POS.mzML 1
MS_C_POS.mzML 1
MS_D_POS.mzML 1
MS_QC_POOL_3_POS.mzML 1
MS_E_POS.mzML 1
MS_F_POS.mzML 1
MS_QC_POOL_4_POS.mzML 1
LC-MS feature definitions and characteristics in its rowData():
rowData(se)DataFrame with 9068 rows and 11 columns
mzmed mzmin mzmax rtmed rtmin rtmax npeaks
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
FT0001 50.9898 50.9893 50.9904 203.600 201.459 208.108 8
FT0002 51.0590 51.0581 51.0599 191.167 190.053 194.525 9
FT0003 51.9866 51.9863 51.9879 203.147 201.459 207.046 7
FT0004 53.0204 53.0161 53.0205 203.234 200.962 217.922 10
FT0005 53.5208 53.5184 53.5216 203.194 201.183 209.900 10
... ... ... ... ... ... ... ...
FT9064 998.697 998.691 998.705 25.352 23.6341 26.4839 4
FT9065 998.779 998.758 998.784 162.691 161.5110 164.8667 8
FT9066 999.204 999.191 999.218 146.163 143.0103 147.9139 8
FT9067 999.330 999.318 999.339 157.048 154.3261 159.1735 7
FT9068 999.781 999.775 999.794 162.763 161.5110 164.3995 7
CTR CVD QC ms_level
<numeric> <numeric> <numeric> <integer>
FT0001 1 3 4 1
FT0002 2 3 4 1
FT0003 0 3 4 1
FT0004 3 3 4 1
FT0005 3 3 4 1
... ... ... ... ...
FT9064 0 0 4 1
FT9065 2 2 4 1
FT9066 3 1 4 1
FT9067 3 1 3 1
FT9068 1 3 3 1
and has two assays with feature abundances, one with the original integrated peak areas of identified chromatographic peaks and one with additional gap-filled abundances.
assayNames(se)[1] "raw" "raw_filled"
Comprehensive data set descriptions and metadata are important to enable re-use of the data and follow FAIR principles. Collecting the experiment’s metadata consists mostly of manual work e.g. looking up CV parameters for used instruments or sample tissues. Metadata for samples and related measurements is ideally added to the SummarizedExperiment’s colData(). For the present data set sample characteristics such as the species and tissue are defined in columns "species" and "tissue":
colData(se)$species [1] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[2] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[3] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[4] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[5] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[6] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[7] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[8] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[9] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
[10] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
colData(se)$tissue [1] "[BTO, BTO:0000133, blood serum, ]" "[BTO, BTO:0000131, blood plasma, ]"
[3] "[BTO, BTO:0000131, blood plasma, ]" "[BTO, BTO:0000133, blood serum, ]"
[5] "[BTO, BTO:0000131, blood plasma, ]" "[BTO, BTO:0000131, blood plasma, ]"
[7] "[BTO, BTO:0000133, blood serum, ]" "[BTO, BTO:0000131, blood plasma, ]"
[9] "[BTO, BTO:0000131, blood plasma, ]" "[BTO, BTO:0000133, blood serum, ]"
Note
ℹ️ ideally, CV parameters should be used as much as possible to ensure a standardized description of the data. The EMBL-EBI Ontology Lookup Service can be used to find ontology terms (CV parameters) for various controlled vocabularies.
Also measurement-related information are defined in the colData(), including the polarity:
colData(se)$polarity [1] "positive" "positive" "positive" "positive" "positive" "positive"
[7] "positive" "positive" "positive" "positive"
We can use this information to compile the data set’s metadata. To this end we define the column names in the SummarizedExperiment’s colData() that contain information for sample, measurement run and assay mzTab-M fields. This mapping of mzTab-M fields to column names can be defined with the sampleCols(), msRunCols() and assayCols() helper functions. We use for example the content of the column "sample_name" for the mzTab-M sample name. Content from the colData() columns "species", "tissue" and "sample_type" is used for mzTab-M sample fields species, tissue and sample_type.
colData(se)$sample_name [1] "POOL" "A" "B" "POOL" "C" "D" "POOL" "E" "F" "POOL"
#' define mapping of `colData()` column names to mzTab-M sample fields
scols <- sampleCols(sample = "sample_name", species = "species",
tissue = "tissue", sample_type = "sample_type")Similarly we specify columns from the same colData() with information on individual MS runs (and assays):
#' Define columns for MS run and assays
mscols <- msRunCols(location = "derived_spectra_data_file",
instrument_ref = "instrument", scan_polarity = "polarity")
acols <- assayCols(assay = "derived_spectra_data_file")The column "derived_spectra_data_file" contains the MS data file name which we use both to define the MS runs and assays (assuming thus a 1:1 mapping between them).
colData(se)$derived_spectra_data_file [1] "FILES/MS_QC_POOL_1_POS.mzML" "FILES/MS_A_POS.mzML"
[3] "FILES/MS_B_POS.mzML" "FILES/MS_QC_POOL_2_POS.mzML"
[5] "FILES/MS_C_POS.mzML" "FILES/MS_D_POS.mzML"
[7] "FILES/MS_QC_POOL_3_POS.mzML" "FILES/MS_E_POS.mzML"
[9] "FILES/MS_F_POS.mzML" "FILES/MS_QC_POOL_4_POS.mzML"
At last we define also the study variables of the experiment. These can be technical characteristics or phenotype(s) of the samples. The present experiment consists of plasma samples of individuals with or without a cardiovascular disease (CVD) and repeated measurements of an external (serum) sample pools that was used as quality control sample. These are defined in colData() columns "blood_sample_type", "age" and "phenotype".
#' technical variable: the sample matrix
colData(se)$blood_sample_type [1] "blood serum" "blood plasma" "blood plasma" "blood serum" "blood plasma"
[6] "blood plasma" "blood serum" "blood plasma" "blood plasma" "blood serum"
#' phenotype of study samples or QC for QC samples
colData(se)$phenotype [1] "QC" "CVD" "CTR" "QC" "CTR" "CVD" "QC" "CTR" "CVD" "QC"
#' age of study participants; NA for QC samples
colData(se)$age [1] NA 53 30 NA 66 36 NA 66 44 NA
With these information defined we can use the MzTabM() function to create a template mzTab-M for the present experiment. Parameter groups defines the column names of the SummarizedExperiment’s colData() to be used as mzTab-M study variable groups.
mzt <- MzTabM(se, id = "MTBLS8735", sampleCols = scols,
msRunCols = mscols, assayCols = acols,
groups = c("age", "phenotype", "blood_sample_type"))
mztObject of class MzTabM
mzTab-M version 2.1.0-M
MTD section with 189 rows.
This MzTabM object contains only metadata information, but no abundances/feature data yet. Also, some general metadata are still missing and, if exported to a mzTab-M file, it might not yet validate. We are next adding the small molecule feature (SMF) content and, in the subsequent section Completing the metadata content, adding eventually missing required metadata fields.
We next add feature abundance information to the mzTab-M. While the abundance values can be taken from one of the assay()s from the SummarizedExperiment, we need to provide also feature characteristics. These are usually available in the SummarizedExperiments rowData():
rowData(se)DataFrame with 9068 rows and 11 columns
mzmed mzmin mzmax rtmed rtmin rtmax npeaks
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
FT0001 50.9898 50.9893 50.9904 203.600 201.459 208.108 8
FT0002 51.0590 51.0581 51.0599 191.167 190.053 194.525 9
FT0003 51.9866 51.9863 51.9879 203.147 201.459 207.046 7
FT0004 53.0204 53.0161 53.0205 203.234 200.962 217.922 10
FT0005 53.5208 53.5184 53.5216 203.194 201.183 209.900 10
... ... ... ... ... ... ... ...
FT9064 998.697 998.691 998.705 25.352 23.6341 26.4839 4
FT9065 998.779 998.758 998.784 162.691 161.5110 164.8667 8
FT9066 999.204 999.191 999.218 146.163 143.0103 147.9139 8
FT9067 999.330 999.318 999.339 157.048 154.3261 159.1735 7
FT9068 999.781 999.775 999.794 162.763 161.5110 164.3995 7
CTR CVD QC ms_level
<numeric> <numeric> <numeric> <integer>
FT0001 1 3 4 1
FT0002 2 3 4 1
FT0003 0 3 4 1
FT0004 3 3 4 1
FT0005 3 3 4 1
... ... ... ... ...
FT9064 0 0 4 1
FT9065 2 2 4 1
FT9066 3 1 4 1
FT9067 3 1 3 1
FT9068 1 3 3 1
For our example we use column "mzmed" which defines the features’ m/z value which can be mapped to the SMF field exp_mass_to_charge and "rtmed" that reports the median retention time of the feature which can be used for the SMF field retention_time_in_seconds. We will in addition add an optional field feature_id to report and add the IDs of the individual features from the SummarizedExperiment, which we add as a column "feature_id" to the rowData():
rowData(se)$feature_id <- rownames(se)Similar to the metadata column mappings above, we define a mapping of SMF fields to rowData() columns using the smfCols() helper function:
smf_cols <- smfCols(exp_mass_to_charge = "mzmed",
retention_time_in_seconds = "rtmed",
feature_id = "feature_id")Providing these additional SMF mapping in the MzTabM() call above will compile a MzTabM object with an MTD and SMF section from the SummarizedExperiment. Parameter assayName defines which of the SummarizedExperiment’s assays will be used the SMF section.
#' Create a MTD+SMF mzTab-M object from the SummarizedExperiment
mzt <- MzTabM(se, id = "MTBLS8735", sampleCols = scols,
msRunCols = mscols, assayCols = acols,
groups = c("age", "phenotype", "blood_sample_type"),
smfCols. = smf_cols, assayName = "raw_filled")
mztObject of class MzTabM
mzTab-M version 2.1.0-M
MTD section with 189 rows.
SMF section with 9068 rows and 22 columns.
Note
ℹ️ we could also use
smf(se, assayName = "raw_filled", smfCols. = smf_cols)to extract the SMF table from theSummarizedExperimentand add that manually to the aMzTabMobject.
This mzt variable contains now the mzTab-M content that could be extracted from the SummarizedExperiment. The first rows of the metadata section are:
mtd(mzt) |> head()
[1,] "mzTab-version"
[2,] "mzTab-ID"
[3,] "software[1]"
[4,] "quantification_method"
[5,] "sample[1]"
[6,] "sample[1]-species[1]"
values
[1,] "2.1.0-M"
[2,] "MTBLS8735"
[3,] "[,,RmzTabM,RmzTabM version 0.97.17]"
[4,] "[MS, MS:1001834, LC-MS label-free quantitation analysis, ]"
[5,] "POOL"
[6,] "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]"
And the first lines of the SMF section:
smf(mzt) |> head() SFH SMF_ID SME_ID_REFS SME_ID_REF_ambiguity_code adduct_ion isotopomer
FT0001 SMF 1 null null null null
FT0002 SMF 2 null null null null
FT0003 SMF 3 null null null null
FT0004 SMF 4 null null null null
FT0005 SMF 5 null null null null
FT0006 SMF 6 null null null null
exp_mass_to_charge charge retention_time_in_seconds
FT0001 50.9897946401403 null 203.600077134134
FT0002 51.059035992328 null 191.167453757996
FT0003 51.9865730172271 null 203.14665178874
FT0004 53.0203569195002 null 203.234292327779
FT0005 53.5208004472819 null 203.193618564868
FT0006 54.0100702952703 null 159.281630787851
retention_time_in_seconds_start retention_time_in_seconds_end
FT0001 null null
FT0002 null null
FT0003 null null
FT0004 null null
FT0005 null null
FT0006 null null
abundance_assay[1] abundance_assay[2] abundance_assay[3]
FT0001 421.6162 689.2422 411.3295
FT0002 710.8078 875.9192 457.5920
FT0003 445.5711 613.4410 277.5022
FT0004 16994.5260 24605.7340 19766.7069
FT0005 3284.2664 4526.0531 3521.8221
FT0006 10681.7476 10009.6602 9599.9701
abundance_assay[4] abundance_assay[5] abundance_assay[6]
FT0001 481.7436 314.7567 635.2732
FT0002 693.6997 781.2416 648.4344
FT0003 497.8866 425.3774 634.9370
FT0004 17808.0933 22780.6683 22873.1061
FT0005 3379.8909 4396.0762 4317.7734
FT0006 10800.5449 4792.2390 7296.4262
abundance_assay[7] abundance_assay[8] abundance_assay[9]
FT0001 439.6086 570.5849 579.9360
FT0002 700.9716 1054.0207 534.4577
FT0003 449.0933 556.2544 461.0465
FT0004 16965.7762 23432.1252 22198.4607
FT0005 3270.5290 4533.8667 4161.0132
FT0006 2382.1788 9236.9799 6817.8785
abundance_assay[10] opt_global_feature_id
FT0001 437.0340 FT0001
FT0002 711.0361 FT0002
FT0003 232.1075 FT0003
FT0004 16796.4497 FT0004
FT0005 3142.2268 FT0005
FT0006 6911.5439 FT0006
In the next section we will complete the data adding some metadata fields that could not be derived from the result object.
Some of the metadata information must be manually added, because it can not be extracted from the SummarizedExperiment. This depends also on the information compiled into the mzTab-M file. We used for example the BTO and NCBITaxon ontologies to describe the samples, but these two are not added by default. The getMtdCv() function can be used to get the set of defined controlled vocabularies (ontologies) in the MzTabM object:
getMtdCv(mzt) cv[1]-label
"MS"
cv[1]-full_name
"PSI-MS controlled vocabulary"
cv[1]-version
"4.1.138"
cv[1]-uri
"https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo"
cv[2]-label
"PRIDE"
cv[2]-full_name
"PRIDE PRoteomics IDEntifications (PRIDE) database controlled vocabulary"
cv[2]-version
"16:10:2023 11:38"
cv[2]-uri
"https://www.ebi.ac.uk/ols/ontologies/pride"
cv[3]-label
"STATO"
cv[3]-full_name
"General purpose STATistics Ontology"
cv[3]-version
"2026-04-20"
cv[3]-uri
"https://www.ebi.ac.uk/ols4/ontologies/stato"
We therefore need to add the two missing vocabularies:
mzt <- setMtdCv(mzt, label = c("BTO", "NCBITaxon"),
full_name = c("The BRENDA Tissue Ontology (BTO)",
"NCBI organismal classification"),
version = c("2021-10-26", "2025-12-03"),
uri = c("https://www.ebi.ac.uk/ols4/ontologies/bto",
"https://www.ebi.ac.uk/ols4/ontologies/ncbitaxon"))Also, we add xcms as software to the MzTabM:
mzt <- setMtdField(mzt, "software", "[MS, MS:1001582, xcms, 4.10.0]")
getMtdField(mzt, "software") software[1] software[2]
"[,,RmzTabM,RmzTabM version 0.97.17]" "[MS, MS:1001582, xcms, 4.10.0]"
Also, we need to add instrument information to the MzTabM object.
#' Adding MS instrument information.
mzt <- setMtdInstrument(
mzt, name = "[MS, MS:1002584, AB Sciex TripleTOF 5600+, ]",
source = "[MS, MS:1000073, ESI, ]",
analyzer = c(`analyzer[1]` =
"[MS, MS:1003763, quadrupole time-of-flight instrument, ]"),
detector = "[,,null,null]")And at last we add also contact information:
mzt <- setMtdContact(
mzt, name = c("Johannes Rainer", "Philippine Louail"),
affiliation= c("Institute for Biomedicine, Eurac Research, Bolzano, Italy",
"Institute for Biomedicine, Eurac Research, Bolzano, Italy"),
email = c("[email protected]", "[email protected]"),
orcid = c("0000-0002-6977-7147", "0009-0007-5429-6846"))Now we have a complete mzTab-M content compiled and can proceed to export it.
The MzTabM object, containing the MTD and SMF sections, is ready for export to an mzTab-M file. Following generation, the file is verified using the mzTab-M validator. The validation report confirms that the file was generated successfully, returning only a single Info message. This message notes the absence of the SML section, which is expected given that we intentionally generated an MTD+SMF file.
writeMzTabM(mzt, path = file.path(tempdir(), "MTBLS8735_mtd_smf.mzTab"))The low-level functions listed in this section provide the base functionality to convert or format information and data for/from the mzTab-M format. These functions are designed to be re-used by other R packages and take and return only basic, plain R data types.
All formatting and export functions require that all their parameters, if specified, must be fully named, i.e., no positional matching of a function’s arguments is supported.
The mzTab-M format defines various fields and parameters to describe the data and information of an experiment. The RmzTabM package provides a variety of utility functions that help defining and formatting this information.
See also the specification of the MTD section for more information and optional and mandatory metadata fields.
The general categories of the metadadata in the mzTab-M MTD section are core information, sample information, MS run information, assay information and study variable information. For each of these categories a separate R function is available to create and format the respective fields. As an example, we define below a data.frame with sample information. In our example we assume 3 samples (e.g. cell lines) each measured at two different time points. An additional column genotype specifies the genotype of the individual samples and a column operator the initials of the researcher extracting the samples.
#' Define a simple data.frame of the measured samples of an experiment
exp <- data.frame(
sample_name = c("S1_T1", "S1_T2", "S2_T1", "S2_T2", "S3_T1", "S3_T2"),
sample_id = c("S1", "S1", "S2", "S2", "S3", "S3"),
timepoint = c("0h", "6h", "0h", "6h", "0h", "6h"),
genotype = c("WT", "WT", "KO", "KO", "KO", "KO"),
operator = c("BB", "BB", "BB", "BB", "FB", "FB"),
file_name = c("s1-t1.mzML", "s1-t2.mzML", "s2-t1.mzML", "s2-t2.mzML",
"s3-t1.mzML", "s3-t2.mzML")
)
exp sample_name sample_id timepoint genotype operator file_name
1 S1_T1 S1 0h WT BB s1-t1.mzML
2 S1_T2 S1 6h WT BB s1-t2.mzML
3 S2_T1 S2 0h KO BB s2-t1.mzML
4 S2_T2 S2 6h KO BB s2-t2.mzML
5 S3_T1 S3 0h KO FB s3-t1.mzML
6 S3_T2 S3 6h KO FB s3-t2.mzML
We will next compile the MTD information for the experiment using the individual helper functions, starting with the Core information: this comprises general information about the experiment. A minimal set of fields can be compiled using the mtdSkeleton() function. We have to provide an ID for the experiment and in addition we specify the software used to process the data:
mtd <- mtdSkeleton(
id = "EXP_001",
software = "[MS, MS:1001582, xcms, 4.1.0]"
)library(pander)
pandoc.table(mtd, style = "rmarkdown", split.table = Inf, justify = "ll")
| | |
|:-------------------------------------------|:------------------------------------------------------------------------|
| mzTab-version | 2.1.0-M |
| mzTab-ID | EXP_001 |
| software[1] | [MS, MS:1001582, xcms, 4.1.0] |
| quantification_method | [MS, MS:1001834, LC-MS label-free quantitation analysis, ] |
| cv[1]-label | MS |
| cv[1]-full_name | PSI-MS controlled vocabulary |
| cv[1]-version | 4.1.138 |
| cv[1]-uri | https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo |
| cv[2]-label | PRIDE |
| cv[2]-full_name | PRIDE PRoteomics IDEntifications (PRIDE) database controlled vocabulary |
| cv[2]-version | 16:10:2023 11:38 |
| cv[2]-uri | https://www.ebi.ac.uk/ols/ontologies/pride |
| cv[3]-label | STATO |
| cv[3]-full_name | General purpose STATistics Ontology |
| cv[3]-version | 2026-04-20 |
| cv[3]-uri | https://www.ebi.ac.uk/ols4/ontologies/stato |
| database[1] | [,, "no database", null ] |
| database[1]-prefix | null |
| database[1]-version | Unknown |
| database[1]-uri | null |
| small_molecule-quantification_unit | [PRIDE, PRIDE:0000330, Arbitrary quantification unit, ] |
| small_molecule_feature-quantification_unit | [PRIDE, PRIDE:0000330, Arbitrary quantification unit, ] |
| small_molecule-identification_reliability | [MS, MS:1002896, compound identification confidence level, ] |
This represents some minimal information. The data of the MTD section is formatted as a character 2-column matrix. We could now either change the value (i.e., the elements in the second column of this matrix) of existing fields, or also manually add additional fields/information. As an example we add a title and description for the experiment. See also the mzTab-M format definition for other supported fields.
mtd <- rbind(
mtd,
c("title", "Experiment 1 preprocessed data"),
c("description", "The preprocessed data of the experiment 1.")
)To help with formatting we can also use the mtdFields() function. Below we use this function to add information about the MS instrumentation to the MTD section:
instr <- mtdFields(
name = "[MS, MS:1000449, LTQ Orbitrap,]",
source = "[MS, MS:1000073, ESI,]",
`analyzer[1]` = "[MS, MS:1000291, linear ion trap,]",
detector = "[MS, MS:1000253, electron multiplier,]",
field_prefix = "instrument"
)pandoc.table(instr, style = "rmarkdown", split.table = Inf, justify = "ll")| instrument[1]-name | [MS, MS:1000449, LTQ Orbitrap,] |
| instrument[1]-source | [MS, MS:1000073, ESI,] |
| instrument[1]-analyzer[1] | [MS, MS:1000291, linear ion trap,] |
| instrument[1]-detector | [MS, MS:1000253, electron multiplier,] |
And we add that information to the mtd variable.
mtd <- rbind(mtd, instr)The next category of metadata information is sample information. This comprises (optional) information on individual samples that were measured with the various assays/runs. We use the mtdSample() function to assist in compiling this information. Parameters sample, species, tissue and cell_type, disease and description allow to provide pre-defined sample properties. Additional sample annotations and details can be provided through the function’s .... For the example below we define some of these properties and in addition provide a custom field for the extraction data. Be aware that mtdSample() does not support partial or positional matching of parameters; for each of the parameters the full parameter name has to be used (i.e., sample = ... instead of sam = ... or s = ...).
mtd_s <- mtdSample(
sample = unique(exp$sample_id),
species = "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]",
tissue = "[BTO, BTO:0000759, liver, ]",
cell_type = "[CL, CL:0000182, hepatocyte, ]",
c("[,,Extraction date, 2011-12-21]",
"[,,Extraction date, 2011-12-22]",
"[,,Extraction date, 2011-12-23]")
)pandoc.table(mtd_s, style = "rmarkdown", split.table = Inf, justify = "ll")| sample[1] | S1 |
| sample[1]-species[1] | [NCBITaxon, NCBITaxon:9606, Homo sapiens, ] |
| sample[1]-tissue[1] | [BTO, BTO:0000759, liver, ] |
| sample[1]-cell_type[1] | [CL, CL:0000182, hepatocyte, ] |
| sample[1]-custom[1] | [,,, [,,Extraction date, 2011-12-21]] |
| sample[2] | S2 |
| sample[2]-species[1] | [NCBITaxon, NCBITaxon:9606, Homo sapiens, ] |
| sample[2]-tissue[1] | [BTO, BTO:0000759, liver, ] |
| sample[2]-cell_type[1] | [CL, CL:0000182, hepatocyte, ] |
| sample[2]-custom[1] | [,,, [,,Extraction date, 2011-12-22]] |
| sample[3] | S3 |
| sample[3]-species[1] | [NCBITaxon, NCBITaxon:9606, Homo sapiens, ] |
| sample[3]-tissue[1] | [BTO, BTO:0000759, liver, ] |
| sample[3]-cell_type[1] | [CL, CL:0000182, hepatocyte, ] |
| sample[3]-custom[1] | [,,, [,,Extraction date, 2011-12-23]] |
Note that the general information part should also contain the references to all controlled vocabulary (CV) ontologies used in the mzTab-M file. The default ontologies added by the mtb_skeleton() function are the PSI-MS, PRIDE and STATO ontologies. If other vocabularies are used, they should be either added manually (following the scheme of the others, i.e., the fields starting with "cv[") or provided with the cv_* function arguments of the mtb_skeleton() function. For our example we use also the BRENDA tissue ontology (BTO) and the NCBITaxon ontology to define the tissue of origin and species of the samples and hence need to add these ontologies to the general metadata section. We use the mtdFields() function for this. For a CV entry we need to provide a label, the full_name, the version and the uri:
add_cv <- mtdFields(
label = c("BTO", "NCBITaxon"),
full_name = c("The BRENDA Tissue Ontology (BTO)",
"NCBI organismal classification"),
version = c("2021-10-26", "2025-12-03"),
uri = c("https://www.ebi.ac.uk/ols4/ontologies/bto",
"https://www.ebi.ac.uk/ols4/ontologies/ncbitaxon"),
field_prefix = "cv")
add_cv [,1] [,2]
[1,] "cv[1]-label" "BTO"
[2,] "cv[1]-full_name" "The BRENDA Tissue Ontology (BTO)"
[3,] "cv[1]-version" "2021-10-26"
[4,] "cv[1]-uri" "https://www.ebi.ac.uk/ols4/ontologies/bto"
[5,] "cv[2]-label" "NCBITaxon"
[6,] "cv[2]-full_name" "NCBI organismal classification"
[7,] "cv[2]-version" "2025-12-03"
[8,] "cv[2]-uri" "https://www.ebi.ac.uk/ols4/ontologies/ncbitaxon"
We need to update the index of the cv, since there are already 3 CVs (MS, PRIDE and STATO) defined by in the metadata part. We thus replace next the "1" with "4" and "2" with "5" and append this CV term to the metadata section.
add_cv[, 1L] <- sub("1", "4", add_cv[, 1L])
add_cv[, 1L] <- sub("2", "5", add_cv[, 1L])
mtd <- rbind(mtd, add_cv)We can then add the sample information to the mtd variable by simply rbind()ing it.
mtd <- rbind(mtd, mtd_s)Next we compile MS run information of the experiment using the mtdMsRun() helper function. This should comprise all (MS-specific) information related to the measurement of each sample - including also the MS data file names and locations. For our example we use the file names reported in the sample data frame and specify the polarity of the measurement runs.
mtd_msr <- mtdMsRun(
location = exp$file_name,
format = "[MS, MS:1000584, mzML file, ]",
id_format = "[MS, MS:1000530, mzML unique identifier, ]",
scan_polarity = "positive")pandoc.table(mtd_msr, style = "rmarkdown", split.table = Inf, justify = "ll")| values | |
|---|---|
| ms_run[1]-location | s1-t1.mzML |
| ms_run[1]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[1]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[1]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[2]-location | s1-t2.mzML |
| ms_run[2]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[2]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[2]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[3]-location | s2-t1.mzML |
| ms_run[3]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[3]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[3]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[4]-location | s2-t2.mzML |
| ms_run[4]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[4]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[4]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[5]-location | s3-t1.mzML |
| ms_run[5]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[5]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[5]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[6]-location | s3-t2.mzML |
| ms_run[6]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[6]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[6]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
Each row in the exp data frame was assigned to a "ms_run" with the location and format of the respective file as well as the polarity in which the data was acquired. We can combine this data with the mtd variable.
mtd <- rbind(mtd, mtd_msr)Next we define the assay information. Generally, each measurement (MS run) is associated to one assay, but also more complex configurations are supported. See the help of the mtdAssay() function for details on multiplexed or pre-fractionated samples. Mandatory information that has to be provided to the mtdAssay() function are the name (ID) of the assay and the reference to the MS run in which the assay was measured. For the latter, a format of "ms_run[<index of the MS run>]" is expected. For our example we provide in addition also the (optional, but suggested) reference to the original sample. Note that each assay must represent one column in the following feature abundance table (SMF).
mtd_a <- mtdAssay(
assay = exp$sample_name,
sample_ref = c("sample[1]", "sample[1]", "sample[2]", "sample[2]",
"sample[3]", "sample[3]"),
ms_run_ref = paste0("ms_run[", seq_len(nrow(exp)), "]")
)The result formatted assay information is shown in the table below.
pandoc.table(mtd_a, style = "rmarkdown", split.table = Inf, justify = "ll")| assay[1] | S1_T1 |
| assay[1]-sample_ref | sample[1] |
| assay[1]-ms_run_ref | ms_run[1] |
| assay[2] | S1_T2 |
| assay[2]-sample_ref | sample[1] |
| assay[2]-ms_run_ref | ms_run[2] |
| assay[3] | S2_T1 |
| assay[3]-sample_ref | sample[2] |
| assay[3]-ms_run_ref | ms_run[3] |
| assay[4] | S2_T2 |
| assay[4]-sample_ref | sample[2] |
| assay[4]-ms_run_ref | ms_run[4] |
| assay[5] | S3_T1 |
| assay[5]-sample_ref | sample[3] |
| assay[5]-ms_run_ref | ms_run[5] |
| assay[6] | S3_T2 |
| assay[6]-sample_ref | sample[3] |
| assay[6]-ms_run_ref | ms_run[6] |
We add this information to the mtd variable.
mtd <- rbind(mtd, mtd_a)At last we compile the study variable information of our example experiment. This should capture all experiment-relevant study variables (phenotype or experimental conditions). In R, such information is generally encoded in a sample or phenotype data.frame, with rows being individual samples (or measurements thereof) and columns the sample characteristics (i.e., the study variable groups, with the individual values of the columns being, in the mzTab-M definition, the study variables). The mtdStudyVariables() function formats a sample/experiment data.frame into the corresponding mzTab-M fields. Parameter groups allows to select the columns of the input data.frame which represent the study variable groups (phenotype or experimental conditions). Additional function arguments allow to specify the statistical type and the datatype for each column/study variable group, but the defaults should work in most situations. By default, R data types character and factor are mapped to the STATO type categorical, while the STATO type continuous is used for numeric and integer columns. If the data.frame contains ordinal variables it should be manually specified with parameter group_type. In our example we define in addition an optional unit for the study variable timepoint. Units have to be provided in CV parameter format; for study variable groups without unit "" or NA has to be used.
mtd_svar <- mtdStudyVariables(
exp, groups = c("timepoint", "genotype", "operator"),
group_unit = c("[, , hours, ]", "", ""))The formatted data is shown in the table below.
pandoc.table(mtd_svar, style = "rmarkdown", split.table = Inf, justify = "ll")| study_variable_group[1] | [,,timepoint,] |
| study_variable_group[1]-description | Sample matrix column timepoint |
| study_variable_group[1]-type | [STATO, STATO:0000252, categorical variable, ] |
| study_variable_group[1]-datatype | xsd:string |
| study_variable_group[1]-unit | [, , hours, ] |
| study_variable_group[2] | [,,genotype,] |
| study_variable_group[2]-description | Sample matrix column genotype |
| study_variable_group[2]-type | [STATO, STATO:0000252, categorical variable, ] |
| study_variable_group[2]-datatype | xsd:string |
| study_variable_group[3] | [,,operator,] |
| study_variable_group[3]-description | Sample matrix column operator |
| study_variable_group[3]-type | [STATO, STATO:0000252, categorical variable, ] |
| study_variable_group[3]-datatype | xsd:string |
| study_variable[1] | 0h |
| study_variable[1]-assay_refs | assay[1]|assay[3]|assay[5] |
| study_variable[1]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[1]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[1]-description | Variable timepoint, value 0h |
| study_variable[1]-group_ref | study_variable_group[1] |
| study_variable[2] | 6h |
| study_variable[2]-assay_refs | assay[2]|assay[4]|assay[6] |
| study_variable[2]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[2]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[2]-description | Variable timepoint, value 6h |
| study_variable[2]-group_ref | study_variable_group[1] |
| study_variable[3] | WT |
| study_variable[3]-assay_refs | assay[1]|assay[2] |
| study_variable[3]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[3]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[3]-description | Variable genotype, value WT |
| study_variable[3]-group_ref | study_variable_group[2] |
| study_variable[4] | KO |
| study_variable[4]-assay_refs | assay[3]|assay[4]|assay[5]|assay[6] |
| study_variable[4]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[4]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[4]-description | Variable genotype, value KO |
| study_variable[4]-group_ref | study_variable_group[2] |
| study_variable[5] | BB |
| study_variable[5]-assay_refs | assay[1]|assay[2]|assay[3]|assay[4] |
| study_variable[5]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[5]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[5]-description | Variable operator, value BB |
| study_variable[5]-group_ref | study_variable_group[3] |
| study_variable[6] | FB |
| study_variable[6]-assay_refs | assay[5]|assay[6] |
| study_variable[6]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[6]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[6]-description | Variable operator, value FB |
| study_variable[6]-group_ref | study_variable_group[3] |
For each column a study variable group was defined while each unique value in each of the specified columns was encoded as a "study_variable" (or rather as a study variable value), with its assay_refs attribute containing the rows/assays in which this value was measured. The variable’s "description" (by default) indicates the name of the column. The "average_function" and "variation_function" attributes allow to define the function that was used to calculate the average and variance of the abundance values for that variable value.
We next add the study variable information to the mtd variable.
mtd <- rbind(mtd, mtd_svar)At last we sort the elements according to the expected order in the MTD section using the mtdSort() function.
mtd <- mtdSort(mtd)This two-column matrix could now be saved to a text file using a tabulator ("\t") as a field separator. The full metadata header is shown in the table below.
pandoc.table(mtd, style = "rmarkdown", split.table = Inf, justify = "ll")| mzTab-version | 2.1.0-M |
| mzTab-ID | EXP_001 |
| title | Experiment 1 preprocessed data |
| description | The preprocessed data of the experiment 1. |
| instrument[1]-name | [MS, MS:1000449, LTQ Orbitrap,] |
| instrument[1]-source | [MS, MS:1000073, ESI,] |
| instrument[1]-analyzer[1] | [MS, MS:1000291, linear ion trap,] |
| instrument[1]-detector | [MS, MS:1000253, electron multiplier,] |
| software[1] | [MS, MS:1001582, xcms, 4.1.0] |
| quantification_method | [MS, MS:1001834, LC-MS label-free quantitation analysis, ] |
| sample[1] | S1 |
| sample[1]-species[1] | [NCBITaxon, NCBITaxon:9606, Homo sapiens, ] |
| sample[1]-tissue[1] | [BTO, BTO:0000759, liver, ] |
| sample[1]-cell_type[1] | [CL, CL:0000182, hepatocyte, ] |
| sample[1]-custom[1] | [,,, [,,Extraction date, 2011-12-21]] |
| sample[2] | S2 |
| sample[2]-species[1] | [NCBITaxon, NCBITaxon:9606, Homo sapiens, ] |
| sample[2]-tissue[1] | [BTO, BTO:0000759, liver, ] |
| sample[2]-cell_type[1] | [CL, CL:0000182, hepatocyte, ] |
| sample[2]-custom[1] | [,,, [,,Extraction date, 2011-12-22]] |
| sample[3] | S3 |
| sample[3]-species[1] | [NCBITaxon, NCBITaxon:9606, Homo sapiens, ] |
| sample[3]-tissue[1] | [BTO, BTO:0000759, liver, ] |
| sample[3]-cell_type[1] | [CL, CL:0000182, hepatocyte, ] |
| sample[3]-custom[1] | [,,, [,,Extraction date, 2011-12-23]] |
| ms_run[1]-location | s1-t1.mzML |
| ms_run[1]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[1]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[1]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[2]-location | s1-t2.mzML |
| ms_run[2]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[2]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[2]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[3]-location | s2-t1.mzML |
| ms_run[3]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[3]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[3]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[4]-location | s2-t2.mzML |
| ms_run[4]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[4]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[4]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[5]-location | s3-t1.mzML |
| ms_run[5]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[5]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[5]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| ms_run[6]-location | s3-t2.mzML |
| ms_run[6]-format | [MS, MS:1000584, mzML file, ] |
| ms_run[6]-id_format | [MS, MS:1000530, mzML unique identifier, ] |
| ms_run[6]-scan_polarity[1] | [MS, MS:1000130, positive scan, ] |
| assay[1] | S1_T1 |
| assay[1]-sample_ref | sample[1] |
| assay[1]-ms_run_ref | ms_run[1] |
| assay[2] | S1_T2 |
| assay[2]-sample_ref | sample[1] |
| assay[2]-ms_run_ref | ms_run[2] |
| assay[3] | S2_T1 |
| assay[3]-sample_ref | sample[2] |
| assay[3]-ms_run_ref | ms_run[3] |
| assay[4] | S2_T2 |
| assay[4]-sample_ref | sample[2] |
| assay[4]-ms_run_ref | ms_run[4] |
| assay[5] | S3_T1 |
| assay[5]-sample_ref | sample[3] |
| assay[5]-ms_run_ref | ms_run[5] |
| assay[6] | S3_T2 |
| assay[6]-sample_ref | sample[3] |
| assay[6]-ms_run_ref | ms_run[6] |
| study_variable_group[1] | [,,timepoint,] |
| study_variable_group[1]-description | Sample matrix column timepoint |
| study_variable_group[1]-type | [STATO, STATO:0000252, categorical variable, ] |
| study_variable_group[1]-datatype | xsd:string |
| study_variable_group[1]-unit | [, , hours, ] |
| study_variable_group[2] | [,,genotype,] |
| study_variable_group[2]-description | Sample matrix column genotype |
| study_variable_group[2]-type | [STATO, STATO:0000252, categorical variable, ] |
| study_variable_group[2]-datatype | xsd:string |
| study_variable_group[3] | [,,operator,] |
| study_variable_group[3]-description | Sample matrix column operator |
| study_variable_group[3]-type | [STATO, STATO:0000252, categorical variable, ] |
| study_variable_group[3]-datatype | xsd:string |
| study_variable[1] | 0h |
| study_variable[1]-assay_refs | assay[1]|assay[3]|assay[5] |
| study_variable[1]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[1]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[1]-description | Variable timepoint, value 0h |
| study_variable[1]-group_ref | study_variable_group[1] |
| study_variable[2] | 6h |
| study_variable[2]-assay_refs | assay[2]|assay[4]|assay[6] |
| study_variable[2]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[2]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[2]-description | Variable timepoint, value 6h |
| study_variable[2]-group_ref | study_variable_group[1] |
| study_variable[3] | WT |
| study_variable[3]-assay_refs | assay[1]|assay[2] |
| study_variable[3]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[3]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[3]-description | Variable genotype, value WT |
| study_variable[3]-group_ref | study_variable_group[2] |
| study_variable[4] | KO |
| study_variable[4]-assay_refs | assay[3]|assay[4]|assay[5]|assay[6] |
| study_variable[4]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[4]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[4]-description | Variable genotype, value KO |
| study_variable[4]-group_ref | study_variable_group[2] |
| study_variable[5] | BB |
| study_variable[5]-assay_refs | assay[1]|assay[2]|assay[3]|assay[4] |
| study_variable[5]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[5]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[5]-description | Variable operator, value BB |
| study_variable[5]-group_ref | study_variable_group[3] |
| study_variable[6] | FB |
| study_variable[6]-assay_refs | assay[5]|assay[6] |
| study_variable[6]-average_function | [MS, MS:1002962, mean, ] |
| study_variable[6]-variation_function | [MS, MS:1002963, variation coefficient, ] |
| study_variable[6]-description | Variable operator, value FB |
| study_variable[6]-group_ref | study_variable_group[3] |
| cv[1]-label | MS |
| cv[1]-full_name | PSI-MS controlled vocabulary |
| cv[1]-version | 4.1.138 |
| cv[1]-uri | https://raw.githubusercontent.com/HUPO-PSI/psi-ms-CV/master/psi-ms.obo |
| cv[2]-label | PRIDE |
| cv[2]-full_name | PRIDE PRoteomics IDEntifications (PRIDE) database controlled vocabulary |
| cv[2]-version | 16:10:2023 11:38 |
| cv[2]-uri | https://www.ebi.ac.uk/ols/ontologies/pride |
| cv[3]-label | STATO |
| cv[3]-full_name | General purpose STATistics Ontology |
| cv[3]-version | 2026-04-20 |
| cv[3]-uri | https://www.ebi.ac.uk/ols4/ontologies/stato |
| cv[4]-label | BTO |
| cv[4]-full_name | The BRENDA Tissue Ontology (BTO) |
| cv[4]-version | 2021-10-26 |
| cv[4]-uri | https://www.ebi.ac.uk/ols4/ontologies/bto |
| cv[5]-label | NCBITaxon |
| cv[5]-full_name | NCBI organismal classification |
| cv[5]-version | 2025-12-03 |
| cv[5]-uri | https://www.ebi.ac.uk/ols4/ontologies/ncbitaxon |
| database[1] | [,, “no database”, null ] |
| database[1]-prefix | null |
| database[1]-version | Unknown |
| database[1]-uri | null |
| small_molecule-quantification_unit | [PRIDE, PRIDE:0000330, Arbitrary quantification unit, ] |
| small_molecule_feature-quantification_unit | [PRIDE, PRIDE:0000330, Arbitrary quantification unit, ] |
| small_molecule-identification_reliability | [MS, MS:1002896, compound identification confidence level, ] |
The small molecule feature (SMF) section captures information on the quantified entities (features) of an experiment. This includes the feature abundances across assays as well as the feature’s m/z, retention times and eventual additional annotations such as the ion or the exact mass. The smfCreate() function compiles and formats this section based on the provided abundance matrix and feature specifications.
Below we create an example abundance matrix and feature characteristics data matching the metadata from the previous section. Generally, such information can be extracted from the result objects of preprocessing software. We first define the abundance matrix: columns are assays, rows features. Importantly, the number and order of the assays has to match the assay definition in the metadata (defined above with the mtdAssay() function). Our example data consists of quantification of 7 features in 6 measurements (assays) of 3 samples.
abundances <- cbind(c(200.1, 1232.1, 54.3, 399.1, 599.8, 23.1, NA),
c(260.2, 39.5, 177.4, 599.5, 5344.1, 332.1, 43.0),
c(256.1, 904.2, 56.9, 533.1, 489.9, 3231.22, 23.4),
c(232.1, 43.3, 201.4, 434.2, 5154.1, 43.4, 324.3),
c(264.2, 1102.4, 43.5, 514.5, 583.1, 432.3, 43.3),
c(246.2, 52.1, 187.2, 508.3, 601.5, 432.2, 34.5))
colnames(abundances) <- exp$sample_name
rownames(abundances) <- c("FT01", "FT02", "FT03", "FT04", "FT05",
"FT06", "FT07")We next define also a data.frame with the feature characteristics from the MS measurement run (one row per feature and columns with m/z, retention time and, where known, also the adduct information and charge). Note that without any annotation (and hence a SML and SME section) adduct and charge information will not be available for the SMF table.
feature_info <- data.frame(
mzmed = c(195.088, 127.1, 299.2, 181.07, 218.077, 343.123, 148.06),
rtmed = c(25.6, 128.4, 67.2, 127.3, 25.7, 167.2, 76.34),
rtmin = c(23.1, 125.1, 65.1, 122.3, 23.3, 162.3, 71.3),
rtmax = c(26.9, 130.3, 69.1, 134.2, 26.8, 172.1, 81.2),
adduct = c("[M+H]+", NA, NA, "[M+Na]+", "[M+Na]+", "[M+H]+", "[M+H]+"),
charge = c(1L, NA, NA, 1L, 1L, 1L, 1L)
)
rownames(feature_info) <- rownames(abundances)We can now feed this information to the smfCreate() function. In addition to the predefined, parameters, also additional feature annotations/columns can be passed to the function through it’s ... parameter. We provide the IDs of the individual features with feature_id =. These are then stored into a column "opt_global_feature_id". Note that all parameters must be fully named, i.e., x = or charge = since the function does not support positional matching of its arguments.
smf <- smfCreate(
x = abundances,
exp_mass_to_charge = feature_info$mzmed,
retention_time_in_seconds = feature_info$rtmed,
retention_time_in_seconds_start = feature_info$rtmin,
retention_time_in_seconds_end = feature_info$rtmax,
charge = feature_info$charge,
adduct_ion = feature_info$adduct,
feature_id = rownames(feature_info))The SMF content is:
smf SFH SMF_ID SME_ID_REFS SME_ID_REF_ambiguity_code adduct_ion isotopomer
FT01 SMF 1 null null [M+H]+ null
FT02 SMF 2 null null null null
FT03 SMF 3 null null null null
FT04 SMF 4 null null [M+Na]+ null
FT05 SMF 5 null null [M+Na]+ null
FT06 SMF 6 null null [M+H]+ null
FT07 SMF 7 null null [M+H]+ null
exp_mass_to_charge charge retention_time_in_seconds
FT01 195.088 1 25.6
FT02 127.1 null 128.4
FT03 299.2 null 67.2
FT04 181.07 1 127.3
FT05 218.077 1 25.7
FT06 343.123 1 167.2
FT07 148.06 1 76.34
retention_time_in_seconds_start retention_time_in_seconds_end
FT01 23.1 26.9
FT02 125.1 130.3
FT03 65.1 69.1
FT04 122.3 134.2
FT05 23.3 26.8
FT06 162.3 172.1
FT07 71.3 81.2
abundance_assay[1] abundance_assay[2] abundance_assay[3]
FT01 200.1 260.2 256.10
FT02 1232.1 39.5 904.20
FT03 54.3 177.4 56.90
FT04 399.1 599.5 533.10
FT05 599.8 5344.1 489.90
FT06 23.1 332.1 3231.22
FT07 NA 43.0 23.40
abundance_assay[4] abundance_assay[5] abundance_assay[6]
FT01 232.1 264.2 246.2
FT02 43.3 1102.4 52.1
FT03 201.4 43.5 187.2
FT04 434.2 514.5 508.3
FT05 5154.1 583.1 601.5
FT06 43.4 432.3 432.2
FT07 324.3 43.3 34.5
opt_global_feature_id
FT01 FT01
FT02 FT02
FT03 FT03
FT04 FT04
FT05 FT05
FT06 FT06
FT07 FT07
Importantly, smfCreate() added a column "SMF_ID" with an integer representing the unique identifier of each feature (row). These IDs can then be used for referencing between the SML and SME tables.
The Small Molecule (SML) table represents the final result of an experiment that is reported. It contains the abundances of molecules along with their annotations and abundance summaries for the experiment’s study variables. The content of the SML table is in general a subset of the SMF table, containing only the annotated features.
Below we define a data.frame with annotations for features from the previous section’s SMF table. Such data should be compiled based on the results of an annotation software or workflow that used the SMF information as input. In our example, FT01 and FT05 are the "[M+H]+" and "[M+Na]+" ions of caffeine, FT04 the "[M+Na]+" ion of either glucose or mannose, FT06 the "[M+H]+" ion of sucrose and FT07 "[M+H]+" ion of DL-glutamate. For FT02 and FT03 no annotation is known. For caffeine we report only one (the main) ion in the table but reference the two features in the SMF table. For the ambiguous annotation of FT04 we report both annotations, separated by a "|". The two features without annotation are not reported.
anns <- data.frame(
id = c("HMDB:HMDB0001847",
"HMDB:HMDB0000122|HMDB:HMDB0000169",
"HMDB:HMDB0000258",
"HMDB:HMDB0060475"),
formula = c("C8H10N4O2",
"C6H12O6|C6H12O6",
"C12H22O11",
"C5H9NO4"),
neutral_mass = c(194.0804,
"180.0634|180.0634",
342.1162,
147.0531),
name = c("caffeine",
"glucose|mannose",
"sucrose",
"DL-glutamate"),
adduct = c("[M+H]1+",
"[M+Na]1+",
"[M+H]1+",
"[M+H]1+"),
uri = c("http://www.hmdb.ca/metabolites/HMDB0001847",
"http://www.hmdb.ca/metabolites/HMDB0000122|http://www.hmdb.ca/metabolites/HMDB0000169",
"http://www.hmdb.ca/metabolites/HMDB0000258",
"http://www.hmdb.ca/metabolites/HMDB0060475"),
note = c("manual curation")
)We next subset the feature abundance matrix for the selected (and annotated) molecules we want to report.
abundances_sml <- abundances[c(1, 4, 6, 7), ]With this information we can use the smlCreate() function to compile the SML table. Note that (again) we must fully name all function arguments to which we pass values. Any additional (named) parameters provided to the function (like note = anns$note below) will be added as optional columns (prefixed with "opt_")
sml <- smlCreate(x = abundances_sml,
database_identifier = anns$id,
chemical_formula = anns$formula,
theoretical_neutral_mass = anns$neutral_mass,
adduct_ions = anns$adduct,
uri = anns$uri,
note = anns$note)
sml SMH SML_ID SMF_ID_REFS database_identifier chemical_formula
FT01 SML 1 null HMDB:HMDB0001847 C8H10N4O2
FT04 SML 2 null HMDB:HMDB0000122|HMDB:HMDB0000169 C6H12O6|C6H12O6
FT06 SML 3 null HMDB:HMDB0000258 C12H22O11
FT07 SML 4 null HMDB:HMDB0060475 C5H9NO4
smiles inchi chemical_name
FT01 null null null
FT04 null|null null|null null|null
FT06 null null null
FT07 null null null
uri
FT01 http://www.hmdb.ca/metabolites/HMDB0001847
FT04 http://www.hmdb.ca/metabolites/HMDB0000122|http://www.hmdb.ca/metabolites/HMDB0000169
FT06 http://www.hmdb.ca/metabolites/HMDB0000258
FT07 http://www.hmdb.ca/metabolites/HMDB0060475
theoretical_neutral_mass adduct_ions reliability
FT01 194.0804 [M+H]1+ null
FT04 180.0634|180.0634 [M+Na]1+ null
FT06 342.1162 [M+H]1+ null
FT07 147.0531 [M+H]1+ null
best_id_confidence_measure best_id_confidence_value abundance_assay[1]
FT01 null null 200.1
FT04 null null 399.1
FT06 null null 23.1
FT07 null null NA
abundance_assay[2] abundance_assay[3] abundance_assay[4]
FT01 260.2 256.10 232.1
FT04 599.5 533.10 434.2
FT06 332.1 3231.22 43.4
FT07 43.0 23.40 324.3
abundance_assay[5] abundance_assay[6] opt_global_note
FT01 264.2 246.2 manual curation
FT04 514.5 508.3 manual curation
FT06 432.3 432.2 manual curation
FT07 43.3 34.5 manual curation
This SML is however not yet complete. We must update the relationship between rows in the SML and the SMF section in column "SMF_ID_REFS".
sml$SMF_ID_REFS = c("1|5", "4", "6", "7")And finally we need to add columns with abundance average and variation for study variables defined in the MTD section. Here we can use the smlAddStudyVariableColumns() helper function providing both the SML and the MTD data.
sml <- smlAddStudyVariableColumns(sml, mtd)
sml SMH SML_ID SMF_ID_REFS database_identifier chemical_formula
FT01 SML 1 1|5 HMDB:HMDB0001847 C8H10N4O2
FT04 SML 2 4 HMDB:HMDB0000122|HMDB:HMDB0000169 C6H12O6|C6H12O6
FT06 SML 3 6 HMDB:HMDB0000258 C12H22O11
FT07 SML 4 7 HMDB:HMDB0060475 C5H9NO4
smiles inchi chemical_name
FT01 null null null
FT04 null|null null|null null|null
FT06 null null null
FT07 null null null
uri
FT01 http://www.hmdb.ca/metabolites/HMDB0001847
FT04 http://www.hmdb.ca/metabolites/HMDB0000122|http://www.hmdb.ca/metabolites/HMDB0000169
FT06 http://www.hmdb.ca/metabolites/HMDB0000258
FT07 http://www.hmdb.ca/metabolites/HMDB0060475
theoretical_neutral_mass adduct_ions reliability
FT01 194.0804 [M+H]1+ null
FT04 180.0634|180.0634 [M+Na]1+ null
FT06 342.1162 [M+H]1+ null
FT07 147.0531 [M+H]1+ null
best_id_confidence_measure best_id_confidence_value abundance_assay[1]
FT01 null null 200.1
FT04 null null 399.1
FT06 null null 23.1
FT07 null null NA
abundance_assay[2] abundance_assay[3] abundance_assay[4]
FT01 260.2 256.10 232.1
FT04 599.5 533.10 434.2
FT06 332.1 3231.22 43.4
FT07 43.0 23.40 324.3
abundance_assay[5] abundance_assay[6] abundance_study_variable[1]
FT01 264.2 246.2 240.1333
FT04 514.5 508.3 482.2333
FT06 432.3 432.2 1228.8733
FT07 43.3 34.5 NA
abundance_study_variable[2] abundance_study_variable[3]
FT01 246.1667 230.15
FT04 514.0000 499.30
FT06 269.2333 177.60
FT07 133.9333 NA
abundance_study_variable[4] abundance_study_variable[5]
FT01 249.650 237.125
FT04 497.525 491.475
FT06 1034.780 907.455
FT07 106.375 NA
abundance_study_variable[6] abundance_variation_study_variable[1]
FT01 255.20 0.1453594
FT04 511.40 0.1505366
FT06 432.25 1.4209044
FT07 38.90 0.4219318
abundance_variation_study_variable[2]
FT01 0.05707527
FT04 0.16108421
FT06 0.74983276
FT07 1.23133754
abundance_variation_study_variable[3]
FT01 0.1846497
FT04 0.2838057
FT06 1.2302702
FT07 NA
abundance_variation_study_variable[4]
FT01 0.05536875
FT04 0.08745695
FT06 1.42612166
FT07 1.36790894
abundance_variation_study_variable[5]
FT01 0.1164790
FT04 0.1865403
FT06 1.7142351
FT07 1.2926962
abundance_variation_study_variable[6] opt_global_note
FT01 0.0498743027 manual curation
FT04 0.0085726673 manual curation
FT06 0.0001635875 manual curation
FT07 0.1599624595 manual curation
For each study variable in MTD a abundance_study_variable and abundance_variation_study_variable column were added, aggregating the abundance values from the respective assays with the aggregation and variation function defined in the MTD section.
TODO: implement these functions moving the respective code from the legacy repo
General utility functions include:
mtdFields(): to format values in the mzTab-M-specific format.mtdSort(): to sort rows of the metadata matrix into the expected order.parseCvParameter(): extract elements and values from a CV parameter.isCvParameter(): checks whether a character is in the expected CV parameter format.sessionInfo()R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 26.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] pander_0.6.6 SummarizedExperiment_1.43.0
[3] Biobase_2.73.1 GenomicRanges_1.65.0
[5] Seqinfo_1.3.0 IRanges_2.47.2
[7] S4Vectors_0.51.3 BiocGenerics_0.59.7
[9] generics_0.1.4 MatrixGenerics_1.25.0
[11] matrixStats_1.5.0 RmzTabM_0.97.17
loaded via a namespace (and not attached):
[1] cli_3.6.6 knitr_1.51 rlang_1.2.0
[4] xfun_0.59 otel_0.2.0 data.table_1.18.4
[7] DelayedArray_0.39.3 jsonlite_2.0.0 buildtools_1.0.0
[10] htmltools_0.5.9 maketools_1.3.2 sys_3.4.3
[13] rmarkdown_2.31 grid_4.6.0 abind_1.4-8
[16] evaluate_1.0.5 fastmap_1.2.0 yaml_2.3.12
[19] compiler_4.6.0 Rcpp_1.1.1-1.1 XVector_0.53.0
[22] lattice_0.22-9 digest_0.6.39 SparseArray_1.13.2
[25] Matrix_1.7-5 tools_4.6.0 S4Arrays_1.13.0