---
title: "The R API for the mzTab-M File Format"
package: RmzTabM
format:
  html:
    minimal: true
    theme: flatly
vignette: >
  %\VignetteIndexEntry{The R API for the mzTab-M File Format}
  %\VignetteKeywords{Mass Spectrometry, MS, MSMS, Metabolomics, Infrastructure, Quantitative}
  %\VignettePackage{RmzTabM}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{quarto::html}
  %\VignetteDepends{RmzTabM,BiocStyle,pander}
---

# Introduction

The *RmzTabM* package provides the API and core functionality to read and write
files in mzTab-M format. The functions can be re-used and integrated by other R
packages to support import and export of their respective
metabolomics/lipidomics result objects in this format.

For a general overview of the mzTab-M format see [this
figure](https://github.com/HUPO-PSI/mzTab-M/blob/main/specification_documents/img/media/figure-mztab-sections.png).

![mzTab-M
format](https://github.com/HUPO-PSI/mzTab-M/blob/main/specification_documents/img/media/figure-mztab-sections.png)

The *RmzTabM* package supports mzTab-M version **2.1**.

# Installation

# General information on the mzTab-M format

The mzTab-M format consists of four cross-referenced data tables: metadata
(MTD), Small Molecule (SML), Small Molecule Feature (SMF) and the Small Molecule
Evidence (SME). The MTD section is supposed to contain all experiment and
measurement relevant information. The SML section contains the final results of
an analysis that should be reported, i.e., the (annotated) molecules and their
respective abundances. The SMF section contains information on the
measured (LC-MS) *features* and their abundance values. The SME section
contains information on the annotation process (and reliability) of the
molecules reported in the SML section. The SML is supposed to be a subset of the
SMF table. The structure and relationship between rows in these different tables
is defined by the mzTab-M standard and follows strict rules. The functions from
the *RmzTabM* package assist in creating and formatting these tables.

# R mzTab-M API

The *RmzTabM* package provides low level, core functions and higher-level
functions to work with files in mzTab-M format. The high-level functions are
more user-oriented, simplifying the import and export of data and information
from and to files in mzTab-M format. The low-level, core functions are
developer-oriented, providing helper functions that can be re-used in other R
packages to read and write from and to mzTab-M files.

For a description of the mzTab-M format and the set of mandatory and optional
fields refer to the official [format
definition](https://github.com/HUPO-PSI/mzTab-M/blob/main/specification_documents/mzTab_format_specification_2_1-M.adoc).

```{r}
library(RmzTabM)
```

## High-level, user faced functions

TODO: implement these functions. These functions should simplify import/export
taking more complex or multiple data parts (`data.frame`s, `matrix` etc) as
input and write the formatted data directly to a mzTab-M file, or should read a
mzTab-M file returning it's content as e.g. a `list` of elements.

## Low-level functions

The low-level functions listed in this section provide the base functionality to
convert or format information and data for/from the [mzTab-M
format](https://github.com/HUPO-PSI/mzTab-M/blob/main/specification_documents/mzTab_format_specification_2_1-M.adoc). These
functions are designed to be re-used by other R packages and take and return
only basic, plain R data types.

### Formatting and exporting

All formatting and export functions require that all their parameters, if
specified, **must** be fully named, i.e., no positional matching of a function's
arguments is supported.

#### Metadata

The mzTab-M format defines various fields and parameters to describe the data
and information of an experiment. The *RmzTabM* package provides a variety of
utility functions that help defining and formatting this information.

See also the [specification of the MTD
section](https://github.com/HUPO-PSI/mzTab-M/blob/main/specification_documents/mzTab_format_specification_2_1-M.adoc#62-metadata-section)
for more information and optional and mandatory metadata fields.

The general categories of the metadadata in the mzTab-M MTD section are *core
information*, *sample information*, *MS run information*, *assay information*
and *study variable information*. For each of these categories a separate R
function is available to create and format the respective fields. As an example,
we define below a `data.frame` with sample information. In our example we assume
3 samples (e.g. cell lines) each measured at two different time points. An
additional column *genotype* specifies the genotype of the individual samples
and a column *operator* the initials of the researcher extracting the samples.

```{r}
#' Define a simple data.frame of the measured samples of an experiment
exp <- data.frame(
    sample_name = c("S1_T1", "S1_T2", "S2_T1", "S2_T2", "S3_T1", "S3_T2"),
    sample_id = c("S1", "S1", "S2", "S2", "S3", "S3"),
    timepoint = c("0h", "6h", "0h", "6h", "0h", "6h"),
    genotype = c("WT", "WT", "KO", "KO", "KO", "KO"),
    operator = c("BB", "BB", "BB", "BB", "FB", "FB"),
    file_name = c("s1-t1.mzML", "s1-t2.mzML", "s2-t1.mzML", "s2-t2.mzML",
                  "s3-t1.mzML", "s3-t2.mzML")
)
exp
```

We will next compile the MTD information for the experiment using the individual
helper functions, starting with the **Core information**: this comprises general
information about the experiment. A minimal set of fields can be compiled using
the `mtdSkeleton()` function. We have to provide an ID for the experiment and
in addition we specify the software used to process the data:

```{r}
mtd <- mtdSkeleton(
    id = "EXP_001",
    software = "[MS, MS:1001582, xcms, 4.1.0]"
)
```

```{r}
library(pander)
pandoc.table(mtd, style = "rmarkdown", split.table = Inf, justify = "ll")
```

This represents some minimal information. The data of the MTD section is
formatted as a `character` 2-column `matrix`. We could now either change the
value (i.e., the elements in the second column of this matrix) of existing
fields, or also manually add additional fields/information. As an example we add
a title and description for the experiment. See also the
[mzTab-M](https://github.com/HUPO-PSI/mzTab-M/blob/main/specification_documents/mzTab_format_specification_2_1-M.adoc#62-metadata-section)
format definition for other supported fields.

```{r}
mtd <- rbind(
    mtd,
    c("title", "Experiment 1 preprocessed data"),
    c("description", "The preprocessed data of the experiment 1.")
)
```

To help with formatting we can also use the `mtdFields()` function. Below we
use this function to add information about the MS instrumentation to the MTD
section:

```{r}
instr <- mtdFields(
    name = "[MS, MS:1000449, LTQ Orbitrap,]",
    source = "[MS, MS:1000073, ESI,]",
    `analyzer[1]` = "[MS, MS:1000291, linear ion trap,]",
    detector = "[MS, MS:1000253, electron multiplier,]",
    field_prefix = "instrument"
)
```

```{r, results = "asis"}
pandoc.table(instr, style = "rmarkdown", split.table = Inf, justify = "ll")
```

And we add that information to the `mtd` variable.

```{r}
mtd <- rbind(mtd, instr)
```

The next category of metadata information is **sample information**. This
comprises (optional) information on individual samples that were measured with
the various assays/runs. We use the `mtdSample()` function to assist in
compiling this information. Parameters `sample`, `species`, `tissue` and
`cell_type`, `disease` and `description` allow to provide pre-defined sample
properties. Additional sample annotations and details can be provided through
the function's `...`. For the example below we define some of these properties
and in addition provide a custom field for the extraction data. Be aware that
`mtdSample()` does **not** support partial or positional matching of
parameters; for each of the parameters the **full parameter name** has to be
used (i.e., `sample = ...` instead of `sam = ...` or `s = ...`).

```{r}
mtd_s <- mtdSample(
    sample = unique(exp$sample_id),
    species = "[NCBITaxon, NCBITaxon:9606, Homo sapiens, ]",
    tissue = "[BTO, BTO:0000759, liver, ]",
    cell_type = "[CL, CL:0000182, hepatocyte, ]",
    c("[,,Extraction date, 2011-12-21]",
      "[,,Extraction date, 2011-12-22]",
      "[,,Extraction date, 2011-12-23]")
    )
```

```{r, results = "asis"}
pandoc.table(mtd_s, style = "rmarkdown", split.table = Inf, justify = "ll")
```

Note that the general information part should also contain the references to
**all** controlled vocabulary (CV) ontologies used in the mzTab-M file. The
default ontologies added by the `mtb_skeleton()` function are the *PSI-MS*,
*PRIDE* and *STATO* ontologies. If other vocabularies are used, they should be
either added manually (following the scheme of the others, i.e., the fields
starting with `"cv["`) or provided with the `cv_*` function arguments of the
`mtb_skeleton()` function. For our example we use also the *BRENDA tissue
ontology* (BTO) and the *NCBITaxon* ontology to define the tissue of origin and
species of the samples and hence need to add these ontologies to the general
metadata section. We use the `mtdFields()` function for this. For a CV entry we
need to provide a *label*, the *full_name*, the *version* and the *uri*:

```{r}
add_cv <- mtdFields(
    label = c("BTO", "NCBITaxon"),
    full_name = c("The BRENDA Tissue Ontology (BTO)",
                  "NCBI organismal classification"),
    version = c("2021-10-26", "2025-12-03"),
    uri = c("https://www.ebi.ac.uk/ols4/ontologies/bto",
            "https://www.ebi.ac.uk/ols4/ontologies/ncbitaxon"),
    field_prefix = "cv")
add_cv
```

We need to update the index of the cv, since there are already 3 CVs (MS, PRIDE
and STATO) defined by in the metadata part. We thus replace next the `"1"` with
`"4"` and `"2"` with `"5"` and append this CV term to the metadata section.

```{r}
add_cv[, 1L] <- sub("1", "4", add_cv[, 1L])
add_cv[, 1L] <- sub("2", "5", add_cv[, 1L])
mtd <- rbind(mtd, add_cv)
```

We can then add the sample information to the `mtd` variable by simply
`rbind()`ing it.

```{r}
mtd <- rbind(mtd, mtd_s)
```

Next we compile **MS run information** of the experiment using the
`mtdMsRun()` helper function. This should comprise all (MS-specific)
information related to the measurement of each sample - including also the MS
data file names and locations. For our example we use the file names
reported in the sample data frame and specify the polarity of the measurement
runs.

```{r}
mtd_msr <- mtdMsRun(
    location = exp$file_name,
    format = "[MS, MS:1000584, mzML file, ]",
    id_format = "[MS, MS:1000530, mzML unique identifier, ]",
    scan_polarity = "positive")
```

```{r, results = "asis"}
pandoc.table(mtd_msr, style = "rmarkdown", split.table = Inf, justify = "ll")
```

Each row in the `exp` data frame was assigned to a `"ms_run"` with the location
and format of the respective file as well as the polarity in which the data was
acquired. We can combine this data with the `mtd` variable.

```{r}
mtd <- rbind(mtd, mtd_msr)
```

Next we define the **assay information**. Generally, each measurement (MS run)
is associated to one assay, but also more complex configurations are
supported. See the help of the `mtdAssay()` function for details on multiplexed
or pre-fractionated samples. Mandatory information that has to be provided to
the `mtdAssay()` function are the name (ID) of the assay and the reference to
the MS run in which the assay was measured. For the latter, a format of
`"ms_run[<index of the MS run>]"` is expected. For our example we provide in
addition also the (optional, but suggested) reference to the original
sample. Note that each assay must represent one column in the following feature
abundance table (SMF).

```{r}
mtd_a <- mtdAssay(
    assay = exp$sample_name,
    sample_ref = c("sample[1]", "sample[1]", "sample[2]", "sample[2]",
                   "sample[3]", "sample[3]"),
    ms_run_ref = paste0("ms_run[", seq_len(nrow(exp)), "]")
)
```

The result formatted *assay information* is shown in the table below.

```{r, results = "asis"}
pandoc.table(mtd_a, style = "rmarkdown", split.table = Inf, justify = "ll")
```

We add this information to the `mtd` variable.

```{r}
mtd <- rbind(mtd, mtd_a)
```

At last we compile the **study variable information** of our example
experiment. This should capture all experiment-relevant study variables
(phenotype or experimental conditions). In R, such information is generally
encoded in a *sample* or *phenotype* `data.frame`, with rows being individual
samples (or measurements thereof) and columns the sample characteristics (i.e.,
the *study variable groups*, with the individual values of the columns being, in
the mzTab-M definition, the *study variables*). The `mtdStudyVariables()`
function formats a sample/experiment `data.frame` into the corresponding mzTab-M
fields. Parameter `groups` allows to select the columns of the input
`data.frame` which represent the study variable groups (phenotype or
experimental conditions). Additional function arguments allow to specify
the statistical type and the datatype for each column/study variable group, but
the defaults should work in most situations. By default, R data types
`character` and `factor` are mapped to the STATO type *categorical*, while the
STATO type *continuous* is used for `numeric` and `integer` columns. If the
`data.frame` contains *ordinal* variables it should be manually specified with
parameter `group_type`. In our example we define in addition an optional *unit*
for the study variable *timepoint*. Units have to be provided in CV parameter
format; for study variable groups without unit `""` or `NA` has to be used.

```{r}
mtd_svar <- mtdStudyVariables(
    exp, groups = c("timepoint", "genotype", "operator"),
    group_unit = c("[, , hours, ]", "", ""))
```

The formatted data is shown in the table below.

```{r, results = "asis"}
pandoc.table(mtd_svar, style = "rmarkdown", split.table = Inf, justify = "ll")
```

For each column a *study variable group* was defined while each unique value in
each of the specified columns was encoded as a `"study_variable"` (or rather as
a *study variable value*), with its *assay_refs* attribute containing the
rows/assays in which this value was measured. The variable's `"description"` (by
default) indicates the name of the column. The `"average_function"` and
`"variation_function"` attributes allow to define the function that was used to
calculate the average and variance of the abundance values for that variable
value.

We next add the study variable information to the `mtd` variable.

```{r}
mtd <- rbind(mtd, mtd_svar)
```

At last we sort the elements according to the expected order in the MTD
section using the `mtdSort()` function.

```{r}
mtd <- mtdSort(mtd)
```

This two-column `matrix` could now be saved to a text file using a tabulator
(`"\t"`) as a field separator. The full metadata header is shown in the table
below.

```{r, results = "asis"}
pandoc.table(mtd, style = "rmarkdown", split.table = Inf, justify = "ll")
```

#### Small Molecule Feature (SMF) Table

The *small molecule feature* (SMF) section captures information on the
quantified entities (features) of an experiment. This includes the feature
abundances across assays as well as the feature's *m/z*, retention times and
eventual additional annotations such as the ion or the exact mass. The
`smfCreate()` function compiles and formats this section based on the provided
abundance matrix and feature specifications.

Below we create an example abundance matrix and feature characteristics data
matching the metadata from the previous section. Generally, such information can
be extracted from the result objects of preprocessing software. We first define
the abundance matrix: columns are assays, rows features. Importantly, the number
and order of the assays has to match the *assay* definition in the metadata
(defined above with the `mtdAssay()` function). Our example data consists of
quantification of 7 features in 6 measurements (assays) of 3 samples.

```{r}
abundances <- cbind(c(200.1, 1232.1, 54.3, 399.1, 599.8, 23.1, NA),
                    c(260.2, 39.5, 177.4, 599.5, 5344.1, 332.1, 43.0),
                    c(256.1, 904.2, 56.9, 533.1, 489.9, 3231.22, 23.4),
                    c(232.1, 43.3, 201.4, 434.2, 5154.1, 43.4, 324.3),
                    c(264.2, 1102.4, 43.5, 514.5, 583.1, 432.3, 43.3),
                    c(246.2, 52.1, 187.2, 508.3, 601.5, 432.2, 34.5))
colnames(abundances) <- exp$sample_name
rownames(abundances) <- c("FT01", "FT02", "FT03", "FT04", "FT05",
                          "FT06", "FT07")
```

We next define also a `data.frame` with the feature characteristics from the MS
measurement run (one row per feature and columns with *m/z*, retention time and,
where known, also the adduct information and charge). Note that without any
annotation (and hence a SML and SME section) adduct and charge information will
not be available for the SMF table.

```{r}
feature_info <- data.frame(
    mzmed = c(195.088, 127.1, 299.2, 181.07, 218.077, 343.123, 148.06),
    rtmed = c(25.6, 128.4, 67.2, 127.3, 25.7, 167.2, 76.34),
    rtmin = c(23.1, 125.1, 65.1, 122.3, 23.3, 162.3, 71.3),
    rtmax = c(26.9, 130.3, 69.1, 134.2, 26.8, 172.1, 81.2),
    adduct = c("[M+H]+", NA, NA, "[M+Na]+", "[M+Na]+", "[M+H]+", "[M+H]+"),
    charge = c(1L, NA, NA, 1L, 1L, 1L, 1L)
)
rownames(feature_info) <- rownames(abundances)
```

We can now feed this information to the `smfCreate()` function. In addition to
the predefined, parameters, also additional feature annotations/columns can be
passed to the function through it's `...` parameter. We provide the IDs of the
individual features with `feature_id = `. These are then stored into a column
`"opt_feature_id"`. Note that all parameters **must** be fully named, i.e., `x =
` or `charge = ` since the function does not support positional matching of its
arguments.

```{r}
smf <- smfCreate(
    x = abundances,
    exp_mass_to_charge = feature_info$mzmed,
    retention_time_in_seconds = feature_info$rtmed,
    retention_time_in_seconds_start = feature_info$rtmin,
    retention_time_in_seconds_end = feature_info$rtmax,
    charge = feature_info$charge,
    adduct_ion = feature_info$adduct,
    feature_id = rownames(feature_info))
```

The SMF content is:

```{r}
smf
```

Importantly, `smfCreate()` added a column `"SMF_ID"` with an integer
representing the unique identifier of each feature (row). These IDs can then be
used for referencing between the SML and SME tables.

#### Small Molecule (SML) Table

The Small Molecule (SML) table represents the *final* result of an experiment
that is reported. It contains the abundances of molecules along with their
annotations and abundance summaries for the experiment's study variables. The
content of the SML table is in general a subset of the SMF table, containing
only the annotated features.

Below we define a `data.frame` with annotations for features from the previous
section's SMF table. Such data should be compiled based on the results of an
annotation software or workflow that used the SMF information as input. In our
example, *FT01* and *FT05* are the `"[M+H]+"` and `"[M+Na]+"` ions of caffeine,
*FT04* the `"[M+Na]+"` ion of either glucose or mannose, *FT06* the `"[M+H]+"`
ion of sucrose and *FT07* `"[M+H]+"` ion of DL-glutamate. For *FT02* and *FT03*
no annotation is known. For caffeine we report only one (the *main*) ion in the
table but reference the two features in the SMF table. For the ambiguous
annotation of *FT04* we report both annotations, separated by a `"|"`. The two
features without annotation are not reported.

```{r}
anns <- data.frame(
    id = c("HMDB:HMDB0001847",
           "HMDB:HMDB0000122|HMDB:HMDB0000169",
           "HMDB:HMDB0000258",
           "HMDB:HMDB0060475"),
    formula = c("C8H10N4O2",
                "C6H12O6|C6H12O6",
                "C12H22O11",
                "C5H9NO4"),
    neutral_mass = c(194.0804,
                     "180.0634|180.0634",
                     342.1162,
                     147.0531),
    name = c("caffeine",
             "glucose|mannose",
             "sucrose",
             "DL-glutamate"),
    adduct = c("[M+H]1+",
               "[M+Na]1+",
               "[M+H]1+",
               "[M+H]1+"),
    uri = c("http://www.hmdb.ca/metabolites/HMDB0001847",
            "http://www.hmdb.ca/metabolites/HMDB0000122|http://www.hmdb.ca/metabolites/HMDB0000169",
            "http://www.hmdb.ca/metabolites/HMDB0000258",
            "http://www.hmdb.ca/metabolites/HMDB0060475"),
    note = c("manual curation")
)
```

We next subset the feature abundance matrix for the selected (and annotated)
molecules we want to report.

```{r}
abundances_sml <- abundances[c(1, 4, 6, 7), ]
```

With this information we can use the `smlCreate()` function to compile the SML
table. Note that (again) we **must fully name** all function arguments to which
we pass values. Any additional (named) parameters provided to the function (like
`note = anns$note` below) will be added as *optional* columns (prefixed with
`"opt_"`)

```{r}
sml <- smlCreate(x = abundances_sml,
                 database_identifier = anns$id,
                 chemical_formula = anns$formula,
                 theoretical_neutral_mass = anns$neutral_mass,
                 adduct_ions = anns$adduct,
                 uri = anns$uri,
                 note = anns$note)
sml
```

This SML is however not yet complete. We must update the relationship between
rows in the SML and the SMF section in column `"SMF_ID_REFS"`.

```{r}
sml$SMF_ID_REFS = c("1|5", "4", "6", "7")
```

And finally we need to add columns with abundance average and variation for
study variables defined in the MTD section. Here we can use the
`smlAddStudyVariableColumns()` helper function providing both the SML and
the MTD data.

```{r}
sml <- smlAddStudyVariableColumns(sml, mtd)
sml
```

For each study variable in MTD a *abundance_study_variable* and
*abundance_variation_study_variable* column were added, aggregating the
abundance values from the respective assays with the aggregation and variation
function defined in the MTD section.

## SME

### Reading and importing

TODO: implement these functions moving the respective code from the legacy repo

### General utility functions

General utility functions include:

- `mtdFields()`: to format values in the mzTab-M-specific format.
- `mtdSort()`: to sort rows of the metadata `matrix` into the expected order.
- `parseCvParameter()`: extract elements and values from a CV parameter.
- `isCvParameter()`: checks whether a `character` is in the expected CV
  parameter format.

# Session information

```{r}
sessionInfo()
```