---
title: "Retrieve and Use Mass Spectrometry Data from Metabolomics Workbench"
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{Retrieve and Use Mass Spectrometry Data from Metabolomics Workbench}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignettePackage{MsBackendMetabolomicsWorkbench}
%\VignetteDepends{Spectra,BiocStyle,jsonlite}
---
```{r style, echo = FALSE, results = 'asis', message=FALSE}
BiocStyle::markdown()
```
**Package**: `r Biocpkg("MsBackendMetabolomicsWorkbench")`
**Authors**: `r packageDescription("MsBackendMetabolomicsWorkbench")[["Author"]] `
**Last modified:** `r file.info("MsBackendMetabolomicsWorkbench.Rmd")$mtime`
**Compiled**: `r date()`
```{r, echo = FALSE, message = FALSE}
library(Spectra)
library(BiocStyle)
library(jsonlite)
```
# Introduction
Metabolomics experiments and results including mass spectrometry (MS) data can
be deposited in several public repositories, such as
[Metabolomics Workbench](https://metabolomicsworkbench.org/) repository, a data
resource developed by the NIH Common Fund's Data Repository and Coordinating
Center (DRCC) at the San Diego Supercomputer Center, University of California
San Diego. While data is available, manual lookup and download is cumbersome
hampering the re-analysis of public data and replication of results. The
*MsBackendMetabolomicsWorkbench* package closes this gap by providing
functionality to query, retrieve and cache MS data from Metabolomics Workbench
directly from R hence enabling a direct and seamless integration of MS data from
Metabolomics Workbench into R-based analysis workflows.
*MsBackendMetabolomicsWorkbench* leverages on Bioconductor's
`r Biocpkg("BiocFileCache")` for caching remote data locally and provides a *MS
data backend* for the `r Biocpkg("Spectra")` package to enable loading and
integrating cached MS data directly into R.
# Installation
The package can be installed from within R with the commands below:
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RforMassSpectrometry/MsBackendMetabolomicsWorkbench")
```
# Importing MS Data from Metabolomics Workbench
Each experiment in Metabolomics Workbench is identified by a unique accession
starting with *ST* followed by a number. The repository provides programmatic
access via the Metabolomics Workbench REST API and POST requests, so users can
query experiments and associated data files directly. With
`MsBackendMetabolomicsWorkbench`, you can resolve these accessions and download
supported MS files (mzML/CDF/mzXML) into a local cache, then load them into a
`Spectra` object for downstream processing.
Below we list all files from Metabolomics Workbench experiment *ST002115*.
```{r}
library(MsBackendMetabolomicsWorkbench)
#' List files of a Metabolomics Workbench data set
all_files <- mwb_list_files("ST002115")
head(all_files)
```
MS data files in supported formats (mzML, CDF, mzXML) can be directly loaded
using the `MsBackendMetabolomicsWorkbench` backend into R as a `Spectra` object
(`MsBackendMetabolomicsWorkbench` directly extends *Spectra*'s `MsBackendMzR`
backend and therefore supports import of MS data files in these formats). There
are two supported download modes:
1. POST request to fetch individual data files. (*default*)
2. FTP request to download the zip containing all files of the experiment.
Below we list zip file of Metabolomics Workbench experiment *ST002115*.
```{r}
#' List zipped FTP files for a Metabolomics Workbench data set
mwb_ftp_list_files("ST002115")
```
The FTP archive contains all files for the experiment, which may include
unneeded files. If only a subset of files is needed, the default POST option
(with `ftp_zip = FALSE`) is more efficient. By default, all MS data files of the
data set would be retrieved, but in our example below we restrict to a few data
files to reduce the amount of data that needs to be downloaded. To this end we
define a pattern matching the file name of only some data files using the
`filePattern` parameter.
```{r}
library(Spectra)
#' Load MS data files of one data set
s <- Spectra("ST002115", filePattern = "01_RP.mzXML$", ftp_zip = FALSE,
source = MsBackendMetabolomicsWorkbench())
s
```
This call downloaded 4 files from the experiment into the local cache and loaded
them as a `Spectra` object. The downloading and caching of the data is handled
by Bioconductor's `r Biocpkg("BiocFileCache")`. The local cache can thus also be
managed directly using functionality from that package. Any subsequent loading
of the same data files will load the locally cached versions avoiding thus
repetitive download of the same data.
The `Spectra` object with the MS data files of the Metabolomics Workbench data
set enables now any subsequent analysis of the data in R. On top of the spectra
variables and mass peak data values that are provided by the MS data files also
additional information related to the Metabolomics Workbench data set are
available as specific *spectra variables*. We list all available spectra
variables of the data set below.
```{r}
spectraVariables(s)
```
The Metabolomics Workbench-specific variables are `"mwb_id"`, `"zip_file"` and
`"file_name"` providing the Metabolomics Workbench ID of the data set, the zip
file name in the FTP server and the original data file name in the Metabolomics
Workbench for each individual spectrum.
```{r}
spectraData(s, c("mwb_id", "zip_file", "file_name"))
basename(s$file_name) |> head()
```
The `mwb_sync()` function can be used to *synchronize* the local content of
a `MsBackendMetabolomicsWorkbench` and is useful if, for example, locally cached
files were deleted. The function checks if all data files of the backend are
available locally and eventually downloads and caches missing files.
```{r, echo = FALSE}
Sys.sleep(4)
```
```{r}
mwb_sync(s@backend)
```
In addition, it is also possible to *manually* cache and download selected files
from Metabolomics Workbench using the `mwb_sync_data_files()` function. Before
downloading, this function first evaluates if the respective data files are
already cached and only downloads them if needed. As a result, the function
returns a `data.frame` with the storage location and other information of the
cached file(s). Below we use this function to retrieve the local storage
information on one of the data files of the Metabolomics Workbench data set
*ST002115*:
```{r}
res <- mwb_sync_data_files("ST002115",
fileName = "HT1080_DMSO_01_RP.mzXML")
res
```
The `mwb_cached_data_files()` function can be used to inspect and list all
locally cached Metabolomics Workbench data files. This function does not require
an active internet connection since only local content is queried. With the
default settings, a `data.frame` with all available data files is returned.
```{r}
mwb_cached_data_files()
```
Locally cached files for a Metabolomics Workbench data set can be removed using
the `mwb_delete_cache()` function providing the ID of the Metabolomics
Workbench data set for which local data files should be removed.
# General use and information retrieval from Metabolomics Workbench
Next to the `MsBackendMetabolomicsWorkbench` backend for `Spectra` objects, the
*MsBackendMetabolomicsWorkbench* package provides also various utility functions
to query and retrieve information from Metabolomics Workbench.
The `mwb_rest_request()` queries the Metabolomics Workbench REST API for a given
study/analysis ID and output item (e.g. `summary`, `factors`). Returns the raw
response as a `character` string in the format specified by `outputFormat`
(`json` or `txt`).
Below we query the REST API for the summary of the Metabolomics Workbench data
set *ST002115*:
```{r}
summary <- mwb_rest_request("ST002115", outputItem = "summary",
outputFormat = "json")
fromJSON(summary)
```
The `mwb_ftp_download()` function allows to download the zip archive of the
experiment (directly, i.e., without caching). As an example we download below
the zip archive to a temporary folder. In our example below we do not run it to
reduce the amount of data that needs to be downloaded.
```{r, eval = FALSE}
mwb_ftp_download("ST002115", path = tempdir())
```
The `mwb_metadata()` function retrieves the metadata of a given Metabolomics
Workbench data set as a `list` of 2 `data.frame`:
- `MS_run`: contains the metadata of the MS runs of the data set, identified
by the analysis ID(s),
- `sample_annotation`: contains the metadata of the samples of the data set.
Not all the experiments have a column with the associated sample file
name, the association cab be retrieved by the `mwb_list_files()` function.
The function handles the case of multiple analysis IDs by combining the metadata
of all analysis IDs into a single `data.frame` for the experiment and a single
`data.frame` for the sample annotation.
Below we retrieve the metadata of the data set *ST002115*:
```{r}
meta <- mwb_metadata("ST002115")
meta$MS_run
meta$sample_annotation
```
# Session information
```{r}
sessionInfo()
```