| Title: | Retrieve Mass Spectrometry Data from Metabolomics Workbench |
|---|---|
| Description: | Metabolomics Workbench is one of the main public repositories for storage of metabolomics experiments. The MsBackendMetabolomicsWorkbench package provides functionality to retrieve and represent mass spectrometry (MS) data from Metabolomics Workbench. Data files are downloaded and cached locally avoiding repetitive downloads. MS data from metabolomics experiments can thus be directly and seamlessly integrated into R-based analysis workflows with the Spectra and MsBackendMetabolomicsWorkbench package. |
| Authors: | Gabriele Tomè [aut, cre] (ORCID: <https://orcid.org/0000-0002-3976-6068>, fnd: MetaRbolomics4Galaxy project (CUP: D53C25001030003) co-funded by the Autonomous Province of Bolzano under the Joint Projects South Tyrol–Germany 2025 program.), Philippine Louail [aut] (ORCID: <https://orcid.org/0009-0007-5429-6846>), Johannes Rainer [aut] (ORCID: <https://orcid.org/0000-0002-6977-7147>) |
| Maintainer: | Gabriele Tomè <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 0.1.4 |
| Built: | 2026-06-09 13:47:58 UTC |
| Source: | https://github.com/rformassspectrometry/msbackendmetabolomicsworkbench |
Utility functions to interact with the Metabolomics Workbench (MWB) repository, including listing, downloading, caching, and querying data files and study metadata.
mwb_cached_data_files(): lists locally cached data files from
Metabolomics Workbench. Since this function evaluates only local content
it does not require an internet connection. With the default parameters all
available data files are listed. The parameters can be used to restrict the
lookup.
mwb_list_files(): returns the available files for the specified
Metabolomics Workbench data set by submitting a POST request to the
Metabolomics Workbench archive contents endpoint. The function returns a
data.frame with columns "zip_file" and "sample_file" containing the
archive name and the file name within that archive.
Parameter pattern allows to filter the results by matching against the
"sample_file" column. This function requires an active internet
connection.
mwb_rest_request(): queries the Metabolomics Workbench REST API for
a given study/analysis ID and output item (e.g. "summary", "factors").
Returns the raw response as a character string in the format specified
by outputFormat ("json" or "txt"). This function requires an active
internet connection.
mwb_ftp_list_files(): queries the Metabolomics Workbench FTP server for a
given experiment ID and returns the related files. Parameter pattern
allows to filter the results. In contrast to mwb_list_files(), this
function lists only the files on the FTP server (like the zip file of the
experiment), while mwb_list_files() lists the files contained within the
zip file. Other files may also be present on the FTP server. This function
requires an active internet connection.
mwb_ftp_download(): download files from Metabolomics Workbench FTP
server for a given experiment ID. Use pattern to filter files by name
using a regular expression (by default all files are downloaded). Use
path to set the destination directory for downloaded files. Only files
listed by mwb_ftp_list_files() can be downloaded.
mwb_metadata(): retrieves the metadata of a given MWB data set as a
list with two data.frame: one with the metadata of the experiment and
one with the sample annotation. The function handles the case of multiple
analysis IDs by combining the metadata of all analysis IDs into a single
data.frame for the experiment and a single data.frame for the sample
annotation. This function requires an active internet connection.
mwb_sync_data_files(): synchronize data files of a specified
MWB data set eventually downloading and locally caching them.
Parameter fileName allows to specify names of selected data files to
sync.
mwb_delete_cache(): removes all local content for the mwb
data set with ID mwbId. This will delete eventually present
locally cached data files for the specified data set. This does not
change any other data eventually present in the local BiocFileCache.
mwb_list_files(x = character(), pattern = NULL) mwb_rest_request( id = character(), idType = c("study_id", "analysis_id"), outputItem = character(), outputFormat = c("json", "txt") ) mwb_ftp_list_files(mwbId = character(), pattern = "*") mwb_ftp_download( mwbId = character(), pattern = "*", path = "./", overwrite = FALSE ) mwb_metadata(mwbId = character()) mwb_sync_data_files( mwbId = character(), pattern = "mzML$|mzml$|CDF$|cdf$|mzXML$", fileName = character(), ftp_zip = FALSE ) mwb_cached_data_files( mwbId = character(), pattern = "*", fileName = character() ) mwb_delete_cache(mwbId = character())mwb_list_files(x = character(), pattern = NULL) mwb_rest_request( id = character(), idType = c("study_id", "analysis_id"), outputItem = character(), outputFormat = c("json", "txt") ) mwb_ftp_list_files(mwbId = character(), pattern = "*") mwb_ftp_download( mwbId = character(), pattern = "*", path = "./", overwrite = FALSE ) mwb_metadata(mwbId = character()) mwb_sync_data_files( mwbId = character(), pattern = "mzML$|mzml$|CDF$|cdf$|mzXML$", fileName = character(), ftp_zip = FALSE ) mwb_cached_data_files( mwbId = character(), pattern = "*", fileName = character() ) mwb_delete_cache(mwbId = character())
x |
|
pattern |
for |
id |
|
idType |
for |
outputItem |
for |
outputFormat |
for |
mwbId |
|
path |
for |
overwrite |
for |
fileName |
for |
ftp_zip |
for |
Metabolomics Workbench provides metadata through a REST API. MS data files can be obtained in two ways:
Downloading the full zip archive from the FTP server. A POST request to the MWB archive page gets the correct zip archive name for a MWB ID. The archive contains all files of the experiment, which may include also unneeded files. If only a subset of files is needed, the second option is more efficient.
Download individual files using a two-step POST-based procedure: query the MWB archive page to get exact file names. Then, download each file via POST request.
For mwb_list_files(): data.frame with columns zip_file and
sample_file containing, respectively, the archive name and the relative
file within that archive
For mwb_rest_request(): character(1) with the raw REST API response
body, formatted as JSON or plain text depending on outputFormat.
For mwb_sync_data_files() and mwb_cached_data_files(): a
data.frame with the MWB ID, the name(s) and remote and
local file names of the synchronized data files.
For mwb_ftp_list_files: character with the files in FTP server for a
specific ID.
For mwb_metadata: list with two data.frame: one with the metadata of
the experiment and one with the sample annotation.
Gabriele Tomè, Johannes Rainer, Philippine Louail
## Retrieve available files for the data set ST002115 mwb_list_files("ST002115") ## Retrieve the available .mzML files. A1_mzMLfiles <- mwb_list_files("ST000016", pattern = "A1") A1_mzMLfiles ## Query the REST API for a study summary in JSON format mwb_rest_request("ST002115", outputItem = "summary") ## List zip file of the data set ST002115 mwb_ftp_list_files("ST002115") ## Download the file with: `mwb_ftp_download("ST002115", path = tempdir())`## Retrieve available files for the data set ST002115 mwb_list_files("ST002115") ## Retrieve the available .mzML files. A1_mzMLfiles <- mwb_list_files("ST000016", pattern = "A1") A1_mzMLfiles ## Query the REST API for a study summary in JSON format mwb_rest_request("ST002115", outputItem = "summary") ## List zip file of the data set ST002115 mwb_ftp_list_files("ST002115") ## Download the file with: `mwb_ftp_download("ST002115", path = tempdir())`
MsBackendMetabolomicsWorkbench retrieves and represents mass spectrometry
(MS) data from metabolomics studies stored in the
Metabolomics Workbench repository, a
data resource developed by the NIH Common Fund's Data Repository and
Coordinating Center (DRCC) at the San Diego Supercomputer Center, University
of California San Diego.
The repository provides access to study metadata, processed experimental
results, metabolite structures, and reference compound information through a
RESTful HTTP API / FTP server / POST request. The backend directly extends
the Spectra::MsBackendMzR backend from the Spectra package and hence
supports MS data in mzML, CDF, and mzXML format. Data in other formats cannot
be loaded with MsBackendMetabolomicsWorkbench. Upon initialization with the
backendInitialize() method, the MsBackendMetabolomicsWorkbench backend
fetches and caches study data files locally using Bioconductor's
BiocFileCache package, avoiding repeated queries to the remote repository.
See the help and vignettes of that package for details on cached data
resources. Additional utility functions for management of cached files are
also provided by MsBackendMetabolomicsWorkbench. See help for
mwb_cached_data_files() for more information.
MsBackendMetabolomicsWorkbench() ## S4 method for signature 'MsBackendMetabolomicsWorkbench' backendInitialize( object, mwbId = character(), filePattern = "mzML$|CDF$|cdf$|mzXML$", ftp_zip = FALSE, offline = FALSE, ... ) ## S4 method for signature 'MsBackendMetabolomicsWorkbench' backendRequiredSpectraVariables(object, ...) mwb_sync(x, offline = FALSE)MsBackendMetabolomicsWorkbench() ## S4 method for signature 'MsBackendMetabolomicsWorkbench' backendInitialize( object, mwbId = character(), filePattern = "mzML$|CDF$|cdf$|mzXML$", ftp_zip = FALSE, offline = FALSE, ... ) ## S4 method for signature 'MsBackendMetabolomicsWorkbench' backendRequiredSpectraVariables(object, ...) mwb_sync(x, offline = FALSE)
object |
an instance of |
mwbId |
|
filePattern |
|
ftp_zip |
for |
offline |
|
... |
additional parameters; currently ignored. |
x |
an instance of |
The backend uses the BiocFileCache package for caching of the data files. These are stored in the default local BiocFileCache cache along with additional metadata that includes the Metabolomics Workbench ID. Note that at present only MS data files in mzML, CDF and mzXML format are supported.
The MsBackendMetabolomicsWorkbench backend defines and provides additional
spectra variables "mwb_id", "zip_file" and "file_name" that list
the MetabolomicsWorkbench ID, the original zip file name and the original
data file name on the Metabolomics Workbench ftp server for each individual
spectrum. The "file_name" can be used for the mapping between the
experiment's samples and the individual data files, respective their spectra.
The MsBackendMetabolomicsWorkbench backend is considered read-only and
does thus not support changing m/z and intensity values directly.
For MsBackendMetabolomicsWorkbench(): an instance of
MsBackendMetabolomicsWorkbench.
For backendInitialize(): an instance of
MsBackendMetabolomicsWorkbench with the MS data of the specified
MetabolomicsWorkbench data set.
For backendRequiredSpectraVariables(): character with spectra
variables that are needed for the backend to provide the MS data.
For mwb_sync(): the input MsBackendMetabolomicsWorkbench with
the paths to the locally cached data files being eventually
updated.
New instances of the class can be created with the
MsBackendMetabolomicsWorkbench() function. Data is loaded and initialized
using the backendInitialize() function, which accepts parameters mwbId,
filePattern and ftp_zip. mwbId must be the accession of a single
existing Metabolomics Workbench study (e.g. "ST000016"). Optional parameter
filePattern defines the pattern used to filter the file names of the MS
data files and defaults to data files with file endings of supported MS data
formats. Optional parameter ftp_zip = TRUE will download the complete zip
file of the experiment from the FTP server and extract the data files
locally, which can be faster than downloading the files individually via POST
request. However if only a subset of the data files is required, it is more
efficient to download the files separately via POST request with
ftp_zip = FALSE and filePattern set to the desired file name pattern.
backendInitialize() requires an active internet connection, as the function
queries the Metabolomics Workbench via POST request and compares remote file
content against locally cached files before synchronizing any changes or
updates. This behavior can be bypassed with offline = TRUE, in which case
only locally cached content is used.
The backendRequiredSpectraVariables() function returns the names of the
spectra variables required for the backend to provide the MS data.
The mwb_sync() function can be used to synchronize the local data cache
and ensure that all study data files are locally available. The function
checks the local cache and downloads any missing data files from the
Metabolomics Workbench repository.
To account for transient network failures and high server load on the
Metabolomics Workbench endpoint, download functions automatically retry
failed requests. An error is raised after 5 consecutive failed attempts.
Between each attempt, the function waits for a progressively increasing time
period (5 seconds between the first and second attempt, 10 seconds between
the second and third, and so forth). The sleep time multiplier can be
configured via the "mwb.sleep_mult" option (defaults to 5). An active
internet connection is required for all non-cached operations; use
offline = TRUE in backendInitialize() to suppress remote requests and
rely exclusively on the local BiocFileCache cache.
Gabriele Tomè, Philippine Louail, Johannes Rainer
library(MsBackendMetabolomicsWorkbench) ## List files of a MetabolomicsWorkbench data set mwb_list_files("ST002115") ## Initialize a MsBackendMetabolomicsWorkbench representing all MS ## data files of the data set with the ID "ST002115". This will ## download and cache all files and subsequently load and represent ## them in R. be <- backendInitialize(MsBackendMetabolomicsWorkbench(), "ST002115", filePattern = "DMSO_01_RP.mzXML$") be ## The `mwb_sync()` function can be used to ensure that all data ## files are available locally. This function will eventually download ## missing data files or update their paths. be <- mwb_sync(be)library(MsBackendMetabolomicsWorkbench) ## List files of a MetabolomicsWorkbench data set mwb_list_files("ST002115") ## Initialize a MsBackendMetabolomicsWorkbench representing all MS ## data files of the data set with the ID "ST002115". This will ## download and cache all files and subsequently load and represent ## them in R. be <- backendInitialize(MsBackendMetabolomicsWorkbench(), "ST002115", filePattern = "DMSO_01_RP.mzXML$") be ## The `mwb_sync()` function can be used to ensure that all data ## files are available locally. This function will eventually download ## missing data files or update their paths. be <- mwb_sync(be)