--- title: "Retrieve and Use Mass Spectrometry Data from MassIVE" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Retrieve and Use Mass Spectrometry Data from MassIVE} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{MsBackendMassIVE} %\VignetteDepends{Spectra,BiocStyle} --- ```{r style, echo = FALSE, results = 'asis', message=FALSE} BiocStyle::markdown() ``` **Package**: `r Biocpkg("MsBackendMassIVE")`
**Authors**: `r packageDescription("MsBackendMassIVE")[["Author"]] `
**Last modified:** `r file.info("MsBackendMassIVE.Rmd")$mtime`
**Compiled**: `r date()` ```{r, echo = FALSE, message = FALSE} library(Spectra) library(BiocStyle) ``` # Introduction Metabolomics experiments and results including mass spectrometry (MS) data can be deposited in several public repositories, such as [MassIVE](https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp) (Mass Spectrometry Interactive Virtual Environment). MassIVE is a community resource developed by the NIH-funded Center for Computational Mass Spectrometry at UC San Diego to promote the global, free exchange of mass spectrometry data. MassIVE supports deposition of both proteomics and metabolomics experiments and is a full member of the [ProteomeXchange](http://www.proteomexchange.org/) consortium. While data is available, manual lookup and download is cumbersome hampering the re-analysis of public data and replication of results. The *MsBackendMassIVE* package closes this gap by providing functionality to query, retrieve and cache MS data from MassIVE directly from R hence enabling a direct and seamless integration of MS data from MassIVE into R-based analysis workflows. *MsBackendMassIVE* leverages on Bioconductor's `r Biocpkg("BiocFileCache")` for caching remote data locally and provides a *MS data backend* for the `r Biocpkg("Spectra")` package to enable loading and integrating cached MS data directly into R. # Installation The package can be installed from within R with the commands below: ```{r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("RforMassSpectrometry/MsBackendMassIVE") ``` # Importing MS Data from MassIVE Each experiment in MassIVE is identified by a unique accession starting with *MSV* followed by a number. While the [MassIVE web page](https://massive.ucsd.edu/ProteoSAFe/) allows only a manual, non programmatic, lookup of data and experiments, a separate, central registry of data files and experiments is hosted on GNPS2. This [datasetcache](https://datasetcache.gnps2.org/datasette/database/filename) registry allows programmatic access and is used by *MsBackendMassIVE* to query information on MassIVE experiments. Below we list all files from the MassIVE data set with the ID *MSV000080547*. ```{r} library(MsBackendMassIVE) #' List files of a MassIVE data set all_files <- massive_list_files("MSV000080547") head(all_files) ``` These files are accessible through the FTP path associated with the MassIVE data set. Below we use the `massive_ftp_path()` function to return the FTP path for our test data set. ```{r} massive_ftp_path("MSV000080547", mustWork = FALSE) ``` MS data files in supported formats (mzML, CDF, mzXML) can be directly loaded using the `MsBackendMassIVE` backend into R as a `Spectra` object (`MsBackendMassIVE` directly extends *Spectra*'s `MsBackendMzR` backend and therefore supports import of MS data files in these formats). By default, all MS data files of the data set would be retrieved, but in our example below we restrict to a few data files to reduce the amount of data that needs to be downloaded. To this end we define a pattern matching the file name of only some data files using the `filePattern` parameter. ```{r} library(Spectra) #' Load MS data files of one data set s <- Spectra("MSV000080547", filePattern = "1.mzML$", source = MsBackendMassIVE()) s ``` This call downloaded 2 files from the experiment into the local cache and loaded them as a `Spectra` object. The downloading and caching of the data is handled by Bioconductor's `r Biocpkg("BiocFileCache")`. The local cache can thus also be managed directly using functionality from that package. Any subsequent loading of the same data files will load the locally cached versions avoiding thus repetitive download of the same data. The `Spectra` object with the MS data files of the MassIVE data set enables now any subsequent analysis of the data in R. On top of the spectra variables and mass peak data values that are provided by the MS data files also additional information related to the MassIVE data set are available as specific *spectra variables*. We list all available spectra variables of the data set below. ```{r} spectraVariables(s) ``` The MassIVE-specific variables are `"massive_id"` and `"data_file"` providing the MassIVE ID of the data set and the original data file name in the MassIVE FTP server for each individual spectrum. ```{r} spectraData(s, c("massive_id", "data_file")) basename(s$data_file) |> head() ``` The `massive_sync()` function can be used to *synchronize* the local content of a `MsBackendMassIVE` and is useful if, for example, locally cached files were deleted. The function checks if all data files of the backend are available locally and eventually downloads and caches missing files. ```{r} massive_sync(s@backend) ``` In addition, it is also possible to *manually* cache and download selected files from MassIVE using the `massive_sync_data_files()` function. Before downloading, this function first evaluates if the respective data files are already cached and only downloads them if needed. As a result, the function returns a `data.frame` with the storage location and other information of the cached file(s). Below we use this function to retrieve the local storage information on one of the data files of the MassIVE data set *MSV000080547*: ```{r} res <- massive_sync_data_files("MSV000080547", fileName = "AG_spiked_sample11.mzML") res ``` The `massive_cached_data_files()` function can be used to inspect and list all locally cached MassIVE data files. This function does not require an active internet connection since only local content is queried. With the default settings, a `data.frame` with all available data files is returned. ```{r} massive_cached_data_files() ``` Locally cached files for a MassIVE data set can be removed using the `massive_delete_cache()` function providing the ID of the MassIVE data set for which local data files should be removed. # General use and information retrieval from MassIVE Next to the `MsBackendMassIVE` backend for `Spectra` objects, the *MsBackendMassIVE* package provides also various utility functions to query and retrieve information from MassIVE or GNPS2's *datasetcache*. The `massive_param_file()` function reads the parameter file from a MassIVE data set that provides general, experiment-specific information. These are retrieved as a two-column `data.frame` with the first column containing the names of the data set properties, and the second their values. ```{r} prm <- massive_param_file("MSV000080547") head(prm) ``` The `massive_number_files()` function returns the number of files in a MassIVE data set. By default, only MS data files (mzML, CDF, mzXML) are counted, but this can be changed by providing a different pattern to the `pattern` parameter. ```{r} massive_number_files("MSV000080547") ``` The `massive_download_file()` function allows to download any file of an experiment (directly, i.e., without caching). As an example we download below a docx file to a temporary folder. ```{r} massive_list_files("MSV000083058") |> head() massive_download_file("MSV000083058", fileName = "README_Histones_P108_VS3.docx", path = tempdir()) ``` *MsBackendMassIVE* provides also two utility functions to query the GNPS2 *datasetcache*, `gnps2_query()` and `gnps2_usi_download_link()`. Below we use `gnps2_query()` to retrieve all information for a MassIVE data set from the datasetcache. ```{r} res <- gnps2_query("MSV000083058") head(res) ``` Via the `gnps2_query()` function, it is also possible to compute the total size of data files in a MassIVE data set. To this end, the `filepath_pattern` parameter can be used to restrict the query to specific files. ```{r} files_info <- gnps2_query("MSV000083058", filepath_pattern = "mzML$") size_gb <- round(sum(files_info$size)/(2^30), 2) message("Total size of mzML files in data set MSV000083058: ", size_gb, " GB") ``` The `gnps2_usi_download_link()` returns a fully qualified link to a data file (listed in the GNPS2 datasetcache), based on it's USI. ```{r} gnps2_usi_download_link(res$usi[4]) ``` # Session information ```{r} sessionInfo() ```