--- title: "Storage Modes of MS Data Objects" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Storage Modes of MS Data Objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{MsStash} %\VignetteDepends{MsStash,BiocStyle,alabaster.base,fs} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE) library(BiocStyle) ``` # Introduction Data objects in R can be serialized to disk in R's *Rds* format using the base R `save()` function and re-imported using the `load()` function. This R-specific binary data format can however not be used or read by other programming languages preventing thus the exchange of R data objects between software or programming languages. The *MsStash* package defines basic classes and generic methods to export and import mass spectrometry data objects in various storage formats aiming to facilitate data exchange between software. This includes, among other formats, also storage of data objects using Bioconductor's `r Biocpkg("alabaster.base")` package. For export or import of MS data objects, the `saveMsObject()` and `readMsObject()` functions can be used. For `saveMsObject()`, the first parameter is the MS data object that should be stored, for `readMsObject()` it defines type of MS object that should be restored (returned). The second parameter `param` defines and configures the storage format of the MS data. The currently supported formats and the respective parameter objects are: - `PlainTextParam`: storage of data in (a custom) plain text file format. - `AlabasterParam`: storage of MS data using Bioconductor's `r Biocpkg("alabaster.base")` framework based files in HDF5 and JSON format. These storage formats are described in more details in the following sections. An example use of these functions and parameters: `saveMsObject(x, param = PlainTextParam(storage_path))` to store an MS data object assigned to a variable `x` to a directory `storage_path` using the plain text file format. To restore the data (assuming `x` was an instance of a `MsExperiment` class): `readMsObject(MsExperiment(), param = PlainTextParam(storage_path))`. # Installation The package can be installed with the *BiocManager* package. To install *BiocManager* use `install.packages("BiocManager")` and, after that, `BiocManager::install("RforMassSpectrometry/MsStash")` to install this package. # Example implementations To illustrate how the save/read functionality can be implemented for a specific data class, we first define a simple toy R S4 object to represent the data from a single mass spectrum. This `MySpectrum` class contains slots to hold the spectrum's *m/z* and intensity values as well as some (limited) metadata. ```{r} #' Class definition setClass("MySpectrum", slots = c(mz = "numeric", intensity = "numeric", rtime = "numeric", msl = "integer"), prototype = prototype( mz = numeric(), intensity = numeric(), rtime = numeric(), msl = integer())) #' Default constructor function MySpectrum <- function(mz = numeric(), intensity = numeric(), rtime = numeric(), msl = integer()) { stopifnot(length(mz) == length(intensity)) if (length(mz) && !length(rtime)) rtime <- NA_real_ if (length(mz) && !length(msl)) msl <- NA_integer_ new("MySpectrum", mz = mz, intensity = intensity, rtime = rtime, msl = as.integer(msl)) } ``` We can now create an example `MySpectrum` object. ```{r} s <- MySpectrum(c(1.4, 1.6, 1.9, 2.56), c(123.1, 1235.3, 12.45, 51.5)) s ``` ## Suggested properties of implemented methods To ensure consistency, the `saveMsObject()` should: - first create the directory to which the data should be exported (defined by param `path`). - throw an error if the directory exists or contains already an exported object (avoiding thus accidental overwriting and eventual data corruption/inconsistencies). Both methods support also `...`, hence, if needed, additional parameters can be added to an implementation of the generic method if needed. ```{r} library(MsStash) ``` ## `PlainTextParam` Storage of MS data objects in *plain* text format aims to support an easy exchange of data, and in particular analysis results, with external software, such as [MS-DIAL](https://systemsomicslab.github.io/compms/msdial/main.html) or [mzmine3](http://mzmine.github.io/download.html). In most cases, the data is stored as tabulator delimited text files simplifying the use of the data and results across multiple programming languages, or their import into spreadsheet applications. MS data objects stored in plain text format can also be fully re-imported into R providing thus an alternative, and more flexible, object serialization approach than the R internal *Rds*/*RData* format. We implement a `saveMsObject()` method for our `MySpectrum` class and the `PlainTextParam`. This function first creates the required directory and throws an error if an result file is already stored there. Then it exports the data: for our example we store the data of the object into a single text file in a custom format we define: the metadata if first written to the file, one line per metadata item followed by the *m/z* and intensity values, each *m/z*-intensity pair in one line separated by a tabulator. ```{r} #' Write example class to a plain text file setMethod("saveMsObject", signature(object = "MySpectrum", param = "PlainTextParam"), function(object, param) { dir.create(path = param@path, recursive = TRUE, showWarnings = FALSE) fl <- file.path(param@path, "my_spectrum.txt") if (file.exists(fl)) stop("Overwriting an existing result object is not ", "supported.") ## Write the type of object as a comment followed by the ## metadata. writeLines(c(paste0("# ", class(object)[1L]), paste0("rtime:", object@rtime), paste0("msl:", object@msl)), con = fl) ## Write the peak data, i.e. m/z and intensity values write.table(cbind(object@mz, object@intensity), file = fl, sep = "\t", append = TRUE, col.names = FALSE, row.names = FALSE) }) ``` We next export our example object `s` with the `saveMsData()` method to a temporary folder. ```{r} p <- PlainTextParam(path = file.path(tempdir(), "text_format")) saveMsObject(s, p) ``` The data was thus exported to this text file. The individual lines are: ```{r} readLines(file.path(p@path, "my_spectrum.txt")) ``` We next implement the `readMsObject()` method for this class. This function will read the text file content and assign the imported values to the different slots of the `MySpectrum` class. ```{r} #' Read example object from plain text file storage format setMethod("readMsObject", signature(object = "MySpectrum", param = "PlainTextParam"), function(object, param) { fl <- file.path(param@path, "my_spectrum.txt") if (!file.exists(fl)) stop("my_spectrum.txt not found in the provided path") l <- readLines(fl, n = 3) # read the comment and the metadata p <- read.table(fl, sep = "\t", skip = 3) MySpectrum( mz = p[, 1L], intensity = p[, 2L], rtime = suppressWarnings( as.numeric(sub("rtime:", "", l[2], fixed = TRUE))), msl = suppressWarnings( as.integer(sub("msl:", "", l[3], fixed = TRUE)))) }) ``` We can now restore our `MySpectrum` object with the `readMsObject()` method from the exported text file: ```{r} p <- PlainTextParam(path = file.path(tempdir(), "text_format")) b <- readMsObject(MySpectrum(), p) b ``` ## `AlabasterParam` The [alabaster framework](https://github.com/ArtifactDB/alabaster.base) and related Bioconductor package `r Biocpkg("alabaster.base")` implements methods to save a variety of R/Bioconductor objects to on-disk representations based on standard file formats like HDF5 and JSON. This ensures that Bioconductor objects can be easily read from other languages like Python and Javascript. With `AlabasterParam`, *MsStash* provides a parameter class to configure saving MS data objects in the *alabaster* storage format. To enable writing in this format a `saveMsObject()` method should be implemented for the MS data object and `AlabasterParam`. To enable full *alabaster* support it is also suggested to implement the `alabaster.base::saveObject` method, a validation method and a function to read from an alabaster format. For more details refer also to the package vignette of the `r Biocpkg("alabaster.base")` package, in particular chapter 5 *Extending to new classes*. We below define a `saveObject()` method. The generic for this method is defined in the *alabaster.base* package. While it would be possible to simply save the data as simple text files as we did above, we use *alabaster*'s strategy to allow storage of more complex objects (such as S4 objects in the individual slots). This uses `altSaveObject()` and `altReadObject()` to save individual slots or parent/child classes in sub-directories of `path`. For each of these classes, a `saveObject()` needs to be defined. ```{r} library(alabaster.base) setMethod("saveObject", "MySpectrum", function(x, path, ...) { ## Create the directory where to save the data dir.create(path = path, recursive = TRUE, showWarnings = FALSE) ## Create an "object" file; this defines the type of object stored in path saveObjectFile(path, "my_spectrum") ## save each slot into it's own directory altSaveObject(x@mz, path = file.path(path, "mz")) altSaveObject(x@intensity, path = file.path(path, "intensity")) altSaveObject(x@rtime, path = file.path(path, "retention_time")) altSaveObject(x@msl, path = file.path(path, "ms_level")) }) ``` We next need to implement a *validation function* for the stash (directory). For our example we simply check that the `path` contains the expected sub-directories with the object's content. This function needs then to be registered with the `registerValidateObjectFunction()` method for our class. ```{r} #' Define a helper function to check that the folder contains all #' expected sub-directories. validateMySpectrum <- function(path, metadata) { if (!dir.exists(path)) stop("Directory ", path, " does not exist") req_dir <- c("mz", "intensity", "retention_time", "ms_level") if (any(miss <- !dir.exists(file.path(path, req_dir)))) stop("Required directories ", paste0("\"", req_dir[miss], "\"", collapse = ", "), " not found in ", path) } #' Register the validation function registerValidateObjectFunction("my_spectrum", validateMySpectrum) ``` Finally we define the function to read the data back from the stash. We then register this function with *alabaster*'s `registerReadObjectFunction()` function. ```{r} #' Define a function that can read from an alabaster-based serialization #' of `MySpectrum` objects readMySpectrum <- function(path, metadata, ...) { validateMySpectrum(path) ## Read the data from individual sub-directories mz <- altReadObject(file.path(path, "mz")) int <- altReadObject(file.path(path, "intensity")) rtime <- altReadObject(file.path(path, "retention_time")) msl <- altReadObject(file.path(path, "ms_level")) MySpectrum(mz = mz, intensity = int, rtime = rtime, msl = msl) } #' Register the read function registerReadObjectFunction("my_spectrum", readMySpectrum) ``` Registration of the validation and read functions is generally done in the extension package's `onLoad()` function. With these functions defined and registered, we can store an instance of `MySpectrum` directly with *alabaster*'s `saveObject()` method: ```{r} #' Define the path where we want to export out data p <- file.path(tempdir(), "alabaster_export") #' Save the object saveObject(s, path = p) ``` This saved the object's content to the directory specified with `path`. The content of this folder is: ```{r} library(fs) dir_tree(p) ``` We can read the serialized object again as a `MySpectrum` object: ```{r} b <- readObject(p) b ``` We next implement the `saveMsObject()` and `readMsObject()` methods for `MySpectrum` and `AlabasterParam`. These can simply re-use the functions implemented above. ```{r} #' Write example class to a plain text file setMethod("saveMsObject", signature(object = "MySpectrum", param = "AlabasterParam"), function(object, param) { if (file.exists(file.path(param@path, "OBJECT"))) stop("'path' contains already an MS data stash. Overwriting", " is not supported. Please remove 'path' first.") saveObject(object, param@path) }) #' Read example object from plain text file storage format setMethod("readMsObject", signature(object = "MySpectrum", param = "AlabasterParam"), function(object, param) { readMySpectrum(param@path) }) ``` We can now stash our MS object in either the text file-based format (`PlainTextParam`) or the alabaster-based format (`AlabasterParam`). Below we write it using the alabaster approach. ```{r} p <- file.path(tempdir(), "alabaster_format_2") ap <- AlabasterParam(p) saveMsObject(s, ap) ``` To read the data back we can then use `readMsObject()` specifying in addition the type of object we want to read. ```{r} b <- readMsObject(MySpectrum(), ap) b ``` # Session information ```{r} sessionInfo() ```