Savely Store MS Data Objects in a Portable Stash

Introduction

Data objects in R can be serialized to disk in R’s rds or RData format using the base R save() function and re-imported using the load() function. This R-specific binary data format can however not be used easily by other programming languages preventing the exchange of R data objects between software or programming languages. The MsStash package defines basic classes and generic methods to export and import mass spectrometry (MS) data objects in various storage formats aiming to facilitate data exchange between software. The SpectraStash package implements portable data storage formats (stashes) for data classes from the Spectra package, including the Spectra object and it’s various data backends.

Installation

The package can be installed with the BiocManager package. To install BiocManager use install.packages("BiocManager") and, after that, BiocManager::install("RforMassSpectrometry/SpectraStash") to install this package.

A stash for Spectra objects

MS data objects can be saved and restored through the saveMsObject() and readMsObject() functions into (or from) MS data stashes. Supported stash formats and their respective parameter objects are:

  • PlainTextParam: storage of data in (a custom) plain text file format.
  • AlabasterParam: storage of MS data using Bioconductor’s r Biocpkg("alabaster.base") framework using files in HDF5 and JSON format. MS stashes in this format fully support the functions saveObject() and readObject() from alabaster.base.

See also the vignette from the MsStash for details on the formats and implementation notes.

As an example we create below a Spectra object from two example MS data files from the MsDataHub package.

library(Spectra)
library(SpectraStash)
library(MsDataHub)
fls <- c(X20171016_POOL_POS_1_105.134.mzML(),
         X20171016_POOL_POS_3_105.134.mzML())
sps <- Spectra(fls)
sps
## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1     0.280         1
## 2            1     0.559         2
## 3            1     0.838         3
## 4            1     1.117         4
## 5            1     1.396         5
## ...        ...       ...       ...
## 1858         1   258.636       927
## 1859         1   258.915       928
## 1860         1   259.194       929
## 1861         1   259.473       930
## 1862         1   259.752       931
##  ... 34 more variables/columns.
## 
## file(s):
## 14224b95f897_7859
## 1422343aa99_7860

We next filter the data restricting to spectra and mass peaks with a retention time between 20 and 200 seconds and an m/z between 110 and 120.

sps <- filterRt(sps, c(20, 200))
sps <- filterMzRange(sps, c(110, 120))
sps
## MSn data (Spectra) with 1290 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    20.089        72
## 2            1    20.368        73
## 3            1    20.647        74
## 4            1    20.926        75
## 5            1    21.205        76
## ...        ...       ...       ...
## 1286         1   198.649       712
## 1287         1   198.928       713
## 1288         1   199.207       714
## 1289         1   199.486       715
## 1290         1   199.765       716
##  ... 34 more variables/columns.
## 
## file(s):
## 14224b95f897_7859
## 1422343aa99_7860
## Lazy evaluation queue: 1 processing step(s)
## Processing:
##  Filter: select retention time [20..200] on MS level(s)  [Fri Jun 26 13:16:32 2026]
##  Filter: select peaks with an m/z within [110, 120] [Fri Jun 26 13:16:32 2026]

We next store this Spectra object to a SpectraStash using the saveMsObject() function. We use an alabaster format and define the location of the stash with the path parameter of AlabasterParam. For the present example we save it to a temporary folder.

#' Define the location of the stash
d <- file.path(tempfile(), "spectra_stash")

#' Configure the format and location
ap <- AlabasterParam(d)

#' Save the `Spectra` object to the stash
saveMsObject(sps, ap)

The content of the stash folder is:

library(fs)
dir_tree(d)
## /tmp/RtmpLCi5TT/file142227f0070d/spectra_stash
## ├── OBJECT
## ├── _environment.json
## ├── backend
## │   ├── OBJECT
## │   └── spectra_data
## │       ├── OBJECT
## │       └── basic_columns.h5
## ├── metadata
## │   ├── OBJECT
## │   └── list_contents.json.gz
## ├── processing
## │   ├── OBJECT
## │   └── contents.h5
## ├── processing_chunk_size
## │   ├── OBJECT
## │   └── contents.h5
## ├── processing_queue_variables
## │   ├── OBJECT
## │   └── contents.h5
## └── spectra_processing_queue.json

In alabaster format, each slot of the Spectra object is stored into its own sub directory. Spectra objects don’t handle the MS data itself, but rely on a MsBackend to provide this data. The MsBackend used by the Spectra object is stored into it’s own stash located in the backend directory of the SpectraStash. The Spectra object can be restored again with readMsObject():

res <- readMsObject(Spectra(), ap)
res
## MSn data (Spectra) with 1290 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    20.089        72
## 2            1    20.368        73
## 3            1    20.647        74
## 4            1    20.926        75
## 5            1    21.205        76
## ...        ...       ...       ...
## 1286         1   198.649       712
## 1287         1   198.928       713
## 1288         1   199.207       714
## 1289         1   199.486       715
## 1290         1   199.765       716
##  ... 25 more variables/columns.
## 
## file(s):
## 14224b95f897_7859
## 1422343aa99_7860
## Lazy evaluation queue: 1 processing step(s)
## Processing:
##  Filter: select retention time [20..200] on MS level(s)  [Fri Jun 26 13:16:32 2026]
##  Filter: select peaks with an m/z within [110, 120] [Fri Jun 26 13:16:32 2026]

We need to specify the type of the object to restore with the first parameter of the function - in our case Spectra(). The full Spectra object was restored, including the processing queue and history.

We can also read (restore) only the MsBackend from the SpectraStash. Since the present stash is in alabaster format we can either use readMsObject() or also the readObject() from alabaster.base:

library(alabaster.base)
be <- readObject(file.path(d, "backend"))
be
## MsBackendMzR with 1290 spectra
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    20.089        72
## 2            1    20.368        73
## 3            1    20.647        74
## 4            1    20.926        75
## 5            1    21.205        76
## ...        ...       ...       ...
## 1286         1   198.649       712
## 1287         1   198.928       713
## 1288         1   199.207       714
## 1289         1   199.486       715
## 1290         1   199.765       716
##  ... 25 more variables/columns.
## 
## file(s):
## 14224b95f897_7859
## 1422343aa99_7860

Or using readMsObject():

be <- readMsObject(MsBackendMzR(), AlabasterParam(file.path(d, "backend")))
be
## MsBackendMzR with 1290 spectra
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    20.089        72
## 2            1    20.368        73
## 3            1    20.647        74
## 4            1    20.926        75
## 5            1    21.205        76
## ...        ...       ...       ...
## 1286         1   198.649       712
## 1287         1   198.928       713
## 1288         1   199.207       714
## 1289         1   199.486       715
## 1290         1   199.765       716
##  ... 25 more variables/columns.
## 
## file(s):
## 14224b95f897_7859
## 1422343aa99_7860

Creating self-contained stashes

Our example Spectra object uses an MsBackendMzR backend which keeps only limited information in memory and retrieves the peaks data (i.e., the m/z and intensity values) from the original MS data files upon demand. The stash for MsBackendMzR objects contains therefore also only the spectra metadata and a reference to the original MS data files - but no peaks data.

If the original MS data files were moved to a different location or if the SpectraStash folder was moved to another computer, the updated path to the raw MS data files would need to be provided with the spectraPath parameter of the readMsObject() function. As an alternative, it is also possible to create a self-contained stash setting consolidate = TRUE in saveMsObject(). We below save our Spectra object again, this time into a self-contained stash.

d2 <- file.path(tempdir(), "spectra_stash2")
saveMsObject(sps, AlabasterParam(d2), consolidate = TRUE)

The consolidate = TRUE parameter is passed to the saveMsObject() call of the MsBackend, which, for MsBackendMzR copies the original MS data files into the stash folder:

dir_tree(d2)
## /tmp/RtmpLCi5TT/spectra_stash2
## ├── OBJECT
## ├── _environment.json
## ├── backend
## │   ├── 1422343aa99_7860
## │   ├── 14224b95f897_7859
## │   ├── OBJECT
## │   └── spectra_data
## │       ├── OBJECT
## │       └── basic_columns.h5
## ├── metadata
## │   ├── OBJECT
## │   └── list_contents.json.gz
## ├── processing
## │   ├── OBJECT
## │   └── contents.h5
## ├── processing_chunk_size
## │   ├── OBJECT
## │   └── contents.h5
## ├── processing_queue_variables
## │   ├── OBJECT
## │   └── contents.h5
## └── spectra_processing_queue.json

Note the two additional files in the backend folder - these are the original MS data files in mzML format. Such a self-contained stash folder allows to restore the full data even if the stash is moved to another file system. Of course, depending on the size of the data set and the respective raw MS data files, the stash folder can become very large.

Stashes for Spectra with in-memory backends

In addition to the on-disk backends MsBackendMzR and MsBackendHdf5Peaks, Spectra defines also in-memory backends MsBackendMemory and MsBackendDataFrame, which keep the full MS data in memory. Below we change the backend of our sps object to MsBackendMemory:

sps <- setBackend(sps, MsBackendMemory())
sps
## MSn data (Spectra) with 1290 spectra in a MsBackendMemory backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    20.089        72
## 2            1    20.368        73
## 3            1    20.647        74
## 4            1    20.926        75
## 5            1    21.205        76
## ...        ...       ...       ...
## 1286         1   198.649       712
## 1287         1   198.928       713
## 1288         1   199.207       714
## 1289         1   199.486       715
## 1290         1   199.765       716
##  ... 34 more variables/columns.
## Lazy evaluation queue: 1 processing step(s)
## Processing:
##  Filter: select retention time [20..200] on MS level(s)  [Fri Jun 26 13:16:32 2026]
##  Filter: select peaks with an m/z within [110, 120] [Fri Jun 26 13:16:32 2026]
##  Switch backend from MsBackendMzR to MsBackendMemory [Fri Jun 26 13:16:33 2026]

We next stash this updated Spectra object removing first the stash directory of the previous SpectraStash (because overwriting stash directories is not allowed).

#' Remove the existing SepctraStash
unlink(d2, recursive = TRUE)

#' Store the `Spectra` object in alabaster format
saveMsObject(sps, AlabasterParam(d2))

Inspecting the content of the stash folder we can see a different structure:

dir_tree(d2)
## /tmp/RtmpLCi5TT/spectra_stash2
## ├── OBJECT
## ├── _environment.json
## ├── backend
## │   ├── OBJECT
## │   └── backend
## │       ├── OBJECT
## │       ├── mod_count
## │       │   ├── OBJECT
## │       │   └── contents.h5
## │       ├── peaks.h5
## │       └── spectra_data
## │           ├── OBJECT
## │           └── basic_columns.h5
## ├── metadata
## │   ├── OBJECT
## │   └── list_contents.json.gz
## ├── processing
## │   ├── OBJECT
## │   └── contents.h5
## ├── processing_chunk_size
## │   ├── OBJECT
## │   └── contents.h5
## ├── processing_queue_variables
## │   ├── OBJECT
## │   └── contents.h5
## └── spectra_processing_queue.json

The MS peaks data is now stored within a file peaks.h5, a file in a HDF5 format used by the MsBackendHdf5Peaks backend: saving in-memory backends changes the data first to a MsBackendHdf5Peaks backend which is then stored into an additional backend sub-folder of the stash. We can restore the Spectra object with:

readMsObject(Spectra(), AlabasterParam(d2))
## MSn data (Spectra) with 1290 spectra in a MsBackendMemory backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    20.089         1
## 2            1    20.368         2
## 3            1    20.647         3
## 4            1    20.926         4
## 5            1    21.205         5
## ...        ...       ...       ...
## 1286         1   198.649      1286
## 1287         1   198.928      1287
## 1288         1   199.207      1288
## 1289         1   199.486      1289
## 1290         1   199.765      1290
##  ... 25 more variables/columns.
## Lazy evaluation queue: 1 processing step(s)
## Processing:
##  Filter: select retention time [20..200] on MS level(s)  [Fri Jun 26 13:16:32 2026]
##  Filter: select peaks with an m/z within [110, 120] [Fri Jun 26 13:16:32 2026]
##  Switch backend from MsBackendMzR to MsBackendMemory [Fri Jun 26 13:16:33 2026]

In addition, we can restore the MsBackendMemory with:

readMsObject(MsBackendMemory(), AlabasterParam(file.path(d2, "backend")))
## MsBackendMemory with 1290 spectra
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    20.089         1
## 2            1    20.368         2
## 3            1    20.647         3
## 4            1    20.926         4
## 5            1    21.205         5
## ...        ...       ...       ...
## 1286         1   198.649      1286
## 1287         1   198.928      1287
## 1288         1   199.207      1288
## 1289         1   199.486      1289
## 1290         1   199.765      1290
##  ... 25 more variables/columns.

and also the MsBackendHdf5Peaks which is used as the actual data storage format for the in-memory MsBackendMemory (note the double backend sub-folder):

readMsObject(MsBackendHdf5Peaks(),
             AlabasterParam(file.path(d2, "backend", "backend")))
## MsBackendHdf5Peaks with 1290 spectra
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1    20.089         1
## 2            1    20.368         2
## 3            1    20.647         3
## 4            1    20.926         4
## 5            1    21.205         5
## ...        ...       ...       ...
## 1286         1   198.649      1286
## 1287         1   198.928      1287
## 1288         1   199.207      1288
## 1289         1   199.486      1289
## 1290         1   199.765      1290
##  ... 25 more variables/columns.
## 
## file(s):
##  peaks.h5

Session information

sessionInfo()
## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] alabaster.base_1.13.0 fs_2.1.0              MsDataHub_1.11.5     
##  [4] SpectraStash_0.97.6   MsStash_0.99.0        Spectra_1.23.3       
##  [7] BiocParallel_1.47.0   S4Vectors_0.51.3      BiocGenerics_0.59.7  
## [10] generics_0.1.4        BiocStyle_2.41.0     
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.53.4          xfun_0.59                bslib_0.11.0            
##  [4] httr2_1.2.3              Biobase_2.73.1           rhdf5_2.57.1            
##  [7] rhdf5filters_1.25.0      vctrs_0.7.3              tools_4.6.1             
## [10] curl_7.1.0               parallel_4.6.1           AnnotationDbi_1.75.0    
## [13] tibble_3.3.1             RSQLite_3.53.2           cluster_2.1.8.2         
## [16] blob_1.3.0               pkgconfig_2.0.3          data.table_1.18.4       
## [19] dbplyr_2.6.0             lifecycle_1.0.5          compiler_4.6.1          
## [22] Biostrings_2.81.3        Seqinfo_1.3.0            codetools_0.2-20        
## [25] ncdf4_1.24               clue_0.3-68              htmltools_0.5.9         
## [28] sys_3.4.3                buildtools_1.0.0         sass_0.4.10             
## [31] yaml_2.3.12              crayon_1.5.3             pillar_1.11.1           
## [34] jquerylib_0.1.4          MASS_7.3-65              cachem_1.1.0            
## [37] MetaboCoreUtils_1.21.1   ExperimentHub_3.3.1      AnnotationHub_4.3.1     
## [40] tidyselect_1.2.1         digest_0.6.39            purrr_1.2.2             
## [43] dplyr_1.2.1              BiocVersion_3.24.0       maketools_1.3.2         
## [46] fastmap_1.2.0            cli_3.6.6                magrittr_2.0.5          
## [49] withr_3.0.3              filelock_1.0.3           rappdirs_0.3.4          
## [52] bit64_4.8.2              XVector_0.53.0           httr_1.4.8              
## [55] rmarkdown_2.31           bit_4.6.0                otel_0.2.0              
## [58] png_0.1-9                memoise_2.0.1            evaluate_1.0.5          
## [61] knitr_1.51               IRanges_2.47.2           BiocFileCache_3.3.0     
## [64] rlang_1.2.0              Rcpp_1.1.1-1.1           glue_1.8.1              
## [67] DBI_1.3.0                mzR_2.47.0               BiocManager_1.30.27     
## [70] alabaster.schemas_1.13.0 jsonlite_2.0.0           R6_2.6.1                
## [73] Rhdf5lib_2.1.0           ProtGenerics_1.39.2      MsCoreUtils_1.25.4