Package 'MsDataHub' reference manual

Title:	Mass Spectrometry Data on ExperimentHub
Description:	The MsDataHub package uses the ExperimentHub infrastructure to distribute raw mass spectrometry data files, peptide spectrum matches or quantitative data from proteomics and metabolomics experiments.
Authors:	Laurent Gatto [aut, cre] (ORCID: <https://orcid.org/0000-0002-1520-2268>), Kristina Gomoryova [ctb] (ORCID: <https://orcid.org/0000-0003-4407-3917>), Johannes Rainer [aut] (ORCID: <https://orcid.org/0000-0002-6977-7147>), Guillaume Deflandre [ctb] (ORCID: <https://orcid.org/0009-0008-1257-2416>)
Maintainer:	Laurent Gatto <[email protected]>
License:	Artistic-2.0
Version:	1.11.5
Built:	2026-07-21 07:52:14 UTC
Source:	https://github.com/rformassspectrometry/msdatahub

Ai et al (2025) single-cell data

Description

Single-cell proteomics captures the Proteome Heterogeneity in Human iPSC-Derived Cardiomyocytes and Adult Cardiomyocytes.

Project description (from MassIVE): Human induced pluripotent stem cell (IPSC)-derived cardiomyocytes (iCMs) have become important tools to model cardiovascular diseases and drug toxicology. Despite suggested transcriptomic heterogeneity in both iPSC and iCMs, the cellular proteome heterogeneity is poorly understood. Using cutting-edge single cell proteomics, we quantify the maturation from IPSC to iCMs and observed two distinct populations of iCMs with different metabolism, which recapitulates the single adult cardiomyocyte proteome populations albeit less mature.

The two DIA-NN report files are downloaded from the MassIVE dataset MSV000094438 (doi:10.25345/C5T727S7Q) are redistruted here are:

Adult cardiomyocyte (aCMs): 299 cells
iPSC-derived cardiomyocytes (iCMs): 2184 cells

Dataset license: CC0 1.0 Universal (CC0 1.0)

Author(s)

EuBIC 2025 developper meeting SCP hackathon members

References

Ai, Lizhuo, Aleksandra Binek, Vladimir Zhemkov, Jae Hyung Cho, Ali Haghani, Simion Kreimer, Edo Israely, et al. 2025. “Single Cell Proteomics Reveals Specific Cellular Subtypes in Cardiomyocytes Derived from Human iPSCs and Adult Hearts.” Mol. Cell. Proteomics, no. 100910 (January): 100910. https://doi.org/10.1016/j.mcpro.2025.100910.

DIA benchmarking data

Description

These data were generated based on publicly available DIA benchmarking dataset from Gotti et al. (2021). A subset of raw data, containing "overlapped" in the File.Name were searched using the DIA-NN software, and the resulting report.tsv (here labelled as 'benchmarkingDIA.tsv') is provided.

The dataset contains 8 conditions containing a mix of E.coli and Universal Standard Protein-1 (UPS1) peptides. Per 1 ug of E.coli protein (equal in all samples), UPS1 proteins are diluted to final concentration of 50, 25, 10, 5, 2.5, 1, 0.25 and 0.1 fmol.

Each sample was prepared in 3 replicates, so altogether there are 24 samples in the dataset.

Author(s)

Kristina Gomoryova and Laurent Gatto

References

Gotti C, Roux-Dalvai F, Joly-Beauparlant C, Mangnier L, Leclercq M, Droit A. Extensive and Accurate Benchmarking of DIA Acquisition Methods and Software Tools Using a Complex Proteomic Standard. J Proteome Res. 2021 Oct 1;20(10):4801-4814. doi: 10.1021/acs.jproteome.1c00490. Epub 2021 Sep 2. PMID: 34472865.

Boekweg et al (2022) SCP, bulk and identification data

Description

Features of Peptide Fragmentation Spectra in Single-Cell Proteomics

Project description (from MassIVE):

SCP: This dataset contains 3 data types: trace samples consisting of 2ng and 0.2ng aliquots of HeLa protein digest standard, and single HeLa cells. Pierce™ HeLa protein digest standard and formic acid were purchased from Thermo Fisher Scientific (Waltham, MA). Mobile phase A (0.1% formic acid in water) and mobile phase B (0.1% formic acid in acetonitrile) were respectively prepared from LC-MS grade water and acetonitrile purchased from Honeywell (Charlotte, NC). The digest standard was reconstituted to a final concentration of 200 ng/µL with 100 µL of mobile phase A to form a stock solution. For the experiments, the stock samples were further diluted to 0.2 and 2 ng/µL using the same mobile phase. HeLa cells were cultured from cells purchased from American Type Culture Culture Collection (Manassas, VA). Single HeLa cells were prepared using the nanoPOTS workflow and analyzed by manual LC injection as described previously (29797682) except that cells were isolated into nanowells using the CellenONE platform (Lyon, France) instead of by fluorescence activated cell sorting. Columns: 30-µm-i.d. fused silica capillary columns from Polymicro (Phoenix, AZ) were packed with different materials: Jupiter C18 3.0 µm, 300 Å particles and Kinetex C18 core shell particles of 1.3 µm, 100 Å µm were purchased from Phenomenex (Torrance, CA); BEH C18, 1.7 µm, 130 Å was from Waters (Milford, MA). Column lengths were adjusted to keep the pressure and the linear velocity constant for all columns. The lengths were 50, 9 and 16 cm for Jupiter, Kinetex and BEH columns respectively. Solid-phase-extraction (SPE) columns were prepared by packing Jupiter C18 particles into 100-µm-i.d. × 5-cm-long fused silica capillaries. The file names contain the sample size and lc packing material. doi:10.25345/C5NV69

Bulk: This data was originally uploaded to pride project PXD011163. More details can be found there. Cells were lysed, reduced, and alkylated in lysis buffer (1% SDC, 10 mM TCEP, 40 mM CAA, and 100 mM TRIS, pH 8.0) supplemented with complete EDTA-free protease inhibitor mixture and phosSTOP phosphatase inhibitor mixture. Cells were heated for 5 min at 95 C, sonicated with a Bioruptor Plus, and diluted 1:10 with 50 mM ammonium bicarbonate, pH 8.0. Proteins were digested overnight at 37 C with trypsin and Lys-C (enzyme:substrate ratio of 1:50 and 1:75). SDC was precipitated by acidification to 5% of formic acid. Samples were desalted using Sep-Pak C18 cartridges and directly subjected to phosphopeptide enrichment. Samples for proteome analysis were instead dried down and stored at -80 C until nLC-MS analysis. Phosphopeptides enrichment was performed using Fe(III)-NTA in an automated fashion using the AssayMAP Bravo Platform. Reversed phase nLC-MS/MS analysis was performed with an Agilent 1290 Infinity UHPLC system coupled to an Orbitrap Q Exactive Plus mass spectrometer, or Orbitrap Fusion mass spectrometer for the phosphoproteome analysis. The UHPLC was equipped with a double frit trapping column (Reprosil C18, 3 um, 2 cm x 100 um) and a single frit analytical column (Poroshell EC-C18, 2.7 um, 50 cm x 75 um). Trapping was performed in solvent A (0.1% FA in water) at 5 uL/min, while for the elution the flow rate was passively split to 300 nL/min. The linear gradient was as follows: 13-40% solvent B (0.1% FA in 80% ACN) in 220 min, or 8-32% in 95 min for phosphopeptide analysis. Total analysis time was 235 min for the proteome samples and 110 min for the phosphoproteome samples. The mass spectrometers were operated in data-dependent mode. The Orbitrap Q Exactive Plus full-scan MS spectra from m/z 375-1600 were acquired at a resolution of 35000 (FWHM) after accumulation to a target value of 3e6. Up to 10 most intense precursor ions were selected for fragmentation, with the isolation window set to 1.5 m/z. HCD fragmentation was performed at normalized collision energy of 25% after the accumulation to a target value of 5e4. MS/MS was acquired at a resolution of 17500 (FWHM). The Orbitrap Fusion full-scan MS spectra from m/z 375-1500 were acquired at a resolution of 120000 (FWHM) after accumulation to a target value of 4e5. The most intense peptide ions fitting within a 3 s cycle were selected for HCD fragmentation, with the isolation window set to 1.6 m/z, and a normalized collision energy of 30%, after the accumulation to a target value of 5e4. MS/MS was acquired at a resolution of 30000 (FWHM). doi:10.25345/C5BN6F

The mzML files are downloaded from the MassIVE datasets MSV000087524 (doi:10.25345/C5NV69)(SCP), MSV000087689 (doi:10.25345/C5BN6F) (bulk) and redistruted here are:

D19_15um30cm_SC1.mzML
OR11_20160122_PG_HeLa_CVB3_CT_A.mzML

The identification files were created with the search engine Sage, and uploaded on zenodo: DOI:10.5281/zenodo.19370231 following the experiment's guidelines.

D19_15um30cm_SC1.sage.tsv
OR11_20160122_PG_HeLa_CVB3_CT_A.sage.tsv

Dataset license: CC0 1.0 Universal (CC0 1.0)

Author(s)

Guillaume Deflandre

References

Boekweg, Hannah, Daisha Van Der Watt, Thy Truong, S. Madisyn Johnston, Amanda J. Guise, Edward D. Plowey, Ryan T. Kelly, and Samuel H. Payne. 2022. “Features of Peptide Fragmentation Spectra in Single-Cell Proteomics.” Journal of Proteome Research 21 (1): 182–88.

MS data in CDF format

Description

This data set represents a single CDF file in (AIA/ANDI) NetCDF format from a larger experiment in which the metabolic consequences of knocking the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The file contains datain centroid mode acquired in positive ion mode from 200-600 m/z and 2500-4500 seconds.

Data file:

ko15.CDF file in NetCDF format.

References

Saghatelian, A et al. Assignment of endogenous substrates to enzymes by global metabolite profiling, Biochemistry, 2004. http://dx.doi.org/10.1021/bi0480335

CE-MS data files

Description

The CE-MS test files consist of two files, "CEMS_10ppm.mzML" and "CEMS_25ppm.mzML". The data contains CE-MS runs of a standard mixture that contains e.g. Lysine (at 10 ppm and 25 ppm, respectively) and the neutral EOF marker Paracetamol (50 ppm). The data was acquired on a 7100 capillary electrophoresis system from Agilent Technologies, coupled to an Agilent 6560 IM-QToF-MS. CE Separation was performed using a 80 cm fused silica capillary with an internal diameter of 50 µm and external diameter of 365 µm. The Background Electrolyte was 10 % acetic acid and separation was performed at +30 kV and a constant pressure of 50 mbar. MS detection was performed in positive ionization mode.

The raw data were then converted to the open-source .mzML format (Proteowizard). To reduce data size, the test data was subset to a retention time range from 400 to 900 seconds and an m/z range from 147.1 to 152.0.

Files:

"CEMS_10ppm.mzML"
"CEMS_25ppm.mzML"

Data location DOI:10.5281/zenodo.18481720

Author(s)

Liesa Salzer

CPTAC label-free data

Description

This case-study is a subset of the data of the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC) (Paulovich et al. 2010). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 micro g/micro L Saccharomyces cerevisiae strain BY4741.

Five different spike-in concentrations were used:

6A: 0.25 fmol UPS1 proteins/micro L
6B: 0.74 fmol UPS1 proteins/micro L
6C: 2.22 fmol UPS1 proteins/micro L
6D: 6.67 fmol UPS1 proteins/micro L
6E: 20 fmol UPS1 proteins/micro L

Three replicates are available for each concentration.

The data were searched with MaxQuant version 1.5.2.8 (Cox et al. 2008) including matching between runs. Detailed search settings were described in Goeminne et al. (2016).

Three files are readily available as tab-delimited spreadsheets:

cptac_a_b_peptides.txt: triplicates from lab 3 for groupes 6A and 6B.
cptac_a_b_c_peptides.txt: triplicates from labs 1, 2 and 3 for groupes 6A, 6B and 6C.
cptac_peptides.txt: triplicates from labs 1, 2, and 3 for all groups.

Author(s)

Laurent Gatto and Lieven Clement

References

Paulovich, Amanda G, Dean Billheimer, Amy-Joan L Ham, Lorenzo Vega-Montoto, Paul A Rudnick, David L Tabb, Pei Wang, et al. 2010. Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance. Mol. Cell. Proteomics 9 (2): 242–54.
Cox, J, and M Mann. 2008. MaxQuant Enables High Peptide Identification Rates, Individualized p.p.b.-Range Mass Accuracies and Proteome-Wide Protein Quantification. Nat Biotechnol 26 (12): 1367–72. https://doi.org/10.1038/nbt.1511.
Goeminne, LJ, Gevaert K and Clement, L. 2016. Peptide-level Robust Ridge Regression Improves Estimation, Sensitivity, and Specificity in Data-dependent Quantitative Label-free Shotgun Proteomics, Mol Cell Proteomics, 15:2 657-668.

Contaminant and cRAP databases

Description

These 3 fasta files are widely used proteomics contaminants. The files are:

crap_gpm.fasta: the common Repository of Adventitious Proteins (cRAP) from the Global Proteome Machine (GPM) organisation.
crap_ccp.fasta: Cambridge Centre for Proteomics' own cRAP fasta database.
crap_maxquant.fasta.gz: MaxQuant's contaminant database.

These files are extracted from the camprotR package and described in the cRAP databases vignette (see References).

These files are added to the MsDataHub package via the corresponding Zenodo repository to raciliate re-use with minimal dependecies and avoid repeated downloaded using caching.

All credit for compiling the fasta files goes to Charlotte Dawson, maintainer of the camprotR package.

Author(s)

Laurent Gatto

References

cRAP databases vignette: https://cambridgecentreforproteomics.github.io/camprotR/articles/crap.html
cRPA protein sequences (GPM): https://www.thegpm.org/crap/
camprotR package: https://cambridgecentreforproteomics.github.io/camprotR/index.html
Gatto, L. (2025). Proteomics contaminant databases (1.0). Zenodo. https://doi.org/10.5281/zenodo.15115102

FTICR-MS data files

Description

Direct injection fourier-transform ion cyclotron resonance (FTICR) mass spectrometry (MS) data files.

Files:

HAM004_641fE_14-11-07–Exp1.extracted.mzML
HAM004_641fE_14-11-07–Exp2.extracted.mzML
HAM004_641fE_14-11-07–Exp3.extracted.mzML
HAM004_641fE_14-11-07–Exp4.extracted.mzML
HAM004_641fE_14-11-07–Exp5.extracted.mzML
HAM005_641fE_14-11-07–Exp1.extracted.mzML
HAM005_641fE_14-11-07–Exp2.extracted.mzML
HAM005_641fE_14-11-07–Exp3.extracted.mzML
HAM005_641fE_14-11-07–Exp4.extracted.mzML
HAM005_641fE_14-11-07–Exp5.extracted.mzML

Data location: DOI:10.5281/zenodo.18494294

Author(s)

Mark Haid

Misc DDA and DIA datasets

Description

Various DDA and DIA experiments that were processed with different software to illustrate the QFeatures::readQFeatures() function:

⁠vanPuyvelde_2022_LFQ_*⁠: DDA and DIA data from van Puyvelde et al. (2022). LFQ benchmarking experiment.
⁠Christoforou_2016_TMT_DDA_*⁠: data from Christoforou et al. (2016). hyperLOPIT spatial proteomics experiment.
Derks_2022_plex_DIA_DIANN_report_subset.tsv: subset from Derks et al. (2022) plexDIA data.

Source files are available on Zenodo: https://zenodo.org/records/19137577

Author(s)

Kristina Gomoryova and Laurent Gatto

References

Van Puyvelde, B., Daled, S., Willems, S. et al. A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics. Sci Data 9, 126 (2022). https://doi.org/10.1038/s41597-022-01216-6.
Christoforou, A., Mulvey, C., Breckels, L. et al. A draft map of the mouse pluripotent stem cell spatial proteome. Nat Commun 7, 9992 (2016). https://doi.org/10.1038/ncomms9992
Derks, J., Leduc, A., Wallmann, G. et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat Biotechnol 41, 50–59 (2023). https://doi.org/10.1038/s41587-022-01389-w

Multiple Reaction Monitoring mode (MRM) test files

Description

MRM-standmix-5.mzML: sample from mouse brain acquired by HILIC ESI-QqQ/MS in Dynamic multiple reaction monitoring mode (MRM). HPLC system was a 1290 Infinity (Agilent Technologies) coupled to ion-Funnel Triple quadrupole 6490 mass spectrometer (Agilent Technologies). This file was contributed by Xavi Domingo-Almenara from the The Scripps Research Institute, San Diego, CA.

Data location: DOI:10.5281/zenodo.18502866

Author(s)

Xavi Domingo-Almenara

MS3 SPS TMT data

Description

MS3TMT10_01022016_32917-33481.mzML.gz: A subset of 565 spectra from a currenly unpublished TMT 10-plex experiment run on an Thermo Orbitrap Lumos with synchronous precursor selection (SPS) MS3. Only the MS2 spectra were centroided during convertion using msconvert (ProteoWizard release: 3.0.9283 (2016-1-11)) using vendor libraries.
MS3TMT11.mzML: A subset of 994 spectra from a currenly unpublished MS3 SPS TMT 11-plex experiment converted to mzML using msconvert. The file contains 30, 482 and 482 MS1, MS2 and MS3 spectra, respectively. The MS1 spectra are in profile mode; other MS levels are centroided. See 'Sensitive and Accurate Quantitation of Phosphopeptides Using TMT Isobaric Labeling Technique' for details about the acquisition method.
An feature data containing identification data is available with fdms3tmt11, which can be used to directly update the feature data.

Author(s)

Laurent Gatto

All MsDataHub datasets

Description

The MsDataHub package provides example mass spectrometry data, peptide spectrum matches or quantitative data from proteomics and metabolomics experiments.

The MsDataHub() function returns a data.frame with all the annotated datasets provided in the package. For details on these individual datasets, refer to their respective manual pages.

See the vignette and the respective manuals pages for more details about the package and the data themselves.

Usage

MsDataHub()
MsDataHub()

Value

A data.frame describing the data available in MsDataHub.

Author(s)

Laurent Gatto

Examples


MsDataHub()
MsDataHub()

PXD000001 Proteomics Data

Description

The PXD000001 files are part of the first ProteomeXchange submission (Vizcaíno J.A. et al, 2014), and contain the following files.

TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz: an TMT6 6-plex LC-MSMS data containing 6 human spiked-in proteins in a constant Erwinia carotovora protein backgroud. The data is described in more details in Gatto and Christoforou (2013).
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid: generated searching the raw data against the Erwinia carotovora fasta database

References

Vizcaíno J.A. et al. ProteomeXchange: globally co-ordinated proteomics data submission and dissemination, Nature Biotechnology 2014, 32, 223–226. http://www.ncbi.nlm.nih.gov/pubmed/24727771
Gatto L. and Christoforou A. Using R and Bioconductor for proteomics data analysis, Biochim Biophys Acta - Proteins and Proteomics, 2013. http://www.ncbi.nlm.nih.gov/pubmed/23692960

Derks 2022 plexDIA data

Description

Single cell proteomics data acquired by the Slavov Lab using the plexDIA protocol. It contains quantitative information from pancreatic ductal acinar cells (PDAC; HPAF-II), melanoma cells (WM989-A6-G3) and monocytes (U-937) at precursor and protein level. The each run acquired 3 samples thanks to mTRAQ multiplexing.

The data were downloaded from the Slavov lab google drive:

https://drive.google.com/drive/folders/1pUC2zgXKtKYn22mlor0lmUDK0frgwL_-
DIANN_outputs
wJD1146_1193_1200_tsvLib
Report.tsv

For more details about the data: https://plexdia.slavovlab.net/

The file is reshare here allow its dissemination via the MsDataHub package.

Author(s)

Laurent Gatto

References

Derks, J., Leduc, A., Wallmann, G. et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat Biotechnol (2022). 10.1038/s41587-022-01389-w.

AB Sciex LC-MS data files

Description

The sciex mzML files represent profile-mode LC-MS data of pooled human serum samples (the same pool being measured). The samples were analyzed by ultra high-performance liquid chromatography (UHPLC; Agilent 1290) coupled to a Q-TOF mass spectrometer (TripleTOF 5600+ AB Sciex). The chromatographic separation was based in hydrophilic interaction liquid chromatography (HILIC) and performed using an Waters Acquity BEH Amide, 100 x 2.1 mm column.

The mass spectrometer was operated in full scan mode in the mass range from 50 to 1000 m/z and with an accumulation time of 250 ms. The files represent a subset of spectra/scans from m/z 105 to 134 and from retention time 0 to 260 seconds. The files were generated in the same LC-MS run, but from different injections. Details on the individual files are provided below.

Files:

20171016_POOL_POS_1_105-134.mzML: profile-mode LC-MS data of pooled human serum samples. Injection index: 1.
20171016_POOL_POS_3_105-134.mzML: profile-mode LC-MS data of pooled human serum samples. Injection index: 19.

Author(s)

Sigurdur Smarason, Giuseppe Paglia and Johannes Rainer

Triple TOF SWATH Data

Description

These files represent data from reverse-phased LC-MS/MS runs on the Agilent Pesticide mix obtained from a Sciex 6600 Triple ToF operated either in Sequential Window Acquisition of all THeoretical mass spectra (SWATH) or Data Dependent Acquisition (DDA) acquisition mode.

The data files are:

PestMix1_DDA.mzML: mzML file with MS1 and MS2 spectra from the Agilent Pesticide Mix acquired in DDA mode.
PestMix1_SWATH.mzML: mzML file with MS1 and MS2 spectra from the Agilent Pesticide Mix acquired in SWATH mode.

Author(s)

Micheal Witting, Johannes Rainer

Package 'MsDataHub'

Help Index

Ai et al (2025) single-cell data

Description

Author(s)

References

DIA benchmarking data

Description

Author(s)

References

Boekweg et al (2022) SCP, bulk and identification data

Description

Author(s)

References

MS data in CDF format

Description

References

CE-MS data files

Description

Author(s)

CPTAC label-free data

Description

Author(s)

References

Contaminant and cRAP databases

Description

Author(s)

References

FTICR-MS data files

Description

Author(s)

Misc DDA and DIA datasets

Description

Author(s)

References

Multiple Reaction Monitoring mode (MRM) test files

Description

Author(s)

MS3 SPS TMT data

Description

Author(s)

All MsDataHub datasets

Description

Usage

Value

Author(s)

Examples

PXD000001 Proteomics Data

Description

References

See Also

Derks 2022 plexDIA data

Description

Author(s)

References

AB Sciex LC-MS data files

Description

Author(s)

Triple TOF SWATH Data

Description

Author(s)