| Title: | Mass Spectrometry Data on ExperimentHub |
|---|---|
| Description: | The MsDataHub package uses the ExperimentHub infrastructure to distribute raw mass spectrometry data files, peptide spectrum matches or quantitative data from proteomics and metabolomics experiments. |
| Authors: | Laurent Gatto [aut, cre] (ORCID: <https://orcid.org/0000-0002-1520-2268>), Kristina Gomoryova [ctb] (ORCID: <https://orcid.org/0000-0003-4407-3917>), Johannes Rainer [aut] (ORCID: <https://orcid.org/0000-0002-6977-7147>), Guillaume Deflandre [ctb] (ORCID: <https://orcid.org/0009-0008-1257-2416>) |
| Maintainer: | Laurent Gatto <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 1.11.5 |
| Built: | 2026-05-22 09:29:14 UTC |
| Source: | https://github.com/rformassspectrometry/msdatahub |
Single-cell proteomics captures the Proteome Heterogeneity in Human iPSC-Derived Cardiomyocytes and Adult Cardiomyocytes.
Project description (from MassIVE): Human induced pluripotent stem cell (IPSC)-derived cardiomyocytes (iCMs) have become important tools to model cardiovascular diseases and drug toxicology. Despite suggested transcriptomic heterogeneity in both iPSC and iCMs, the cellular proteome heterogeneity is poorly understood. Using cutting-edge single cell proteomics, we quantify the maturation from IPSC to iCMs and observed two distinct populations of iCMs with different metabolism, which recapitulates the single adult cardiomyocyte proteome populations albeit less mature.
The two DIA-NN report files are downloaded from the MassIVE dataset MSV000094438 (doi:10.25345/C5T727S7Q) are redistruted here are:
Adult cardiomyocyte (aCMs): 299 cells
iPSC-derived cardiomyocytes (iCMs): 2184 cells
Dataset license: CC0 1.0 Universal (CC0 1.0)
EuBIC 2025 developper meeting SCP hackathon members
Ai, Lizhuo, Aleksandra Binek, Vladimir Zhemkov, Jae Hyung Cho, Ali Haghani, Simion Kreimer, Edo Israely, et al. 2025. “Single Cell Proteomics Reveals Specific Cellular Subtypes in Cardiomyocytes Derived from Human iPSCs and Adult Hearts.” Mol. Cell. Proteomics, no. 100910 (January): 100910. https://doi.org/10.1016/j.mcpro.2025.100910.
These data were generated based on publicly available DIA benchmarking dataset from Gotti et al. (2021). A subset of raw data, containing "overlapped" in the File.Name were searched using the DIA-NN software, and the resulting report.tsv (here labelled as 'benchmarkingDIA.tsv') is provided.
The dataset contains 8 conditions containing a mix of E.coli and Universal Standard Protein-1 (UPS1) peptides. Per 1 ug of E.coli protein (equal in all samples), UPS1 proteins are diluted to final concentration of 50, 25, 10, 5, 2.5, 1, 0.25 and 0.1 fmol.
Each sample was prepared in 3 replicates, so altogether there are 24 samples in the dataset.
Kristina Gomoryova and Laurent Gatto
Gotti C, Roux-Dalvai F, Joly-Beauparlant C, Mangnier L, Leclercq M, Droit A. Extensive and Accurate Benchmarking of DIA Acquisition Methods and Software Tools Using a Complex Proteomic Standard. J Proteome Res. 2021 Oct 1;20(10):4801-4814. doi: 10.1021/acs.jproteome.1c00490. Epub 2021 Sep 2. PMID: 34472865.
Features of Peptide Fragmentation Spectra in Single-Cell Proteomics
Project description (from MassIVE):
SCP: This dataset contains 3 data types: trace samples consisting of 2ng and 0.2ng aliquots of HeLa protein digest standard, and single HeLa cells. Pierce™ HeLa protein digest standard and formic acid were purchased from Thermo Fisher Scientific (Waltham, MA). Mobile phase A (0.1% formic acid in water) and mobile phase B (0.1% formic acid in acetonitrile) were respectively prepared from LC-MS grade water and acetonitrile purchased from Honeywell (Charlotte, NC). The digest standard was reconstituted to a final concentration of 200 ng/µL with 100 µL of mobile phase A to form a stock solution. For the experiments, the stock samples were further diluted to 0.2 and 2 ng/µL using the same mobile phase. HeLa cells were cultured from cells purchased from American Type Culture Culture Collection (Manassas, VA). Single HeLa cells were prepared using the nanoPOTS workflow and analyzed by manual LC injection as described previously (29797682) except that cells were isolated into nanowells using the CellenONE platform (Lyon, France) instead of by fluorescence activated cell sorting. Columns: 30-µm-i.d. fused silica capillary columns from Polymicro (Phoenix, AZ) were packed with different materials: Jupiter C18 3.0 µm, 300 Å particles and Kinetex C18 core shell particles of 1.3 µm, 100 Å µm were purchased from Phenomenex (Torrance, CA); BEH C18, 1.7 µm, 130 Å was from Waters (Milford, MA). Column lengths were adjusted to keep the pressure and the linear velocity constant for all columns. The lengths were 50, 9 and 16 cm for Jupiter, Kinetex and BEH columns respectively. Solid-phase-extraction (SPE) columns were prepared by packing Jupiter C18 particles into 100-µm-i.d. × 5-cm-long fused silica capillaries. The file names contain the sample size and lc packing material. doi:10.25345/C5NV69
Bulk: This data was originally uploaded to pride project PXD011163. More details can be found there. Cells were lysed, reduced, and alkylated in lysis buffer (1% SDC, 10 mM TCEP, 40 mM CAA, and 100 mM TRIS, pH 8.0) supplemented with complete EDTA-free protease inhibitor mixture and phosSTOP phosphatase inhibitor mixture. Cells were heated for 5 min at 95 C, sonicated with a Bioruptor Plus, and diluted 1:10 with 50 mM ammonium bicarbonate, pH 8.0. Proteins were digested overnight at 37 C with trypsin and Lys-C (enzyme:substrate ratio of 1:50 and 1:75). SDC was precipitated by acidification to 5% of formic acid. Samples were desalted using Sep-Pak C18 cartridges and directly subjected to phosphopeptide enrichment. Samples for proteome analysis were instead dried down and stored at -80 C until nLC-MS analysis. Phosphopeptides enrichment was performed using Fe(III)-NTA in an automated fashion using the AssayMAP Bravo Platform. Reversed phase nLC-MS/MS analysis was performed with an Agilent 1290 Infinity UHPLC system coupled to an Orbitrap Q Exactive Plus mass spectrometer, or Orbitrap Fusion mass spectrometer for the phosphoproteome analysis. The UHPLC was equipped with a double frit trapping column (Reprosil C18, 3 um, 2 cm x 100 um) and a single frit analytical column (Poroshell EC-C18, 2.7 um, 50 cm x 75 um). Trapping was performed in solvent A (0.1% FA in water) at 5 uL/min, while for the elution the flow rate was passively split to 300 nL/min. The linear gradient was as follows: 13-40% solvent B (0.1% FA in 80% ACN) in 220 min, or 8-32% in 95 min for phosphopeptide analysis. Total analysis time was 235 min for the proteome samples and 110 min for the phosphoproteome samples. The mass spectrometers were operated in data-dependent mode. The Orbitrap Q Exactive Plus full-scan MS spectra from m/z 375-1600 were acquired at a resolution of 35000 (FWHM) after accumulation to a target value of 3e6. Up to 10 most intense precursor ions were selected for fragmentation, with the isolation window set to 1.5 m/z. HCD fragmentation was performed at normalized collision energy of 25% after the accumulation to a target value of 5e4. MS/MS was acquired at a resolution of 17500 (FWHM). The Orbitrap Fusion full-scan MS spectra from m/z 375-1500 were acquired at a resolution of 120000 (FWHM) after accumulation to a target value of 4e5. The most intense peptide ions fitting within a 3 s cycle were selected for HCD fragmentation, with the isolation window set to 1.6 m/z, and a normalized collision energy of 30%, after the accumulation to a target value of 5e4. MS/MS was acquired at a resolution of 30000 (FWHM). doi:10.25345/C5BN6F
The mzML files are downloaded from the MassIVE datasets MSV000087524 (doi:10.25345/C5NV69)(SCP), MSV000087689 (doi:10.25345/C5BN6F) (bulk) and redistruted here are:
D19_15um30cm_SC1.mzML
OR11_20160122_PG_HeLa_CVB3_CT_A.mzML
The identification files were created with the search engine Sage, and uploaded on zenodo: DOI:10.5281/zenodo.19370231 following the experiment's guidelines.
D19_15um30cm_SC1.sage.tsv
OR11_20160122_PG_HeLa_CVB3_CT_A.sage.tsv
Dataset license: CC0 1.0 Universal (CC0 1.0)
Guillaume Deflandre
Boekweg, Hannah, Daisha Van Der Watt, Thy Truong, S. Madisyn Johnston, Amanda J. Guise, Edward D. Plowey, Ryan T. Kelly, and Samuel H. Payne. 2022. “Features of Peptide Fragmentation Spectra in Single-Cell Proteomics.” Journal of Proteome Research 21 (1): 182–88.
This data set represents a single CDF file in (AIA/ANDI) NetCDF format from a larger experiment in which the metabolic consequences of knocking the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The file contains datain centroid mode acquired in positive ion mode from 200-600 m/z and 2500-4500 seconds.
Data file:
ko15.CDF file in NetCDF format.
Saghatelian, A et al. Assignment of endogenous substrates to enzymes by global metabolite profiling, Biochemistry, 2004. http://dx.doi.org/10.1021/bi0480335
The CE-MS test files consist of two files, "CEMS_10ppm.mzML" and
"CEMS_25ppm.mzML". The data contains CE-MS runs of a standard mixture
that contains e.g. Lysine (at 10 ppm and 25 ppm, respectively) and the
neutral EOF marker Paracetamol (50 ppm). The data was acquired on a
7100 capillary electrophoresis system from Agilent Technologies, coupled
to an Agilent 6560 IM-QToF-MS. CE Separation was performed using a 80 cm
fused silica capillary with an internal diameter of 50 µm and external
diameter of 365 µm. The Background Electrolyte was 10 % acetic acid and
separation was performed at +30 kV and a constant pressure of 50 mbar.
MS detection was performed in positive ionization mode.
The raw data were then converted to the open-source .mzML format (Proteowizard). To reduce data size, the test data was subset to a retention time range from 400 to 900 seconds and an m/z range from 147.1 to 152.0.
Files:
"CEMS_10ppm.mzML"
"CEMS_25ppm.mzML"
Data location DOI:10.5281/zenodo.18481720
Liesa Salzer
This case-study is a subset of the data of the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC) (Paulovich et al. 2010). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 micro g/micro L Saccharomyces cerevisiae strain BY4741.
Five different spike-in concentrations were used:
6A: 0.25 fmol UPS1 proteins/micro L
6B: 0.74 fmol UPS1 proteins/micro L
6C: 2.22 fmol UPS1 proteins/micro L
6D: 6.67 fmol UPS1 proteins/micro L
6E: 20 fmol UPS1 proteins/micro L
Three replicates are available for each concentration.
The data were searched with MaxQuant version 1.5.2.8 (Cox et al. 2008) including matching between runs. Detailed search settings were described in Goeminne et al. (2016).
Three files are readily available as tab-delimited spreadsheets:
cptac_a_b_peptides.txt: triplicates from lab 3 for groupes 6A and 6B.
cptac_a_b_c_peptides.txt: triplicates from labs 1, 2 and 3 for groupes 6A, 6B and 6C.
cptac_peptides.txt: triplicates from labs 1, 2, and 3 for all groups.
Laurent Gatto and Lieven Clement
Paulovich, Amanda G, Dean Billheimer, Amy-Joan L Ham, Lorenzo Vega-Montoto, Paul A Rudnick, David L Tabb, Pei Wang, et al. 2010. Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance. Mol. Cell. Proteomics 9 (2): 242–54.
Cox, J, and M Mann. 2008. MaxQuant Enables High Peptide Identification Rates, Individualized p.p.b.-Range Mass Accuracies and Proteome-Wide Protein Quantification. Nat Biotechnol 26 (12): 1367–72. https://doi.org/10.1038/nbt.1511.
Goeminne, LJ, Gevaert K and Clement, L. 2016. Peptide-level Robust Ridge Regression Improves Estimation, Sensitivity, and Specificity in Data-dependent Quantitative Label-free Shotgun Proteomics, Mol Cell Proteomics, 15:2 657-668.
These 3 fasta files are widely used proteomics contaminants. The files are:
crap_gpm.fasta: the common Repository of Adventitious Proteins (cRAP) from the Global Proteome Machine (GPM) organisation.
crap_ccp.fasta: Cambridge Centre for Proteomics' own cRAP fasta database.
crap_maxquant.fasta.gz: MaxQuant's contaminant database.
These files are extracted from the camprotR package and described in the
cRAP databases vignette (see References).
These files are added to the MsDataHub package via the corresponding
Zenodo repository to raciliate re-use with minimal dependecies and avoid
repeated downloaded using caching.
All credit for compiling the fasta files goes to Charlotte Dawson,
maintainer of the camprotR package.
Laurent Gatto
cRAP databases vignette: https://cambridgecentreforproteomics.github.io/camprotR/articles/crap.html
cRPA protein sequences (GPM): https://www.thegpm.org/crap/
camprotR package: https://cambridgecentreforproteomics.github.io/camprotR/index.html
Gatto, L. (2025). Proteomics contaminant databases (1.0). Zenodo. https://doi.org/10.5281/zenodo.15115102
Direct injection fourier-transform ion cyclotron resonance (FTICR) mass spectrometry (MS) data files.
Files:
HAM004_641fE_14-11-07–Exp1.extracted.mzML
HAM004_641fE_14-11-07–Exp2.extracted.mzML
HAM004_641fE_14-11-07–Exp3.extracted.mzML
HAM004_641fE_14-11-07–Exp4.extracted.mzML
HAM004_641fE_14-11-07–Exp5.extracted.mzML
HAM005_641fE_14-11-07–Exp1.extracted.mzML
HAM005_641fE_14-11-07–Exp2.extracted.mzML
HAM005_641fE_14-11-07–Exp3.extracted.mzML
HAM005_641fE_14-11-07–Exp4.extracted.mzML
HAM005_641fE_14-11-07–Exp5.extracted.mzML
Data location: DOI:10.5281/zenodo.18494294
Mark Haid
Various DDA and DIA experiments that were processed with different software
to illustrate the QFeatures::readQFeatures() function:
vanPuyvelde_2022_LFQ_*: DDA and DIA data from van Puyvelde et
al. (2022). LFQ benchmarking experiment.
Christoforou_2016_TMT_DDA_*: data from Christoforou et
al. (2016). hyperLOPIT spatial proteomics experiment.
Derks_2022_plex_DIA_DIANN_report_subset.tsv: subset from Derks et
al. (2022) plexDIA data.
Source files are available on Zenodo: https://zenodo.org/records/19137577
Kristina Gomoryova and Laurent Gatto
Van Puyvelde, B., Daled, S., Willems, S. et al. A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics. Sci Data 9, 126 (2022). https://doi.org/10.1038/s41597-022-01216-6.
Christoforou, A., Mulvey, C., Breckels, L. et al. A draft map of the mouse pluripotent stem cell spatial proteome. Nat Commun 7, 9992 (2016). https://doi.org/10.1038/ncomms9992
Derks, J., Leduc, A., Wallmann, G. et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat Biotechnol 41, 50–59 (2023). https://doi.org/10.1038/s41587-022-01389-w
MRM-standmix-5.mzML: sample from mouse brain acquired by HILIC ESI-QqQ/MS in Dynamic multiple reaction monitoring mode (MRM). HPLC system was a 1290 Infinity (Agilent Technologies) coupled to ion-Funnel Triple quadrupole 6490 mass spectrometer (Agilent Technologies). This file was contributed by Xavi Domingo-Almenara from the The Scripps Research Institute, San Diego, CA.
Data location: DOI:10.5281/zenodo.18502866
Xavi Domingo-Almenara
MS3TMT10_01022016_32917-33481.mzML.gz: A subset of 565 spectra from a
currenly unpublished TMT 10-plex experiment run on an Thermo Orbitrap
Lumos with synchronous precursor selection (SPS) MS3. Only the MS2 spectra
were centroided during convertion using msconvert (ProteoWizard release:
3.0.9283 (2016-1-11)) using vendor libraries.
MS3TMT11.mzML: A subset of 994 spectra from a currenly unpublished
MS3 SPS TMT 11-plex experiment converted to mzML using
msconvert. The file contains 30, 482 and 482 MS1, MS2 and MS3
spectra, respectively. The MS1 spectra are in profile mode; other MS
levels are centroided. See 'Sensitive and Accurate Quantitation of
Phosphopeptides Using TMT Isobaric Labeling Technique' for details
about the acquisition method.
An feature data containing identification data is available with
fdms3tmt11, which can be used to directly update the feature data.
Laurent Gatto
The MsDataHub package provides example mass spectrometry data, peptide spectrum matches or quantitative data from proteomics and metabolomics experiments.
The MsDataHub() function returns a data.frame with all the
annotated datasets provided in the package. For details on these
individual datasets, refer to their respective manual pages.
See the vignette and the respective manuals pages for more details about the package and the data themselves.
MsDataHub()MsDataHub()
A data.frame describing the data available in
MsDataHub.
Laurent Gatto
MsDataHub()MsDataHub()
The PXD000001 files are part of the first ProteomeXchange submission (Vizcaíno J.A. et al, 2014), and contain the following files.
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz: an TMT6 6-plex LC-MSMS data containing 6 human spiked-in proteins in a constant Erwinia carotovora protein backgroud. The data is described in more details in Gatto and Christoforou (2013).
TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid: generated searching the raw data against the Erwinia carotovora fasta database
Vizcaíno J.A. et al. ProteomeXchange: globally co-ordinated proteomics data submission and dissemination, Nature Biotechnology 2014, 32, 223–226. http://www.ncbi.nlm.nih.gov/pubmed/24727771
Gatto L. and Christoforou A. Using R and Bioconductor for proteomics data analysis, Biochim Biophys Acta - Proteins and Proteomics, 2013. http://www.ncbi.nlm.nih.gov/pubmed/23692960
The rpx package can be used to access and download any PRIDE/ProteomeXchange files.
Single cell proteomics data acquired by the Slavov Lab using the plexDIA protocol. It contains quantitative information from pancreatic ductal acinar cells (PDAC; HPAF-II), melanoma cells (WM989-A6-G3) and monocytes (U-937) at precursor and protein level. The each run acquired 3 samples thanks to mTRAQ multiplexing.
The data were downloaded from the Slavov lab google drive:
https://drive.google.com/drive/folders/1pUC2zgXKtKYn22mlor0lmUDK0frgwL_-
DIANN_outputs
wJD1146_1193_1200_tsvLib
Report.tsv
For more details about the data: https://plexdia.slavovlab.net/
The file is reshare here allow its dissemination via the MsDataHub package.
Laurent Gatto
Derks, J., Leduc, A., Wallmann, G. et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat Biotechnol (2022). 10.1038/s41587-022-01389-w.
The sciex mzML files represent profile-mode LC-MS data of pooled human serum samples (the same pool being measured). The samples were analyzed by ultra high-performance liquid chromatography (UHPLC; Agilent 1290) coupled to a Q-TOF mass spectrometer (TripleTOF 5600+ AB Sciex). The chromatographic separation was based in hydrophilic interaction liquid chromatography (HILIC) and performed using an Waters Acquity BEH Amide, 100 x 2.1 mm column.
The mass spectrometer was operated in full scan mode in the mass range from 50 to 1000 m/z and with an accumulation time of 250 ms. The files represent a subset of spectra/scans from m/z 105 to 134 and from retention time 0 to 260 seconds. The files were generated in the same LC-MS run, but from different injections. Details on the individual files are provided below.
Files:
20171016_POOL_POS_1_105-134.mzML: profile-mode LC-MS data of pooled human serum samples. Injection index: 1.
20171016_POOL_POS_3_105-134.mzML: profile-mode LC-MS data of pooled human serum samples. Injection index: 19.
Sigurdur Smarason, Giuseppe Paglia and Johannes Rainer
These files represent data from reverse-phased LC-MS/MS runs on the Agilent Pesticide mix obtained from a Sciex 6600 Triple ToF operated either in Sequential Window Acquisition of all THeoretical mass spectra (SWATH) or Data Dependent Acquisition (DDA) acquisition mode.
The data files are:
PestMix1_DDA.mzML: mzML file with MS1 and MS2 spectra from the Agilent Pesticide Mix acquired in DDA mode.
PestMix1_SWATH.mzML: mzML file with MS1 and MS2 spectra from the Agilent Pesticide Mix acquired in SWATH mode.
Micheal Witting, Johannes Rainer