Improving access to proteogenomic data for the National Cancer Institute
We developed a research data management solution that ensures speed and quality when handling large volumes of mass spectrometry data in the Clinical Proteomic Tumor Analysis Consortium (CPTAC)
21st-century medicine involves the integration of data from many sources as researchers and physicians work to address diseases, like cancer, in a comprehensive fashion. The National Cancer Institute has initiated an extensive analysis of the proteins expressed in cancer cells. The Data Coordinating Center and Proteomic Data Commons serve as the central repository for the proteomics data and distributes it to physicians, clinicians, and scientists in the cancer research community. It is the largest cancer proteomic data warehouse in the world.
The National Cancer Institute receives large volumes of mass spectrometry data from research groups in the Clinical Proteomic Tumor Analysis Consortium (CPTAC). The agency needed a way to store this data in one central location to make the information accessible to all cancer researchers interested in the tumor proteome—and maintain the results for future research after the conclusion of each CPTAC cancer program.
In addition, the proteomic data needed to be moved in a secure fashion, with no loss of content. The proteomic data storage site previously used by the research community had challenges with slow data transfer times and some file loss.
Our team created a secure data portal for researchers by combining a web server, database, file storage system, and an IBM-Aspera high-speed data transfer server. We also developed daily transfer logs to track and troubleshoot errors.
The portal allows as many researchers as possible to access this important proteogenomic data. We built quality control and security into data receipt by encrypting data in transit and then verifying it with a checksum file. Due to this focus on data integrity, researchers can trust that files correctly map back to the right sample and accurately capture the information associated with tumor acquisition. Our team also employs harmonization to ensure clinical data from many different sources are usable and may be compared across cancer programs.
ResultsThe CPTAC Data Coordinating Center and Proteomic Data Commons are providing information about the cancer proteome to researchers around the world so they can use these data in their work. The site provides private areas for each research team to exchange data—as well as a public Proteomic Data Commons portal for distribution of data from the CPTAC program and from collaborators in the International Cancer Proteogenome Consortium.
The portal regularly manages 29 terabytes, with 785 terabytes of data downloaded in 140 countries. The impact of the CPTAC has been showcased in 18 scholarly publications, which highlights the breadth of researchers using this technology and data resource to advance our understanding of proteogenomics across many cancers, including ovarian, breast, colon, lung, pediatric and adult brain cancer, and others.
Contributions to cancer research, Proteomic Data Commons (PDC)