To solve the most pressing public health challenges, we need to make data accessible to a broader community of researchers—and introduce new ways of working that encourage collaboration and rapid discovery. Here’s a framework to help public health leaders spur meaningful change.
Why public health research needs an update
The COVID-19 pandemic has underscored our need for data that moves faster than disease.
Scientific advancements are predicated on good data. In some domains—including cancer biology, infectious disease, and addiction research—there can be so much data that it becomes difficult to manage in a way that maintains security and integrity over time. Many public health agencies struggle to manage the large volume of data generated for research or, conversely, are challenged to get enough of the right data into the right hands, to fuel analysis and insights that lead to timely interventions.
Such was the predicament for the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC), which receives large amounts of tumor proteomic data and mass spectrometry data from numerous sources from across the cancer research community. The challenge is to store these data in a central repository that all cancer researchers can access while supporting the reliable, secure, and rapid movement of data.
How can public health leaders create the systems that support breakthrough discoveries—and inspire a culture of scientific collaboration within their organizations?
In order to conduct real-time infectious disease surveillance, advance cancer research, and develop life-saving public health interventions, we encourage agencies to embrace a three-pronged approach:
Evolve the workflow of research so that sharing data (while ensuring data standards are met) becomes as routine as checking email—this requires combined expertise in health IT, research data management, and scientific disciplines such as epidemiology, genomics and proteomics, and health research.
A critical piece of the solution for grappling with the high volume, variety, and velocity of data lies in the use of data commons. Data commons—or the platforms that store shareable data sets that researchers can manipulate and interrogate with the help of visualization tools, artificial intelligence, and machine learning—help improve the speed and quality of data management as well as ensuring research data is accessible beyond the life of a grant, research project, or individual careers.
For CPTAC, ICF created the Data Coordinating Center where researchers exchange data and the Proteomic Data Commons (PDC) where anyone can explore the data. These tools work together so that clinical data from many different sources are usable and may be compared across cancer programs. We partnered with the NCI to ensure quality control and security for data transit through encryption and verification processes. We also harmonized the data so researchers could analyze it and draw comparisons with other data sets.
Data management best practices
1. Build quality control and security into data receipt. For the NCI CPTAC, we encrypted data in transit and then verified it with a checksum file. Due to this focus on data integrity, researchers can trust that files correctly map back to the right sample and accurately capture the information associated with tumor acquisition.
2. Harmonize your data/establish standards for data collection, sharing, and integration. This ensures clinical data from many different sources are usable and may be compared across cancer programs. When data is harmonized, public health researchers can explore, for example, how trends may vary between populations and in different geographical areas.
3. Factor data privacy into the solution design. While data sharing provides enormous benefits to the research community, you may require private spaces for individual research teams to exchange data. For the NCI, we created private areas as well as the public PDC portal for distribution of data from collaborators in the International Cancer Proteogenome Consortium.
4. Engage stakeholders in the standards development process. Communication and collaboration are key. Because standards will impact a wide range of users such as measure developers, tool developers, implementers, and those who manage receiving systems, it’s important to engage stakeholders early in the process to ensure the standards meet their needs. For example, ICF supported the Centers for Medicare & Medicaid Services in establishing standards for working with electronic health records, extracting patient information, and assessing clinical quality.
The importance of these data handling approaches is underscored by the requirement among federal funding agencies that researchers ensure the data they produce adhere to the FAIR (findability, accessibility, interoperability, and reusability) principles. The FAIR data principles were put forth in 2016 by an international group of scientists, funders, and publishers to encourage the reusability of data given its increasing volume and complexity. While many research institutions around the world have adopted these principles, some struggle to comply. Proper training, tools, and support for sharing and standardizing data can help researchers and program directors ensure compliance.
Build a community of scientists across fields, as well as nonscientists, who can curate, maintain, and evolve the data commons model—and entice them to the table with the promise of data that is open, standardized, credible, elastic, robust, reliable, and FAIR (findable, accessible, interoperable, and reusable).
Learning to embrace new ways of working is easier said than done; in fact, cultural resistance to change is the primary reason digital transformation efforts fail according to an ICF survey of 500 federal employees.
The new generation of workers is multidisciplinary, mobile, and accustomed to using intuitive tools to streamline workflows. Gen Z and millennial workers are more open to trying out "bleeding edge" technology at work as well as incorporating virtual assistants, design tools, and augmented reality/virtual reality applications into their workflow. What they lack in peer-reviewed academic publications they may more than make up for in diverse perspectives, creative thinking, and collaborative mindsets. Public health leaders can cultivate enthusiasm for new ways of working by implementing interdisciplinary teams and empowering staff to become change advocates.
Interdisciplinary team building and change management best practices
1. Develop favorable interpersonal conditions. Ensure team members have a shared understanding and expectation of trust, respect, openness, communication, and learning.
2. Provide sufficient scaffolding. Teams require continuous process improvement, a culture of continual learning, and other support to help build the foundation required for robust and sustainable collaboration.
3. Grow the team's skills. Encourage team members to not only develop their expertise but also co-create new ideas and approaches with others.
4. Empower staff to become change advocates. Identify influential users who can explore new technologies like data commons and bring back information to their teams to generate excitement among colleagues.
Get comfortable with new tools that automate data collection and leverage artificial intelligence and machine learning to surface patterns and insights for researchers to explore in the service of their research objectives—and inspire researchers to embrace the art of the possible through dynamic training and change management approaches.
Specialized data commons, like CPTAC, provide critical functions to the research community. Data commons focus on meeting the needs of researchers within a certain public health domain and their discrete data challenges.
Because of the ethos of data commons, these forums are fertile ground for behavioral and cultural change. They will ease the transition that public health research must make away from the long-standing incentive structure of science, which revolves around publishing in peer-reviewed journals. As more experts and non-experts become familiar with these tools, the fundamental value of sharing data—and making it widely available as part of the scientific democracy—will become apparent.
Data analysis best practices
1. Invest in data collection abilities and domain experts to understand the relationships within the data. Successful health IT implementations require partners who know how to bring science and technology together. In collaboration, these experts can train new tools to provide a mathematical representation of the real world. Once that's done, anybody in the agency can use it to ask questions of the data.
2. Be smart about how you catalog and inventory all your data assets. Know where they are, who's using them, and why. A data fabric—a data management architecture that integrates and models data from diverse source systems into an integrated platform for on-demand access—is a cohesive way of connecting a collection of data tools.
3. Look for data connections that rise to the surface. By connecting disparate data sets together in a way that activates metadata, which is data about the data, you can see and understand more complicated relationships—and generate insights that are much more powerful than the data just on its own.
Another powerful tool is the knowledge graph because it can rapidly explore large volumes of data and find relationships between data points and their attributes. For example, the knowledge graph of genome-wide association studies would depict gene mutations and cancer types as nodes or circles in the graph, and relationships between the entities as lines. Knowledge graphs are one of the analytical and visualization tools that could be available within the cloud-based data science infrastructure of data commons.
Looking ahead: The next generation of data commons
The future of data commons is more expansive. Just as the data commons that have been developed over the last several years have encouraged public health researchers to push the boundaries of how they work within their domain, there is movement underway to expand the mission of data commons. While bespoke data commons solutions play an important role in advancing agency missions, success depends on enabling individuals to step outside of their comfort zones and unite around the common goal of finding solutions to the most pressing public health crises.
Data commons are coming that house any matter of data type and allow any number of discovery activities. These forums will answer the need of every individual who works with, and thus struggles with, data—whether they are a leader at a federal health agency, a member of a research team participating in a consortium, a healthcare professional, a director of a funding agency, or a policymaker. And by bringing diverse sets of data together, individuals can test hypotheses in minutes that may otherwise require year-long studies.
We partnered with a major cloud service provider to create a flexible data commons solution as a demonstration of a next generation data commons. In partnership with Amazon Web Services (AWS) Envision Engineering group, we created an automated, user-friendly environment to import, store, explore, and analyze data using AWS cloud technology. The goal for this tool is to allow researchers to collect, store, and share high-quality data and collaborate in the service of breakthrough discoveries, especially in realms where data poverty is a known problem. Mental health. Gun violence. Rare diseases.
If researchers can engage with a larger, harmonized, and accessible data environment, they can support the objectives of their own research, while simultaneously enabling the research activities of others.
Such high quality, standardized data would also pave the way to grow the applications and benefits of artificial intelligence and machine learning in public health research. For example, recent advances in natural language processing, a subfield of artificial intelligence, are enabling rapid analysis and data extraction from scientific literature, health records, social media, and other documents. This could lead to faster identification of diseases and more efficient evaluation of the safety of disease prevention strategies. Similar tools would be able to automatically analyze data as they are entered into a data commons, allowing for unbiased data exploration and detection of unexpected patterns between data sets.
By bringing together diverse users and datasets, tools, and predictive modeling, we’re creating a dramatically different model of data sharing from what previously existed, providing democratic access to data in an interface that is portable, lightweight, scalable, and elastic. Cloud-based data environments create access to relevant data sets in a way that rewards “out of the box” interdisciplinary public health research and thinking.
Public health research requires a vast array of data to identify patterns, trends, and associations. By providing support around data management, workforce transformation, capacity building, and industry-leading digital solutions—we can help agencies get the data they need to draw connections in their data and make important public health decisions. Adopting this new approach for data handling, sharing, and standardization would represent major strides toward ensuring that the impressive quantity of public health data is matched in quality.