Machine learning could make document classification easier and more accurate — if we deploy it correctly.
From climate change to opioid addiction, we are facing serious public health crises that put our research and data management experts to the test. When it comes to scientific evidence, systematic literature reviews—painstaking assessments of all the literature ever produced on a given subject—are often regarded as the gold standard. Though no research method is foolproof, says Vox health correspondent Julia Belluz, “these studies represent the best available syntheses of global evidence about the likely effects of different decisions, therapies and policies.”
That comprehensiveness comes at high price, though, in terms of time and money. It involves sifting through enormous volumes of literature--sometimes hundreds of thousands of scientific abstracts--stored in academic databases. Researchers use broad keywords to query these databases and capture as many results as possible, but the task of wading through those documents falls to subject matter experts in what turns out to be a manual, time-intensive, and expensive process.
Is Automation the Answer?
Machine learning could make the document classification process easier and more accurate, but faces barriers to adoption in certain contexts. Because systematic literature reviews have the power to influence clinical and public health practices--and, by extension, health outcomes for patients and communities--the stakes are much higher than usual. If Netflix fails to recommend a film its users would like, the fallout is pretty minimal. If a widely-disseminated systematic literature review fails to account for key research, though, lives could be at stake.
Examples of machine learning methods include:
- Document classification technologies, like text analytics, which use computational algorithms to detect and exploit patterns in large volumes of text
- Natural language processing (NLP), in which machines use grammar and linguistic structure to analyze text in similar ways to how humans process language
Machine learning practitioners prioritize scientific defensibility and the reliability of their predictions when assessing model performance. They need to ensure that exempting a proportion of results from manual review based on text analytics technology does not result in omission of more than an acceptable level of relevant documents. Typically, regulators require that no more than 5% of relevant articles are omitted from a systematic review of the literature.
Automated document classification technologies can be broadly divided into two categories: supervised and unsupervised machine learning. Supervised machine learning methods use a training dataset—a set of instructions developed by the researcher to help the computer build a predictive model—to classify documents whose relevance status is unknown and also to produce metrics of the machine’s expected classification accuracy. Unsupervised machine learning does not include the time-consuming creation of a training dataset, but it requires users to devise classification rules and cannot explicitly predict model performance.
We'll be exhibiting and presenting at The American Public Health Association (APHA) Annual Meeting & Expo, November 4-8, 2017. Register today.
Testing Our Theories
To address the shortcomings of both traditional document classification approaches, we developed a hybrid method that combines the benefits and limits the downsides. Supervised clustering--which is based on the theory of semi-supervised learning--can be used to classify documents and generate accurate and unbiased estimates of model accuracy with a minimal training dataset.
In a recent research paper, we used two previously classified datasets related to chemical toxicity to demonstrate the accuracy and lack of bias in model performance metrics generated by supervised clustering. Simulations showed that the accuracy and manual work saved using our supervised clustering method is comparable to the performance of more expensive supervised machine learning-based methods. Decision-makers are presented with a semi-continuous decision curve that captures the costs (the fraction of omitted relevant documents) and benefits (the fraction of documents eliminated from review) associated with each potential decision point. The method also includes an ensemble learning component that prioritizes documents in terms of relevance, ensuring that the most important results are reviewed first.
Our recent research has also focused on the area of “active learning,” which can limit costs for research teams with the resources to create a sizeable training dataset by interactively training only the most informative documents. We simulated the performance of this method using a fully classified set of approximately 7,000 abstracts from the chemical toxicity literature. We examined the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling (in which the machine asks the user to train documents whose status it is least certain about) and probability-based sampling (in which the machine asks users to train documents whose status is most likely to be relevant). We discovered that while these methods can potentially reduce training dataset size compared to random sampling, they potentially suffer from biased accuracy predictions that negate their potential benefits. In other words, when an active machine learning algorithm tells a user that the user can stop training it and makes predictions on the fraction of relevant documents contained in the discarded documents, it may be omitting more relevant articles than it declares. This poses serious problems in the regulatory context.
We have developed and tested a bias correction algorithm that greatly reduces bias in model performance predictions when using active learning and are currently building an online tool that implements this technology. Our research has also compared the results from bias-corrected active learning to those based on our hybrid supervised clustering method and shows how the latter, with only a small training dataset, can perform comparably with respect to the accuracy of predictions and the fraction of documents eliminated from review.
What’s Next for Better Data Classification?
The cost savings and efficiencies offered by text analytics and natural language processing make it inevitable that these technologies will be adopted widely in the literature search context. In the regulatory environment, though, issues around legal and scientific defensibility present potential barriers to adoption.
If we’re going to surmount these reservations, we need to develop and promote sophisticated analytical methods to ensure transparency, replicability, and accuracy, while maximizing cost savings. We’ll look forward to continuing the conversation with colleagues and industry partners at the 2017 American Public Health Association (APHA) Annual Meeting in Atlanta, Georgia. Learn more.