Testing Our Theories
To address the shortcomings of both traditional document classification approaches, we developed a hybrid method that combines the benefits and limits the downsides. Supervised clustering--which is based on the theory of semi-supervised learning--can be used to classify documents and generate accurate and unbiased estimates of model accuracy with a minimal training dataset.
In a recent research paper, we used two previously classified datasets related to chemical toxicity to demonstrate the accuracy and lack of bias in model performance metrics generated by supervised clustering. Simulations showed that the accuracy and manual work saved using our supervised clustering method is comparable to the performance of more expensive supervised machine learning-based methods. Decision-makers are presented with a semi-continuous decision curve that captures the costs (the fraction of omitted relevant documents) and benefits (the fraction of documents eliminated from review) associated with each potential decision point. The method also includes an ensemble learning component that prioritizes documents in terms of relevance, ensuring that the most important results are reviewed first.
Our recent research has also focused on the area of “active learning,” which can limit costs for research teams with the resources to create a sizeable training dataset by interactively training only the most informative documents. We simulated the performance of this method using a fully classified set of approximately 7,000 abstracts from the chemical toxicity literature. We examined the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling (in which the machine asks the user to train documents whose status it is least certain about) and probability-based sampling (in which the machine asks users to train documents whose status is most likely to be relevant). We discovered that while these methods can potentially reduce training dataset size compared to random sampling, they potentially suffer from biased accuracy predictions that negate their potential benefits. In other words, when an active machine learning algorithm tells a user that the user can stop training it and makes predictions on the fraction of relevant documents contained in the discarded documents, it may be omitting more relevant articles than it declares. This poses serious problems in the regulatory context.
We have developed and tested a bias correction algorithm that greatly reduces bias in model performance predictions when using active learning and are currently building an online tool that implements this technology. Our research has also compared the results from bias-corrected active learning to those based on our hybrid supervised clustering method and shows how the latter, with only a small training dataset, can perform comparably with respect to the accuracy of predictions and the fraction of documents eliminated from review.
What’s Next for Better Data Classification?
The cost savings and efficiencies offered by text analytics and natural language processing make it inevitable that these technologies will be adopted widely in the literature search context. In the regulatory environment, though, issues around legal and scientific defensibility present potential barriers to adoption.
If we’re going to surmount these reservations, we need to develop and promote sophisticated analytical methods to ensure transparency, replicability, and accuracy, while maximizing cost savings. We’ll look forward to continuing the conversation with colleagues and industry partners at the 2017 American Public Health Association (APHA) Annual Meeting in Atlanta, Georgia. Learn more.