AI holds the potential to help doctors find early markers of disease and policymakers avoid decisions that lead to war. However, a growing body of evidence has revealed deep flaws in how machine learning is used in science, a problem that has swept through dozens of fields and implicated thousands of erroneous papers.
Now, an interdisciplinary team of 19 researchers, led by Princeton University computer scientists Arvind Narayanan and Sayash Kapoor, has published guidelines for the responsible use of machine learning in science.
Review: REFORMS: Consensus-based Recommendations for Machine-learning-based Science. Image Credit: Phonlamai Photo / Shutterstock
"When we graduate from traditional statistical methods to machine learning methods, there are a vastly greater number of ways to shoot oneself in the foot," said Narayanan, director of Princeton's Center for Information Technology Policy and a professor of computer science. "If we don't have an intervention to improve our scientific standards and reporting standards when it comes to machine learning-based science, we risk not just one discipline but many different scientific disciplines rediscovering these crises one after another."
The authors say their work is an effort to stamp out this smoldering crisis of credibility that threatens to engulf nearly every corner of the research enterprise. A paper detailing their guidelines appeared May 1 in the journal Science Advances.
Because machine learning has been adopted across virtually every scientific discipline, with no universal standards safeguarding the integrity of those methods, Narayanan said the current crisis, which he calls the reproducibility crisis, could become far more serious than the replication crisis that emerged in social psychology more than a decade ago.
The good news is that a simple set of best practices can help resolve this newer crisis before it gets out of hand, according to the authors, who come from computer science, mathematics, social science, and health research.
"This is a systematic problem with systematic solutions," said Kapoor, a graduate student who works with Narayanan and who organized the effort to produce the new consensus-based checklist.
The checklist focuses on ensuring the integrity of research that uses machine learning. Science depends on the ability to reproduce results and validate claims independently. Otherwise, new work cannot be reliably built atop old work, and the entire enterprise collapses. While other researchers have developed checklists that apply to discipline-specific problems, notably in medicine, the new guidelines start with the underlying methods and apply them to any quantitative discipline.
One of the main takeaways is transparency. The checklist calls on researchers to provide detailed descriptions of each machine learning model, including the code, the data used to train and test the model, the hardware specifications used to produce the results, the experimental design, the project's goals, and any limitations of the study's findings. According to the authors, the standards are flexible enough to accommodate a wide range of nuance, including private datasets and complex hardware configurations.
While the increased rigor of these new standards might slow the publication of any given study, the authors believe wide adoption of these standards would increase the overall rate of discovery and innovation, potentially by a lot.
"What we ultimately care about is the pace of scientific progress," said sociologist Emily Cantrell, one of the lead authors pursuing her Ph.D. at Princeton. By making sure the papers that get published are of high quality and that they're a solid base for future papers to build on, we potentially speed up the pace of scientific progress. Focusing on scientific progress and not just getting papers out the door is where our emphasis should be."
Kapoor concurred. The errors hurt. "At the collective level, it's just a major time sink," he said. That time costs money and once wasted, that money could have catastrophic downstream effects, limiting the kinds of science that attract funding and investment, tanking ventures that are inadvertently built on faulty science, and discouraging countless numbers of young researchers.
In working toward a consensus about what should be included in the guidelines, the authors said they aimed to strike a balance between being simple enough to be widely adopted and comprehensive enough to catch as many common mistakes as possible.
They say researchers could adopt the standards to improve their own work, peer reviewers could use the checklist to assess papers, and journals could adopt the standards as a requirement for publication.
"The scientific literature, especially in applied machine learning research, is full of avoidable errors," Narayanan said. "And we want to help people. We want to keep honest people honest."
Source:
Journal reference: