In an article published in the journal Nature, the preprint server bioRxiv used large language models (LLMs) to generate short summaries of newly posted manuscript preprints. The goal is to improve accessibility and help readers quickly grasp critical aspects of the research. However, initial impressions highlight the significant challenges in accurately summarizing highly technical scientific content using artificial intelligence (AI).
Preprints are unpublished manuscripts posted online before formal peer review and publication. Preprint repositories like bioRxiv and medRxiv have become vital outlets for rapidly disseminating scientific findings across disciplines, especially during fast-moving events like the coronavirus disease of 2019 (COVID-19) pandemic. They allow researchers to stake claims on discoveries and receive community feedback before peer review.
However, the explosive growth in preprints has also become overwhelming. Thousands of preprints are uploaded daily across servers like bioRxiv, medRxiv, and arXiv. Scientists need help to keep up with the massive volume of new research in their fields.
To help readers digest new papers in this deluge, bioRxiv has partnered with ScienceCast, an AI startup utilizing LLMs - AI systems trained on massive text corpora - to generate multi-level summaries of manuscript preprints automatically. The pilot, announced on November 8, 2023, creates three distinct summaries for each newly uploaded preprint targeting different reading levels.
BioRxiv co-founder Dr. Richard Sever said a key motivation was enhancing accessibility, as scientific papers are often highly technical and esoteric. The AI-generated summaries aim to make new research more understandable to wider audiences. The summaries analyze the full manuscript text, not just the author-written abstracts.
While noticing clear factual errors in some summaries, Dr. Sever found that most were accurate and even superior to abstracts crafted by the paper authors themselves. This suggests the potential for AI tools to distill key scientific insights from complex papers. However, the pilot also highlights enduring challenges.
Implementation and Goals
The general audience summaries aim to provide simplified overviews of the research accessible to non-experts outside the field. The mid-level summaries incorporate more scientific jargon and technical details suited for students and researchers in adjacent disciplines. Finally, the expert summaries attempt to give in-depth outlines targeting specialists in that specific field.
The overarching goal is to assist readers in determining if they want to take the time to read the full preprint based on the topic, methods, results, and conclusions. This could help researchers cut through the noise and focus their limited time on papers most relevant to their work.
However, scientists analyzing early summaries surfaced concerning inaccuracies and nonsensical statements. Dr. Erik van Nimwegen, a computational biologist, derided the expert summary of his preprint on gene expression patterns as "complete gibberish," taking issue with it on social media.
Dr. Robert Seder, a vaccine researcher, found multiple factual errors in the AI-generated summaries of his preprint describing clinical trials of an inhaled COVID-19 vaccine. With significant edits, Dr. Seder noted, the summaries could adequately reflect the research. This indicates the potential of AI summarization if systemic accuracy issues are addressed.
Fundamental Challenges
The mixed early results highlight AI's enduring challenges in summarizing highly technical content. While LLMs can generate human-like text given sufficient data, they can also be misled by the nuanced terminology, experimental methods, and domain knowledge prevalent in scientific communication.
Moreover, the interpretive aspect of summarizing - conveying the essence, significance, and implications - is challenging for current AI tools. This limitation is compounded when translating novel discoveries on the cutting edge of human knowledge.
Nonetheless, their natural language capabilities will improve as language models ingest more scientific training data. Dr. Victor Galitski, Chief Technical Officer of ScienceCast, notes that specialized scientific LLMs are already in development. The company is fine-tuning models on biological and biomedical data to enhance accuracy on texts like bioRxiv preprints.
Future Outlook
As AI capabilities advance, automated summarization tools could become indispensable aids for navigating the scientific literature deluge. However, ensuring accuracy and reliability over technical terminology and concepts remains an ongoing challenge.
The summaries on bioRxiv explicitly state the AI source. Dr. Sever suggests involving authors to review or approve content if the pilot matures into a permanent fixture. For now, the consequences of errors are minimized by excluding medRxiv, which carries clinical implications.
Beyond summarization, AI advances are also enabling interactive literature analysis tools. A "Ask the Paper" feature allows users to converse with bioRxiv preprints to elicit critical details. Apps like NanCI provide referenced answers to reader questions for a corpus of papers.
The physics preprint server arXiv already utilizes AI for audio summarization. As language models continue improving through training on scientific text, automated summarization, and literature assistants may find widespread adoption. But concerns around factual precision remain.
Striking the optimal balance between automation and human oversight will be critical for realizing the promise of AI tools in expanding access while maintaining accuracy across burgeoning research literature. The bioRxiv pilot represents an intriguing early experiment in assistive scientific summarization - highlighting possibilities and enduring challenges.