In an article published in the journal Nature, researchers introduced a corpus of 58,658 machine-annotated incident reports of medication errors, addressing the challenge of unstructured free text in such reports. The aim was to facilitate automated analysis through natural language processing (NLP), enhancing patient safety.
The dataset achieved high F1-scores of 0.97 for named entity recognition (NER) and 0.76 for intention/factuality analysis (I&F), providing valuable resources for developing information extraction models in incident learning from medication errors.
Background
Medication errors pose a significant risk to patient safety, necessitating effective incident reporting and learning systems. While NLP holds promise for extracting insights from medication-related records, the challenge lies in unstructured narrative incident reports. Existing systems struggle to efficiently analyze the extensive volume of free-text reports, hindering the World Health Organization's (WHO) patient safety goals.
Previous studies lacked a comprehensive, publicly available annotated corpus of incident reports for training and validating NLP models. Addressing this gap, the current paper leveraged Japan's open-access dataset, creating the world's largest annotated corpus for medication error-related named entities. Through systematic methodologies and a machine annotator, the study captured 478,175 named entities from 58,568 incident reports. It distinguished between intended and actual occurrences, recognizing incident types.
The researchers built upon prior work in structured annotation methodologies and introduced a valuable resource for developing NLP models. The scalable machine annotator held the potential for broader applications, extending beyond incident reports to other document types like electronic health records. This initiative aimed to transform incident learning in healthcare, bridging gaps in information extraction and contributing to enhanced patient safety through advanced AI-driven learning systems.
Methods
The study detailed a comprehensive methodology for creating and validating a large corpus of machine-annotated incident reports of medication errors, addressing the challenge of unstructured narrative data hindering automated analysis. Leveraging Japan's open-access dataset from the Japan Council for Quality Health Care (JQ) project, containing 58,568 annotatable free-text incident reports, the paper outlines a robust annotation scheme encompassing NER, I&F, and the categorization of incident types.
A machine-annotation pipeline, incorporating a three-layer multi-task bidirectional encoder representation from transformers (BERT) model, was developed through pre-training on the JQ corpus, fine-tuning with rule-based annotated data, and further refinement using gold-standard data. The resulting model achieved high accuracy, with F1-scores of 0.97 for NER and 0.76 for I&F in cross-validation.
The annotated corpus, the world's largest of its kind, offered valuable insights into medication error-related named entities, I&F, and incident types, providing a crucial resource for advancing information extraction models and enhancing incident learning systems in healthcare. The workflow, encompassing data collection, annotation, and machine-annotation development, demonstrated a systematic approach to structuring unstructured free-text data for meaningful analysis and learning.
Data Records
The dataset included detailed information at the named entity level, encompassing identification, report details, named entity specifics, and error status. An English data dictionary accompanied the dataset. Additionally, the NLP pipeline and a readme file for the machine annotator were accessible, enabling the application of the BERT model to annotate other incident reports. For technical validation, labeled datasets from 2010–2020 (40 reports), 2021 (20 reports), and error-free reports (10 reports) were provided. The datasets facilitated further research and development in information extraction from medication error incident reports.
Technical Validation and Usage Notes
The technical validation involved cross-validation, internal validation, external validation, and error analysis of the medication incident reports dataset. During five-fold cross-validation, the model achieved notable F1-scores of 97% for NER and 76% for I&F. Internal validation, using 40 randomly selected reports, demonstrated macro-average F1-scores of 83% and 57% for I&F tasks. External validation, based on 20 reports from 2021, yielded F1 scores of 83% and 50% for the same metrics.
Error analysis showed a 90% accuracy with a 10% false positive rate in identifying error-free reports. The dataset, structured for machine analysis, served as a repository for similar incidents, enabling efficient retrieval through named entities, intention/factuality, and incident types. It aids digital health system designers in automating knowledge extraction from past cases. The open, annotated dataset, translated into English, became a valuable global resource for studying real-world medication errors. It provided a standard for NLP challenges related to medication errors and adverse drug events.
Conclusion
In conclusion, the researchers introduced a corpus of 58,658 machine-annotated incident reports of medication errors, addressing the challenge of unstructured free text. This initiative, leveraging Japan's open-access dataset, created the world's largest annotated corpus, fostering automated analysis through NLP and advancing patient safety.
The dataset achieved high F1 scores, providing a crucial resource for developing NLP models. The comprehensive methodology, machine-annotation pipeline, and technical validation underscored the systematic approach, offering valuable insights into medication error-related named entities and incident types. The dataset's availability on Figshare, along with usage notes, ensured its utility for digital health system designers, researchers, and global efforts in studying medication errors.
Journal reference:
- Hu, X., Lin, C., Chen, T., & Chen, W. (2024). Interactive design generation and optimization from generative adversarial networks in spatial computing. Scientific Reports, 14(1), 5154. https://doi.org/10.1038/s41598-024-54783-6, https://www.nature.com/articles/s41597-024-03036-2