In an article published in the journal Nature, researchers introduced a dictionary-based matching graph network called Biomedical Named Entity Recognition (BioNER) to improve the computer’s ability to recognize and understand biological terms. This approach implements a matching graph method and a bi-directional graph convolutional network (BiCGN). Instead of a simple masked manner, the proposed approach can leverage the dictionary-based matching graph.
Background
Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. It is based on ideas from various fields, such as natural language processing (NLP), bioinformatics, medical informatics, and computational linguistics. It supports applications for identifying documents and concepts matching search queries. Search engines such as PubMed search allow users to query literature databases with words or phrases present in documents, metadata, or indices.
Traditional biomedical text mining uses feature-based methods. These methods were improved upon by the use of deep neural structures like Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Convolutional Neural Networks (CNN), and Transformers (BERT).
A BERT pre-trained with the help of a bio-medical corpus was able to succeed, but there were still unresolved issues. One of those issues was the lack of integration of human knowledge, specifically for entities that were not well-represented in the corpus.
The researchers proposed the implementation of biomedical dictionaries using position features and external dictionary information. The proposed solution, Dictionary-Based Matching Graph Network (DGMN), implements a matching graph method that defines entities from start to endpoint. Both BiLSTM and BioBERT were used as basic encoders for text representation. This method significantly improved the performance compared to the masked approaches and enhanced BioNER through the effective implementation of dictionary information by addressing the issues in entity recognition.
Approach
The model architecture can be broken down into five layers. The first one was with the input layer, which provides a series of biomedical text. The second layer breaks down the text into tokens in a process of tokenization, which eases further operations. These tokens are fed into encoders that utilize BiLSTM and BioBERT, followed by the BiGCN encoding both forward and reverse versions of the dictionary-based matching graph. This process could be repeated for multiple layers for T times. It is followed by the activation function layer and, finally, the output layer, which delivers label sequences corresponding to the input text.
The BioBERT and BiLSTM Encoder provided contextual representations. Moreover, the PieceTokenizer further tokenized words into subwords when required. BiGCN transformed a series of entities into directional graph connections and encoded graph information in both forward and backward directions. Two Graph Convolutional Networks (GCNs) were involved in formulating forward and reverse versions.
Experiments
The experiments were conducted on five biomedical text datasets containing information about gene mention recognition, chemical entity mention recognition, disease mention recognition and biomedical entity recognition.
The datasets used were as follows:
- BC2GM: The gene mentions recognition BioCreative II aimed at labeling the proteins and genes.
- BC4CHEMD: The chemical entity mentions recognition BioCreative IV for labeling the proteins and genes.
- BC5CDR: The most recent chemical and disease mention recognition, BioCreative V, which was a combination of BC5CDR-chem and BC5DR-disease datasets.
- NCBI-Disease: This database was introduced for disease name normalization and recognition and has various applications.
- JNLPBA: The biomedical entity recognition dataset for labeling protein/genes, RNA, DNA, cell line, and cell types.
Biomedical entity datasets were gathered for three types of entities, proteins/genes, diseases, and chemicals from the database of Comparative Toxicogenomics and biomedical data website. The DBGN was compared with various methods such as MTM, BERT, BioBERT, and CollaboNet, where all methods were enhanced with the help of conditional random field (CRF).
The Graphics Processing Unit used to train the neural network models was the GeForce GTX2080Ti. The pre-trained BioBERT contained 12 hidden layers with 768 hidden units for each layer, and Adam was the optimizer used for BioBERT and DBGN. For all experiments, the layer size is set to two for BiCGN. The performance metrics used to evaluate the model were Precision, Recall, and Macro-averaged F1.
Results
There was no notable improvement in the performance of the original BERT. However, BioBERT showed improved performance throughout all the datasets due to its domain-specific representation. DBGN was able to outperform all of its competitors across all the databases used with the help of the dictionary-based matching graph. The proposed method was able to achieve noteworthy performance improvements in less training time compared to both BERT and BioBert.
In the context of layer size, the best performance was delivered by two for all the databases. The BiCGN improved performance with the help of both forward and backward information. The Fuse Layer enhanced performance by fusing two GCNs in each layer. Even though Residual connections did not improve results, they reduced the training epochs. BiLSTM utilized the capturing of bidirectional long-range dependencies to improve performance.
Conclusion
In conclusion, DBGN significantly advanced biomedical entity recognition, outperforming state-of-the-art models. The BiGCN module contributed to the model's success with minimal training time increase. Future research could extend this approach to enhance various NLP applications and address entity boundary challenges, promising a coherent system for the efficient recognition of diverse biomedical entities.