In an article recently submitted to the ArXiv* server, researchers investigated the feasibility of leveraging reinforcement learning (RL) and knowledge to improve the reliability of language models.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
The natural language processing (NLP) community is developing several new datasets and LMs for different domains, including general-purpose domains and specific domains, using crowd-sourcing techniques. New LMs must display their effectiveness/reliability in comprehending natural language through effective and simple general language understanding and evaluation (GLUE) benchmarks to gain acceptance in the NLP community.
GLUE benchmarks have gained importance in NLP due to high annotator agreement, which sets a higher performance threshold for new LMs. However, concerns regarding the ability of new LMs to match the performance obtained from manual annotations in datasets are still increasing.
The performance of LMs is typically evaluated using the inter-annotator agreement scores. For instance, Cohen’s Kappa is used by GLUE tasks to measure the reliability scores. However, the performance assessment primarily depends on conventional metrics, which demonstrate the LM reliability only to a limited extent.
Leveraging knowledge and RL
In this paper, researchers investigated the feasibility of using a knowledge-guided LM ensembling approach that leverages RL to integrate knowledge from Wikipedia and ConceptNet in LMs as knowledge graph embeddings to improve the reliability of LMs.
Specifically, the proposed approach mimicked human annotators who compensate for the information gaps in datasets depending on external knowledge. The Kappa-inspired ensembling of LMs represented a synergistic collaboration between simpler models that resulted in a more effective and resilient system compared to singular models. Additionally, the collective strength of the ensemble improved the decision-making confidence of the ensemble and enabled it to address the deficiencies of individual models under specific conditions.
Researchers conceptualized, devised, and evaluated the ensembling of LMs based on three research questions, including the possibility of using Kappa to determine the reliability of GLUE benchmark-trained LMs, improving Kappa by ensembling the LM strategically considering LMs as annotators, and improving the overall reliability by infusing external knowledge during ensembling.
Three ensembling techniques, including deep ensemble (DE), semi ensemble (SE), and shallow ensemble (ShE), were proposed to address these questions. ShE and SE techniques did not integrate knowledge while the DE ensemble technique integrated knowledge from external sources, including ConceptNet and Wikipedia.
Researchers assessed the reliability of ensemble models across nine GLUE tasks using Kappa as it is a better measure to assess prediction uncertainty due to the chance behavior of LMs. They also evaluated the accuracy of the ensemble techniques.
The nine benchmark classification datasets employed in this study from the GLUE suite include corpus of linguistic acceptability (CoLA), Stanford sentiment treebank (SST-2), Microsoft research paraphrase corpus (MRPC), semantic textual similarity benchmark (STS-B), Quora question pairs (QQP), multi-genre natural language inference (MNLI), question NLI (QNLI), recognizing textual entailment (RTE), and Winograd schema challenge (WNLI).
The bidirectional encoder representations from transformers (BERT) model were employed to present the findings as the streamlined model contains a few million parameters, which makes it efficient and simple. Two BERT model variants, including BERTlarge and BERTbase, were used as baselines.
Significance of the study
All BERT-ensembles/ensemble models uniformly outperformed the individual BERT models/baselines across the nine GLUE tasks. ShE was the best-performing model in two GLUE tasks, including QNLI and MNLI, while SE was the best-performing model in three tasks, including CoLA, MRPC, and STS-B.
DE was the highest-performing model in the remaining four GLUE tasks, including QQP, RTE, SST-2, and WNLI. Additionally, the average accuracy of DE was 5.21% and 5.57% higher than BERTbase and BERTlarge, respectively, in the GLUE tasks.
The Kappa score increased in all ensemble models, with SE attaining the highest Kappa. Overall, an average 0.12 increment in Kappa score was observed in all ensemble models compared to baselines. Thus, the uncertainty demonstrated by individual models in their predictions can be addressed effectively by combining all ensemble models/DE, SE, and ShE, which results in a greater focus on confident outcomes and increasing the Kappa values.
DE displayed the highest accuracy compared to baselines and an average 0.11 increment in the Kappa score, which indicated that integrating knowledge from external sources could significantly improve the overall reliability of the LMs. However, the marginally superior Kappa score of SE compared to DE indicated that increased accuracy does not always lead to enhanced reliability. Additionally, more research is required to investigate these ensemble models on real-world datasets to assess their performance and reliability in domain-specific applications.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.