In an article recently submitted to the ArXiv* server, researchers addressed the challenge of improving general-purpose pre-trained language models in commonsense reasoning by highlighting that existing models still struggled with benchmarks like the Com2Sense Dataset and attributed this to a gap between current machine learning (ML) techniques.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
The study aimed to bridge this gap by introducing contemporary ML-based methods. Their experiments involved knowledge transfer, model ensemble, and adding a pairwise contrastive objective. The best model they proposed surpassed previous models, achieving approximately 15% absolute improvements in Pairwise Accuracy and around 8.7% absolute improvements in Standard Accuracy.
Literature Review
Endowing Natural Language Processing (NLP) models with human-like commonsense knowledge has been a longstanding challenge in this field. In 2021, researchers introduced the Com2Sense dataset, a comprehensive benchmark for commonsense reasoning. This dataset features natural language sentence pairs labeled as True or False based on their adherence to intuitive commonsense knowledge. The central evaluation criterion, Pairwise Accuracy, requires models to predict correctly for both sentences to be considered successful.
Initial studies involving the Com2Sense dataset revealed that existing general-purpose language models and dedicated commonsense understanding models could have performed better on this dataset. Notably, these models exhibited significant performance drops from Standard Accuracy to Pairwise Accuracy, indicating a considerable deviation from human-like behavior.
Study Methodology
Transfer Learning: The authors explored knowledge transfer to enhance language models' performance in commonsense reasoning. They leveraged the Semantic Evaluation (SemEval) 2020 Task Four dataset, which includes pairs of sentences, one sensible while the other is not. Instead of directly fine-tuning a model on the Com2Sense dataset, they pretrained the DeBERTaV3large model on the SemEval dataset, producing a checkpoint model. Subsequently, they fine-tuned this obtained model on Com2Sense using the same parameters as their best-performing model. This approach aims to harness the knowledge from SemEval to improve performance in commonsense reasoning.
Contrastive Loss Function: Researchers recognized that the Com2Sense dataset consists of complementary statement pairs, where each statement has a counterpart constructed with minor word perturbations but with different labels. To leverage this distinctive feature, they suggest the utilization of a Pairwise Contrastive Loss (PCL), which draws inspiration from the Information Noise-Contrastive Estimation (InfoNCE) Contrastive Loss.
The researchers actively designed this loss function to help the model differentiate between semantically similar yet syntactically different commonsensical inputs and their counterparts. It focuses on pushing apart the representations of each complementary input pair in the embedding space. The PCL aims to enhance the model's ability to capture the distinctions between conceptually related sentences.
Ensemble Techniques and Data Perturbation: Given the complementary nature of the Com2Sense dataset, each input pair should consist of one positive sample and one negative sample. The authors proposed a model ensemble and rule-based perturbation method to reduce instances where the model incorrectly assigns the same labels to both sentences. They employ multiple fine-tuned models and rank them by their pairwise accuracy on the development set. The highest-performing model is used as a base predictor to generate predictions for the test set, which may include Same-Output Pairs. For these pairs, they evaluate the ability of other models to differentiate between the two samples. If a new model can distinguish them, its prediction is adopted. This ensemble strategy helps ensure that the models can effectively discriminate between syntactically similar sentence pairs that convey different ideas.
Study Findings
The study begins by comparing different model backbones, such as the BERT (Bidirectional Encoder Representations from Transformers) base, RoBERTa (A variant of BERT) base, DeBERTa (A variant of BERT) base, and DeBERTaV3 (A variant of BERT) base with the best finetuning parameters used by their respective authors. The results show DeBERTaV3 as the best structure with 48.74% pairwise accuracy (acc.), while DeBERTa and RoBERTa exhibit similar performance at ∼18%. BERT is the lowest-performing model at ∼3%. Multiple experiments with DeBERTaV3 base and DeBERTaV3 large, under the best finetuning parameters, aimed to determine the best model size.
The results show that DeBERTaV3 large reaches 68.34% pairwise accuracy, while DeBERTaV3 base reaches 52.76% and also supports the hypothesis that larger models have more vital common sense reasoning ability. After selecting the best-performing model, DeBERTaV3 large, as the base model, hyperparameter tuning on model parameters was performed, including batch size (equivalent batch size after gradient accumulation), learning rate, and warmup steps. In each case, all other parameters remained fixed as the impact of varying values for the investigated parameter differed.
Limitations
The study also acknowledged certain limitations. Inconsistent Graphics Processing Unit (GPU) batch sizes during training, driven by hardware constraints, affected the results. Moreover, limited exploration of hyperparameters due to time constraints may have yet to reach the global optimum. Another significant limitation was the reliance on the PCL, which is specific to datasets structured with paired commonsensical/non-commonsensical inputs, making it less generalizable to other datasets lacking such a structure.
Conclusion
In conclusion, this research project aimed to enhance general-purpose language models’ commonsense learning and reasoning, as evaluated through the Com2Sense benchmark. The study demonstrated that knowledge transfer, pairwise contrastive learning, and model ensembling substantially improved model performance, especially across diverse backbones. These methods outperformed the existing state-of-the-art approaches, representing a significant step forward in commonsense reasoning. The study offers valuable insights and techniques for future research in natural language understanding and commonsense reasoning.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.