In a paper published in the journal Information, researchers proposed utilizing large language models (LLMs) like a generative pre-trained transformer (GPT-3.5), Llama 2, and Mistral to automatically suggest properties beyond traditional keywords, aiming to enhance the findability of scientific research.
They compared manually curated properties in the open research knowledge graph (ORKG) with those generated by LLMs, assessing performance through semantic alignment, fine-grained property mapping accuracy, scientific nearest concept lookup (SciNCL) embedding-based cosine similarity, and expert surveys within a multidisciplinary science context. While LLMs showed promise in structuring science, further refinement was recommended to better align with scientific tasks and human expertise.
Background
Past work has extensively investigated the utilization of LLMs in scientific literature analysis, encompassing tasks like summarization, insights extraction, and literature reviews. However, the specific application of LLMs for recommending research dimensions is relatively new. Recent advancements include the development of domain-specific language models like scientific bidirectional encoder representations from transformers (SciBERT), scientific paper embeddings using citation-context information (SPECTER), and SciNCL.
Evaluations comparing LLM-generated dimensions with manually curated properties have employed similarity measures such as cosine and Jaccard similarity. Additionally, LLMs have been utilized as evaluators, showcasing their potential in assessing generated content quality.
Evaluation and Analysis Framework
In the section, three subsections outline the approach. Firstly, the creation of the gold-standard evaluation dataset from the ORKG with human-domain-expert-annotated research comparison properties used to assess their similarity to LLM-generated properties is described.
Secondly, an overview of the three LLMs, viz., GPT-3.5, Llama 2, and Mistral, applied to automatically generate the research comparison properties, highlighting their respective technical characteristics. Lastly, the various evaluation methods used in this study are discussed, offering differing perspectives on the similarity comparison of ORKG properties for the instances in the gold-standard dataset versus those generated by the LLMs.
The analysts detailed the process of curating the evaluation dataset from ORKG comparisons. This dataset comprises structured papers from diverse research fields, each accompanied by human-annotated properties. These properties reflect nuanced aspects of research contributions across various domains, which is essential for comparative analysis with LLM-generated dimensions. The distinction between ORKG properties and research dimensions is elucidated, emphasizing the broader context provided by the latter in analyzing research problems.
The researchers discussed the selection and characterization of three LLMs—GPT-3.5, Llama 2, and Mistral—for generating research dimensions. Each model is assessed based on its parameters, accessibility, and performance, highlighting their suitability for the evaluation tasks. Furthermore, the methodology for designing prompts tailored to each LLM to ensure optimal performance in generating research dimensions is outlined.
The investigators outlined the approach to evaluating the similarity between ORKG properties and LLM-generated research dimensions. Multiple evaluation techniques include semantic alignment and deviation assessments using GPT-3.5, property-dimension mappings, and embedding-based semantic distance evaluations. Additionally, the human assessment survey conducted to gauge the utility of LLM-generated dimensions compared to domain-expert-annotated ORKG properties is described.
LLM Performance Evaluation
The evaluation section delves into the performance assessment of LLMs by comparing them to ORKG properties and recommending research dimensions. It employs various similarity assessments, including semantic alignment, deviation evaluations, property mappings, and embedding-based analyses.
Results indicate a moderate alignment between paper properties and research dimensions, with LLM-generated dimensions showing diversity but lower similarity to ORKG properties. It highlights the challenge of replicating expert annotation using LLMs and suggests avenues for improving alignment through domain-specific fine-tuning.
In-depth analysis reveals a discrepancy in mappings between paper properties and research dimensions, emphasizing the varied scopes of ORKG properties and research dimensions. While LLMs offer diversity in dimension generation, their alignment with expert-annotated ORKG properties remains a hurdle.
However, embedding-based evaluations demonstrate a high semantic similarity between LLM-generated dimensions and ORKG properties, particularly with GPT-3.5. It underscores the potential of LLMs in automating research metadata creation, albeit with room for further refinement.
Overall, the evaluation underscores the capability of LLMs to generate research dimensions aligned with expert-annotated ORKG properties, albeit with some challenges. Despite the need for improvement in specificity and alignment with research goals, LLMs offer valuable support in structuring research contributions and comparisons. It highlights the promising role of artificial intelligence (AI) tools like LLMs in enhancing knowledge organization within platforms such as the ORKG, paving the way for more efficient and effective research dissemination and discovery.
Conclusion
In summary, the study investigated the efficacy of LLMs in recommending research dimensions, focusing on their alignment with manually curated ORKG properties and their potential to automate research metadata creation. A moderate alignment was found between LLM-generated dimensions and expert-curated properties, alongside challenges in replicating the nuanced expertise of domain experts.
While LLMs offered diversity in dimension generation, their alignment with expert-curated properties remained a hurdle. Future research should explore fine-tuning LLMs on scientific domains to enhance their performance in recommending research dimensions, advancing their potential in automating research metadata creation, and improving knowledge organization.