Citation Challenges in AI: GPT-4 Makes Strides but Faces Lingering Issues

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.Sep 11 2023

In a paper published in the journal Scientific Reports, researchers examined fabricated bibliographic citations in chatbots like Chat Generative Pre-trained Transformer (ChatGPT)-3.5 and ChatGPT-4. The results showed that GPT-4 had fewer fabricated citations (18% vs. 55% in GPT-3.5) and fewer errors in legitimate citations (24% vs. 43% in GPT-3.5). Despite significant advancements in GPT-4 over GPT-3.5, challenges persist.

*Study: Citation Challenges in AI: GPT-4 Makes Strides but Faces Lingering Issues. Image credit: MMD Creative/Shutterstock*

Background

OpenAI introduced a chatbot known as ChatGPT-3 in November 2022, which utilizes artificial intelligence (AI) technologies like Natural Language Processing (NLP), machine learning, and deep learning to produce text that resembles human writing. It can perform various tasks, including answering questions, engaging in conversations, summarizing, translating, and creating content. However, it is primarily a language-processing tool and may not consistently produce accurate content. The susceptibility of the chatbot to factual errors has been mitigated in ChatGPT-4, thereby indicating an improvement in performance.

Previous studies consistently identified the tendency of ChatGPT to generate citations for non-existent works, with prevalence ranging from 47% to 69%. Accurate citations are crucial for supporting claims and providing context, but fabricated citations challenge these goals. Additionally, non-fabricated ChatGPT citations often contain errors, particularly in numerical components.

Proposed method

Data Collection and Study Overview

In this study, both GPT-3.5 and GPT-4 were utilized to generate short papers across 42 multidisciplinary topics. The objective was to compile data on the 636 bibliographic citations found within the 84 papers. Subsequently, extensive searches were conducted across various databases and websites. These searches aimed to assess several aspects, including the prevalence of fabricated citations, the frequency of errors within citations referencing genuine works, the adherence to the fundamental principles of APA citation format, and the characteristics of the hyperlinks contained in ChatGPT citations. Supplementary Appendix 1 details the 42 paper topics, while Supplementary Appendix 2 includes the 84 generated texts from GPT-3.5 and GPT-4. Supplementary Appendix 3 comprises the resulting data file.

Paper Generation and Prompts

The employment of both GPT-3.5 and GPT-4 to generate 42 short papers resembled the type typically expected from first-year composition courses in U.S. universities. These paper topics encompassed diverse subjects, including the health effects of e-cigarettes, the consequences of China's one-child policy, and the potential use of cloning to revive extinct species. Each paper's initiation involved a new chat or conversation embedded within a prompt adhering to recommended guidelines. The chatbot received a uniform introductory text instructing it to act as an academic researcher. Its task was to compose a 2000-word paper that included citations and a bibliography where all the contents were focused on a specific research question related to the chosen topic.

The character limit in ChatGPT's responses necessitated that the initial responses consistently fell short of 2000 words and never formed a complete paper. Additional prompts like "Please continue" were often used to prompt ChatGPT to continue the text from where it left off. Text following the initial bibliography was excluded from the analysis in this study. Supplementary Appendix 2 encompasses the complete texts generated by GPT-3.5 and GPT-4 in response to each of the 42 paper topics.

Data Compilation and Analysis

The length of each paper and the number of works listed in the bibliography were recorded. Additionally, any noteworthy irregularities, such as misinformation or fabricated empirical results, were carefully documented. Parenthetical citations without corresponding references were also noted during the assessment. Among the 84 papers, a total of 636 citations were identified. For each citation, complete bibliographic information, citation frequency in the text, the classification of the work as scholarly or popular, and the publication type (article, book, chapter, or website) were documented. The "website" category encompassed web content other than articles, books, and chapters.

Subsequently, comprehensive searches were conducted across various sources to determine the authenticity of each cited work—if it was genuine or fabricated. The sources included Google, Google Scholar, Amazon, PubMed, Scopus, and publisher’s websites. A work was considered genuine (non-fabricated) if it closely matched the title and author(s). Minor citation errors were categorized as citation errors rather than considering them as evidence of fabricated works. Additionally, confirmation of the absence of such works through journal volume/issue checks and publisher website searches was carried out.

Examining non-fabricated works involved identifying substantive errors in the provided bibliographic information, such as incorrect authorship, title, journal, publisher, volume number, and pagination. The assessment of adherence to the APA citation format was also conducted for both genuine and fabricated works. This evaluation included APA-specific elements and common citation format components, such as publisher/organization names. Deviations from APA format, including variations in the place of publication, state abbreviations, and the inclusion of issue numbers, were not accounted for. These elements have seen changes with recent editions of the APA Publication Manual.

Experimental analysis

In this experiment, GPT-3.5 and GPT-4 were employed to generate short papers on various topics. This was accomplished by analyzing 636 bibliographic citations from 84 papers. Despite the substantial improvements seen in GPT-4 compared to GPT -3.5, challenges in citation generation continued to exist. Notably, GPT-4 exhibited fewer fabricated citations and fewer substantive citation errors, reflecting a positive trend. However, a substantial portion of citations remained fabricated. The study also highlighted persistent issues in citation formatting that included improper title capitalization. Additionally, hyperlinks in citations were infrequent and often inaccurate. The study's findings underscore the complexity of bibliographic citation generation in AI language models and raise questions about their ability to recognize and process bibliographic data accurately.

Conclusion

To sum up, this paper examined the citation generation capabilities of GPT-3.5 and GPT-4 across various topics. While GPT-4 showed significant progress with fewer fabricated citations and substantive errors compared to its predecessor, many challenges in citation generation persisted. Issues such as formatting errors and unreliable hyperlinks underscore the intricacies of this process in AI language models. Continued efforts are needed to enhance the accuracy and reliability of AI-generated citations for broader practical applications.

Journal reference:

Walters, W. H., & Wilder, E. I. (2023). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13:1, 14045. https://doi.org/10.1038/s41598-023-41032-5, https://www.nature.com/articles/s41598-023-41032-5

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2023, September 11). Citation Challenges in AI: GPT-4 Makes Strides but Faces Lingering Issues. AZoAi. Retrieved on December 30, 2025 from https://www.azoai.com/news/20230911/Citation-Challenges-in-AI-GPT-4-Makes-Strides-but-Faces-Lingering-Issues.aspx.
MLA
Chandrasekar, Silpaja. "Citation Challenges in AI: GPT-4 Makes Strides but Faces Lingering Issues". AZoAi. 30 December 2025. <https://www.azoai.com/news/20230911/Citation-Challenges-in-AI-GPT-4-Makes-Strides-but-Faces-Lingering-Issues.aspx>.
Chicago
Chandrasekar, Silpaja. "Citation Challenges in AI: GPT-4 Makes Strides but Faces Lingering Issues". AZoAi. https://www.azoai.com/news/20230911/Citation-Challenges-in-AI-GPT-4-Makes-Strides-but-Faces-Lingering-Issues.aspx. (accessed December 30, 2025).
Harvard
Chandrasekar, Silpaja. 2023. Citation Challenges in AI: GPT-4 Makes Strides but Faces Lingering Issues. AZoAi, viewed 30 December 2025, https://www.azoai.com/news/20230911/Citation-Challenges-in-AI-GPT-4-Makes-Strides-but-Faces-Lingering-Issues.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.