Using GPT-4o for Web Archiving

Download PDF Copy

By Muhammad OsamaReviewed by Joel ScanlonNov 14 2024

Generative AI offers unprecedented efficiency in metadata creation, yet the human touch still leads in quality—will this innovation reshape digital preservation?

Research: Web Archives Metadata Generation with GPT-4o: Challenges and Insights. Image Credit: Shutterstock AI

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

A research paper recently posted on the arXiv preprint* server explored using generative artificial intelligence (AI), specifically generative pre-trained transformer 4 (GPT-4o), to automate metadata generation for web archives (WARCs). The researchers in Singapore focused on the Web Archive Singapore (WAS) initiative, aiming to meet the growing demand for efficient and cost-effective metadata creation in the expanding digital landscape. They highlighted the potential for substantial cost savings, efficiency gains, and the critical limitations of AI-generated metadata, including accuracy issues compared to human-curated content.

Large Language Models (LLMs) for Digital Preservation

The rapid evolution of the digital landscape necessitates effective methods for preserving online heritage. The Resource Discovery department of the National Library Board Singapore (NLB), responsible for cataloging collections, including WAS, faces the challenge of managing a rapidly expanding web archive. Traditional manual metadata creation is labor-intensive and resource-heavy, making it unsustainable given the large volume of data. This challenge has prompted the exploration of advanced technologies, such as LLMs like GPT-4o, to automate this costly and time-consuming process. These LLMs, with advanced natural language processing capabilities, offer a new potential for tasks such as summarization and content generation but face obstacles in terms of reliability and precision.

Automated Metadata Generation Using GPT-4o

Multi-Prompt Strategy: Two prompt types—basic and rule-specific—were developed to cater to diverse website categories, including corporate sites and personal blogs, enhancing the relevance and accuracy of the generated metadata.

In this paper, the authors developed and evaluated an automated system for generating titles and abstracts as metadata for WAS. They aimed to improve metadata creation efficiency while ensuring the accuracy of the automated outputs. Using 112 WARC files from WAS, the researchers adopted a systematic approach, including data collection, preparation, and token reduction through three key heuristics. To minimize processing costs with GPT-4o, they employed data reduction techniques such as prioritizing content from "About" pages, selecting content from the shortest URLs, and applying regex filtering to limit token count. HTML content was extracted from the WARC files using Python libraries like WARCIO and BeautifulSoup, which facilitated the capture of relevant metadata while excluding unnecessary elements.

The study implemented specific heuristics to optimize content extraction: prioritizing "About" pages, focusing on content from the shortest URL, and using regex filtering, all inspired by professional cataloging practices. Additionally, prompt engineering techniques were used, crafting prompts both with and without specialized rules for different types of websites (e.g., corporate sites, personal blogs, and property listings). This allowed the researchers to compare results for varied types of web content.

Evaluation and Analysis of Automated Metadata

Both automated and manual evaluation methods were employed to assess the quality of the generated metadata. For automated evaluation, metrics such as Levenshtein Distance and BERTScore were used to assess similarity and quality, while manual evaluation involved eight trained catalogers who compared AI-generated metadata with human-created metadata, utilizing McNemar's test to measure accuracy.

Manual Validation by Experts: The generated metadata underwent a rigorous review by eight trained catalogers who assessed the AI-generated titles and abstracts against human-generated versions using both McNemar’s and Cochran’s Q statistical tests.

The automated approach achieved a remarkable 99.9% reduction in token count and associated costs compared to processing entire WARC files. However, manual evaluation using Cochran's Q and McNemar's tests revealed statistically significant differences (p = 0.02) between LLM-generated and human-generated metadata, indicating that human-created metadata exhibited higher accuracy and relevance. The analysis also highlighted several challenges associated with LLM-generated content, including frequent accuracy issues, hallucinations, and translation errors. A significant percentage, approximately 19.6%, of AI-generated titles and abstracts contained inaccuracies, compared to only 6.3% in human-generated metadata. This discrepancy underscores the need for ongoing refinement of LLMs to mitigate errors and enhance content reliability. Despite these challenges, the authors emphasized the potential of LLMs in archiving, stating that the technology can help streamline workflows and allow human catalogers to focus on more complex tasks requiring expertise.

Key Applications and Implications

The findings of this research have significant implications that extend beyond web archiving. The presented techniques and insights could benefit various fields requiring large-scale data management and metadata generation, such as digital libraries, museums, and educational institutions. By integrating generative AI into their workflows, these organizations could streamline operations, reduce costs, and enhance access to digital content. This automation, when complemented by human oversight, offers a promising pathway to achieving both efficiency and reliability in large-scale metadata management.

Conclusion and Future Directions

In summary, this study represents a significant advancement in applying AI-driven solutions to WARCs, showcasing both the potential and limitations of GPT-4o for automated metadata generation. While the approach offers considerable efficiency and cost savings, it highlights the essential role of human oversight in maintaining metadata quality and accuracy. Future directions include refining prompt engineering methods, improving data reduction heuristics, and considering smaller, specialized models to address privacy concerns. A collaborative approach that combines the strengths of LLMs and human expertise will also be crucial for achieving accurate digital preservation and reliable metadata generation.

Source:

Github warc2summary - https://github.com/masamune-prog/warc2summary

Journal reference:

Preliminary scientific report. Huang, A. Y., Nair, A., Goh, Z. R., & Liu, T. (2024). Web Archives Metadata Generation with GPT-4o: Challenges and Insights. ArXiv. https://arxiv.org/abs/2411.05409

Posted in: AI Research News

Comments (0)

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Osama, Muhammad. (2024, November 14). Using GPT-4o for Web Archiving. AZoAi. Retrieved on April 01, 2025 from https://www.azoai.com/news/20241114/Using-GPT-4o-for-Web-Archiving.aspx.
MLA
Osama, Muhammad. "Using GPT-4o for Web Archiving". AZoAi. 01 April 2025. <https://www.azoai.com/news/20241114/Using-GPT-4o-for-Web-Archiving.aspx>.
Chicago
Osama, Muhammad. "Using GPT-4o for Web Archiving". AZoAi. https://www.azoai.com/news/20241114/Using-GPT-4o-for-Web-Archiving.aspx. (accessed April 01, 2025).
Harvard
Osama, Muhammad. 2024. Using GPT-4o for Web Archiving. AZoAi, viewed 01 April 2025, https://www.azoai.com/news/20241114/Using-GPT-4o-for-Web-Archiving.aspx.