Jina AI Unveils Reader-LM to Transform HTML-to-Markdown Conversion

Reader-LM, the latest breakthrough from Jina AI, leverages small language models to deliver top-tier performance in converting noisy HTML into clean, structured markdown, challenging the dominance of larger models in the field.

Illustration of reader-lm, replacing the pipeline of readability+turndown+regex heuristics using a small language model. Image Credit: Jina AI

Illustration of reader-lm, replacing the pipeline of readability+turndown+regex heuristics using a small language model. Image Credit: Jina AI

In a recent article published on the Jina AI website, engineers introduced an application programming interface (API) that converted uniform resource locators (URLs) to large language model (LLM)-friendly markdown using headless Chrome, Mozilla's Readability package, and the Turndown library.

After addressing user feedback and challenges with detail and content conversion, they explored using small language models (SLMs) for this task. They released two versions of Reader-LM, namely reader-lm-0.5b and reader-lm-1.5b, which are multilingual SLMs that processed up to 256K tokens and achieved state-of-the-art results while being significantly smaller than traditional models.

Background

Initially, the team used Jina Reader, an API that converted URLs into LLM-friendly markdown, employing a combination of headless Chrome, Mozilla's readability package, and the turndown library. Despite early challenges with content quality and conversion issues, researchers explored using SLMs to streamline the process. They released reader-lm-0.5b and reader-lm-1.5b, compact but powerful multilingual models specifically optimized for hypertext markup language (HTML)-to-markdown conversion. These models support up to 256K tokens and demonstrated superior performance compared to larger models.

Introduction to Reader-LM

To get started with Reader-LM on Google Colab,  they used the provided Colab notebook to explore converting web content into markdown format using the model. The notebook is optimized for Colab's free Tesla T4 graphics processing unit (T4 GPU), which, while cost-effective, may face limitations in performance for larger inputs due to the lack of advanced optimizations such as bfloat16 and flash attention. For more demanding use cases, utilizing a higher-end GPU is recommended.

Benchmark evaluations of Reader-LM demonstrated its effectiveness compared to other large language models. In particular, Reader-LM-1.5b outperformed many models in metrics such as ROUGE-L (recall-oriented understudy for gisting evaluation - longest common subsequence), Token Error Rate (TER), and Word Error Rate (WER), with higher scores indicating better performance in generating accurate and consistent markdowns from HTML.

Specifically, Reader-LM-1.5b achieved a remarkable ROUGE-L score of 0.72, indicating strong performance in content overlap, and had a lower TER of 0.19 compared to competitors, highlighting its efficiency in reducing token errors and hallucinations. These metrics suggest that Reader-LM-1.5b is a robust option for converting HTML to markdown, especially compared to other large models and baseline systems.

Regarding qualitative performance, Reader-LM-1.5b excelled in several key areas, including header extraction, content conversion, structure preservation, and markdown syntax usage. While it did not always surpass the Jina Reader API, it remained competitive, particularly in maintaining document structure and correct markdown syntax. Even. the performance of Reader-LM-0.5b, though smaller, was also notable, especially in preserving the structure of the content, proving it to be a viable alternative to larger models for many applications.

The training process for Reader-LM involved two distinct stages, focusing on handling both short and long HTML inputs. The model's development included the creation of high-quality training data pairs and tackling challenges such as degeneration and dull loops.

Analysts employed advanced techniques like contrastive search and chunk-wise model forwarding to enhance training efficiency and mitigate issues like repetitive generation. Although an encoder-only model was initially considered, it was ultimately deemed less effective due to difficulties in creating accurate token-level training data, leading to the continued use of decoder-only architectures that better manage longer and more complex inputs.

Results Overview

Benchmark evaluations of Reader-LM have demonstrated its superior performance compared to several large language models, including GPT-4o and Gemini-1.5. Specifically, Reader-LM-1.5b achieved notable metrics: a ROUGE-L score of 0.72, indicating strong content overlap, and a TER of 0.19, reflecting reduced token errors and hallucinations. These results underscore Reader-LM's effectiveness in generating accurate markdowns from HTML and highlight its competitive edge in the field.

In qualitative assessments, Reader-LM-1.5b excelled in header extraction, markdown syntax usage, and structure preservation. While it did not always outperform the Jina Reader API, it remained highly competitive, especially in maintaining document structure and formatting. The model's performance in these areas shows it to be a reliable alternative to larger models and baseline systems.

Even Reader-LM-0.5b, though smaller, also demonstrated solid performance, particularly in preserving content structure. This balance of efficiency and performance makes it a viable option for a wide range of applications. The results reflect Reader-LM's strong capabilities in converting HTML to markdown, with the 1.5b and 0.5b models proving effective in their respective contexts.

Conclusion

To sum up, Reader-LM is a pioneering SLM that was developed to convert HTML to markdown efficiently. It demonstrated strong performance in handling context-based reasoning, revealing that the task is more complex than simple "selective-copy." The use of a pre-trained model significantly improved training efficiency. Future enhancements could include extending context length, accelerating decoding, and introducing support for specific extraction instructions.

Source:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, September 17). Jina AI Unveils Reader-LM to Transform HTML-to-Markdown Conversion. AZoAi. Retrieved on September 19, 2024 from https://www.azoai.com/news/20240917/Jina-AI-Unveils-Reader-LM-to-Transform-HTML-to-Markdown-Conversion.aspx.

  • MLA

    Chandrasekar, Silpaja. "Jina AI Unveils Reader-LM to Transform HTML-to-Markdown Conversion". AZoAi. 19 September 2024. <https://www.azoai.com/news/20240917/Jina-AI-Unveils-Reader-LM-to-Transform-HTML-to-Markdown-Conversion.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Jina AI Unveils Reader-LM to Transform HTML-to-Markdown Conversion". AZoAi. https://www.azoai.com/news/20240917/Jina-AI-Unveils-Reader-LM-to-Transform-HTML-to-Markdown-Conversion.aspx. (accessed September 19, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. Jina AI Unveils Reader-LM to Transform HTML-to-Markdown Conversion. AZoAi, viewed 19 September 2024, https://www.azoai.com/news/20240917/Jina-AI-Unveils-Reader-LM-to-Transform-HTML-to-Markdown-Conversion.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.