Neural Policy Gradient (PG) RANK: A Breakthrough in Training Language Models for Text Retrieval

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.Oct 11 2023

In an article recently submitted to the ArXiv* server, researchers highlighted the importance of text retrieval in language processing applications. They pointed out that current state-of-the-art text retrieval models, which relied on pre-trained large language models, often required complex heuristics to train effectively due to the heuristic nature of contrastive losses.

*Study: Neural Policy Gradient (PG) RANK: A Breakthrough in Training Language Models for Text Retrieval. Image credit: Ole.CNX/Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

They introduced "Neural Policy Gradient (PG) RANK,” a novel training algorithm, which leverages policy gradient to train retrieval models end-to-end to address this challenge. This approach reduced the reliance on heuristics and aligned the training objective with downstream decision-making quality. Experimental results showed significant in-domain performance improvements and robust out-of-domain generalization, particularly in question-answering tasks.

Background

In modern language processing pipelines, retrieving relevant factual information is essential for various tasks such as web search, question answering, and open-ended generation. The quality of the retrieval system significantly impacts downstream decision-making. Large language models (LLMs) have been incorporated into retrieval policies to improve performance, but training these models poses challenges due to the complexity of rankings. Existing training methods often rely on pairwise preferences and complex heuristics, making training competitive retrieval systems difficult.

Past research in information retrieval, specifically text retrieval, has transitioned from traditional count-based methods to contemporary neural models. These neural models, known as dense models, leverage dense vector representations to improve recovery by encoding queries and documents as dense vectors. LLMs have been pivotal in this evolution. Training these models requires sampling damaging records and employing techniques like knowledge distillation, which can introduce complexity.

Technique Description

The Neural PG-RANK method revolves around the Plackett-Luce Ranking Policy, which can accommodate any score-based retrieval architecture. The technique defines representation functions that encode queries and documents into fixed-width vector representations and introduces a comparison function to compute scores based on these representations. Under the Plackett-Luce model, a ranking policy is formulated as a product of SoftMax distributions, providing a differentiable alternative to sorting documents by their scores. This policy gradient approach aligns to optimize the retrieval system's performance efficiently.

Leveraging insights from the learning-to-rank (LTR) literature, the method addresses the optimization problem related to the ranking policy. The technique uses the REINFORCE algorithm and the log-derivative trick to compute the policy gradient. The gradient is calculated as an expectation over rankings, enabling estimation using Monte Carlo sampling. However, a baseline is incorporated into the objective to enhance the stability of updates and reduce variance. This variance reduction technique is crucial for the effectiveness of the method.

While Neural PG-RANK can apply to any utility function, the focus in experiments is on Normalized Discounted Cumulative Gain (nDCG)@10 for comparison purposes. Justification for this choice lies in its theoretical consistency and practical suitability for a wide range of relevant annotations. Moreover, the utility at a specific rank interacts only with the probability of the partial ranking up to that point, simplifying the policy gradient estimation. Neural PG-RANK offers a moral and practical approach to learning retrieval policies, aligning the training objective with evaluation metrics and improving the pipeline performance.

Experimental Results

The extensive set of experiments that evaluated Neural PG-RANK assessed its effectiveness in both first-stage retrieval and second-stage reranking scenarios using various text retrieval benchmarks, focusing on the widely-used Microsoft Machine Reading Comprehension (MS MARCO) dataset. Here, the essential findings and results of these experiments are summarized.

In-Domain Performance (Second-Stage Reranking): When evaluating Neural PG-RANK as a second-stage re-ranker over a candidate set of 1,000 documents for each query, remarkable in-domain improvements were observed. The model exhibited notable gains over warm-start policies based on Sentence Bidirectional Encoder Representations from Transformers (SBERT) and Topic-Aware Stance-Biased (TAS-B) on the MS MARCO development set. Specifically, Neural PG-RANK achieved significant increases in nDCG@10, with improvements of +0.095 and +0.089 compared to SBERT and TAS-B, respectively. The model approached near-perfect scores across various nDCG@k measures, including nDCG@1, nDCG@3, nDCG@5, and nDCG@10, demonstrating its ability to enhance second-stage reranking performance substantially.

Out-of-Domain Generalization (Second-Stage Reranking): Regarding out-of-domain generalization, Neural PG-RANK exhibited commendable performance, comparable to baseline models on various evaluation datasets from the BEIR benchmark. Notably, it outperformed baselines on challenging measures such as NaturalQuestions (NQ) and Hotpot Question Answering (HotpotQA), which are prominent in question-answering tasks. The model's performance gains were particularly pronounced for smaller values of k in nDCG@k measures, emphasizing its potential to excel in tasks requiring top-k document retrieval.

First-Stage Retrieval: While Neural PG-RANK excelled in second-stage reranking, the performance in first-stage recovery involved searching the entire document collection for each query. However, using Neural PG-RANK policies trained specifically for second-stage reranking did not match the performance of baseline systems when used as first-stage retrievers. This discrepancy suggests that the model's training methodology may need to be revised to adapt to the demands of first-stage retrieval. Future investigations will explore alternative training approaches, such as cutting-plane methods, to enhance effectiveness in this context.

These experimental results collectively highlight the promise and potential of Neural PG-RANK in improving text retrieval systems, particularly in second-stage reranking tasks. The model's in-domain solid performance, impressive out-of-domain generalization, and outstanding nDCG scores underscore its significance in enhancing the efficiency and effectiveness of information retrieval systems. Further refinements and alternative training strategies may unlock its full potential in addressing the challenges of first-stage retrieval.

Conclusion

To sum up, the introduction of Neural PG-RANK represents a significant advancement in training LLM-based retrieval models. By reducing reliance on intricate heuristics and aligning training objectives with practical utility, this work has demonstrated its effectiveness in improving in-domain performance and achieving substantial out-of-domain generalization. This approach paves the way for developing highly effective retrieval-based LLM pipelines tailored for real-world applications.

Journal reference:

Preliminary scientific report. Gao, G., et al. (2023). Policy-Gradient Training of Language Models for Ranking. ArXiv. https://arxiv.org/abs/2310.04407, https://arxiv.org/pdf/2310.04407.pdf

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2023, October 11). Neural Policy Gradient (PG) RANK: A Breakthrough in Training Language Models for Text Retrieval. AZoAi. Retrieved on July 02, 2025 from https://www.azoai.com/news/20231011/Neural-Policy-Gradient-(PG)-RANK-A-Breakthrough-in-Training-Language-Models-for-Text-Retrieval.aspx.
MLA
Chandrasekar, Silpaja. "Neural Policy Gradient (PG) RANK: A Breakthrough in Training Language Models for Text Retrieval". AZoAi. 02 July 2025. <https://www.azoai.com/news/20231011/Neural-Policy-Gradient-(PG)-RANK-A-Breakthrough-in-Training-Language-Models-for-Text-Retrieval.aspx>.
Chicago
Chandrasekar, Silpaja. "Neural Policy Gradient (PG) RANK: A Breakthrough in Training Language Models for Text Retrieval". AZoAi. https://www.azoai.com/news/20231011/Neural-Policy-Gradient-(PG)-RANK-A-Breakthrough-in-Training-Language-Models-for-Text-Retrieval.aspx. (accessed July 02, 2025).
Harvard
Chandrasekar, Silpaja. 2023. Neural Policy Gradient (PG) RANK: A Breakthrough in Training Language Models for Text Retrieval. AZoAi, viewed 02 July 2025, https://www.azoai.com/news/20231011/Neural-Policy-Gradient-(PG)-RANK-A-Breakthrough-in-Training-Language-Models-for-Text-Retrieval.aspx.