In an article recently posted to the Meta Research website, researchers presented a dense retriever trained with diverse augmentation (DRAGON) to address the effectiveness tradeoffs between supervised and zero-shot retrieval.
By using diverse queries and sources of supervision, DRAGON achieved state-of-the-art effectiveness with a bidirectional encoder representation from transformers (BERT)-base-sized model. It competed with more complex models like ColBERTv2 and sparse lexical and expansion model (SPLADE)++ in both supervised and zero-shot evaluations, demonstrating that high accuracy could be achieved without increasing model size.
Background
Bi-encoder-based neural retrievers enable efficient end-to-end retrieval from large corpora by pre-computing document embeddings. Despite advances, they still underperform traditional methods like best match (BM)25 in real-world scenarios with scarce training data. Previous approaches to enhance dense retrieval (DR) include pre-training, query augmentation, and distillation, but they often face trade-offs between supervised and zero-shot effectiveness or add complexity.
For instance, SPLADE++ and ColBERTv2 capture fine-grained information but increase retrieval latency. Existing methods improve one setting at the expense of the other or rely on larger models, limiting their practicality. This paper proposed DRAGON, a dense retriever trained with diverse data augmentation (DA), to address these limitations. By creating diverse relevance labels from multiple retrievers and using cheap, large-scale augmented queries, DRAGON achieved state-of-the-art effectiveness in both supervised and zero-shot evaluations without increasing model size. This method offered a scalable solution to improve DR training, breaking the effectiveness tradeoff and maintaining simplicity and efficiency.
Pilot Studies on Advanced DA for DR Training
The authors explored data augmentation (DA) strategies to enhance the training of DR, focusing on their proposed DRAGON model. They began by discussing two common methods for query augmentation: sentence cropping from the Microsoft (MS) machine reading comprehension (MARCO) corpus and synthetic query generation using docT5query. They argued that while cross-encoders provided strong supervision, they could not capture diverse matching signals between queries and documents.
To address this, they proposed using multiple sources of supervision from sparse, dense, and multi-vector retrievers. Empirical studies then compared different strategies for utilizing these diverse supervisions. They found that training DRAGON with uniform supervision from multiple retrievers improved zero-shot retrieval effectiveness compared to relying on a single strong teacher. Progressive label augmentation further improved generalization by sequentially introducing more complex supervision during training.
They demonstrated that the trajectory of progressive supervision significantly impacted DRAGON's performance, with a specific sequence (uniCOIL → Contriever → ColBERTv2 → SPLADE++) proving most effective. They proposed a training recipe for DRAGON, involving 20 epochs per iteration with progressive supervision and a mixture of cropped sentences and synthetic queries for query augmentation. This approach avoided fine-tuning on MS MARCO training queries and achieved state-of-the-art effectiveness in both supervised and zero-shot evaluations, surpassing existing DR models without increasing complexity or model size.
Comparison with the State of the Art
Evaluations included MS MARCO and TREC DL queries, focusing on nDCG@10. Zero-shot evaluations spanned 18 BEIR datasets and LoTTE, assessed via Success@5. DRAGON variants were compared with BERT-base-uncased models using advanced techniques like knowledge distillation, contrastive pre-training, and domain adaptation. In supervised evaluations, DRAGON variants consistently outperformed other dense retrievers on MS MARCO and TREC DL queries due to augmented relevance labels.
In zero-shot evaluations, models like Contriever and RetroMAE excelled, benefiting from non-MS MARCO pre-training. However, DRAGONs trained solely on MS MARCO augmented data transfer effectively to BEIR datasets and competed with state-of-the-art sparse retrievers like SPLADE++. DRAGON+ achieved top retrieval effectiveness on BEIR. Ablation studies showed DRAGON's robustness and effectiveness against various initialization checkpoints, demonstrating its capability in both supervised and zero-shot settings.
Discussion
Augmenting relevance labels with a cross-encoder did not improve DRAGON-S's retrieval effectiveness and might even degrade it. Diverse supervision was more effective than relying on a single strong supervision. DRAGON benefitted from masked auto-encoding pre-training rather than contrastive pre-training. Using soft labels from multiple teachers proved challenging and decreased effectiveness. Sentence cropping created more diverse and informative queries than neural generation, leading to better generalization. Cropped sentences provided varied topics and unique augmented passages, allowing DRAGON-S to capture diverse supervised signals and outperform DRAGON-Q in generalization.
Conclusion
In conclusion, DRAGON, a dense retriever trained with diverse data augmentation, achieved state-of-the-art effectiveness in both supervised and zero-shot retrieval tasks using a BERT-base-sized model. By employing diverse queries and multiple sources of supervision, it competed with more complex models like ColBERTv2 and SPLADE++. DRAGON's success demonstrated the importance of diverse augmentation strategies and progressive training. However, the training process was resource-intensive, requiring extensive computational power and large-scale queries. Future work aims to optimize training efficiency while maintaining DRAGON's high effectiveness, making it a robust foundation for domain adaptation and retrieval-augmented language models.