Query Performance Prediction using Relevance Judgments Generated by Large Language Models
arxiv(2024)
摘要
Query performance prediction (QPP) aims to estimate the retrieval quality of
a search system for a query without human relevance judgments. Previous QPP
methods typically return a single scalar value and do not require the predicted
values to approximate a specific information retrieval (IR) evaluation measure,
leading to certain drawbacks: (i) a single scalar is insufficient to accurately
represent different IR evaluation measures, especially when metrics do not
highly correlate, and (ii) a single scalar limits the interpretability of QPP
methods because solely using a scalar is insufficient to explain QPP results.
To address these issues, we propose a QPP framework using automatically
generated relevance judgments (QPP-GenRE), which decomposes QPP into
independent subtasks of judging the relevance of each item in a ranked list to
a given query. This allows us to predict any IR evaluation measure using the
generated relevance judgments as pseudo-labels; Also, this allows us to
interpret predicted IR evaluation measures, and identify, track and rectify
errors in generated relevance judgments to improve QPP quality. We judge
relevance by leveraging a leading open-source large language model (LLM),
LLaMA, to ensure scientific reproducibility. In doing so, we address two main
challenges: (i) excessive computational costs of judging the entire corpus for
predicting a recall-based metric, and (ii) poor performance in prompting LLaMA
in a zero-/few-shot manner. We devise an approximation strategy to predict a
recall-oriented IR measure and propose to fine-tune LLaMA using human-labeled
relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks
show that QPP-GenRE achieves state-of-the-art QPP accuracy for both lexical and
neural rankers in both precision- and recall-oriented metrics.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要