Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs
arxiv(2024)
摘要
In ad-hoc retrieval, evaluation relies heavily on user actions, including
implicit feedback. In a conversational setting such signals are usually
unavailable due to the nature of the interactions, and, instead, the evaluation
often relies on crowdsourced evaluation labels. The role of user feedback in
annotators' assessment of turns in a conversational perception has been little
studied. We focus on how the evaluation of task-oriented dialogue systems
(TDSs), is affected by considering user feedback, explicit or implicit, as
provided through the follow-up utterance of a turn being evaluated. We explore
and compare two methodologies for assessing TDSs: one includes the user's
follow-up utterance and one without. We use both crowdworkers and large
language models (LLMs) as annotators to assess system responses across four
aspects: relevance, usefulness, interestingness, and explanation quality. Our
findings indicate that there is a distinct difference in ratings assigned by
both annotator groups in the two setups, indicating user feedback does
influence system evaluation. Workers are more susceptible to user feedback on
usefulness and interestingness compared to LLMs on interestingness and
relevance. User feedback leads to a more personalized assessment of usefulness
by workers, aligning closely with the user's explicit feedback. Additionally,
in cases of ambiguous or complex user requests, user feedback improves
agreement among crowdworkers. These findings emphasize the significance of user
feedback in refining system evaluations and suggest the potential for automated
feedback integration in future research. We publicly release the annotated data
to foster research in this area.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要