近日,CIKM 2025 公布了论文录用结果。实验室硕士生田一岑关于生成式检索的研究被录用为长文。CIKM是数据挖掘领域的顶级会议,被CCF推荐为B类国际学术会议。这次会议共收到1627篇长文投稿,其中有443篇论文被录用,录用率为27%。
标题:Reinforcement Learning-Driven Generative Retrieval with Semantic-aligned Multi-Layer Identifiers(基于语义对齐多层标识符的强化学习驱动生成式检索)
摘要:Generative retrieval enhances retrieval effectiveness by generating natural language represented document identifiers. However, current methods often struggle with two major challenges: limited identifier quality and insufficient query-document interaction, leading to limited retrieval performance. To tackle these challenges, we propose a novel generative retrieval framework integrated with semantic-aligned multi-layer identifiers and reinforcement learning. To improve identifier quality, we design a prompt-driven multi-task learning strategy to generate three types of hierarchical identifiers: summary, keyword, and pseudo-query, to capture multi-granularity document semantics. Furthermore, we adopt supervised fine-tuning to integrate these identifiers. To improve query-document interaction, we devise a multi-view ranking fusion mechanism that combines retrieval results across multi-layer identifiers. We further employ a GRPO-based reinforcement learning based on dense similarity rewards and a difficulty-aware negative sampling strategy to optimize the generated identifiers. Experiments on multiple benchmark datasets show that our framework significantly outperforms existing generative retrieval methods, offering a promising solution for building more effective and semantically aligned retrieval systems.
中文摘要:生成式检索通过生成用自然语言表示的文档标识符来增强检索效果。然而,目前的方法常常面临两个主要挑战:有限的标识符质量和不足的查询-文档交互,导致检索性能受限。为了解决这些挑战,我们提出了一种新颖的生成式检索框架,该框架集成了语义对齐的多层标识符和强化学习。为了提高标识符质量,我们设计了一种以提示为驱动的多任务学习策略来生成三种层次化标识符:摘要、关键词和伪查询,以捕捉文档的多粒度语义。此外,我们采用有监督的微调来整合这些标识符。为了改善查询-文档交互,我们设计了一种多视图排名融合机制,将跨多层标识符的检索结果进行结合。我们进一步采用基于密集相似性奖励的GRPO强化学习及一种考虑难度的负采样策略来优化生成的标识符。在多个基准数据集上的实验表明,我们的框架显著优于现有的生成式检索方法,为构建更有效和语义对齐的检索系统提供了一个有前途的解决方案。