博士生张博关于多模态对话生成的研究成果被ACM MM 2023录用-信息检索研究室

研究方向

学术报告

资源下载

当前位置：首页>>新闻动态>>正文

博士生张博关于多模态对话生成的研究成果被ACM MM 2023录用

2023-07-26 15:18

近日，多媒体领域顶级会议（ACM MM 2023）公布了录用论文列表，实验室博士生张博关于图像引导对话生成（Image-Grounded Dialogue Generation）的研究成果被录取为长文。ACM MM是多媒体领域的顶级会议，被CCF推荐为A类国际学术会议。

题目：ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

摘要： Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains.

图像引导的对话系统通过整合视觉信息，能产生高质量的回复生成。然而，现有模型在零资源情况下往往难以有效地利用这些信息，主要是因为图像和文本模态之间存在差异。为了克服这一挑战，我们提出了一个创新的多模态框架，称为ZRIGF，它可以在零资源情况下整合图像引导信息进行对话生成。ZRIGF实施了两阶段的学习策略，包括对比式预训练和生成式预训练。对比式预训练包括一个文本-图像匹配模块，该模块将图像和文本映射到一个统一的编码向量空间中，以及一个文本辅助的遮蔽图像建模模块，该模块保留预训练的视觉特征并进一步促进多模态特征对齐。生成式预训练采用多模态融合模块和信息传递模块，基于多模态表示生成有洞见的回复。在基于文本和图像引导的对话数据集上进行的全面实验证明了ZRIGF在生成与上下文相关和信息丰富的回复方面的有效性。此外，我们在图像引导的对话数据集中采用了完全零资源的情景，以展示我们的框架在新领域的强大泛化能力。

【关闭窗口】