近日,多媒体领域顶级会议(ACM MM 2023)公布了录用论文列表,实验室博士生张博关于图像引导对话生成(Image-Grounded Dialogue Generation)的研究成果被录取为长文。ACM MM是多媒体领域的顶级会议,被CCF推荐为A类国际学术会议。
题目:ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation
摘要: Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains.
图像引导的对话系统通过整合视觉信息,能产生高质量的回复生成。然而,现有模型在零资源情况下往往难以有效地利用这些信息,主要是因为图像和文本模态之间存在差异。为了克服这一挑战,我们提出了一个创新的多模态框架,称为ZRIGF,它可以在零资源情况下整合图像引导信息进行对话生成。ZRIGF实施了两阶段的学习策略,包括对比式预训练和生成式预训练。对比式预训练包括一个文本-图像匹配模块,该模块将图像和文本映射到一个统一的编码向量空间中,以及一个文本辅助的遮蔽图像建模模块,该模块保留预训练的视觉特征并进一步促进多模态特征对齐。生成式预训练采用多模态融合模块和信息传递模块,基于多模态表示生成有洞见的回复。在基于文本和图像引导的对话数据集上进行的全面实验证明了ZRIGF在生成与上下文相关和信息丰富的回复方面的有效性。此外,我们在图像引导的对话数据集中采用了完全零资源的情景,以展示我们的框架在新领域的强大泛化能力。