近日,实验室宁金忠博士关于多模态 NER 的研究成果被期刊IEEE Transactions on Audio, Speech, and Language Processing (TASLP) 录用。TASLP期刊是音频、声学、语言信号处理的顶级期刊,在CCF学术推荐列表中认定为B类刊物,清华最新版计算机学术推荐列表中认定为A类刊物。
题目:GenEn-MNER: Enhancing Nested Chinese NER with Multimodal Fusion and Alignment via Speech-to-Text Generation
摘要:In recent years, the academic community has increasingly focused on multimodal Chinese Named Entity Recognition (NER) that utilizes speech cues. Existing methods typically rely solely on the NER objective function to guide the alignment and fusion of speech and text, overlooking the inherent alignment within speech-text pairs. Furthermore, these approaches generally employ sequence labeling techniques, which are inadequate for handling nested entities. To address these limitations, we introduce GenEn-MNER, a novel multimodal nested Chinese NER approach that enhances fusion and alignment through speech-to-text generation. This method leverages natural alignment information obtained from the speech-to-text task, using a cross-modal Transformer to integrate and align modalities. Additionally, the table-filling module redefines nested NER by conceptualizing it as the prediction of token pair relationships. Experimental results, as indicated by F1 scores, on CNERTA flat version (80.83%), CNERTA nest version (80.66%), and AISHELL-NER (94.52%) not only confirm the effectiveness of our approach but also demonstrate its superiority to existing state-of-the-art methods.
译文:近年来,学术界对利用语音线索的多模态中文命名实体识别(NER)的关注度日益提升。然而,现有方法通常仅依赖于NER目标函数来指导语音与文本的对齐与融合,未能充分挖掘语音-文本对中固有的对齐信息。此外,这些方法大多采用序列标注策略,难以有效处理嵌套实体问题。为应对上述挑战,本研究提出了一种新型多模态嵌套中文NER方法—GenEn-MNER,该方法通过语音到文本生成任务增强模态间的融合与对齐。具体而言,GenEn-MNER利用语音到文本任务中获得的自然对齐信息,借助跨模态交叉Transformer实现模态的整合与对齐。同时,我们引入基于填表的策略,将嵌套NER问题重新定义为对词对关系的预测任务。实验结果表明,GenEn-MNER在CNERTA平实体版本(80.83%)、CNERTA嵌套实体版本(80.66%)以及AISHELL-NER(94.52%)数据集上的F1分数均显著优于现有方法,不仅验证了该方法的有效性,更凸显了其在多模态嵌套中文NER任务中的优越性能。