实验室博士生帕尔哈提关于命名实体识别的研究被TALLIP录用-信息检索研究室

研究方向

学术报告

资源下载

当前位置：首页>>新闻动态>>正文

实验室博士生帕尔哈提关于命名实体识别的研究被TALLIP录用

2025-06-05 23:03 卢俊宇

近日，实验室博士生帕尔哈提关于低资源语言命名实体识别的研究成果被国际期刊 ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 正式录用。TALLIP是低资源自然语言处理领域的顶级期刊，在CCF学术推荐列表中认定为C类刊物。

面向低资源语言的双语法律NER数据集及语义感知跨语言标签迁移方法

题目：A Bilingual Legal NER Dataset and Semantics-Aware Cross-Lingual Label Transfer Method for Low-Resource Languages

摘要：Named Entity Recognition (NER) for low-resource languages in the legal domain faces dual challenges of data scarcity and complex terminology. Existing cross-lingual methods based on model transfer and data transfer strategies often suffer from semantic drift and lack effective domain adaptation. To address these issues, we introduce BiLegalNERD, the first bilingual Chinese–Uyghur legal NER dataset, constructed via a semantics-aware annotation transfer framework. We further propose CUTLM, a novel cross-lingual labeling approach that integrates dual-directional translation with Levenshtein-based alignment to ensure accurate entity boundary preservation. On top of this dataset, we develop BiLegalNER, a domain-adaptive multilingual NER model incorporating vocabulary expansion and bilingual fine-tuning. Experimental results show that BiLegalNER achieves F1-scores of 86.65% and 89.11% on automatically translated and human-annotated corpora, respectively, significantly outperforming the strongest multilingual baseline. CUTLM also surpasses prior transfer methods by up to 9.89%, highlighting its effectiveness in entity-level consistency across languages. This work establishes a new benchmark for Uyghur legal NER and provides a scalable and transferable solution for cross-lingual information extraction in low-resource settings.

中文摘要：在司法领域，面向低资源语言的命名实体识别（NER）任务面临数据稀缺与术语复杂的双重挑战。当前主流的跨语言方法（如模型迁移和数据迁移）普遍存在语义漂移现象，且缺乏对领域特征的有效适应能力。为此，本文构建了BiLegalNERD—首个面向中文-维吾尔语的双语法律NER数据集，并采用语义感知的标签迁移策略，以保证标签在跨语言映射过程中的语义一致性。同时，本文提出了CUTLM方法，通过双向翻译与Levenshtein对齐相结合，实现了实体边界的高保真迁移。在此基础上，设计了融合领域词表扩充与双语微调机制的领域自适应NER模型BiLegalNER。实验结果表明，该模型在自动生成数据和人工标注数据上分别达到86.65%和89.11%的F1分数，均优于最强多语言基线模型；CUTLM方法较现有跨语言迁移方法提升了9.89%，验证了该方法能够有效提升跨语言实体级一致性。本研究为维吾尔语法律NER任务提供了新的参考基准，也为低资源语言的跨语言信息抽取贡献了具有推广价值的解决方案。

【关闭窗口】