张益嘉老师的论文被Nature子刊《Scientific Data》录用
新闻来源:IR实验室       发布时间:2019/3/28 22:41:27

  近日收到Nature 子刊《Scientific Data》编辑部邮件,实验室张益嘉老师和美国NIH陆致用研究员科研团队合作开展的研究工作“Improving biomedical word embeddings with subword information and MeSH ”已被录用。

  摘要:

  分布式表示学习能够将词语有效映射为富含语义信息的低维词向量,对于自然语言处理的各项研究具有重要作用。在生物医学领域,具有相对完整的领域资源,如何将领域知识融合到分布式词向量学习中是该领域的研究热点之一。本项工作基于MeSH本体资源,使用网络随机采样算法,生成MeSH主题词序列。在此基础上使用子词模型从MeSH主题词序列和PubMed语料中学习生物医学分布式词向量。该方法能够利用MeSH资源进一步提升分布式词向量的性能,而且子词模型的引入能够有效解决未登录词的问题。实验结果表明该方法训练的BioWordVec词向量在Intrinsic 和 Extrinsic任务中均取得了优异的性能。

  Abstract: 

  Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.