博士生罗凌论文被Journal of Cheminformatics录用
新闻来源:IR实验室       发布时间:2018/12/11 16:27:42

  近日收到生物信息学期刊Journal of Cheminformatics编辑部邮件,实验室博士生罗凌的论文“A Neural Network Approach to Chemical and Gene/Protein Entity Recognition in Patents”被录用,该期刊影响因子为3.893。

  摘要:

  在生物医学研究中,专利包含了丰富的生物医学信息,专利文本挖掘近年来也受到了研究者们广泛关注。为了促进基于专利的生物医学文本挖掘发展,BioCreative V.5国际评测举办了3个任务,分别是化学物实体识别(CEMP),基因蛋白实体识别(GPRO)和实体在线服务系统(TIPS)。本文描述了我们参加CEMP和GPRO两个任务提出的神经网络方法。该方法中,我们应用了BiLSTM-CRF模型进行生物医学实体识别。为了提升系统性能,我们还探索了词性,chunking等额外特征对模型的性能影响。在官方公布的两个任务结果中,我们的系统在所有参赛队伍中都取得了最好的结果,在CEMP和GPRP任务上P,R,F值分别为:88.32%, 92.62%, 90.42%;76.65%, 81.91%, 79.19%。

  Abstract: 

  In biomedical research, patents contain the significant amount of information, and biomedical text mining has received much attention in patents recently. To accelerate the development of biomedical text mining for patents, the BioCreative V.5 challenge organized three tracks, i.e., chemical entity mention recognition (CEMP), gene and protein related object recognition (GPRO) and technical interoperability and performance of annotation servers (TIPS), to focus on biomedical entity recognition in patents. This paper describes our neural network approach for the CEMP and GPRO tracks. In the approach, a bidirectional Long Short-Term Memory with a conditional random field layer (BiLSTM-CRF) is employed to recognize biomedical entities from patents. To improve the performance, we explored the effect of additional features (i.e., part of speech (POS), chunking and named entity recognition (NER) features generated by the GENIA tagger) for the neural network model. In the official results, our best runs achieve the highest performances (a precision of 88.32%, a recall of 92.62%, and an F-score of 90.42% in the CEMP track; a precision of 76.65%, a recall of 81.91%, and an F-score of 79.19% in the GPRO track) among all participating teams in both tracks.