近日,实验室团队关于中英双语生物医学大模型“太一”的研究成果被Journal of the American Medical Informatics Association (JAMIA)期刊录用并上线。JAMIA是医学信息学领域内的TOP期刊,属于JCR一区,CCF推荐期刊B类,影响因子为6.4。
论文地址:https://doi.org/10.1093/jamia/ocae037
项目地址:https://github.com/DUTIR-BioNLP/Taiyi-LLM
题目:Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks
摘要:Objective: Most existing fine-tuned biomedical large language models (LLMs) focus on enhancing performance in monolingual biomedical question answering and conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on diverse biomedical natural language processing (NLP) tasks in different languages, we present Taiyi, a bilingual fine-tuned LLM for diverse biomedical NLP tasks.
Materials and Methods: We first curated a comprehensive collection of 140 existing biomedical text mining datasets (102 English and 38 Chinese datasets) across over 10 task types. Subsequently, these corpora were converted to the instruction data used to fine-tune the general LLM. During the supervised fine-tuning phase, a 2-stage strategy is proposed to optimize the model performance across various tasks.
Results: Experimental results on 13 test sets, which include named entity recognition, relation extraction, text classification, and question answering tasks, demonstrate that Taiyi achieves superior performance compared to general LLMs. The case study involving additional biomedical NLP tasks further shows Taiyi’s considerable potential for bilingual biomedical multitasking.
Conclusion: Leveraging rich high-quality biomedical corpora and developing effective fine-tuning strategies can significantly improve the performance of LLMs within the biomedical domain. Taiyi shows the bilingual multitasking capability through supervised fine-tuning. However, those tasks such as information extraction that are not generation tasks in nature remain challenging for LLM-based generative approaches, and they still underperform the conventional discriminative approaches using smaller language models.
研究目标:大多数现存的生物医学大型语言模型微调工作主要集中在提升单语种生物医学问答和会话任务的性能上。为了探索大语言模型在不同语言的多种生物医学自然语言处理(NLP)任务中的有效性,我们提出了“太一”,一个基于多任务指令微调的中英双语生物医学大模型。
方法:我们首先全面地收集并整合了现存的140个生物医学文本挖掘数据集(包括102个英文和38个中文数据集),其覆盖了10多种任务类型。随后,提出了一种两阶段策略进行监督指令微调,以优化模型在不同任务上的性能。
结果:在包含生物医学命名实体识别、关系提取、文本分类、问答等任务的13个测试集上进行实验。结果表明,“太一”相比于已有的通用大语言模型展现了更优越的性能。此外,通过进一步的案例分析展示了“太一”在双语生物医学多任务方面的巨大潜力。
结论:利用丰富的高质量生物医学语料库和开发有效的监督微调策略可以显著提高大语言模型在生物医学领域上的性能。提出的“太一”模型展现了强大的双语多任务处理能力。但在那些不属于天然的生成任务上(例如信息抽取),使用大语言模型的性能仍不如基于判别式方法的较小语言模型。