近日,国际生物信息学和生物医学会议(BIBM 2024)公布了录用论文列表,实验室录用了9篇论文,其中6篇长文,3篇短文,BIBM被CCF列为B级国际会议,致力于生物信息学和对于生物医学的跨领域研究。
论文一: An Improved Method for Phenotype Concept Recognition Using Rich HPO Information(基于本体信息的表型概念识别改进方法),long paper
作者:硕士生祁杰蔚等
摘要:Automatically identifying human phenotype ontology (HPO) concepts from text is important for disease analysis. Existing ontology-driven methods for phenotype concept recognition mainly rely on concept names and synonym information from the ontology, without fully exploiting the rich ontology information. In this paper, we present an improved phenotype concept recognition method by incorporating rich HPO information. We first design prompts with HPO information and use a cutting-edge large language model GPT-4 to generate synonym augmentation for expanding distant supervised training data. We then propose an ontology vector-enhanced phenotype concept classification model to efficiently integrate the taxonomic hierarchical structure of HPO. Additionally, we employ noisy data augmentation to improve the model's recognition ability in noisy texts and implement a negation detection function. Experimental results on three standard corpora and two typo corpora show our method compares favorably to previous methods and achieves a significant improvement in noisy texts.
从文本中自动识别人类表型本体(HPO)概念对于疾病分析具有重要意义。现有的本体驱动的表型概念识别方法主要依赖于本体中的概念名称和同义词信息,没有充分利用丰富的本体信息。本文提出了一种利用丰富的HPO信息进行表型概念识别的改进方法。我们首先设计带有HPO信息的提示,并使用尖端的大型语言模型GPT-4来生成同义词增强,以扩展远程监督训练数据。然后,我们提出了一个本体向量增强的表型概念分类模型,以有效地整合HPO的分类层次结构。此外,我们采用了噪声数据增强来提高模型在噪声文本中的识别能力,并实现了否定检测功能。在三个标准语料库和两个错别字语料库上的实验结果表明,我们的方法优于以往的方法,在噪声文本中取得了显著的改进。
论文二:Dual Sentiment-aware Networks for Adverse Drug Reactions Detection(基于双重情感感知网络的药物不良反应检测),long paper
作者:博士生邱云志等
摘要:Adverse drug reactions (ADRs) are harmful reactions that occur with qualified drugs under normal dosage. Due to its significant impact on patients' health, timely detection for ADRs is of great significance. Current approaches typically introduce sentiment to enhance the detection of ADRs because sentiment information can reflect the subjective feelings of the patient about the drug. However, none of the existing works has considered both explicit and implicit sentiment knowledge. To this end, we propose a novel approach named DSN to adaptively fuse explicit and implicit sentiment knowledge for adverse drug reaction detection. Specifically, we first propose a dimension extension approach to avoid the information loss problem faced when introducing explicit sentiment knowledge, and then utilize sentiment aware network based on attention mechanism to learn explicit and implicit sentiment knowledge; Finally, we design a sentiment fusion network to adaptively learn text and sentiment knowledge in order to help and guide the deeper fusion of text and two kinds of sentiment knowledge. Extensive experiments conducted on several datasets demonstrate the superiority of our proposed DSN over existing methods.
药物不良反应(ADRs)是指合格药物在正常剂量下发生的有害反应。由于其对患者健康的重大影响,及时发现ADRs具有重要意义。考虑到情感信息可以反映患者对药物的主观感受,因此目前的方法通常会引入情感信息来增强ADR的检测。然而,现有的方法都没有同时考虑显式和隐式情感知识。为此,我们提出了一种名为DSN的新方法,用于自适应地融合显式和隐式情感知识来检测药物不良反应。具体来说,我们首先提出了一种维度扩展方法来避免引入显式情感知识时面临的信息缺失问题,然后利用基于注意力机制的情感感知网络来学习两种情感知识;最后,我们设计了一种情感融合网络来自适应地学习文本和情感知识,以帮助和指导文本和两种情感知识的深度融合。在多个数据集上进行的广泛实验证明了我们提出的DSN模型优于现有方法。
论文三:Document Embeddings Enhance Biomedical Retrieval-Augmented Generation(文档向量增强生物医学RAG),long paper
作者:硕士生孔永乐等
摘要:Large language models (LLMs) perform well in many NLP tasks but frequently generate inaccurate information in the biomedical domain, due to hallucination issues. Retrieval Augmented Generation (RAG) has been introduced to address this issue by integrating external knowledge, enhancing the factual accuracy of outputs. However, naive RAG encounters challenges in effectively utilizing retrieved content, particularly in specialized domains like biomedicine. LLMs often struggle to integrate retrieved content as irrelevant information can interfere with the model’s judgment. Even if relevant documents are retrieved, the model may be unable to accurately comprehend and utilize the domain-specific features due to its inherent knowledge limitations. To overcome these limitations, we propose Document Embeddings Enhanced Biomedical RAG (DEEB-RAG), a framework that incorporates document embeddings along with the original retrieved text. DEEB-RAG uses MedCPT to generate document embeddings and these embeddings are then aligned with the LLM’s semantic space using a two-stage training process on a simple projector. Experimental results on biomedical QA datasets show that DEEB-RAG improves accuracy, with an average performance increase of 2.3% over naive RAG. This demonstrates DEEB-RAG’s ability to mitigate the challenges of utilizing complex biomedical information, thereby enhancing the reliability and effectiveness of LLMs in biomedical domain.
大型语言模型(LLMs)在众多自然语言处理任务中表现出色,但在生物医学领域常因幻觉问题生成不准确的信息。为了解决这一问题,研究人员引入了检索增强生成(RAG)方法,通过整合外部知识提高输出的事实准确性。然而,简单的RAG在有效利用检索内容方面仍面临挑战,特别是在生物医学等专业领域。LLMs在整合检索内容时往往受到无关信息干扰,即使检索到相关文档,模型也可能因其固有知识的局限性而无法准确理解和利用领域特定特征。为克服这些局限性,我们提出了文档向量增强的生物医学RAG(DEEB-RAG)框架,该框架结合了原始检索文本以及对应的文档向量。DEEB-RAG使用MedCPT生成文档嵌入,并通过简单投影器的两阶段训练过程将这些嵌入与LLM的语义空间对齐。生物医学问答数据集上的实验结果表明,DEEB-RAG的准确性平均提高了2.3%,展示了其在应对复杂生物医学信息方面的能力,从而增强了LLMs在生物医学领域的可靠性和有效性。
论文四:Biomedical Document-level Relation Extraction with Coreference and Anaphor Graphs(基于共指和指代图的生物医学文档级关系抽取),long paper
作者:博士生李记如等
摘要:Biomedical document-level relation extraction is a crucial technology for mining the biomedical relationships necessary for clinical diagnosis, treatment, and medical discovery. Although existing intrasentential relation extraction methods have achieved significant results, the complexity and scattered nature of information in biomedical literature require relation extraction techniques to effectively handle cross-sentence information. For example, existing methods have not been able to explicitly model the phenomena of coreference and anaphor in documents, thus affecting the model’s understanding of complex semantics within the document. To address this issue, we propose a new document-level relation extraction model with coreference and anaphor graphs. By abstracting the document into an undirected graph that includes coreference and anaphor information, the framework effectively models the interactions between entities and leverages graph convolutional network in conjunction with pre-trained language model to dynamically understand graph structures. Additionally, the shift from fine-grained entity-pair level to coarse-grained document-level training and inference significantly enhances the model’s efficiency while maintaining high extraction performance. Extensive experiments demonstrate that our model achieves a 5.3% increase in F1-score over baseline models on the BioRED dataset with higher efficiency, confirming its effectiveness in handling relation extraction tasks in complex biomedical literature.
生物医学文档级关系抽取是挖掘临床诊断、治疗和医学发现所需生物医学关系的关键技术。尽管现有的句内关系抽取方法已取得显著成果,但生物医学文献的复杂性和信息的分散性要求关系抽取技术必须有效处理跨句子信息。例如,现有方法尚未能显式建模文档中的共指和指代现象,从而影响模型对文档内复杂语义的理解。为解决这一问题,我们提出了一种新的具有共指和指代图的文档级关系抽取模型。通过将文档抽象为包含共指和指代信息的无向图,该框架有效地建模了实体之间的交互,并结合预训练语言模型利用图卷积网络动态理解图结构。此外,通过将细粒度的实体对级别的训练与推理转换为粗粒度的文章级的训练与推理,我们在保持优异关系抽取性能的同时,显著提升了模型的效率。广泛的实验表明,我们的模型在BioRED数据集上的F1分数比基线模型提高了5.3%,证明了其在处理复杂生物医学文献中的关系提取任务方面的有效性。
论文五:Modeling Implicit Emotion and User-specific Context for Malevolence Detection in Mental Health Counseling Dialogues(对隐含情感和用户语境建模的恶意对话检测),long paper
作者:徐博老师等
摘要:Generative conversational agents, driven by large language models, have gained widespread popularity. However, a significant drawback lies in their tendency to produce uncontrollable and unpredictable contents, thereby increasing the risk of generating malevolent responses that potentially exacerbate users’ mental health issues. Although existing research on malevolence detection in dialogues addressed the modeling of interaction patterns in dialogues, the implicitly expressed emotion and user-specific context are often neglected. Addressing this gap, we propose a hypergraph-enhanced context modeling approach for detecting malevolence in mental health counseling dialogues. Our approach harnesses the emotion reasoning capabilities of large language models to generate implicit emotional prompts. Employing hypergraph neural networks, our approach effectively integrates emotional context, user-specific context, and interactive context, fusing them into high-order semantic representations using hypergraph convolution. Experimental results on two benchmark datasets, MDRDC and Dialogue Safety, demonstrate the superiority of our model over state-of-the-art baseline models, particularly in complex contextual scenarios.
由大型语言模型驱动的生成式会话代理受到广泛欢迎。然而,它们的一个显著缺点是容易产生不可控和不可预测的内容,从而增加了产生恶意反应的风险,有可能加剧用户的心理健康问题。尽管现有关于对话中恶意检测的研究有对对话中交互模式的建模,但往往忽视了内隐表达的情感和用户特定的语境。针对这一缺陷,我们提出了一种超图增强的情境建模方法,用于检测心理健康咨询对话中的恶意行为。我们的方法利用大型语言模型的情感推理能力来生成隐性情感提示,采用超图神经网络有效地整合了情感语境、用户特定语境和交互语境,并利用超图卷积将它们融合为高阶语义表征。在两个基准数据集(MDRDC 和 Dialogue Safety)上的实验结果表明我们的模型优于最先进的基线模型,尤其是在复杂的上下文场景中。
论文六:MAT: Medical AI-generated Text Detection Dataset from Multi-models and Multi-Methods(MAT:基于多模型多方法的医学AI生成的文本检测数据集),long paper
作者:徐博老师等
摘要:Large language models (LLMs) have been widely used in society due to their amazing emergent capabilities, but they also bring security issues. A large amount of content on the Internet may be generated by AI. Whether it is social forums or more professional academic fields, the abuse of AI has become a problem. Especially in some professional fields, Blindly trusting what the Internet says is dangerous. For this reason, for some fields that are risky and need to limit the use of AI, such as medicine, a more comprehensive benchmark is needed to test the ability of AI-generated text detection tasks. Considering the popularity of LLMs, the data distributions used in the training process of different LLMs may lead to the differences in generated data distributions, especially for some LLMs for non-native English speakers. To address this issue, this article introduces an AI-generated text detection dataset in the field of medical question answering. This dataset is generated by various models and prompting methods, and cross validation is performed on multiple types of data between texts generated by different methods to verify the effectiveness of AI-generated text detection and model classification tasks, and to study the generalization of the dataset in different tasks. We have published the dataset for future research on https://github.com/Hellpoop/MAT
大型语言模型(LLMs)因其惊人的涌现能力而在社会中得到了广泛的应用,但它们也带来了安全问题。互联网上的大量内容可能是由人工智能生成的。无论是社交论坛还是更专业的学术领域,人工智能的滥用都已成为一个问题。尤其是在某些专业领域,盲目相信互联网上说的话是危险的。因此,对于一些有风险且需要限制人工智能使用的领域,如医学,需要一个更全面的基准来测试人工智能生成的文本检测任务的能力。
考虑到LLM的普及,不同LLM训练过程中使用的数据分布可能会导致生成的数据分布存在差异,特别是对于一些非英语母语的LLM。为了解决这个问题,本文介绍了一个医学问答领域的人工智能生成的文本检测数据集。该数据集由各种模型和提示方法生成,并对不同方法生成的文本之间的多种数据进行交叉验证,以验证人工智能生成文本检测和模型分类任务的有效性,并研究数据集在不同任务中的泛化。我们已经发布了数据集,以供未来研究https://github.com/Hellpoop/MAT
论文七:CFAH: A Chinese Dataset for Detecting False Advertising in Healthcare(CFAH:用于检测医疗保健领域虚假宣传的中文数据集),short paper
作者:博士生付伟茹等
摘要:False advertising in healthcare can lead to severe harm to public health and disrupt market order. Traditional regulation of healthcare advertisements relies on manual review, leading to inefficiency. Furthermore, existing detection systems often rely on keyword matching to filter content, which lacks deep semantic analysis of advertising texts and may result in significant under-detection. To tackle this issue, we propose an automatic detection task for false advertising techniques in healthcare including extracting false advertising spans and determining the techniques used. However, there is currently a lack of datasets available for research in this task. In response, we construct and release a Chinese dataset for detecting False Advertising in Healthcare (CFAH), which includes 12 types of false advertising techniques in healthcare advertisements, annotated at the span level. Viewing our task as a span identification challenge, we evaluated mainstream span identification models on the CFAH dataset. The experimental results demonstrate that both span-based traditional methods and fine-tuned generative Large Language Models (LLMs) perform well on this task, with specific models in each category showing the best performance. These results provide important model baselines and references for subsequent research.
医疗保健领域的虚假宣传可能会对公共卫生造成严重危害,并扰乱市场秩序。传统的医疗保健广告监管依赖于人工审查,这导致效率低下。此外,现有的检测系统通常依赖于关键词匹配来过滤内容,这缺乏对广告文本的深度语义分析,可能导致显著的漏检。为了解决这个问题,我们提出了一个自动检测医疗保健领域虚假宣传技术的任务,包括提取虚假宣传片段和确定所使用的技巧。然而,目前缺乏用于这项任务研究的数据集。为此,我们构建并发布了一个中文数据集,用于检测医疗保健中的虚假宣传广告,其中包含了12种医疗保健广告中的宣传技巧,并在跨度级别进行了注释。将我们的任务视为一个跨度识别挑战,我们在CFAH数据集上评估了主流的跨度识别模型。实验结果表明,基于跨度的传统方法和微调的生成式大型语言模型在这个任务上都表现良好,每个类别中的特定模型显示出最佳性能。这些结果为后续研究提供了重要的模型基线和参考。
论文八:Document-level Biomedical Relation Extraction Based on Relation-guided Entity-level Graphs(基于关系引导实体级图的文档级生物医学关系抽取),short paper
作者:硕士生高梁育等
摘要:The task of document-level biomedical relation extraction involves identifying relational facts between entities across sentences, given specific entities. However, most current methods overlook the associations between entity pairs and generate fixed entity representations merely through mentions, leading to irrelevant mentions interfering with the determination of relational facts. Additionally, these methods fail to consider the global information and dependencies between relational entities. To address these issues, we propose a document-level relation extraction model based on relation-guided entity-level graphs. Our model aggregates all mentions of the same entity through a relation-guided attention mechanism to obtain flexible entity representations. Furthermore, by using U-Net to generate entity-level feature graphs, it facilitates global interactions and dependency capture between entity pairs. Experimental results on two benchmark datasets demonstrate the advantages of our approach in document-level biomedical relation extraction.
文档级生物医学关系抽取的任务是在给定特定实体的情况下,识别跨句子实体之间的关系事实。然而,目前大多数方法忽视了实体对之间的关联,仅通过提及来生成固定的实体表示,这导致不相关的提及干扰了关系事实的确定。此外,这些方法未能考虑关系实体之间的全局信息和依赖关系。为了解决这些问题,我们提出了一种基于关系引导实体级图的文档级关系抽取模型。我们的模型通过关系引导注意力机制将同一实体的所有提及进行聚合,以获得灵活的实体表示。此外,通过使用U-Net生成实体级特征图,它促进了实体对之间的全局交互和依赖关系的捕获。在两个基准数据集上的实验结果表明,我们的方法在文档级生物医学关系抽取方面具有优势。
论文九:Biomedical Event Extraction as Semantic Segmentation(将生物医学事件抽取作为语义分割),short paper
作者:硕士生高梁育等
摘要:In the biomedical field, information is widely distributed across numerous pieces of literature. Extracting events between entities from biomedical texts has garnered significant attention in recent years. However, previous research primarily focus on extracting flat biomedical events, with less attention given to nested biomedical events. Moreover, existing methods for extracting nested events often overlook the long-distance dependencies and global information between trigger words and arguments within events, and they lack sufficient interaction with event type information. To address these issues, we propose a semantic segmentation-based method for extracting nested biomedical events. We introduce U-Net to capture global information and interdependencies between event entities. Additionally, we map event types to natural language text and combine them with sentences for encoding to enhance interaction. We also employ two auxiliary tasks to improve the identification of trigger words and arguments. Finally, events are extracted by identifying the four vertices of the segmented region. Experimental results on two benchmark datasets show that our method excels in recognizing nested biomedical events and outperforms current state-of-the-art methods.
在生物医学领域,信息广泛分布于大量的文献之中。近年来,从生物医学文本中提取实体间的事件已引起广泛关注。然而,以往的研究主要集中在提取简单的生物医学事件上,而对嵌套生物医学事件的关注较少。此外,现有的嵌套事件提取方法往往忽视了触发词与事件内参数之间的长距离依赖和全局信息,并且缺乏与事件类型信息的充分交互。为了解决这些问题,我们提出了一种基于语义分割的嵌套生物医学事件提取方法。我们引入U-Net来捕获事件实体之间的全局信息和相互依赖关系。此外,我们将事件类型映射为自然语言文本,并将其与句子结合进行编码,以增强交互性。我们还采用两个辅助任务来提高触发词和参数的识别效果。最后,通过识别分割区域的四个顶点来提取事件。在两个基准数据集上的实验结果表明,我们的方法在识别嵌套生物医学事件方面表现出色,并优于当前最先进的方法。