当前位置:网站首页>Thesis reading_ Medical NLP model_ EMBERT
Thesis reading_ Medical NLP model_ EMBERT
2022-07-05 17:41:00 【xieyan0811】
English title :EMBERT: A Pre-trained Language Model for Chinese Medical Text Mining
Chinese title : A pre training language model for Chinese medical text mining
Address of thesis :https://chywang.github.io/papers/apweb2021.pdf
field : natural language processing , Knowledge map , Biomedical
Time of publication :2021
author :Zerui Cai etc. , East China Normal University
Source :APWEB/WAIM
Quantity cited :1
Reading time :22.06.22
Journal entry
For the medical field , utilize Synonyms in the knowledge map ( Only dictionaries are used , Graph calculation method is not used ), Training similar BERT Natural language representation model . The advantage lies in the substitution of knowledge , Specifically designed Three self supervised learning methods To capture the relationship between fine-grained entities . The experimental effect is slightly better than the existing model . No corresponding code found , The specific operation method is not particularly detailed , Mainly understand the spirit .
What is worth learning from , The Chinese medical knowledge map used , Among them, the use of synonyms ,AutoPhrase Automatically recognize phrases , Segmentation method of high-frequency word boundary, etc .
Introduce
The method in this paper is dedicated to : Make better use of a large number of unlabeled data and pre training models ; Use entity level knowledge to enhance ; Capture fine-grained semantic relationships . And MC-BERT comparison , The model in this paper pays more attention to exploring the relationship between entities .
The author mainly aims at three problems :
- Synonyms are different , such as : Tuberculosis And consumption It refers to the same disease , But the text description is different .
- Entity nesting , such as : Novel coronavirus pneumonia , Both contain pneumonia entities , It also contains the entity of novel coronavirus , Itself is also an entity , The previous method only focused on the whole entity .
- Long entity misreading , such as : Diabetes ketoacids , When parsing, you need to pay attention to the relationship between the primary entity and other entities .
The contribution of the article is as follows :
- A Chinese medical pre training model is proposed EMBERT(Entity-rich Medical BERT), You can learn the characteristics of medical terms .
- Three self supervised tasks are proposed to capture semantic relevance at the entity level .
- Six Chinese medical data sets were used to evaluate , Experiments show that the effect is better than the previous method .
Method

Entity context consistency prediction
Use from http://www.openkg.cn/ From the knowledge map of SameAs Relationship building dictionary , Replace the words in the data set with synonyms to construct more training data , Then predict the consistency between the replaced entity and the context , To improve the effect of the model . On the principle of , The context of the replaced entity and the original entity should also be consistent .
Suppose a sentence contains words x1…xn, Replaced the i Entity xsi,…xei, among s and e Indicates the starting and ending position of replacement , Its context refers to the location in si Before and si What follows , use ci Express .
First , Encode the replaced entity as a vector yi:

then , utilize yi To predict the context ci, And calculate the loss function :

Entity segmentation
A rule-based system is used to cut the long entity into several parts of semantics , And label , Training model with labeled data .
The way to do it is Build a physical vocabulary , Get a group of high-quality entities in the medical field from the Training Center , Combine with entities in the knowledge map . First use AutoPhrase Generate the original segmentation result , Calculate the frequency of the start and end positions of each segment , Yes top-100 Manual check of high-frequency words , As a tangent diversity .
set up Long entity is xsi,…,xei, Cut it further xeij,…,xeij, And the last position of the segmented segment xsij Label as the syncopation point 1, Other location labels are 0, Training models to predict this label , It is defined as a binary classification problem . Formula y Is this position token Vector representation of .

The loss function is calculated as follows :

Bidirectional entity masking
Use the method in the previous step , Long entities can be divided into adjectives and meta entities ( The main entities ), Cover adjectives , Use the primary entity to predict it ; Relative , Also mask the main entity , Predict it with adjectives .
Take the masking meta entity as an example , Use adjectives and relative positions p To calculate the representation of meta entities :

And then use it yj To predict the xj, And calculate the cross entropy as the loss function :

The same is true of using meta entity prediction to predict adjectives , The final loss function Lben It is the sum of two losses .
Loss function
The final loss function , contain BERT The loss of Lex And the loss of the above three methods ,λ It's a super parameter. .

experiment
Use dingxiangyuan medical community Q & A and BBS Data training model , Data volume 5G, The training data used in this paper is obviously less than MC-BERT, But the effect is similar .
The main experimental results are as follows :

边栏推荐
- 机器学习02:模型评估
- Is it safe for China Galaxy Securities to open an account? How long can I buy stocks after opening an account
- Oracle Recovery Tools ----oracle数据库恢复利器
- 服务器配置 jupyter环境
- MySQL之知识点(七)
- 查看自己电脑连接过的WiFi密码
- 力扣解法汇总729-我的日程安排表 I
- Short the command line via jar manifest or via a classpath file and rerun
- 提高應用程序性能的7個DevOps實踐
- 33: Chapter 3: develop pass service: 16: use redis to cache user information; (to reduce the pressure on the database)
猜你喜欢
MySql 查询符合条件的最新数据行

统计php程序运行时间及设置PHP最长运行时间

Mongodb (quick start) (I)

Abnormal recovery of virtual machine Oracle -- Xi Fenfei

Kafaka技术第一课

查看自己电脑连接过的WiFi密码
Redis+caffeine two-level cache enables smooth access speed

ICML 2022 | Meta提出鲁棒的多目标贝叶斯优化方法,有效应对输入噪声

Count the running time of PHP program and set the maximum running time of PHP

Compter le temps d'exécution du programme PHP et définir le temps d'exécution maximum de PHP
随机推荐
Flow characteristics of kitchen knife, ant sword, ice scorpion and Godzilla
Disabling and enabling inspections pycharm
mysql如何使用JSON_EXTRACT()取json值
Example tutorial of SQL deduplication
7. Scala class
漫画:如何实现大整数相乘?(下)
Cloud security daily 220705: the red hat PHP interpreter has found a vulnerability of executing arbitrary code, which needs to be upgraded as soon as possible
ThoughtWorks global CTO: build the architecture according to needs, and excessive engineering will only "waste people and money"
Count the running time of PHP program and set the maximum running time of PHP
Customize the theme of matrix (I) night mode
Vulnerability recurrence - 48. Command injection in airflow DAG (cve-2020-11978)
北京内推 | 微软亚洲研究院机器学习组招聘NLP/语音合成等方向全职研究员
Q2 encryption market investment and financing report in 2022: gamefi becomes an investment keyword
这个17岁的黑客天才,破解了第一代iPhone!
Sentinel-流量防卫兵
mongodb(快速上手)(一)
SQL Server(2)
33:第三章:开发通行证服务:16:使用Redis缓存用户信息;(以减轻数据库的压力)
Knowledge points of MySQL (6)
Cartoon: looking for the k-th element of an unordered array (Revised)