当前位置:网站首页>Raki's notes on reading paper: soft gazetteers for low resource named entity recognition
Raki's notes on reading paper: soft gazetteers for low resource named entity recognition
2022-07-05 04:25:00 【Sleeping Raki】
Abstract & Introduction & Related Work
- Research tasks
Low resource named entity recognition - Existing methods and related work
- Will be based on discourse tags 、 Morphology and manually created entity list ( It is called place name index ) The integration of linguistic features into neural models will lead to Achieve better on English data NER
- Facing the challenge
- However, it is difficult to integrate the features of place name dictionaries directly into these models , Because the Toponymic dictionaries of these languages either have limited coverage , Or not at all . Due to the lack of annotators of available low resource languages , Expanding them takes time and money .
- Innovative ideas
- Introduced soft-gazetteers, A method of creating gazetteer features with continuous values based on ready-made data from high resource languages and large English knowledge bases
- The entity connection method is used
- The experimental conclusion
Our experiments have proved the effectiveness of our proposed soft place name dictionary feature , Among the four low resource languages , Average ratio baseline Improved 4 individual F1 spot . The four languages are : Kinyarwanda 、 Oromo 、 Sinhala and Tigrinya ( What kind of language ?)
Binary Gazetteer Features
Binary gazetteer features are used to indicate the corresponding n-gram Whether it appears in the gazetteer
Entity Linking
Entity link (EL) Is the name of the entity mention Rather than in a structured knowledge base (KB) Tasks associated with the corresponding entries in (Hachey wait forsomeone ,2013). for example , Entities to be mentioned " Mars " Link to Wikipedia entries . In most entity linking systems (Hachey wait forsomeone ,2013 year ;Sil wait forsomeone ,2018 year ), The first step is to screen out candidates KB entry , These entries are further processed by the entity disambiguation Algorithm . Candidate search method , Generally speaking , Also according to the input mention Score each candidate result
Soft Gazetteer Features
Creating gazetteers for low resource languages is difficult , We propose a list of soft lands , The value of comparing specific specified gazetteer features is 0 or 1, The soft land list is continuous , Its value is between 0 To 1 Between
For each of these span, We assume that there is an entity link extraction method that returns a series of candidate structured knowledge bases , And rank the candidate results
Try different ways , Candidate list to generate feature vectors :
- Select only top1
- choose top3, Three eigenvectors
- about top3, Judgment and candidate types t Is it consistent
- Before calculation 30 Type count of candidates
- Calculate the difference between two consecutive scores
We try different combinations of these features by splicing their respective vectors . The connected vector passes through a vector with tanh Nonlinear fully connected neural network layer , And then for NER In the model
Named Entity Recognition Model
Adding an automatic encoder to reconstruct handmade features will lead to NER Performance improvement . The automatic encoder will BiLSTM As input to a with sigmoid Activate the full connection layer of the function , And reconstruct the features . This has forced BiLSTM Retain information from features . The cross entropy loss of feature reconstruction of soft place name dictionary is the goal of automatic encoder , L A E L_{AE} LAE
The loss of training is CRF And self encoder loss The synthesis of
Experiments
Methods
Soft gazetteer methods
We tested different candidate retrieval methods designed for low resource languages . These methods are only trained with Wikipedia's small bilingual dictionary , Its scale is similar to that of the place name dictionary
- WIKIMEN:WikiMention Methods are used in several of the most advanced EL In the system , among , Links to bilingual Wikipedia are used to retrieve appropriate English KB The candidate
- Pivot-based-entity-linking: This method uses n-gram Neural embedding method (Wieting wait forsomeone ,2016) Encoding entity references at the character level , And calculate its relation with KB Similarity of items . We experimented with two variants , And follow Zhou wait forsomeone (2020) Super parameter selection .
1)PBELSUPERVISED: Train according to a small number of bilingual Wikipedia links in the target low resource language .
2)PBELZERO: In some high resource languages (“ Fulcrum ”) Training , And transfer to the target language in the way of zero starting point . The transfer language we use is Swahili for Kinyarwanda , Indonesian is used for Oromo , Hindi is used in Sinhala , And Amharic for Tigre
Oracles
As the upper limit of accuracy , We compare it with two powerful human systems .
- ORACLEEL: For the soft place name dictionary , We assume a perfect candidate search , If the content mentioned is not NIL, Always return the correct KB Items as primary candidates .
- ORACLEGAZ: We artificially increase by adding all named entities to the place name dictionary BINARYGAZ The capacity of . All named entities in our dataset .
Conclusion
We propose a low resource NER How to create features , And it shows its effectiveness in four low resource languages . Possible future directions include the use of a combination of more complex feature designs and candidate retrieval methods
Remark
The model is very simple , This soft land listing method feels a little complicated (
Anyway, it's OK
边栏推荐
- Leetcode hot topic Hot 100 day 33: "subset"
- Decimal to hexadecimal
- Live broadcast preview | container service ack elasticity prediction best practice
- Open graph protocol
- Threejs factory model 3DMAX model obj+mtl format, source file download
- 学习MVVM笔记(一)
- Threejs clicks the scene object to obtain object information, and threejs uses raycaster to pick up object information
- [untitled]
- WeNet:面向工业落地的E2E语音识别工具
- [untitled]
猜你喜欢
如何实现实时音视频聊天功能
NetSetMan pro (IP fast switching tool) official Chinese version v5.1.0 | computer IP switching software download
【科普】热设计基础知识:5G光器件之散热分析
蛇形矩阵
【虚幻引擎UE】实现测绘三脚架展开动画制作
Looking back on 2021, looking forward to 2022 | a year between CSDN and me
level17
Cookie learning diary 1
kubernetes集群之调度系统
[moteur illusoire UE] il ne faut que six étapes pour réaliser le déploiement du flux de pixels ue5 et éviter les détours! (4.26 et 4.27 principes similaires)
随机推荐
Use threejs to create geometry and add materials, lights, shadows, animations, and axes
mxnet导入报各种libcudart*.so、 libcuda*.so找不到
Aperçu en direct | Services de conteneurs ACK flexible Prediction Best Practices
如何优雅的获取每个分组的前几条数据
可观测|时序数据降采样在Prometheus实践复盘
Threejs Internet of things, 3D visualization of farms (I)
Network layer - forwarding (IP, ARP, DCHP, ICMP, network layer addressing, network address translation)
Sword finger offer 04 Search in two-dimensional array
You Li takes you to talk about C language 7 (define constants and macros)
How to remove installed elpa package
Threejs realizes sky box, panoramic scene, ground grass
How can CIOs use business analysis to build business value?
Open graph protocol
蛇形矩阵
level17
Scheduling system of kubernetes cluster
The scale of computing power in China ranks second in the world: computing is leaping forward in Intelligent Computing
快手、抖音、视频号交战内容付费
MacBook installation postgresql+postgis
基于TCP的移动端IM即时通讯开发仍然需要心跳保活