当前位置:网站首页>Raki's notes on reading paper: code and named entity recognition in stackoverflow
Raki's notes on reading paper: code and named entity recognition in stackoverflow
2022-07-05 04:25:00 【Sleeping Raki】
Abstract & Introduction & Related Work
Research tasks
Identification code token Software related entitiesFacing the challenge
- These named entities are often vague , And there are implicit dependencies on the attached code fragments
Innovative ideas
- We propose a named entity recognizer (NER), It uses multi-level attention network to combine text context with code fragment knowledge
The experimental conclusion
SoftNER Than BiLSTM-CRF High 9.73 Of F1 fraction
Annotated StackOverflow Corpus
For every question , The four answers are marked , Including accepted answers 、 The most voted answer , And two randomly selected answers ( If it exists )
Annotation Schema
Define twenty fine-grained entities
StackOverflow/GitHub Tokenization
Put forward a method called SOTOKENIZER New word breaker , It is specially customized for the community in the field of computer programming , We found that , Tokenization is not simple , Because many code related tags are incorrectly segmented by existing web text markers
Named Entity Recognition
Model overview
- embedding Extraction layer
- Multi level attention level
- BiLSTM-CRF
Input Embeddings
extract ELMo representation And two specific areas embedding
Code Recognizer, Indicates whether a word can be part of a code entity , Regardless of the context
Entity Segmenter, Predict whether a word is part of any named entity in a given sentence
In-domain Word Embeddings
Text in the field of software engineering contains programming language tags , Such as variable name or code segment , And interspersed with natural language vocabulary . This makes the input representation of pre training on general news network text unsuitable for the software field . therefore , We are StackOverflow Of 10 Pre training has been carried out for embedding words in different domains in the archives , Include ELMo、BERT and GloVe Vectorial 23 Billion marks
Context-independent Code Recognition
The input features include two language models (LMs) Of unigram word and 6-gram char probability , These two language models are in Gigaword Corpus and StackOverflow 10 All code fragments in the year file were trained separately
We also pre trained with these code snippets FastText(Joulin wait forsomeone ,2016) Words embedded in , One of the word vectors is represented by its character n-grams The sum of . We first use Gauss grading (Maddela and Xu,2018) Each one n-gram Probability turns into a k Dimension vector , This has been shown to improve the performance of neural models using digital features (Sil wait forsomeone ,2017;Liu wait forsomeone ,2016;Maddela and Xu,2018). then , We send vector features into the linear layer , Compare the output to FastText Character level embedded connection , And through another with sigmoid Active hidden layer . If the output probability is greater than 0.5, We predict that the tag is a code entity
Entity Segmentation
take ELMo embedding, And two manual features ,word frequency and code markdown concat Get up as input , Throw it BiLSTM-CRF Judge one in the model token Whether it is an entity mention
Word Frequency
Represents the number of words in the training set . Because of a lot of code token It is defined by individual users , Their frequency of occurrence is much lower than that of normal English words . in fact , In our corpus , Code and non code token The average frequencies of are 1.47 and 7.41. Besides , The average frequency of fuzzy markers that can be both code and non code entities is much higher , by 92.57. To take advantage of this observation , We take word frequency as a feature , The scalar value is converted into a k Dimension vector
Code Markdown
Represents a given token Whether it appears in Stack Overflow Of <code>
Inside the label It is worth noting that , One <code>
The label is noisy , Because users are not always hcodei The tag contains inline code , Or use this tag to highlight non code text . However , We found that we would markdown It is helpful to include information as a feature , Because it improves the performance of our segmentation model
Multi-Level Attention
For each of these word, Use ELMo embedding, Code Recognizer, Entity Segmenter As raw input , Throw it BiGRU Inside , Get the corresponding representation , Then through a linear layer , Connect tanh Activation function , The introduction of a embedding Level context vector , u e u_e ue, Learn during training , And then through a softmax Function to get the corresponding score a i t a_{it} ait
Finally, every one word Of embedding yes
word level embedding Follow embedding level attention equally , A trainable vector is also introduced u w u_w uw
Finally get w o r d i = a i h i word_i = a_ih_i wordi=aihi, Then throw it to BiLSTM-CRF Make predictions in the layer
Evaluation
Conclusion
We developed a new NER corpus , Including from StackOverflow Of 15,372 Sentences and from GitHub Of 6,510 A sentence , And note 20 A fine-grained named entity
We prove that this new corpus is an ideal benchmark data set for contextual word representation , Because there are many challenging ambiguities , It often needs a long-distance context to solve . We propose a new attention based model , be known as SoftNER, Its performance on this data set exceeds that of the most advanced NER Model
Besides , We studied the important subtask of code recognition . Our new code recognition model is based on character ELMo Additional spelling information is captured in addition , And continuously improve NER The performance of the model . We believe that our corpus is based on StackOverflow Named entity markers will be useful for various language and code tasks , Such as code retrieval 、 Software knowledge base extraction and automatic question answering
Remark
I think the main contribution is to build a new data set , The model is regular , Doing a good job in data sets can also lead to a summit !
边栏推荐
- Hexadecimal to decimal
- 【虚幻引擎UE】实现UE5像素流部署仅需六步操作少走弯路!(4.26和4.27原理类似)
- Key review route of probability theory and mathematical statistics examination
- This is an age of uncertainty
- Judge whether the stack order is reasonable according to the stack order
- PHP读取ini文件并修改内容写入
- 可观测|时序数据降采样在Prometheus实践复盘
- How to remove installed elpa package
- Ffmepg usage guide
- 美国5G Open RAN再遭重大挫败,抗衡中国5G技术的图谋已告失败
猜你喜欢
快手、抖音、视频号交战内容付费
【thingsboard】替换首页logo的方法
Mxnet imports various libcudarts * so、 libcuda*. So not found
Threejs factory model 3DMAX model obj+mtl format, source file download
指针函数(基础)
Fuel consumption calculator
CSDN正文自动生成目录
Pointer function (basic)
Realize the attention function of the article in the applet
直播預告 | 容器服務 ACK 彈性預測最佳實踐
随机推荐
FFmepg使用指南
Threejs realizes sky box, panoramic scene, ground grass
【虚幻引擎UE】实现UE5像素流部署仅需六步操作少走弯路!(4.26和4.27原理类似)
Moco is not suitable for target detection? MsrA proposes object level comparative learning target detection pre training method SOCO! Performance SOTA! (NeurIPS 2021)...
Hexadecimal to decimal
A solution to the problem that variables cannot change dynamically when debugging in keil5
Burpsuite grabs app packets
行为感知系统
MacBook installation postgresql+postgis
Use threejs to create geometry, dynamically add geometry, delete geometry, and add coordinate axes
web资源部署后navigator获取不到mediaDevices实例的解决方案(navigator.mediaDevices为undefined)
包 类 包的作用域
Components in protective circuit
Web开发人员应该养成的10个编程习惯
托管式服务网络:云原生时代的应用体系架构进化
[uniapp] system hot update implementation ideas
长度为n的入栈顺序的可能出栈顺序种数
This is an age of uncertainty
自动语音识别(ASR)研究综述
Decimal to hexadecimal