当前位置：网站首页>Raki's notes on reading paper: code and named entity recognition in stackoverflow

Raki's notes on reading paper: code and named entity recognition in stackoverflow

2022-07-05 04:25:00 【Sleeping Raki】

Abstract & Introduction & Related Work

Research tasks
Identification code token Software related entities
Facing the challenge
1. These named entities are often vague , And there are implicit dependencies on the attached code fragments
Innovative ideas
1. We propose a named entity recognizer （NER）, It uses multi-level attention network to combine text context with code fragment knowledge
The experimental conclusion
SoftNER Than BiLSTM-CRF High 9.73 Of F1 fraction

Annotated StackOverflow Corpus

For every question , The four answers are marked , Including accepted answers 、 The most voted answer , And two randomly selected answers （ If it exists ）
Insert picture description here

Annotation Schema

Define twenty fine-grained entities
Insert picture description here

StackOverflow/GitHub Tokenization

Put forward a method called SOTOKENIZER New word breaker , It is specially customized for the community in the field of computer programming , We found that , Tokenization is not simple , Because many code related tags are incorrectly segmented by existing web text markers
Insert picture description here

Named Entity Recognition

Model overview

embedding Extraction layer
Multi level attention level
BiLSTM-CRF

Input Embeddings

extract ELMo representation And two specific areas embedding

Code Recognizer, Indicates whether a word can be part of a code entity , Regardless of the context

Entity Segmenter, Predict whether a word is part of any named entity in a given sentence

In-domain Word Embeddings

Text in the field of software engineering contains programming language tags , Such as variable name or code segment , And interspersed with natural language vocabulary . This makes the input representation of pre training on general news network text unsuitable for the software field . therefore , We are StackOverflow Of 10 Pre training has been carried out for embedding words in different domains in the archives , Include ELMo、BERT and GloVe Vectorial 23 Billion marks

Context-independent Code Recognition

The input features include two language models （LMs） Of unigram word and 6-gram char probability , These two language models are in Gigaword Corpus and StackOverflow 10 All code fragments in the year file were trained separately

We also pre trained with these code snippets FastText（Joulin wait forsomeone ,2016） Words embedded in , One of the word vectors is represented by its character n-grams The sum of . We first use Gauss grading （Maddela and Xu,2018） Each one n-gram Probability turns into a k Dimension vector , This has been shown to improve the performance of neural models using digital features （Sil wait forsomeone ,2017;Liu wait forsomeone ,2016;Maddela and Xu,2018）. then , We send vector features into the linear layer , Compare the output to FastText Character level embedded connection , And through another with sigmoid Active hidden layer . If the output probability is greater than 0.5, We predict that the tag is a code entity

Entity Segmentation

take ELMo embedding, And two manual features ,word frequency and code markdown concat Get up as input , Throw it BiLSTM-CRF Judge one in the model token Whether it is an entity mention

Word Frequency

Represents the number of words in the training set . Because of a lot of code token It is defined by individual users , Their frequency of occurrence is much lower than that of normal English words . in fact , In our corpus , Code and non code token The average frequencies of are 1.47 and 7.41. Besides , The average frequency of fuzzy markers that can be both code and non code entities is much higher , by 92.57. To take advantage of this observation , We take word frequency as a feature , The scalar value is converted into a k Dimension vector

Code Markdown

Represents a given token Whether it appears in Stack Overflow Of <code> Inside the label It is worth noting that , One <code> The label is noisy , Because users are not always hcodei The tag contains inline code , Or use this tag to highlight non code text . However , We found that we would markdown It is helpful to include information as a feature , Because it improves the performance of our segmentation model

Multi-Level Attention

For each of these word, Use ELMo embedding, Code Recognizer, Entity Segmenter As raw input , Throw it BiGRU Inside , Get the corresponding representation , Then through a linear layer , Connect tanh Activation function , The introduction of a embedding Level context vector , $u_e$ , Learn during training , And then through a softmax Function to get the corresponding score $a_{it}$

Insert picture description here
Finally, every one word Of embedding yes

word level embedding Follow embedding level attention equally , A trainable vector is also introduced $u_w$

Finally get $word_i = a_ih_i$ , Then throw it to BiLSTM-CRF Make predictions in the layer
Insert picture description here

Evaluation

Insert picture description here

Conclusion

We developed a new NER corpus , Including from StackOverflow Of 15,372 Sentences and from GitHub Of 6,510 A sentence , And note 20 A fine-grained named entity

We prove that this new corpus is an ideal benchmark data set for contextual word representation , Because there are many challenging ambiguities , It often needs a long-distance context to solve . We propose a new attention based model , be known as SoftNER, Its performance on this data set exceeds that of the most advanced NER Model

Besides , We studied the important subtask of code recognition . Our new code recognition model is based on character ELMo Additional spelling information is captured in addition , And continuously improve NER The performance of the model . We believe that our corpus is based on StackOverflow Named entity markers will be useful for various language and code tasks , Such as code retrieval 、 Software knowledge base extraction and automatic question answering