当前位置:网站首页>A method to solve Bert long text matching
A method to solve Bert long text matching
2022-07-03 23:47:00 【Necther】
introduction
bert It opens the door of transfer learning , First, a general language model is trained through unsupervised corpus , Then fine tune based on your own corpus (finetune) Models to meet different business needs . We know bert Can support the largest token The length is 512, If the maximum length exceeds 512, How to deal with it ? The following paper provides a simple and effective solution .
Simple Applications of BERT for Ad Hoc Document Retrieval
201903 publish
1. Abstract
bert The big trick works well , But its maximum length is 512 As well as its performance, these two shortcomings pose challenges to our online deployment . We're doing it document Level recall , Its text length is much longer than bert Length that can be handled , This paper presents a simple and effective solution . Will grow document Break it down into several short sentences , Each sentence is in bert On independent inference , Then aggregate the scores of these sentences to get document Score of .
2. Paper details and experimental results
2.1 Long text matching solution
The author first matches the task with short text - Social media posts to do recall experiments , adopt query To recall related posts , Generally, the length of a post is short text , stay bert Within the scope that can be dealt with . The evaluation index of the experiment is two Average recall (AP) and top30 The recall rate (P30), The following table shows the results of recent depth models on this dataset .
I think the above experimental data mainly say one thing :
bert It works well on short text matching tasks , performance SOTA
Long text docment Match the general solution :
- Direct truncation , take top length , Lost the following data ;
- Fragment level recursion mechanism , Solve long text dependency , Such as Transformer-XL, To some extent, it can solve the problem of long dependence ( Look at the recursion length ), But the model is a little complicated ;
- Based on extraction model , Extract long text docment As doc A summary of the , Then the matching model is trained based on this summary , In this way, only the summary is considered , Without considering other sentences , Relatively one-sided ;
- Divide the long text into several short sentences , Choose the one with the highest matching degree to match , Similarly, other sentences are not considered .
The method of this paper
Long text recall of news corpus , First of all, this paper uses NLTK The tool divides long text into short sentences , Different from considering the most matching sentence , This paper considers top n A sentence . Final long text docment The matching score of the company is calculated as follows :
among S_doc Is the original long text score ( Text score ), for example BM25 score ,S_i It means the first one i individual top Based on the bert Sentence matching score ( Semantic score ), The parameter a Parameter range of [0,1],w1 The value of is 1,wi Parameter range [0,1], be based on gridsearch To tune in , Get a better performance .
2.2 experimental result
finetune The data of
Our original fine-tuning data is query query And long text document The relationship between , We split the long text into n After a short sentence , Not all sentences and current query It's strongly related ( Positive sample ), Therefore, we can't simply rely on the current long text data . The solution of this paper is based on external corpus , be based on QA perhaps Microblog data , First bert Based on the general unsupervised corpus, we learn the representation of words and sentences , Therefore, fine-tuning based on a small amount of data can also achieve better results , Therefore, this paper chooses external related corpus to fine tune . The specific effects are shown in the table below , We find that the method based on this paper can achieve better results in long text matching .
3. Summary and questions
summary
- This paper proposes a weighted short sentence scoring method to solve the problem of long text matching score ;
- This method can be achieved on the experimental data set of this paper SOTA The effect of , The method is simple and effective ;
reflection
- The fine-tuning data in the paper uses external data , The fine-tuning model does not fit the current data well , Can we sample positive and negative samples from the segmented short sentences , Such fine-tuning data is also derived from long text ;
- If you choose top n, If n If it's too big , Adjusting parameters is a little complicated ,n If it's too big, you can take top3 Adjustable parameter , Then average the following .
reference
边栏推荐
- Amway by head has this project management tool to improve productivity in a straight line
- Unsafe and CAS principle
- NPM script
- QT creator source code learning note 05, how does the menu bar realize plug-in?
- No qualifying bean of type ‘com. netflix. discovery. AbstractDiscoveryClientOptionalArgs<?>‘ available
- Fashion cloud interview questions series - JS high-frequency handwritten code questions
- Maxwell equation and Euler formula - link
- China standard gas market prospect investment and development feasibility study report 2022-2028
- Minimum commission for stock account opening. Stock account opening is free. Is online account opening safe
- Actual combat | use composite material 3 in application
猜你喜欢
Interpretation of corolla sub low configuration, three cylinder power configuration, CVT fuel saving and smooth, safety configuration is in place
Tencent interview: can you find the number of 1 in binary?
Bufferpool caching mechanism for executing SQL in MySQL
Report on the construction and development mode and investment mode of sponge cities in China 2022-2028
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
SPI based on firmware library
Fluent learning (5) GridView
2/14 (regular expression, sed streaming editor)
2022 free examination questions for hoisting machinery command and hoisting machinery command theory examination
Iclr2022: how does AI recognize "things I haven't seen"?
随机推荐
Gossip about redis source code 77
Go error collection | talk about the difference between the value type and pointer type of the method receiver
How to write a good title of 10w+?
Generic tips
Selenium library 4.5.0 keyword explanation (II)
[BSP video tutorial] stm32h7 video tutorial phase 5: MDK topic, system introduction to MDK debugging, AC5, AC6 compilers, RTE development environment and the role of various configuration items (2022-
D24:divisor and multiple (divisor and multiple, translation + solution)
Selenium check box
2022 examination of safety production management personnel of hazardous chemical production units and examination skills of safety production management personnel of hazardous chemical production unit
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
China standard gas market prospect investment and development feasibility study report 2022-2028
Make small tip
Loop compensation - explanation and calculation of first-order, second-order and op amp compensation
"Learning notes" recursive & recursive
[network security] what is emergency response? What indicators should you pay attention to in emergency response?
I would like to ask how the top ten securities firms open accounts? Is it safe to open an account online?
Iclr2022: how does AI recognize "things I haven't seen"?
leetcode-43. String multiplication
NPM script
P1339 [USACO09OCT]Heat Wave G