当前位置:网站首页>A method to solve Bert long text matching
A method to solve Bert long text matching
2022-07-03 23:47:00 【Necther】
introduction
bert It opens the door of transfer learning , First, a general language model is trained through unsupervised corpus , Then fine tune based on your own corpus (finetune) Models to meet different business needs . We know bert Can support the largest token The length is 512, If the maximum length exceeds 512, How to deal with it ? The following paper provides a simple and effective solution .
Simple Applications of BERT for Ad Hoc Document Retrieval
201903 publish
1. Abstract
bert The big trick works well , But its maximum length is 512 As well as its performance, these two shortcomings pose challenges to our online deployment . We're doing it document Level recall , Its text length is much longer than bert Length that can be handled , This paper presents a simple and effective solution . Will grow document Break it down into several short sentences , Each sentence is in bert On independent inference , Then aggregate the scores of these sentences to get document Score of .
2. Paper details and experimental results
2.1 Long text matching solution
The author first matches the task with short text - Social media posts to do recall experiments , adopt query To recall related posts , Generally, the length of a post is short text , stay bert Within the scope that can be dealt with . The evaluation index of the experiment is two Average recall (AP) and top30 The recall rate (P30), The following table shows the results of recent depth models on this dataset .
I think the above experimental data mainly say one thing :
bert It works well on short text matching tasks , performance SOTA
Long text docment Match the general solution :
- Direct truncation , take top length , Lost the following data ;
- Fragment level recursion mechanism , Solve long text dependency , Such as Transformer-XL, To some extent, it can solve the problem of long dependence ( Look at the recursion length ), But the model is a little complicated ;
- Based on extraction model , Extract long text docment As doc A summary of the , Then the matching model is trained based on this summary , In this way, only the summary is considered , Without considering other sentences , Relatively one-sided ;
- Divide the long text into several short sentences , Choose the one with the highest matching degree to match , Similarly, other sentences are not considered .
The method of this paper
Long text recall of news corpus , First of all, this paper uses NLTK The tool divides long text into short sentences , Different from considering the most matching sentence , This paper considers top n A sentence . Final long text docment The matching score of the company is calculated as follows :
among S_doc Is the original long text score ( Text score ), for example BM25 score ,S_i It means the first one i individual top Based on the bert Sentence matching score ( Semantic score ), The parameter a Parameter range of [0,1],w1 The value of is 1,wi Parameter range [0,1], be based on gridsearch To tune in , Get a better performance .
2.2 experimental result
finetune The data of
Our original fine-tuning data is query query And long text document The relationship between , We split the long text into n After a short sentence , Not all sentences and current query It's strongly related ( Positive sample ), Therefore, we can't simply rely on the current long text data . The solution of this paper is based on external corpus , be based on QA perhaps Microblog data , First bert Based on the general unsupervised corpus, we learn the representation of words and sentences , Therefore, fine-tuning based on a small amount of data can also achieve better results , Therefore, this paper chooses external related corpus to fine tune . The specific effects are shown in the table below , We find that the method based on this paper can achieve better results in long text matching .
3. Summary and questions
summary
- This paper proposes a weighted short sentence scoring method to solve the problem of long text matching score ;
- This method can be achieved on the experimental data set of this paper SOTA The effect of , The method is simple and effective ;
reflection
- The fine-tuning data in the paper uses external data , The fine-tuning model does not fit the current data well , Can we sample positive and negative samples from the segmented short sentences , Such fine-tuning data is also derived from long text ;
- If you choose top n, If n If it's too big , Adjusting parameters is a little complicated ,n If it's too big, you can take top3 Adjustable parameter , Then average the following .
reference
边栏推荐
- 2022 Guangdong Provincial Safety Officer a certificate third batch (main person in charge) simulated examination and Guangdong Provincial Safety Officer a certificate third batch (main person in charg
- 炒股开户佣金优惠怎么才能获得,网上开户安全吗
- Pyqt5 sensitive word detection tool production, operator's Gospel
- Advanced C language - pointer 2 - knowledge points sorting
- [Mongodb] 2. Use mongodb --------- use compass
- 想请教一下,十大劵商如何开户?在线开户是安全么?
- Vscode regular match replace console log(.*)
- Amway by head has this project management tool to improve productivity in a straight line
- Scratch uses runner Py run or debug crawler
- 2022.02.14
猜你喜欢
Scratch uses runner Py run or debug crawler
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
How to quickly build high availability of service discovery
[Happy Valentine's day] "I still like you very much, like sin ² a+cos ² A consistent "(white code in the attached table)
Idea a method for starting multiple instances of a service
How to understand the gain bandwidth product operational amplifier gain
Correlation analysis summary
Zipper table in data warehouse (compressed storage)
Loop compensation - explanation and calculation of first-order, second-order and op amp compensation
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
随机推荐
Common mode interference of EMC
Gossip about redis source code 79
P1339 [USACO09OCT]Heat Wave G
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
Solve the problem that the kaggle account registration does not display the verification code
Selenium library 4.5.0 keyword explanation (4)
ADB related commands
Qtoolbutton available signal
Fudan 961 review
Advanced C language - pointer 2 - knowledge points sorting
QT creator source code learning note 05, how does the menu bar realize plug-in?
The upload experience version of uniapp wechat applet enters the blank page for the first time, and the page data can be seen only after it is refreshed again
Tencent interview: can you pour water?
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
[MySQL] sql99 syntax to realize multi table query
股票開戶傭金最低的券商有哪些大家推薦一下,手機上開戶安全嗎
It is the most difficult to teach AI to play iron fist frame by frame. Now arcade game lovers have something
[MySQL] classification of multi table queries
EPF: a fuzzy testing framework for network protocols based on evolution, protocol awareness and coverage guidance
Smart fan system based on stm32f407