当前位置:网站首页>CSDN question and answer module Title Recommendation task (II) -- effect optimization
CSDN question and answer module Title Recommendation task (II) -- effect optimization
2022-07-06 10:42:00 【Alexxinlu】
Catalog
- Series articles
- 1. The problem background
- 2. Effect optimization methodology
- 3. Summary and next step plan
- P.S.
Series articles
- CSDN Q & a module Title recommended tasks ( One ) —— Construction of basic framework
- CSDN Q & a module Title recommended tasks ( Two ) —— Effect optimization
Team blog : CSDN AI team
1. The problem background
This article continues from the previous article 《CSDN Q & a module Title recommended tasks ( One ) —— Construction of basic framework 》. In short , It's right CSDN The question and answer module optimizes the effect of the title asked by users , Recommend more reasonable and informative titles to users . For specific tasks, Beijing can refer to Last article , This article mainly introduces the effect optimization strategy of this task .
2. Effect optimization methodology
The basis of effect optimization is mainly based on Last article 2.3 Section analyzes the problem of wrong data , In addition, it also combines the opinions and suggestions of others , Made more reasonable improvements . The specific optimization methodology is as follows .
2.1 Detection of invalid titles
stay Last article in , All user questions are recommended , But in many cases, it is a valid Title ( About 91%), And a small number of recommended titles may be worse than the original valid titles . Therefore, in order to improve the efficiency of Title Recommendation , And avoid the worse effect after the original title recommendation , You need to detect all invalid titles first , Further Title Recommendation .
Invalid Title Detection mainly includes the following two strategies :
2.1.1 Keyword matching strategy
Most invalid titles contain some keywords , For example, the following title , Contains bosses 、 Rescue 、 emergency !、 thank , The title composed of these words cannot express the meaning of the problem itself .
This paper adopts the method of manual collection , Sort out 64 Key words , And determine whether the title is invalid by string matching . All the keywords sorted out are listed below :
emergency !
Urgent demand
Rescue
uncle
emergency
Help
Please
Ball
For help
Ask for advice
Beg
A great god
bosses
Instruction
Give directions
answer
The small white
Adorable new
children
Help
Help
brother
sister
Younger sister
Youngest sister
The younger brother
eldest brother
Novice
New people
Vegetable dog
rookie
Vegetable chicken
brother
master
Beginners
self-taught
thank you
thank
Brother
Homework
Crazy
teacher
Be really something
This question
subject
Source code
Problem.
Ah ah
Online, etc.
It's too hard
Family
brothers
one's junior of equal standing
This question
This question
Examination questions
A question
Source code
Programming questions
How to write the question
How to solve the problem
Source code
haha
curriculum design
Random sampling 2000 The current coverage of invalid Title Recognition in the data test is 98.32%.
2.1.1 Stop phrase strategy
Some invalid titles have no obvious keywords , But the whole title has no amount of information , For such titles , This article is based on deactivating this table , Remove the stop words from the title , Finally, judge whether the title is empty , If it is empty, the title is invalid .
For example, the following title , It's all question marks :
Another example is the following title , The titles are words without any useful information :
2.2 OCR modular : Ensure the integrity of information
In some user questions , Key information can be included in the uploaded pictures , Or there are only a few pictures in the user's body description , Without any written description , for example :
This article USES the paddle_ocr Picture to text module , Recognize the text information contained in the picture into text , Enter the rules module and Text_Rank Module identification .
Besides , because OCR More time-consuming , Therefore, it is necessary to ensure that as few calls as possible OCR modular , If using the existing information is enough to recommend a reasonable Title , Then don't call OCR Identify all pictures .
2.3 Rules module : promote Precision( Accuracy rate )
For some questions with distinctive characteristics , for example : Application error 、 Ask questions about exercises 、 And the inquiry of knowledge points , In this paper, a rule-based method is used for direct recall . The details are as follows :
2.2.1 Error information extraction module
Some users' questions , The text or picture contains obvious error information , Error reporting information is the key information of this kind of problem , Therefore, the error information can be directly extracted as the user's question Title .
This article uses the regular expression method , Extract the error information . It includes forward rules and reverse rules , Forward rules are used to extract error information , Reverse rules are used to avoid accidental inhalation .
# Positive rule
error_word_pattern = re.compile(r'(exception|error\srequestid|errormessage|errorcode|errmsg|errcode|error|no .*?(detected|found)|unavailable|undefined| System not found | To scald | It doesn't contain [a-z_\.].*? The definition of | perform [a-z_\.].*? An error occurred when | Report errors )')
# Reverse rule
error_word_reverse_pattern = re.compile(r'((except|import|catch|throw)[a-z_ \.\(\)\:]{0,20}exception|([1-9]|logger|self)[\. ]error|error[\((]s[\))][0-9 \,]{1,5}warning[\((]s[\))]|error:function|exception.{1,3}details|onerror|conda config|(catch|throw)[a-z_ \.\(\)\:]{0,20}error)')
2.2.2 Exercise identification module
Some users' questions , It's about how to answer some of the exercises in the course , These questions also contain obvious information , For example, include keywords : Programming questions 、 Exercises etc. .
Here, the regular expression method is also used for matching , Including title based regularization and text based regularization , The regular expression is as follows :
# Title Based regularity
title_exercise_words_pattern = re.compile(r'( subject | Homework | Programming questions | Exercises | Write .*? Program )')
# Text based regularity
body_exercise_words_pattern = re.compile(r'(^(1|A)(、| |\.).{3,}?$|^(①|②|③|④|⑤)| Homework | problem [0-9]|[ one two three four five ] yes :| Title Description )')
2.2.3 Ask the knowledge point module
Some users' questions , Is to ask some very specific knowledge points , So we can extract knowledge points directly , As the title of the user's question .
Here, the regular expression method is also used for matching , Mainly match the information in the title , The details are as follows :
ask_words_pattern = re.compile(r'((?: For help | About )[\u4e00-\u9fa5a-zA-Z0-9_]*?(?: problem | stick )|[^,.?!;]+?(?: The understanding of the | The difference between | Characteristics of | doubts ))')
2.2.4 Add header
In order to further clarify the field of the problem , for example :python、c++、 Artificial intelligence, etc , Make the information contained in the title richer , This article uses the question itself tag Information , Add some templates , Generate a header . The rules are different 3 Rules , Also divided into 3 Template , Plus the title of rule extraction , The final title generated based on the rule is as follows :
- Error information extraction module : About #tag# The question of
Original title : Xiaobai's question bosses Do me a favor !
Recommended title : About # Deep learning # The question of :NameError: name ‘capitalize’ is not defined
- Exercise identification module : About #tag# The subject of
Original title : The great God who can script help ~~~
Recommended title : About #c Language # The subject of
- Ask the knowledge point module : inquiry #tag# Knowledge points of
Original title : Novices ask for advice , The problem of circulation .
Recommended title : inquiry #java# Knowledge points of : The problem of circulation
2.4 Text_Rank modular : promote Recall( Recall rate )
The rule module focuses on ensuring the high accuracy of Title Recommendation , Therefore, there will be a problem of low recall rate , The current recall rate is about 33.5%, So the rest 66.5% Need to use Text_Rank To recall .
Text_Rank It is an abstract text summarization model , Use strategies in Last article There is a brief introduction in , Give priority to questions . Besides , The extraction method has high requirements for the original input text , Therefore, we need to focus on text preprocessing , Remove some useless interference information for text extraction , It mainly includes Code segment 、 Conversion of some escape characters 、URL Information 、 Picture label link information 、 Useless HTML Tag information etc. . Besides , Words that do not contain any valid information will also be removed , for example : Boss, help me see 、 Xiaobai asks for guidance ……
After the above treatment , adopt Text_Rank The generated title will be more accurate and readable .
3. Summary and next step plan
Through the above rules + The strategy of the model , Current title recommendation effect Last article in baseline Of 47.92% Promoted to 86.0%, It has initially reached a usable state .
However, due to the highly colloquial and complex characteristics of user questions , Some details of the Title Recommendation task , Still need further optimization . Next steps include :
- Improve the coverage of rules and resources , Improve the recall rate of rules ;
- Further optimization Text_Rank The effect of the algorithm , Improve the quality of Title Extraction ;
- Text_Rank After all, it is a sentence extracted from the text , As a direct question title, there are some blunt , Next, consider using templates or generated methods , Improve the readability of the title and more in line with the style of questions .
P.S.
This series of articles will be continuously updated . hope NLP Colleagues in other fields 、 Teachers and experts can provide valuable advice , thank you !
边栏推荐
- MySQL 29 other database tuning strategies
- [C language] deeply analyze the underlying principle of data storage
- Baidu Encyclopedia data crawling and content classification and recognition
- Mysql34 other database logs
- Software test engineer development planning route
- [after reading the series] how to realize app automation without programming (automatically start Kwai APP)
- [Julia] exit notes - Serial
- C language string function summary
- Good blog good material record link
- 基于Pytorch的LSTM实战160万条评论情感分类
猜你喜欢
该不会还有人不懂用C语言写扫雷游戏吧
Ueeditor internationalization configuration, supporting Chinese and English switching
windows无法启动MYSQL服务(位于本地计算机)错误1067进程意外终止
Not registered via @EnableConfigurationProperties, marked(@ConfigurationProperties的使用)
What is the current situation of the game industry in the Internet world?
Emotional classification of 1.6 million comments on LSTM based on pytoch
MySQL21-用户与权限管理
MySQL22-逻辑架构
[unity] simulate jelly effect (with collision) -- tutorial on using jellysprites plug-in
MySQL transaction log
随机推荐
MySQL29-数据库其它调优策略
Mysql21 user and permission management
Valentine's Day is coming, are you still worried about eating dog food? Teach you to make a confession wall hand in hand. Express your love to the person you want
How to build an interface automation testing framework?
@controller,@service,@repository,@component区别
用于实时端到端文本识别的自适应Bezier曲线网络
Moteur de stockage mysql23
① BOKE
Download and installation of QT Creator
高并发系统的限流方案研究,其实限流实现也不复杂
Case identification based on pytoch pulmonary infection (using RESNET network structure)
C语言标准的发展
MNIST implementation using pytoch in jupyter notebook
Yum prompt another app is currently holding the yum lock; waiting for it to exit...
MySQL combat optimization expert 02 in order to execute SQL statements, do you know what kind of architectural design MySQL uses?
MySQL36-数据库备份与恢复
MySQL28-数据库的设计规范
The underlying logical architecture of MySQL
What is the difference between TCP and UDP?
February 13, 2022-2-climbing stairs