当前位置:网站首页>CSDN blog summary (I) -- a simple first edition implementation
CSDN blog summary (I) -- a simple first edition implementation
2022-07-06 10:42:00 【Alexxinlu】
Catalog
Series articles
Team blog : CSDN AI team
1. background
2. Blog summary
2.1 Blog structured
The blog contains too many elements , Abstracting directly as text will seriously affect the quality of abstracts . So first, we need to structure the blog , After structuring, the content in the body will be effectively distinguished , for example :head( title )、code( Code )、table( form )、text( The paragraph )、img( picture )、link( link ) etc. , It's more convenient 、 Get the content of each part accurately , Provide more convenient and clear structured information for the preprocessing logic and rule logic in the subsequent blog abstracts , And provide better input for the model . The following figure is an example of blog structure :
2.2 The rules section
- The rules 1 : Judge whether there is “ Preface ”、“ Let me write it out front ” And other modules that introduce the article , If any , Directly extract the content in the preface , And cut it to the specified length ( Default length :256)
- The rules 2 : Judge whether there is content before the first level Title , If any , Extract this part directly , And cut it to the specified length ( Default length :256)
2.3 The model part
If the rule cannot extract the summary , Then use TextRank The model abstracts blog posts . The input of the model is except head( title )、code( Code )、table( form )、text( The paragraph )、img( picture )、link( link ) And other text information . The specific implementation process is as follows :
- a) For samples that do not meet the rules , Directly extract and divide pictures 、 Code 、 title 、 All text except the contents and other information ;
- b) Divide the text into sentences , Input to TextRank In the model , Make a text summary ;
- c) TextRank The model will be based on the importance of the sentence , Rate each sentence ( The total score of all sentences is 1);
- d) Rank all sentences from high to low , And splicing in turn , Until the length is close to the specified length , But no longer than the specified length .( Default length :256)
2.4 Score setting
- The score range is : [0, 1]
- The default rule score is :0.5
- Model score : Sum of scores of all spliced sentences
3. Next step
The current version is a preliminary version , Further optimization is needed . Next steps include :
- Build test set , Conduct quantitative effect evaluation . The evaluation index :BLEU、ROUGE;
- Optimization of sentence splicing : Rank all sentences from high to low , Combined with The order of sentences in the original Splicing , Until the length is close to the specified length ;
- TextRank When the algorithm constructs the sentence graph , Consider the weight of words . for example : Based on all blogs in the same tag , Use similar to TF-IDF The algorithm calculates the weight of each word .
P.S.
This series of articles will be continuously updated . hope NLP Colleagues in other fields 、 Teachers and experts can provide valuable advice , thank you !
边栏推荐
- MySQL combat optimization expert 09 production experience: how to deploy a monitoring system for a database in a production environment?
- Const decorated member function problem
- Kubernetes - problems and Solutions
- [C language] deeply analyze the underlying principle of data storage
- Set shell script execution error to exit automatically
- MySQL combat optimization expert 12 what does the memory data structure buffer pool look like?
- Yum prompt another app is currently holding the yum lock; waiting for it to exit...
- [unity] simulate jelly effect (with collision) -- tutorial on using jellysprites plug-in
- 35 is not a stumbling block in the career of programmers
- IDEA 导入导出 settings 设置文件
猜你喜欢
Record the first JDBC
Complete web login process through filter
MySQL36-数据库备份与恢复
Mysql33 multi version concurrency control
Pytorch RNN actual combat case_ MNIST handwriting font recognition
CSDN问答标签技能树(五) —— 云原生技能树
Mysql25 index creation and design principles
API learning of OpenGL (2002) smooth flat of glsl
Solve the problem of remote connection to MySQL under Linux in Windows
UEditor国际化配置,支持中英文切换
随机推荐
MySQL25-索引的创建与设计原则
Mysql27 index optimization and query optimization
Bytetrack: multi object tracking by associating every detection box paper reading notes ()
The underlying logical architecture of MySQL
Case identification based on pytoch pulmonary infection (using RESNET network structure)
Global and Chinese market of wafer processing robots 2022-2028: Research Report on technology, participants, trends, market size and share
Pytoch LSTM implementation process (visual version)
UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xd0 in position 0成功解决
MySQL底层的逻辑架构
[paper reading notes] - cryptographic analysis of short RSA secret exponents
MySQL19-Linux下MySQL的安装与使用
[unity] simulate jelly effect (with collision) -- tutorial on using jellysprites plug-in
Kubesphere - deploy the actual combat with the deployment file (3)
实现以form-data参数发送post请求
基于Pytorch的LSTM实战160万条评论情感分类
解决扫描不到xml、yml、properties文件配置
Solution to the problem of cross domain inaccessibility of Chrome browser
Super detailed steps for pushing wechat official account H5 messages
What is the difference between TCP and UDP?
【C语言】深度剖析数据存储的底层原理