当前位置:网站首页>CSDN blog summary (I) -- a simple first edition implementation
CSDN blog summary (I) -- a simple first edition implementation
2022-07-06 10:42:00 【Alexxinlu】
Catalog
Series articles
Team blog : CSDN AI team
1. background
2. Blog summary
2.1 Blog structured
The blog contains too many elements , Abstracting directly as text will seriously affect the quality of abstracts . So first, we need to structure the blog , After structuring, the content in the body will be effectively distinguished , for example :head( title )、code( Code )、table( form )、text( The paragraph )、img( picture )、link( link ) etc. , It's more convenient 、 Get the content of each part accurately , Provide more convenient and clear structured information for the preprocessing logic and rule logic in the subsequent blog abstracts , And provide better input for the model . The following figure is an example of blog structure :
2.2 The rules section
- The rules 1 : Judge whether there is “ Preface ”、“ Let me write it out front ” And other modules that introduce the article , If any , Directly extract the content in the preface , And cut it to the specified length ( Default length :256)
- The rules 2 : Judge whether there is content before the first level Title , If any , Extract this part directly , And cut it to the specified length ( Default length :256)
2.3 The model part
If the rule cannot extract the summary , Then use TextRank The model abstracts blog posts . The input of the model is except head( title )、code( Code )、table( form )、text( The paragraph )、img( picture )、link( link ) And other text information . The specific implementation process is as follows :
- a) For samples that do not meet the rules , Directly extract and divide pictures 、 Code 、 title 、 All text except the contents and other information ;
- b) Divide the text into sentences , Input to TextRank In the model , Make a text summary ;
- c) TextRank The model will be based on the importance of the sentence , Rate each sentence ( The total score of all sentences is 1);
- d) Rank all sentences from high to low , And splicing in turn , Until the length is close to the specified length , But no longer than the specified length .( Default length :256)
2.4 Score setting
- The score range is : [0, 1]
- The default rule score is :0.5
- Model score : Sum of scores of all spliced sentences
3. Next step
The current version is a preliminary version , Further optimization is needed . Next steps include :
- Build test set , Conduct quantitative effect evaluation . The evaluation index :BLEU、ROUGE;
- Optimization of sentence splicing : Rank all sentences from high to low , Combined with The order of sentences in the original Splicing , Until the length is close to the specified length ;
- TextRank When the algorithm constructs the sentence graph , Consider the weight of words . for example : Based on all blogs in the same tag , Use similar to TF-IDF The algorithm calculates the weight of each word .
P.S.
This series of articles will be continuously updated . hope NLP Colleagues in other fields 、 Teachers and experts can provide valuable advice , thank you !
边栏推荐
- [reading notes] rewards efficient and privacy preserving federated deep learning
- @controller,@service,@repository,@component区别
- 实现以form-data参数发送post请求
- Solution to the problem of cross domain inaccessibility of Chrome browser
- CSDN问答标签技能树(一) —— 基本框架的构建
- A necessary soft skill for Software Test Engineers: structured thinking
- Windchill配置远程Oracle数据库连接
- Transactions have four characteristics?
- Software test engineer development planning route
- MySQL19-Linux下MySQL的安装与使用
猜你喜欢
MySQL18-MySQL8其它新特性
[reading notes] rewards efficient and privacy preserving federated deep learning
How to find the number of daffodils with simple and rough methods in C language
保姆级手把手教你用C语言写三子棋
MySQL30-事务基础知识
实现以form-data参数发送post请求
Emotional classification of 1.6 million comments on LSTM based on pytoch
Implement sending post request with form data parameter
Valentine's Day is coming, are you still worried about eating dog food? Teach you to make a confession wall hand in hand. Express your love to the person you want
用于实时端到端文本识别的自适应Bezier曲线网络
随机推荐
Kubernetes - problems and Solutions
Mysql33 multi version concurrency control
Record the first JDBC
MySQL21-用戶與權限管理
Global and Chinese markets of static transfer switches (STS) 2022-2028: Research Report on technology, participants, trends, market size and share
Opencv uses freetype to display Chinese
windows无法启动MYSQL服务(位于本地计算机)错误1067进程意外终止
MySQL 29 other database tuning strategies
第一篇博客
解决扫描不到xml、yml、properties文件配置
MySQL storage engine
[programmers' English growth path] English learning serial one (verb general tense)
[untitled]
百度百科数据爬取及内容分类识别
[paper reading notes] - cryptographic analysis of short RSA secret exponents
MySQL26-性能分析工具的使用
Mysql23 storage engine
MySQL21-用户与权限管理
Mysql28 database design specification
API learning of OpenGL (2003) gl_ TEXTURE_ WRAP_ S GL_ TEXTURE_ WRAP_ T