当前位置：网站首页>CSDN blog summary (I) -- a simple first edition implementation

CSDN blog summary (I) -- a simple first edition implementation

2022-07-06 10:42:00 【Alexxinlu】

Catalog

Series articles
1. background
2. Blog summary
3. Next step
P.S.

Series articles

CSDN Blog summary ( One ) —— A simple implementation

Team blog : CSDN AI team

1. background

2. Blog summary

2.1 Blog structured

The blog contains too many elements , Abstracting directly as text will seriously affect the quality of abstracts . So first, we need to structure the blog , After structuring, the content in the body will be effectively distinguished , for example ：head( title )、code( Code )、table( form )、text( The paragraph )、img( picture )、link( link ) etc. , It's more convenient 、 Get the content of each part accurately , Provide more convenient and clear structured information for the preprocessing logic and rule logic in the subsequent blog abstracts , And provide better input for the model . The following figure is an example of blog structure ：
Insert picture description here

2.2 The rules section

The rules 1 ： Judge whether there is “ Preface ”、“ Let me write it out front ” And other modules that introduce the article , If any , Directly extract the content in the preface , And cut it to the specified length （ Default length ：256）
The rules 2 ： Judge whether there is content before the first level Title , If any , Extract this part directly , And cut it to the specified length （ Default length ：256）

2.3 The model part

If the rule cannot extract the summary , Then use TextRank The model abstracts blog posts . The input of the model is except head( title )、code( Code )、table( form )、text( The paragraph )、img( picture )、link( link ) And other text information . The specific implementation process is as follows ：

a) For samples that do not meet the rules , Directly extract and divide pictures 、 Code 、 title 、 All text except the contents and other information ;
b) Divide the text into sentences , Input to TextRank In the model , Make a text summary ;
c) TextRank The model will be based on the importance of the sentence , Rate each sentence （ The total score of all sentences is 1）;
d) Rank all sentences from high to low , And splicing in turn , Until the length is close to the specified length , But no longer than the specified length .（ Default length ：256）

2.4 Score setting

The score range is : [0, 1]
The default rule score is ：0.5
Model score ： Sum of scores of all spliced sentences

3. Next step

The current version is a preliminary version , Further optimization is needed . Next steps include ：

Build test set , Conduct quantitative effect evaluation . The evaluation index ：BLEU、ROUGE;
Optimization of sentence splicing ： Rank all sentences from high to low , Combined with The order of sentences in the original Splicing , Until the length is close to the specified length ;
TextRank When the algorithm constructs the sentence graph , Consider the weight of words . for example ： Based on all blogs in the same tag , Use similar to TF-IDF The algorithm calculates the weight of each word .

P.S.

This series of articles will be continuously updated . hope NLP Colleagues in other fields 、 Teachers and experts can provide valuable advice , thank you ！

原网站

版权声明
本文为[Alexxinlu]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/187/202207060911378194.html