当前位置：网站首页>Article main content extraction software [based on NLP technology]

Article main content extraction software [based on NLP technology]

2022-07-27 04:03:00 【Yangyang 2013 haha】

Use the computer to process a large number of texts , Produce simplicity 、 The process of refining content is text summarization , People can grasp the main content of the text by reading the abstract , This not only saves a lot of time , Improve reading efficiency .

One: TextRank（extract keywords and extract abstract）
TextRank The algorithm is a graph based sorting algorithm for text , Used to extract text keywords and abstracts . The basic idea comes from Google PageRank Algorithm , By dividing the text into several constituent units ( word 、 The sentence ) And build a graph model , Using voting mechanism to sort the important components in the text , Only using the information of a single document can realize keyword extraction 、 Abstract . Let's start with PageRank* Algorithm .*

word How can documents Automatic generation Catalog And automatically extract summaries ？

The steps to automatically create a summary are as follows ：

（1） single click 【 Tools 】 Menu 【 Automatically write summaries 】 A menu item .Word Just Will start automatically writing summaries , To cancel the summary being executed , Press Esc key . After the command is completed, the following figure will appear 6-41 Shown 【 Automatically write summaries 】 Dialog box .

（2） stay 【 Summary type 】 Next, select the display scheme of the document .

（3） stay 【 Equivalent to the percentage of original length 】 Box, type or select the detailed process of the summary degree .

（4） If you want to update the statistics of the document , Please check 【 Update document statistics 】 Check box .

Automatic directory generation steps ：

Point format >> Style and format , A format selection box will appear on the right side of the page , You point to the top on the right >> New style >> Fill in the first level title in the name of the pop-up box , Select a paragraph from the style type , The style is based on the selected title 1, Select the text in the following paragraphs , Then set the font of the title text according to your requirements below 、 Font size 、 How many pounds are there before and after . There is also a row of options to save to template at the bottom of this window , If you plan to use this style in your future documents, click Save to template , If there will be changes in the future, don't check . Point determination , Your first level Title format is set . At this time, a style name named “ First level title ”.

Back to your article , Put the primary title of your article , Select the title of your chapter one by one in the style box on the right “ First level title ” It will be the same as what you just set .

Set your secondary and tertiary headings in the same way , Name the secondary title based on the title 2, The third level title is based on the title 3 such , And go back to the text and apply one by one .

After formatting all your titles , The cursor returns to the front of your article , Point insertion - quote - Catalog , Set the directory display level to 3, So your directory is generated , The directory includes all titles above your three-level titles .

Before generating the directory, open your document structure diagram and display it on the left , Here you can clearly see the structure of your article , This structure is the basis of your automatic directory generation , If your document structure diagram is messy , The automatically generated directory is messy .

python What libraries are there to extract text summaries

The content of an article can be in plain text , But in today's Internet era , More HTML Format . Whatever the format , Abstract It's usually an article beginning The content of , You can follow the specified Number of words To extract .

Two 、 Plain text summary

Plain text documents Is a long string , It is easy to extract its summary ：

#!/usr/bin/env python.

# -*- coding: utf-8 -*-.

"""Get a summary of the TEXT-format document""".

def get_summary(text, count):.

u"""Get the first `count` characters from `text`.

>>> text = u'Welcome This is about Python The article '.

>>> get_summary(text, 12) == u'Welcome This is an article '.

True

"""

assert(isinstance(text, unicode)).

return text[0:count].

if __name__ == '__main__':.

import doctest.

doctest.testmod().

3、 ... and 、HTML Abstract

HTML file Contains a large number of flags （ Such as <h1>、、<a> wait ）, These characters are marking instructions , And usually appear in pairs , Simple text interception will destroy HTML The document structure of , As a result, the summary is not displayed properly in the browser .

Following HTML Document structure at the same time , And intercept the content , It needs to be resolved HTML file . stay Python in , You can use the standard library HTMLParser To complete .

One of the simplest summary extraction functions , Is to ignore HTML Tag and only extract the native text inside the tag . The following is similar to this function Python Realization ：

#!/usr/bin/env python.

# -*- coding: utf-8 -*-.

"""Get a raw summary of the HTML-format document""".

from HTMLParser import HTMLParser.

class SummaryHTMLParser(HTMLParser):.

"""Parse HTML text to get a summary.

>>> text = u'Hi guys:This is a example using SummaryHTMLParser.'.

>>> parser = SummaryHTMLParser(10).

>>> parser.feed(text).

>>> parser.get_summary(u'...').

u'Higuys:Thi...'.

"""

def __init__(self, count):.

HTMLParser.__init__(self).

self.count = count.

self.summary = u''.

def feed(self, data):.

"""Only accept unicode `data`""".

assert(isinstance(data, unicode)).

HTMLParser.feed(self, data).

def handle_data(self, data):.

more = self.count - len(self.summary).

if more > 0:.

# Remove possible whitespaces in `data`.

data_without_whitespace = u''.join(data.split()).

self.summary += data_without_whitespace[0:more].

def get_summary(self, suffix=u'', wrapper=u'p'):.

return u'<{0}>{1}{2}</{0}>'.format(wrapper, self.summary, suffix).

if __name__ == '__main__':.

import doctest.

doctest.testmod().

HTMLParser（ perhaps BeautifulSoup wait ） It is more suitable for completing complex HTML Summary extraction function , For the above simple HTML Summary extraction function , In fact, there are simpler implementation schemes （ comparison SummaryHTMLParser for ）：

#!/usr/bin/env python.

# -*- coding: utf-8 -*-.

"""Get a raw summary of the HTML-format document""".

import re

def get_summary(text, count, suffix=u'', wrapper=u'p'):.

"""A simpler implementation (vs `SummaryHTMLParser`)..

>>> text = u'Hi guys:This is a example using SummaryHTMLParser.'.

>>> get_summary(text, 10, u'...').

u'Higuys:Thi...'.

"""

assert(isinstance(text, unicode)).

summary = re.sub(r'<.*?>', u'', text) # key difference: use regex.

summary = u''.join(summary.split())[0:count].

return u'<{0}>{1}{2}</{0}>'.format(wrapper, summary, suffix).

if __name__ == '__main__':.

import doctest.

doctest.testmod().

EXCEL How to extract names from irregular summaries

First copy the area to be extracted ,

Then pull the width of that column to the size of one word .

Fill at point —— full-justified , The effect is as shown in the picture .

Point data —— Dissection —— next step —— next step —— complete .

Click Find —— Location condition —— Constant —— Check only 【 Text 】

There is a long column , Right click , Then click delete —— Move the lower cell up , The effect is as shown in the picture .

Finally, widen the column that is only one word wide , The numbers are displayed .

How to extract content summary 10

If you need to write it yourself , You need to know something about pdf The infrastructure in the document ( You can refer to PDF Reference 8.8). All the information you need is catalog\info In the object . If you need help, you can add me .88998888.

Derived from UFIDA excel How to extract Department items in the Sub Ledger summary ？

Your data seems to be regular All are "-" The symbols are spaced , You can use it once .

Tool blue - data - Dissection - Separator symbol -‘ next step ’-‘ other ’- Input Separator "-"( Don't use quotation marks , I just want to emphasize ), The next step is done .

原网站

版权声明
本文为[Yangyang 2013 haha]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207262313526504.html