当前位置:网站首页>One click extraction of tables in PDF

One click extraction of tables in PDF

2022-07-06 11:07:00 zkkkkkkkkkkkkk

Preface :

        Due to work needs , Now we need to pdf The table in is not sealed properly csv Or database tables , Then it opened the road of forced research . Through investigation and research , At present, support from Editable pdf Read out the table Python Library has :pdfminer3k、tabula、pdfplumber  etc. . All three libraries have flaws . But I'm still more inclined to  pdfplumber . Self perception pdfplumber  Simple and easy to implement functions . The following article is about  pdfplumber  Introduction to . For the other two Python If you are interested, you can check the relevant information by yourself . about pdf Editable in Central Africa ( Table recognition in the picture ) problem , Maybe this library can't help you .

Catalog

One 、pdfplumber Introduce

        1.1、 Introduce

        1.2、 Open source code git Address

        1.3、 Official documents

        1.4、 Installation mode

Two 、 Easy to use

        2.1、 Data set introduction

        2.2、 Code implementation

        3.3、 Results output


One 、pdfplumber Introduce

        1.1、 Introduce

                 Let's first look at an official introduction :pdfplumber Support vertical view PDF, View each text character 、 Rectangle and row details . Additional features : Table extraction and visual debugging . Most suitable for machine generated , Not scanned pdf file . On the whole pdfplumber It is a multi-function pdf Processing tools .             

        1.2、 Open source code git Address

            GitHub - jsvine/pdfplumber: Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

        1.3、 Official documents

pdfplumber · PyPI

        1.4、 Installation mode

pip install pdfplumber

Two 、 Easy to use

        2.1、 Data set introduction

                Data is transaction flow ,pdf The table is editable . The purpose is to extract the data in the table .

        2.2、 Code implementation

import pdfplumber

# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/ A framed table can be edited .pdf'
pdf = pdfplumber.open(path)


#  obtain pdf Page number object 
print(pdf.pages)    # [<Page:1>]


count = 0
for page in pdf.pages:
    count += 1
    #  page.extract_text() You can grab all the information of the current page , Because there is a lot of content, please comment first .
    # print(page.extract_text())

    for table in page.extract_tables():
        for row in table:
            print(row)
        print(f'============  The first {count} End of page parsing  ============')



#  To dataframe Output 
# pass


pdf.close()

        3.3、 Results output

                The result is output as a list per row . If necessary csv Or database requirements , You can first convert the following data into dataframe, Then output to the target source .

原网站

版权声明
本文为[zkkkkkkkkkkkkk]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060912331803.html