当前位置：网站首页>One click extraction of tables in PDF

One click extraction of tables in PDF

2022-07-06 11:07:00 【zkkkkkkkkkkkkk】

Preface ：
Due to work needs , Now we need to pdf The table in is not sealed properly csv Or database tables , Then it opened the road of forced research . Through investigation and research , At present, support from Editable pdf Read out the table Python Library has ：pdfminer3k、tabula、pdfplumber etc. . All three libraries have flaws . But I'm still more inclined to pdfplumber . Self perception pdfplumber Simple and easy to implement functions . The following article is about pdfplumber Introduction to . For the other two Python If you are interested, you can check the relevant information by yourself . about pdf Editable in Central Africa （ Table recognition in the picture ） problem , Maybe this library can't help you .

Catalog

One 、pdfplumber Introduce

1.1、 Introduce

1.2、 Open source code git Address

1.3、 Official documents

1.4、 Installation mode

Two 、 Easy to use

2.1、 Data set introduction

2.2、 Code implementation

3.3、 Results output

One 、pdfplumber Introduce

1.1、 Introduce

Let's first look at an official introduction ：pdfplumber Support vertical view PDF, View each text character 、 Rectangle and row details . Additional features ： Table extraction and visual debugging . Most suitable for machine generated , Not scanned pdf file . On the whole pdfplumber It is a multi-function pdf Processing tools .

1.2、 Open source code git Address

GitHub - jsvine/pdfplumber: Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

1.3、 Official documents

pdfplumber · PyPI

1.4、 Installation mode

pip install pdfplumber

Two 、 Easy to use

2.1、 Data set introduction

Data is transaction flow ,pdf The table is editable . The purpose is to extract the data in the table .

2.2、 Code implementation

import pdfplumber

# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/ A framed table can be edited .pdf'
pdf = pdfplumber.open(path)


#  obtain pdf Page number object 
print(pdf.pages)    # [<Page:1>]


count = 0
for page in pdf.pages:
    count += 1
    #  page.extract_text() You can grab all the information of the current page , Because there is a lot of content, please comment first .
    # print(page.extract_text())

    for table in page.extract_tables():
        for row in table:
            print(row)
        print(f'============  The first {count} End of page parsing  ============')



#  To dataframe Output 
# pass


pdf.close()

3.3、 Results output

The result is output as a list per row . If necessary csv Or database requirements , You can first convert the following data into dataframe, Then output to the target source .