当前位置:网站首页>One click extraction of tables in PDF
One click extraction of tables in PDF
2022-07-06 11:07:00 【zkkkkkkkkkkkkk】
Preface :
Due to work needs , Now we need to pdf The table in is not sealed properly csv Or database tables , Then it opened the road of forced research . Through investigation and research , At present, support from Editable pdf Read out the table Python Library has :pdfminer3k、tabula、pdfplumber etc. . All three libraries have flaws . But I'm still more inclined to pdfplumber . Self perception pdfplumber Simple and easy to implement functions . The following article is about pdfplumber Introduction to . For the other two Python If you are interested, you can check the relevant information by yourself . about pdf Editable in Central Africa ( Table recognition in the picture ) problem , Maybe this library can't help you .
Catalog
1.2、 Open source code git Address
One 、pdfplumber Introduce
1.1、 Introduce
Let's first look at an official introduction :pdfplumber Support vertical view PDF, View each text character 、 Rectangle and row details . Additional features : Table extraction and visual debugging . Most suitable for machine generated , Not scanned pdf file . On the whole pdfplumber It is a multi-function pdf Processing tools .
1.2、 Open source code git Address
1.3、 Official documents
1.4、 Installation mode
pip install pdfplumberTwo 、 Easy to use
2.1、 Data set introduction
Data is transaction flow ,pdf The table is editable . The purpose is to extract the data in the table .
2.2、 Code implementation
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/ A framed table can be edited .pdf'
pdf = pdfplumber.open(path)
# obtain pdf Page number object
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text() You can grab all the information of the current page , Because there is a lot of content, please comment first .
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ The first {count} End of page parsing ============')
# To dataframe Output
# pass
pdf.close()3.3、 Results output
The result is output as a list per row . If necessary csv Or database requirements , You can first convert the following data into dataframe, Then output to the target source .
边栏推荐
- Copie maître - esclave MySQL, séparation lecture - écriture
- Ansible实战系列二 _ Playbook入门
- Windows cannot start the MySQL service (located on the local computer) error 1067 the process terminated unexpectedly
- JDBC原理
- MySQL 20 MySQL data directory
- Valentine's Day is coming, are you still worried about eating dog food? Teach you to make a confession wall hand in hand. Express your love to the person you want
- CSDN question and answer module Title Recommendation task (I) -- Construction of basic framework
- CSDN问答标签技能树(一) —— 基本框架的构建
- [recommended by bloggers] asp Net WebService background data API JSON (with source code)
- neo4j安装教程
猜你喜欢

MySQL主从复制、读写分离

【博主推荐】SSM框架的后台管理系统(附源码)

Django运行报错:Error loading MySQLdb module解决方法
![[recommended by bloggers] C # generate a good-looking QR code (with source code)](/img/5a/1dbafe5a28f016b815964b9b37c9f1.jpg)
[recommended by bloggers] C # generate a good-looking QR code (with source code)

解决:log4j:WARN Please initialize the log4j system properly.

Breadth first search rotten orange

Mysql22 logical architecture

QT creator design user interface

Idea import / export settings file
![[Li Kou 387] the first unique character in the string](/img/2d/f2c99549cac86c08efbfbd8ba76427.jpg)
[Li Kou 387] the first unique character in the string
随机推荐
MySQL主从复制、读写分离
JDBC principle
MySQL完全卸载(Windows、Mac、Linux)
Attention apply personal understanding to images
记一次某公司面试题:合并有序数组
Windows cannot start the MySQL service (located on the local computer) error 1067 the process terminated unexpectedly
Kubernetes - problems and Solutions
图片上色项目 —— Deoldify
[recommended by bloggers] asp Net WebService background data API JSON (with source code)
C语言标准的发展
API learning of OpenGL (2002) smooth flat of glsl
MySQL flush operation
解决:log4j:WARN Please initialize the log4j system properly.
SSM integrated notes easy to understand version
Global and Chinese markets of static transfer switches (STS) 2022-2028: Research Report on technology, participants, trends, market size and share
数数字游戏
Mysql22 logical architecture
35 is not a stumbling block in the career of programmers
Record a problem of raspberry pie DNS resolution failure
MySQL主從複制、讀寫分離