当前位置:网站首页>One click extraction of tables in PDF
One click extraction of tables in PDF
2022-07-06 11:07:00 【zkkkkkkkkkkkkk】
Preface :
Due to work needs , Now we need to pdf The table in is not sealed properly csv Or database tables , Then it opened the road of forced research . Through investigation and research , At present, support from Editable pdf Read out the table Python Library has :pdfminer3k、tabula、pdfplumber etc. . All three libraries have flaws . But I'm still more inclined to pdfplumber . Self perception pdfplumber Simple and easy to implement functions . The following article is about pdfplumber Introduction to . For the other two Python If you are interested, you can check the relevant information by yourself . about pdf Editable in Central Africa ( Table recognition in the picture ) problem , Maybe this library can't help you .
Catalog
1.2、 Open source code git Address
One 、pdfplumber Introduce
1.1、 Introduce
Let's first look at an official introduction :pdfplumber Support vertical view PDF, View each text character 、 Rectangle and row details . Additional features : Table extraction and visual debugging . Most suitable for machine generated , Not scanned pdf file . On the whole pdfplumber It is a multi-function pdf Processing tools .
1.2、 Open source code git Address
1.3、 Official documents
1.4、 Installation mode
pip install pdfplumberTwo 、 Easy to use
2.1、 Data set introduction
Data is transaction flow ,pdf The table is editable . The purpose is to extract the data in the table .
2.2、 Code implementation
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/ A framed table can be edited .pdf'
pdf = pdfplumber.open(path)
# obtain pdf Page number object
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text() You can grab all the information of the current page , Because there is a lot of content, please comment first .
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ The first {count} End of page parsing ============')
# To dataframe Output
# pass
pdf.close()3.3、 Results output
The result is output as a list per row . If necessary csv Or database requirements , You can first convert the following data into dataframe, Then output to the target source .
边栏推荐
- 一键提取pdf中的表格
- MySQL completely uninstalled (windows, MAC, Linux)
- CSDN markdown editor
- 项目实战-后台员工信息管理(增删改查登录与退出)
- Windows cannot start the MySQL service (located on the local computer) error 1067 the process terminated unexpectedly
- csdn-Markdown编辑器
- Mysql21 - gestion des utilisateurs et des droits
- Some notes of MySQL
- Basic use of redis
- ++Implementation of I and i++
猜你喜欢

Some problems in the development of unity3d upgraded 2020 VR

windows下同时安装mysql5.5和mysql8.0

C language advanced pointer Full Version (array pointer, pointer array discrimination, function pointer)

QT creator shape

连接MySQL数据库出现错误:2059 - authentication plugin ‘caching_sha2_password‘的解决方法

CSDN Q & a tag skill tree (V) -- cloud native skill tree
![[recommended by bloggers] C MVC list realizes the function of adding, deleting, modifying, checking, importing and exporting curves (with source code)](/img/b7/aae35f049ba659326536904ab089cb.png)
[recommended by bloggers] C MVC list realizes the function of adding, deleting, modifying, checking, importing and exporting curves (with source code)

图像识别问题 — pytesseract.TesseractNotFoundError: tesseract is not installed or it‘s not in your path

Postman uses scripts to modify the values of environment variables

API learning of OpenGL (2002) smooth flat of glsl
随机推荐
[recommended by bloggers] C # generate a good-looking QR code (with source code)
Ansible practical Series II_ Getting started with Playbook
Kubesphere - deploy the actual combat with the deployment file (3)
Global and Chinese markets of static transfer switches (STS) 2022-2028: Research Report on technology, participants, trends, market size and share
The virtual machine Ping is connected to the host, and the host Ping is not connected to the virtual machine
Ansible practical series I_ introduction
CSDN问答标签技能树(二) —— 效果优化
Some problems in the development of unity3d upgraded 2020 VR
01项目需求分析 (点餐系统)
Copy constructor template and copy assignment operator template
NPM an error NPM err code enoent NPM err syscall open
[Thesis Writing] how to write function description of jsp online examination system
CSDN question and answer module Title Recommendation task (I) -- Construction of basic framework
Ansible practical Series III_ Task common commands
【博主推荐】C#生成好看的二维码(附源码)
February 13, 2022-3-middle order traversal of binary tree
Development of C language standard
Install mysql5.5 and mysql8.0 under windows at the same time
Windows cannot start the MySQL service (located on the local computer) error 1067 the process terminated unexpectedly
Swagger、Yapi接口管理服务_SE