当前位置:网站首页>一键提取pdf中的表格
一键提取pdf中的表格
2022-07-06 09:13:00 【zkkkkkkkkkkkkk】
前言:
因工作需要,现在需要将pdf中的表格原封不对的输出csv或者数据库表,然后开启了苦逼的调研之路。经过调研,目前支持从可编辑pdf中读取出表格的Python库有:pdfminer3k、tabula、pdfplumber 等。三个库都有瑕疵。但是比好用的话我还是更偏向 pdfplumber 。自我感觉pdfplumber 简单易于实现功能。下面文章是关于 pdfplumber 的介绍。如对另外两个Python库感兴趣的话可以自行查看相关资料。对于pdf中非可编辑(图片中表格识别)问题,可能这个库就帮不上你什么忙了。
目录
一、pdfplumber介绍
1.1、介绍
先看一段官方介绍:pdfplumber支持垂直查看PDF,查看每个文本字符、矩形和行的详细信息。 附加功能:表提取和可视化调试。最适合机器生成的,而不是扫描的pdf文件。总体来说pdfplumber是一个集多种功能为一身的pdf处理工具。
1.2、代码开源git地址
1.3、官方文档
1.4、安装方式
pip install pdfplumber
二、简单使用
2.1、数据集介绍
数据为交易流水,pdf表格为可编辑。目的是将表格里的数据提取出来。
2.2、代码实现
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/有框表格可编辑.pdf'
pdf = pdfplumber.open(path)
# 获取pdf页数对象
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text()可以抓取当前页的全部信息,因为内容较多就先注释。
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ 第{count}页解析结束 ============')
# 转为dataframe输出
# pass
pdf.close()
3.3、结果输出
结果是以每行列表的形式输出的。如果有需要csv或者数据库需求的话,可以先将下面的数据转为dataframe,然后再输出到目标源。
边栏推荐
- MySQL主從複制、讀寫分離
- Install mysql5.5 and mysql8.0 under windows at the same time
- Other new features of mysql18-mysql8
- [ahoi2009]chess Chinese chess - combination number optimization shape pressure DP
- MySQL完全卸载(Windows、Mac、Linux)
- Esp8266 at+cipstart= "", "", 8080 error closed ultimate solution
- CSDN blog summary (I) -- a simple first edition implementation
- Mysql23 storage engine
- MySQL27-索引優化與查詢優化
- C语言标准的发展
猜你喜欢
Windows cannot start the MySQL service (located on the local computer) error 1067 the process terminated unexpectedly
[C language foundation] 04 judgment and circulation
Mysql27 index optimization and query optimization
CSDN question and answer module Title Recommendation task (II) -- effect optimization
MySQL master-slave replication, read-write separation
[recommended by bloggers] C WinForm regularly sends email (with source code)
IDEA 导入导出 settings 设置文件
windows无法启动MYSQL服务(位于本地计算机)错误1067进程意外终止
How to find the number of daffodils with simple and rough methods in C language
Mysql23 storage engine
随机推荐
API learning of OpenGL (2005) gl_ MAX_ TEXTURE_ UNITS GL_ MAX_ TEXTURE_ IMAGE_ UNITS_ ARB
[Li Kou 387] the first unique character in the string
Navicat 导出表生成PDM文件
A brief introduction to the microservice technology stack, the introduction and use of Eureka and ribbon
Unicode decodeerror: 'UTF-8' codec can't decode byte 0xd0 in position 0 successfully resolved
Invalid default value for 'create appears when importing SQL_ Time 'error reporting solution
LeetCode #461 汉明距离
MySQL完全卸载(Windows、Mac、Linux)
Just remember Balabala
Discriminant model: a discriminant model creation framework log linear model
Win10: how to modify the priority of dual network cards?
MySQL主从复制、读写分离
Kubernetes - problems and Solutions
Other new features of mysql18-mysql8
Data dictionary in C #
Solve the problem that XML, YML and properties file configurations cannot be scanned
How to find the number of daffodils with simple and rough methods in C language
Water and rain condition monitoring reservoir water and rain condition online monitoring
C language string function summary
Global and Chinese market of wafer processing robots 2022-2028: Research Report on technology, participants, trends, market size and share