当前位置:网站首页>一键提取pdf中的表格
一键提取pdf中的表格
2022-07-06 09:13:00 【zkkkkkkkkkkkkk】
前言:
因工作需要,现在需要将pdf中的表格原封不对的输出csv或者数据库表,然后开启了苦逼的调研之路。经过调研,目前支持从可编辑pdf中读取出表格的Python库有:pdfminer3k、tabula、pdfplumber 等。三个库都有瑕疵。但是比好用的话我还是更偏向 pdfplumber 。自我感觉pdfplumber 简单易于实现功能。下面文章是关于 pdfplumber 的介绍。如对另外两个Python库感兴趣的话可以自行查看相关资料。对于pdf中非可编辑(图片中表格识别)问题,可能这个库就帮不上你什么忙了。
目录
一、pdfplumber介绍
1.1、介绍
先看一段官方介绍:pdfplumber支持垂直查看PDF,查看每个文本字符、矩形和行的详细信息。 附加功能:表提取和可视化调试。最适合机器生成的,而不是扫描的pdf文件。总体来说pdfplumber是一个集多种功能为一身的pdf处理工具。
1.2、代码开源git地址
1.3、官方文档
1.4、安装方式
pip install pdfplumber二、简单使用
2.1、数据集介绍
数据为交易流水,pdf表格为可编辑。目的是将表格里的数据提取出来。
2.2、代码实现
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/有框表格可编辑.pdf'
pdf = pdfplumber.open(path)
# 获取pdf页数对象
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text()可以抓取当前页的全部信息,因为内容较多就先注释。
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ 第{count}页解析结束 ============')
# 转为dataframe输出
# pass
pdf.close()3.3、结果输出
结果是以每行列表的形式输出的。如果有需要csv或者数据库需求的话,可以先将下面的数据转为dataframe,然后再输出到目标源。
边栏推荐
- Install mysql5.5 and mysql8.0 under windows at the same time
- Swagger、Yapi接口管理服务_SE
- Mysql22 logical architecture
- C language string function summary
- MySQL22-逻辑架构
- MySQL26-性能分析工具的使用
- Global and Chinese market of operational amplifier 2022-2028: Research Report on technology, participants, trends, market size and share
- Install MySQL for Ubuntu 20.04
- A trip to Macao - > see the world from a non line city to Macao
- C语言标准的发展
猜你喜欢

Solution: log4j:warn please initialize the log4j system properly

Pytoch LSTM implementation process (visual version)

MySQL19-Linux下MySQL的安装与使用

How to change php INI file supports PDO abstraction layer

A brief introduction to the microservice technology stack, the introduction and use of Eureka and ribbon

Postman environment variable settings

MySQL21-用戶與權限管理

Mysql27 index optimization and query optimization

连接MySQL数据库出现错误:2059 - authentication plugin ‘caching_sha2_password‘的解决方法

How to find the number of daffodils with simple and rough methods in C language
随机推荐
Esp8266 at+cipstart= "", "", 8080 error closed ultimate solution
Global and Chinese market of wafer processing robots 2022-2028: Research Report on technology, participants, trends, market size and share
[recommended by bloggers] background management system of SSM framework (with source code)
Global and Chinese markets of static transfer switches (STS) 2022-2028: Research Report on technology, participants, trends, market size and share
Postman Interface Association
[ahoi2009]chess Chinese chess - combination number optimization shape pressure DP
How to change php INI file supports PDO abstraction layer
Mysql23 storage engine
Global and Chinese market of transfer switches 2022-2028: Research Report on technology, participants, trends, market size and share
February 13, 2022 - Maximum subarray and
API learning of OpenGL (2002) smooth flat of glsl
A brief introduction to the microservice technology stack, the introduction and use of Eureka and ribbon
Principes JDBC
[recommended by bloggers] asp Net WebService background data API JSON (with source code)
++Implementation of I and i++
Mysql24 index data structure
Mysql21 user and permission management
@controller,@service,@repository,@component区别
CSDN question and answer tag skill tree (I) -- Construction of basic framework
Ansible practical Series III_ Task common commands