当前位置:网站首页>一键提取pdf中的表格
一键提取pdf中的表格
2022-07-06 09:13:00 【zkkkkkkkkkkkkk】
前言:
因工作需要,现在需要将pdf中的表格原封不对的输出csv或者数据库表,然后开启了苦逼的调研之路。经过调研,目前支持从可编辑pdf中读取出表格的Python库有:pdfminer3k、tabula、pdfplumber 等。三个库都有瑕疵。但是比好用的话我还是更偏向 pdfplumber 。自我感觉pdfplumber 简单易于实现功能。下面文章是关于 pdfplumber 的介绍。如对另外两个Python库感兴趣的话可以自行查看相关资料。对于pdf中非可编辑(图片中表格识别)问题,可能这个库就帮不上你什么忙了。
目录
一、pdfplumber介绍
1.1、介绍
先看一段官方介绍:pdfplumber支持垂直查看PDF,查看每个文本字符、矩形和行的详细信息。 附加功能:表提取和可视化调试。最适合机器生成的,而不是扫描的pdf文件。总体来说pdfplumber是一个集多种功能为一身的pdf处理工具。
1.2、代码开源git地址
1.3、官方文档
1.4、安装方式
pip install pdfplumber二、简单使用
2.1、数据集介绍
数据为交易流水,pdf表格为可编辑。目的是将表格里的数据提取出来。
2.2、代码实现
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/有框表格可编辑.pdf'
pdf = pdfplumber.open(path)
# 获取pdf页数对象
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text()可以抓取当前页的全部信息,因为内容较多就先注释。
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ 第{count}页解析结束 ============')
# 转为dataframe输出
# pass
pdf.close()3.3、结果输出
结果是以每行列表的形式输出的。如果有需要csv或者数据库需求的话,可以先将下面的数据转为dataframe,然后再输出到目标源。
边栏推荐
- windows无法启动MYSQL服务(位于本地计算机)错误1067进程意外终止
- Esp8266 at+cipstart= "", "", 8080 error closed ultimate solution
- How to find the number of daffodils with simple and rough methods in C language
- C language string function summary
- @Controller, @service, @repository, @component differences
- LeetCode #461 汉明距离
- [BMZCTF-pwn] 11-pwn111111
- MySQL20-MySQL的数据目录
- MySQL完全卸载(Windows、Mac、Linux)
- [reading notes] rewards efficient and privacy preserving federated deep learning
猜你喜欢
![[reading notes] rewards efficient and privacy preserving federated deep learning](/img/c3/5e88277b5024885d5ceeaa0de14b27.jpg)
[reading notes] rewards efficient and privacy preserving federated deep learning

Pytorch RNN actual combat case_ MNIST handwriting font recognition

A brief introduction to the microservice technology stack, the introduction and use of Eureka and ribbon

Csdn-nlp: difficulty level classification of blog posts based on skill tree and weak supervised learning (I)

Swagger, Yapi interface management service_ SE

1. Mx6u learning notes (VII): bare metal development (4) -- master frequency and clock configuration

CSDN问答模块标题推荐任务(二) —— 效果优化

Esp8266 at+cipstart= "", "", 8080 error closed ultimate solution

Asp access Shaoxing tourism graduation design website

CSDN blog summary (I) -- a simple first edition implementation
随机推荐
Ansible实战系列一 _ 入门
【博主推荐】SSM框架的后台管理系统(附源码)
MySQL master-slave replication, read-write separation
SSM整合笔记通俗易懂版
windows无法启动MYSQL服务(位于本地计算机)错误1067进程意外终止
[BMZCTF-pwn] 11-pwn111111
Moteur de stockage mysql23
Invalid default value for 'create appears when importing SQL_ Time 'error reporting solution
1. Mx6u learning notes (VII): bare metal development (4) -- master frequency and clock configuration
Kubernetes - problems and Solutions
Yum prompt another app is currently holding the yum lock; waiting for it to exit...
Valentine's Day is coming, are you still worried about eating dog food? Teach you to make a confession wall hand in hand. Express your love to the person you want
February 13, 2022-3-middle order traversal of binary tree
Mysql 其他主机无法连接本地数据库
Did you forget to register or load this tag 报错解决方法
Global and Chinese markets of static transfer switches (STS) 2022-2028: Research Report on technology, participants, trends, market size and share
记一次某公司面试题:合并有序数组
La table d'exportation Navicat génère un fichier PDM
Win10: how to modify the priority of dual network cards?
Global and Chinese markets for aprotic solvents 2022-2028: Research Report on technology, participants, trends, market size and share