当前位置:网站首页>One click extraction of tables in PDF
One click extraction of tables in PDF
2022-07-06 11:07:00 【zkkkkkkkkkkkkk】
Preface :
Due to work needs , Now we need to pdf The table in is not sealed properly csv Or database tables , Then it opened the road of forced research . Through investigation and research , At present, support from Editable pdf Read out the table Python Library has :pdfminer3k、tabula、pdfplumber etc. . All three libraries have flaws . But I'm still more inclined to pdfplumber . Self perception pdfplumber Simple and easy to implement functions . The following article is about pdfplumber Introduction to . For the other two Python If you are interested, you can check the relevant information by yourself . about pdf Editable in Central Africa ( Table recognition in the picture ) problem , Maybe this library can't help you .
Catalog
1.2、 Open source code git Address
One 、pdfplumber Introduce
1.1、 Introduce
Let's first look at an official introduction :pdfplumber Support vertical view PDF, View each text character 、 Rectangle and row details . Additional features : Table extraction and visual debugging . Most suitable for machine generated , Not scanned pdf file . On the whole pdfplumber It is a multi-function pdf Processing tools .
1.2、 Open source code git Address
1.3、 Official documents
1.4、 Installation mode
pip install pdfplumber
Two 、 Easy to use
2.1、 Data set introduction
Data is transaction flow ,pdf The table is editable . The purpose is to extract the data in the table .
2.2、 Code implementation
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/ A framed table can be edited .pdf'
pdf = pdfplumber.open(path)
# obtain pdf Page number object
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text() You can grab all the information of the current page , Because there is a lot of content, please comment first .
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ The first {count} End of page parsing ============')
# To dataframe Output
# pass
pdf.close()
3.3、 Results output
The result is output as a list per row . If necessary csv Or database requirements , You can first convert the following data into dataframe, Then output to the target source .
边栏推荐
- Timestamp with implicit default value is deprecated error in MySQL 5.6
- 一键提取pdf中的表格
- Win10: how to modify the priority of dual network cards?
- CSDN问答标签技能树(一) —— 基本框架的构建
- windows无法启动MYSQL服务(位于本地计算机)错误1067进程意外终止
- Neo4j installation tutorial
- API learning of OpenGL (2005) gl_ MAX_ TEXTURE_ UNITS GL_ MAX_ TEXTURE_ IMAGE_ UNITS_ ARB
- [recommended by bloggers] background management system of SSM framework (with source code)
- API learning of OpenGL (2001) gltexgen
- February 13, 2022 - Maximum subarray and
猜你喜欢
Solution: log4j:warn please initialize the log4j system properly
[reading notes] rewards efficient and privacy preserving federated deep learning
安装numpy问题总结
Use dapr to shorten software development cycle and improve production efficiency
C language advanced pointer Full Version (array pointer, pointer array discrimination, function pointer)
基于apache-jena的知识问答
CSDN question and answer module Title Recommendation task (II) -- effect optimization
02-项目实战之后台员工信息管理
A brief introduction to the microservice technology stack, the introduction and use of Eureka and ribbon
Why is MySQL still slow to query when indexing is used?
随机推荐
Invalid global search in idea/pychar, etc. (win10)
A trip to Macao - > see the world from a non line city to Macao
MySQL主從複制、讀寫分離
Some problems in the development of unity3d upgraded 2020 VR
February 13, 2022-3-middle order traversal of binary tree
CSDN问答模块标题推荐任务(一) —— 基本框架的搭建
SSM整合笔记通俗易懂版
Ansible practical Series II_ Getting started with Playbook
图片上色项目 —— Deoldify
Remember a company interview question: merge ordered arrays
【博主推荐】asp.net WebService 后台数据API JSON(附源码)
C语言标准的发展
API learning of OpenGL (2005) gl_ MAX_ TEXTURE_ UNITS GL_ MAX_ TEXTURE_ IMAGE_ UNITS_ ARB
Install mysql5.5 and mysql8.0 under windows at the same time
[leectode 2022.2.13] maximum number of "balloons"
@Controller, @service, @repository, @component differences
CSDN博文摘要(一) —— 一个简单的初版实现
CSDN Q & a tag skill tree (V) -- cloud native skill tree
Navicat 導出錶生成PDM文件
Other new features of mysql18-mysql8