当前位置:网站首页>One click extraction of tables in PDF
One click extraction of tables in PDF
2022-07-06 11:07:00 【zkkkkkkkkkkkkk】
Preface :
Due to work needs , Now we need to pdf The table in is not sealed properly csv Or database tables , Then it opened the road of forced research . Through investigation and research , At present, support from Editable pdf Read out the table Python Library has :pdfminer3k、tabula、pdfplumber etc. . All three libraries have flaws . But I'm still more inclined to pdfplumber . Self perception pdfplumber Simple and easy to implement functions . The following article is about pdfplumber Introduction to . For the other two Python If you are interested, you can check the relevant information by yourself . about pdf Editable in Central Africa ( Table recognition in the picture ) problem , Maybe this library can't help you .
Catalog
1.2、 Open source code git Address
One 、pdfplumber Introduce
1.1、 Introduce
Let's first look at an official introduction :pdfplumber Support vertical view PDF, View each text character 、 Rectangle and row details . Additional features : Table extraction and visual debugging . Most suitable for machine generated , Not scanned pdf file . On the whole pdfplumber It is a multi-function pdf Processing tools .
1.2、 Open source code git Address
1.3、 Official documents
1.4、 Installation mode
pip install pdfplumber
Two 、 Easy to use
2.1、 Data set introduction
Data is transaction flow ,pdf The table is editable . The purpose is to extract the data in the table .
2.2、 Code implementation
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/ A framed table can be edited .pdf'
pdf = pdfplumber.open(path)
# obtain pdf Page number object
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text() You can grab all the information of the current page , Because there is a lot of content, please comment first .
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ The first {count} End of page parsing ============')
# To dataframe Output
# pass
pdf.close()
3.3、 Results output
The result is output as a list per row . If necessary csv Or database requirements , You can first convert the following data into dataframe, Then output to the target source .
边栏推荐
- MySQL19-Linux下MySQL的安装与使用
- Ansible实战系列一 _ 入门
- Csdn-nlp: difficulty level classification of blog posts based on skill tree and weak supervised learning (I)
- Did you forget to register or load this tag 报错解决方法
- 基于apache-jena的知识问答
- A brief introduction to the microservice technology stack, the introduction and use of Eureka and ribbon
- 软件测试-面试题分享
- frp内网穿透那些事
- [untitled]
- CSDN question and answer module Title Recommendation task (I) -- Construction of basic framework
猜你喜欢
Generate PDM file from Navicat export table
[reading notes] rewards efficient and privacy preserving federated deep learning
[Li Kou 387] the first unique character in the string
自动机器学习框架介绍与使用(flaml、h2o)
Basic use of redis
Install mysql5.5 and mysql8.0 under windows at the same time
CSDN问答标签技能树(一) —— 基本框架的构建
Some problems in the development of unity3d upgraded 2020 VR
MySQL18-MySQL8其它新特性
Breadth first search rotten orange
随机推荐
[recommended by bloggers] C # generate a good-looking QR code (with source code)
[leectode 2022.2.13] maximum number of "balloons"
安装numpy问题总结
CSDN问答标签技能树(二) —— 效果优化
【博主推荐】C#生成好看的二维码(附源码)
Installation and use of MySQL under MySQL 19 Linux
Attention apply personal understanding to images
[Thesis Writing] how to write function description of jsp online examination system
Have you mastered the correct posture of golden three silver four job hopping?
【博主推荐】C# Winform定时发送邮箱(附源码)
02-项目实战之后台员工信息管理
MySQL21-用戶與權限管理
NPM an error NPM err code enoent NPM err syscall open
C语言标准的发展
npm一个错误 npm ERR code ENOENT npm ERR syscall open
[untitled]
Neo4j installation tutorial
Solution: log4j:warn please initialize the log4j system properly
Redis的基础使用
CSDN-NLP:基于技能树和弱监督学习的博文难度等级分类 (一)