当前位置:网站首页>One click extraction of tables in PDF
One click extraction of tables in PDF
2022-07-06 11:07:00 【zkkkkkkkkkkkkk】
Preface :
Due to work needs , Now we need to pdf The table in is not sealed properly csv Or database tables , Then it opened the road of forced research . Through investigation and research , At present, support from Editable pdf Read out the table Python Library has :pdfminer3k、tabula、pdfplumber etc. . All three libraries have flaws . But I'm still more inclined to pdfplumber . Self perception pdfplumber Simple and easy to implement functions . The following article is about pdfplumber Introduction to . For the other two Python If you are interested, you can check the relevant information by yourself . about pdf Editable in Central Africa ( Table recognition in the picture ) problem , Maybe this library can't help you .
Catalog
1.2、 Open source code git Address
One 、pdfplumber Introduce
1.1、 Introduce
Let's first look at an official introduction :pdfplumber Support vertical view PDF, View each text character 、 Rectangle and row details . Additional features : Table extraction and visual debugging . Most suitable for machine generated , Not scanned pdf file . On the whole pdfplumber It is a multi-function pdf Processing tools .
1.2、 Open source code git Address
1.3、 Official documents
1.4、 Installation mode
pip install pdfplumber
Two 、 Easy to use
2.1、 Data set introduction
Data is transaction flow ,pdf The table is editable . The purpose is to extract the data in the table .
2.2、 Code implementation
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/ A framed table can be edited .pdf'
pdf = pdfplumber.open(path)
# obtain pdf Page number object
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text() You can grab all the information of the current page , Because there is a lot of content, please comment first .
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ The first {count} End of page parsing ============')
# To dataframe Output
# pass
pdf.close()
3.3、 Results output
The result is output as a list per row . If necessary csv Or database requirements , You can first convert the following data into dataframe, Then output to the target source .
边栏推荐
- Navicat 导出表生成PDM文件
- Global and Chinese markets for aprotic solvents 2022-2028: Research Report on technology, participants, trends, market size and share
- MySQL19-Linux下MySQL的安装与使用
- [recommended by bloggers] C WinForm regularly sends email (with source code)
- 报错解决 —— io.UnsupportedOperation: can‘t do nonzero end-relative seeks
- 安装numpy问题总结
- CSDN question and answer tag skill tree (II) -- effect optimization
- Armv8-a programming guide MMU (2)
- [BMZCTF-pwn] 11-pwn111111
- MySQL完全卸载(Windows、Mac、Linux)
猜你喜欢
随机推荐
Mysql22 logical architecture
【博主推荐】C#生成好看的二维码(附源码)
Navicat 导出表生成PDM文件
Solve the problem that XML, YML and properties file configurations cannot be scanned
The virtual machine Ping is connected to the host, and the host Ping is not connected to the virtual machine
CSDN问答标签技能树(二) —— 效果优化
Have you mastered the correct posture of golden three silver four job hopping?
Neo4j installation tutorial
解决扫描不到xml、yml、properties文件配置
csdn-Markdown编辑器
Swagger、Yapi接口管理服务_SE
01项目需求分析 (点餐系统)
Kubesphere - deploy the actual combat with the deployment file (3)
Navicat 導出錶生成PDM文件
Postman uses scripts to modify the values of environment variables
MySQL19-Linux下MySQL的安装与使用
打开浏览器的同时会在主页外同时打开芒果TV,抖音等网站
Some notes of MySQL
Kubernetes - problems and Solutions
CSDN问答模块标题推荐任务(一) —— 基本框架的搭建