当前位置:网站首页>One click extraction of tables in PDF
One click extraction of tables in PDF
2022-07-06 11:07:00 【zkkkkkkkkkkkkk】
Preface :
Due to work needs , Now we need to pdf The table in is not sealed properly csv Or database tables , Then it opened the road of forced research . Through investigation and research , At present, support from Editable pdf Read out the table Python Library has :pdfminer3k、tabula、pdfplumber etc. . All three libraries have flaws . But I'm still more inclined to pdfplumber . Self perception pdfplumber Simple and easy to implement functions . The following article is about pdfplumber Introduction to . For the other two Python If you are interested, you can check the relevant information by yourself . about pdf Editable in Central Africa ( Table recognition in the picture ) problem , Maybe this library can't help you .
Catalog
1.2、 Open source code git Address
One 、pdfplumber Introduce
1.1、 Introduce
Let's first look at an official introduction :pdfplumber Support vertical view PDF, View each text character 、 Rectangle and row details . Additional features : Table extraction and visual debugging . Most suitable for machine generated , Not scanned pdf file . On the whole pdfplumber It is a multi-function pdf Processing tools .
1.2、 Open source code git Address
1.3、 Official documents
1.4、 Installation mode
pip install pdfplumber
Two 、 Easy to use
2.1、 Data set introduction
Data is transaction flow ,pdf The table is editable . The purpose is to extract the data in the table .
2.2、 Code implementation
import pdfplumber
# path = 'D:\\202104147187110045_1.pdf'
path = '../recognize_img/demo_img/ A framed table can be edited .pdf'
pdf = pdfplumber.open(path)
# obtain pdf Page number object
print(pdf.pages) # [<Page:1>]
count = 0
for page in pdf.pages:
count += 1
# page.extract_text() You can grab all the information of the current page , Because there is a lot of content, please comment first .
# print(page.extract_text())
for table in page.extract_tables():
for row in table:
print(row)
print(f'============ The first {count} End of page parsing ============')
# To dataframe Output
# pass
pdf.close()
3.3、 Results output
The result is output as a list per row . If necessary csv Or database requirements , You can first convert the following data into dataframe, Then output to the target source .
边栏推荐
- Some problems in the development of unity3d upgraded 2020 VR
- 一键提取pdf中的表格
- Mysql21 user and permission management
- [number theory] divisor
- CSDN问答模块标题推荐任务(一) —— 基本框架的搭建
- [C language foundation] 04 judgment and circulation
- 项目实战-后台员工信息管理(增删改查登录与退出)
- [BMZCTF-pwn] 11-pwn111111
- [leectode 2022.2.13] maximum number of "balloons"
- [free setup] asp Net online course selection system design and Implementation (source code +lunwen)
猜你喜欢
QT creator design user interface
[recommended by bloggers] asp Net WebService background data API JSON (with source code)
[download app for free]ineukernel OCR image data recognition and acquisition principle and product application
Why is MySQL still slow to query when indexing is used?
Classes in C #
MySQL主從複制、讀寫分離
[recommended by bloggers] C MVC list realizes the function of adding, deleting, modifying, checking, importing and exporting curves (with source code)
Postman Interface Association
Asp access Shaoxing tourism graduation design website
35 is not a stumbling block in the career of programmers
随机推荐
Postman Interface Association
Swagger, Yapi interface management service_ SE
C language advanced pointer Full Version (array pointer, pointer array discrimination, function pointer)
frp内网穿透那些事
Why is MySQL still slow to query when indexing is used?
February 13, 2022-2-climbing stairs
Invalid default value for 'create appears when importing SQL_ Time 'error reporting solution
Solution: log4j:warn please initialize the log4j system properly
FRP intranet penetration
Install MySQL for Ubuntu 20.04
MySQL 20 MySQL data directory
IDEA 导入导出 settings 设置文件
API learning of OpenGL (2004) gl_ TEXTURE_ MIN_ FILTER GL_ TEXTURE_ MAG_ FILTER
打开浏览器的同时会在主页外同时打开芒果TV,抖音等网站
Navicat 導出錶生成PDM文件
CSDN Q & a tag skill tree (V) -- cloud native skill tree
Solve the problem that XML, YML and properties file configurations cannot be scanned
[Thesis Writing] how to write function description of jsp online examination system
Principes JDBC
MySQL master-slave replication, read-write separation