当前位置:网站首页>Table image extraction based on traditional intersection method and Tesseract OCR
Table image extraction based on traditional intersection method and Tesseract OCR
2022-07-28 04:57:00 【Chen Zing】
This paper applies the traditional intersection method to extract the frame table , Mainly for the existence of merge cell Table for , And use tesseract-ocr To do character recognition , The main difficulty is the extraction part of the former , Now I will share some methods with you .
It is suggested that you can first see the extraction methods of other bloggers for simple tables , Look at this one again , Because I omitted a lot of content , Mainly talking about methods , It is not recommended that friends who have never known each other before come to see , Recommend an article :
Because I want to eat just right , Earn the expense of graduation trip , Therefore, only part of the content will be shared , Hope to understand , It is also my first time to post an article , If there is something not in place , I would also like to point out that .
One 、 Difficulties in table extraction
1. merge cell
I have checked many publications in csdn Blog post on , The traditional method is to use the intersection of frames , Get intersections to extract cells , But most of them are for simple forms , In real office life , We often encounter complex tables with merged cells, as shown in the following figure , Then we need a set of complex methods to extract .

2. The image processing
Image preprocessing is a crucial step , We must reduce all kinds of interference in the image , The purpose is to be able to accurately extract intersections , The whole extraction process , Are carried out according to the intersection coordinates , Image preprocessing can refer to other bloggers , Generally, image binarization is required 、 Graying 、 Denoise 、 Sharpening, etc ......
for instance : There are some images , The connecting part of the frame line is not connected , The intersection cannot be extracted , Then you can use Opencv Methods in the library , Lengthen the detected frame , So that we can get the intersection , Here's the picture .


Because the image processing part is very important , It will affect the subsequent table structure extraction and character recognition , So you need to constantly debug , To achieve the desired effect , I think this is also the limitation of traditional methods , So the current popular methods are deep learning .
Two 、 Table extraction process
After image preprocessing , That's the main content of this article —— Extract table structure
The whole extraction process is divided into two steps :
1. Intersection classification
2. Extract the table according to the intersection category
1. Intersection classification
This is an original picture , There is a case of merging cells , Because this article focuses on methods , Don't talk too much about image processing , So high definition images are used , But we must emphasize the importance of image preprocessing .

(1) After binarization 、 After graying

(2) Extract horizontal and vertical lines , After corrosion and expansion , So the lines are longer than the original


(3) Superimpose horizontal and vertical lines , The purpose is to obtain the intersection

(4) After superposition , Then the intersection is obtained , Marked as black dot

(5) Classify intersections , First, identify the marker points representing each cell , For example, in the following cell ,p1 Points are marker points , Represents the current cell .

With the above information , The intersection points in the graph can be classified for the first time , The marking point is red :

The current classification is far from enough , Still can't extract table cells well , It needs to be classified again , Keep the red dot , And the color of the bottom and top right points of the table , The remaining points are set to green , Pictured :

(6) According to the intersection of different color types , Design a set of methods , You can get the area position of each unit .

(7) The next step is to intercept Each cell is saved in a folder , Record cell position information , Prepare for later table structure restoration and character recognition .
2. Extract the table according to the intersection category
Because this poor student wants to have a meal , Therefore, the core specific methods are not given , If you have interested friends , You can talk in private , I can give the project code , It can also be done through ppt The whole process of the speech , I hope you can scold , To make a long story short , Poverty is the original sin .
3. Actual demonstration





边栏推荐
- CPU and memory usage are too high. How to modify RTSP round robin detection parameters to reduce server consumption?
- 低代码是开发的未来吗?浅谈低代码平台
- Tiantian AMADA CNC bending machine touch screen maintenance rgm21003 host circuit board maintenance
- (3.1) [Trojan horse synthesis technology]
- 网络安全基本知识——密码(一)
- 解析智能扫地机器人中蕴含的情感元素
- Comprehensively analyze the differences between steam and maker Education
- Can plastics comply with gb/t 2408 - Determination of flammability
- 【CPU占用高】software_reporter_tool.exe
- [Sylar] framework Chapter 22 auxiliary module
猜你喜欢

Youxuan database participated in the compilation of the Research Report on database development (2022) of the China Academy of communications and communications
![Geely AI interview question [Hangzhou multi tester] [Hangzhou multi tester _ Wang Sir]](/img/18/27a86595eb3a7d30df359d6b2b8d8c.png)
Geely AI interview question [Hangzhou multi tester] [Hangzhou multi tester _ Wang Sir]

C语言ATM自动取款机系统项目的设计与开发

Web渗透之域名(子域名)收集方法

Histogram of pyplot module of Matplotlib (hist(): basic parameter, return value)

(clone virtual machine steps)

linux下安装mysql

全方位分析STEAM和创客教育的差异化

CPU and memory usage are too high. How to modify RTSP round robin detection parameters to reduce server consumption?

Evolution of ape counseling technology: helping teaching and learning conceive future schools
随机推荐
[daily one] visual studio2015 installation in ancient times
Printf() print char* str
After easycvr is connected to the national standard equipment, how to solve the problem that the equipment video cannot be played completely?
数据安全逐步落地,必须紧盯泄露源头
Is low code the future of development? On low code platform
[idea] check out master invalid path problem
Depth traversal and breadth traversal of tree structure in JS
HDU 3585 maximum shortest distance
Dynamic SQL and paging
FPGA: use PWM wave to control LED brightness
Redis配置文件详解/参数详解及淘汰策略
05.01 string
Angr(十一)——官方文档(Part2)
[Sylar] framework -chapter9-hook module
Improve the core quality of steam education among students
01 node express system framework construction (express generator)
【Oracle】083错题集
Euler road / Euler circuit
(clone virtual machine steps)
Clickhouse填坑记2:Join条件不支持大于、小于等非等式判断