|作者:Huang xiao
一 背景
ICDAR 2021(International Conference on Document Analysis and Recognition)于2021年9月5-10held in Switzerland.ICDARThe conference is the top international conference in the field of document analysis and identification,每两年举办一次,Covers the latest academic achievements and cutting-edge application development trends in the field of document analysis and recognition,Attracted the participation of the world's top R&D teams and experts and scholars in this field.The algorithm competition held at the conference is text recognition(OCR)领域的顶级赛事.Autohome Dealer Technology Department in whichCompetition on Time-Quality Document Image Binarization(DIB)Won the second place under two subtasks of the Image Binarization Algorithm Competition.
图1 Competition results and certificates
二 赛题介绍
ICDAR2021的DIBThe challenge of the competition is to binarize historical document images,To separate text from the background.评价指标采用PSNR、DRDM、F-Measure (FM)、pseudoFMeasure (Fps)以及Cohen’s KappaThe comprehensive weighted value of.The difficulty of the game is that the background of the historical document image is very complex,There are various degradation factors,This makes it difficult for existing algorithms to achieve better results.,such as page stains blocking handwriting,Characters are faded,The cause and background are too similar,Ink soaking,In the back of the text into the positive,But the real label needs to put him in the background,and fold marks,颜色较深,May be confused with text etc..
图2 Various Degraded Examples of Historical Document Image Datasets
三 技术方案
The traditional methods of image binarization are mainly divided into global threshold method.、Local thresholding and some methods that combine the two.The global threshold method directly uses a fixed threshold to segment the document image into two parts, the foreground and the background of the text.,如经典的OTSU算法.The local threshold method calculates a dynamic local threshold based on the local neighborhood window in the image to classify the pixel as foreground text or background.The traditional methods in document image can be obtained in the complex background is not very good accuracy,However, when there are many qualitative case with background image(such as page stains、Penetration of writing on the back、uneven lighting),效果较差.
Methods combined with deep learning are more robust,It can also perform well in complex backgrounds.Deep learning-based methods treat document image binarization as a task of image segmentation,Binary classification of each pixel through a convolutional neural network,Finally, a segmentation map of the entire document image is obtained,Divided into foreground text and background area,从而实现二值化[1].However, for this game,Larger resolution for each historical document image(Often in the direction of the width or height3000像素),考虑到GPU显存的限制,Often in neural network methods,The input is an image patch cropped from the whole image(例如128×128的尺寸),instead of feeding the entire image into the network structure.But this cropping strategy loses the global spatial information of the whole document image,Especially when there is penetration of the writing on the back,At this point, the handwriting on the back is indistinguishable from the real foreground text,can be mistaken for foreground text,resulting in reduced binarization accuracy.
Therefore, we design a set of document image binarization methods that combine global information and local information,achieved good results in the competition,示意图如下:
图3 Binarization method combining global information and local information
Our proposed architecture consists of threeU-NetThe branch modules of:The dimensions of the two input images are128 × 128和256 × 256的局部U-net,and an input image of size512×512的全局U-Net.First place the two partsU-netThe resulting binarized image fusion,reconciliationU-NetThe obtained binarized image is taken and intersected,get the final binarized image.
局部U-net
:采用128×128Sliding window of size to crop the original image,get local chunks,并采用U-Net卷积神经网络[2]Get classification probability graph will block the image again after joining together to complete the image.U-NetIt is an image segmentation model based on deep learning,我们采用经典的U-Net网络结构,由编码器和解码器构成,编码器由4A duplicate modules,Each module includes2层3×3的卷积层和1层2×2的池化层,Convolution followed by a group of standardized layer on each floor(Batch Normalization)and the activation function layer of the linear correction unit(RELU),downsampling path along the encoder,The height and width of the feature map are halved,while double the number of channels.The network structure of the decoder is the opposite of that of the encoder,The height and width of the double characteristics of the diagram,And the number of channels in half.U-NetThe structure has a skip layer connection between the encoder and the decoder(Skip-connection)to improve segmentation accuracy.As the image binarization task is to the value of each pixel mapping of the input image0或1,因此U-NetThe last layer USES the network structureSoftmax激活函数,Thus, each image block can be converted into a classification probability map of the same size.通常地,will give an activation threshold,Convert the classification probability map directly to0或1的二值化图,Because probability graph in each pixel size[0, 1]的区间里, For example, take the activation threshold0.5,Then the probability map is greater than or equal to0.5Values were converted to1,而小于0.5Values were converted to0.为了提高精度,When extracts local information accordingly adopted multi-level scale model of the fusion methods,即融合128×128和256×256Two kinds of local block information.
全局U-net
:Since the local block size is much smaller than the original full image,What is obtained is a classification probability map based on local information.However, it is necessary to take into account the global spatial context information and the limitation of model capacity.,A more straightforward way is to convert the original image(例如3000×3000的尺寸)Downsampling to a fixed lower resolution size(例如512×512的尺寸).但是这种方法有两个缺点:A document image is different and different aspect ratio,uniformly reduced to512×512Will cause the ratio distortion,引入误差;The second is to reduce the number of trainable samples when training the model compared to the way of image segmentation.基于此,我们采用512×512A fixed-size sliding window crops the image downsampled from the original document image,Get the image block,At this time, the image block can contain enough background and foreground text,Contains global spatial context information.
融合
:two localU-net的结果进行融合,128 × 128和256 × 256The classification probability maps obtained by the size are obtained by having different receptive area sizes.U-NetThe image segmentation model gets,After averaging the two, the classification probability map with the same size as the original document image is obtained,given activation threshold0.5,Classification probability graph can be converted to binary map,At this time, the binarized image is obtained from the image segmentation model based on the fusion of local information..Then combine it with the globalU-netThe result of taking the intersection operation,得到最终二值化图像.
图 4 The binarization result of the sample
图4This model is presented in the data set of printing document image binarization results of an example.可以看出,When only local information is considered,That is, when the binarized image is obtained by using partial blocks,It is easy to incorrectly predict the text in the background area in the historical document image as the foreground text.Which is a combination of global and local information,Can better distinguish background area and foreground text area,easier to achieve better results.
四 总结
此次比赛中,Car dealers department puts forward a combined with global and local features of image binarization method,Constructing a multi-scale convolutional neural network to extract image features,By local channel to accurately depict the text outline,Recombined overall pass to better separate complex background and text foreground,Finally, the binarization effect of text images is greatly improved.Image binarization is a crucial preprocessing step in image processing,The effect of binarization on the subsequentOCR(字符识别)Accuracy has a big impact.The results of this research have effectively improved the effect of binarization,For subsequent imagesOCR、Business scenarios such as automatic image review provide valuable experience.
Dealer technical department in imageOCRRich experience in automatic image review,Identify more than 10 million tickets of various types throughout the year,Save company purchasing externalOCRIdentification of the service fee,Better protection of personal information and data security of company customers and users.此外,A telephone robot developed by the dealer's technical department using natural language processing technology、IM对话机器人、Intelligent quality control technology is widely used in the wisdom department products、Marketing activities and related products,Save a lot of lead cleaning、活动邀约、Labor costs for lead conversions, etc.,Used in commercial products sell at the same time,Play a role in increasing company revenue.
参考文献:
[1] Jorge Calvo-Zaragoza and Antonio-Javier Gallego. A selectional autoencoder approach for document image binarization. Pattern Recognition, 86:37{47, 2019.
[2] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234{241. Springer, 2015.
原网站版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/215/202208031658170777.html