当前位置：网站首页>AAAI2020: Real-time Scene Text Detection with Differentiable Binarization

AAAI2020: Real-time Scene Text Detection with Differentiable Binarization

2022-07-01 19:38:00 【Highlight_ Jin】

AAAI2020: Real-time Scene Text Detection with Differentiable Binarization

Insert picture description here
Probability map ： Is the original text mask Shrunk graph
Threshold diagram ： Is the text boundary inward 、 After outward expansion , The resulting difference set region , Better describe the boundaries of the text .

1 Introduction

In recent years , Due to image / Video understanding 、 Visual search 、 Extensive practical applications such as autonomous driving and blind assistance , Reading texts in scene images has become an active research field . As a key component of scene text reading , Scene text detection aimed at locating the bounding box or area of each text instance is still a challenging task , Because scene text usually has different scales and shapes , Including levels 、 Multidirectional and curved text . Scene text detection based on segmentation has recently attracted a lot of attention , Because it can describe various shapes of text , Benefit from its prediction results at the pixel level . However , Most segmentation based methods require complex post-processing , The pixel level prediction results are grouped into detected text instances , The time cost in the reasoning process is quite high . Take the recent two most advanced scene text detection methods as examples .PSENet（Wang wait forsomeone ,2019a） A post-processing method of progressive proportional expansion is proposed , To improve the detection accuracy ;Pixel embedding（Tian wait forsomeone ,2019） It is used to cluster pixels according to the segmentation results , It is necessary to calculate the characteristic distance between pixels .

Most existing detection methods use similar post-treatment pipelines , Pictured 2 Shown （ As shown by the blue arrow ）. First , They set a fixed threshold , Convert the probability map generated by the segmented network into binary image ; then , Use some heuristic techniques , Such as pixel clustering , Group pixels into text instances . in addition , Our pipeline （ According to the plan 2 The red arrow in ） The purpose is to insert the binarization operation into the segmented network for joint optimization . In this way , It can adaptively predict the threshold value of every part of the image , This can completely distinguish the pixels of the foreground and background . However , The standard binarization function is not separable , We propose an approximate binarization function , It is called separable binarization （DB）, When training with segmented Networks , It is completely separable .

The main contribution of this paper is to propose a distinguishable DB modular , This makes the process of binarization in CNN You can do end-to-end training in . By combining a simple semantic segmentation network and the proposed DB modular , We propose a powerful and fast scene text detector . From using DB Module performance evaluation , We find that our detector has several outstanding advantages over the previous most advanced segmentation based methods .

Our method has achieved consistently better performance on the benchmark data set of five scene texts , Including levels 、 Multidirectional and curved text .
Our method performs faster than the previous leading method , because DB It can provide a highly robust binary graph , Greatly simplifies the post-processing process .
DB The effect is quite good when using lightweight backbone , This greatly enhances ResNet-18 Detection performance of backbone .
because DB It can be removed in the reasoning phase without affecting performance , Therefore, there is no additional memory in the test / Time cost .

2 Related work

3 Methodology

The structure of our proposed method is shown in Figure 3 Shown . First , The input image is fed into a feature pyramid skeleton . secondly , Pyramid features are up sampled to the same ratio and cascaded to produce features F. then , features F It is used to predict the probability diagram （P） And threshold graph （T）. after , Approximate binary graph （ˆB） from P and F Calculation . During the training period , Supervision is applied to probability graphs 、 Threshold graph and approximate binary graph , Probability graph and approximate binary graph share the same supervision . In the reasoning stage , The boundary box can be easily obtained from the approximate binary diagram or probability diagram through the box module .

3.1Binarization

Standard binarization Given a probability graph generated by the segmented network P∈RH×W, among H and W Represents the height and width of the graph , It must be converted into a binary diagram P∈RH×W, The value is 1 Pixels are considered to be effective text areas . Usually , This binarization process can be described as follows ：
Insert picture description here
among t Is the predetermined threshold ,（i,j） Express map Coordinate points in .

Differentiable binarization The formula 1 The standard binarization described in is inseparable . therefore , During training , It cannot be optimized with segmented Networks . To solve this problem , We suggest using an approximate ladder function to binarize ： ˆBi,j = 1 1 + e-k(Pi,j-Ti,j) (2) among ˆB Is an approximate binary graph ;T It is an adaptive threshold graph learned from the network ;k Represents the magnification factor . The behavior of this approximate binarization function is similar to that of the standard binarization function （ See the picture 4）, But it is differentiable , Therefore, it can be optimized together with the segmented network during training . Differentiated binarization with adaptive threshold is not only helpful to distinguish the text area from the background , It can also separate tightly bound text instances . Some examples are shown in Figure 7 As explained in .

3.2 Adaptive threshold

3.3 Deformable convolution

3.4 Label generation

The label generation of probability graph is restricted PSENet（Wang wait forsomeone ,2019a） Inspired by the . Given a text image , Each polygon of its text area is described by a group of segments .G={Sk}nk=1 （5）n It's the number of vertices , It may be different in different data sets , for example ,ICDAR 2015 Data sets （Karatzas wait forsomeone ,2015） by 4,CTW1500 Data sets （Liu wait forsomeone ,2019a） by 16. And then by using V atti clipping Algorithm （V ati 1992） Put the polygon G Shrink to Gs The afterlife becomes a positive area . Reduced offset D Is the perimeter of the original polygon L And area A Calculated .D = A(1 - r2) L (6) among r Is the shrinkage , According to experience, it is set to 0.4.

Through a similar program , We can generate labels for the threshold graph . First , Text polygon G Offset by the same amount D Expanded to Gd. We think Gs and Gd The gap between them is the boundary of the text area , ad locum , The label of the threshold graph can be calculated with G The distance of the nearest fragment in .