当前位置：网站首页>Classic paper in the field of character recognition: aster

Classic paper in the field of character recognition: aster

2022-06-25 07:13:00 【Python's path to becoming a God】

Methods an overview

Methods of this paper It mainly solves the problem of character recognition of irregularly arranged characters , The paper is before CVPR206 Of paper（Robust Scene Text Recognition with Automatic Rectification, The method is abbreviated as RARE） Improved version .

1. Main idea

For irregular text , First correct the text into a normal linear arrangement , Recognize again ;
Integrate correction network and identification network into an end-to-end network to train ;
Correct network usage STN, Identify the network with classic sequence to sequence + attention

2. Method framework and process

Method ASTER Its full name is Attentional Scene TExt Recognizer with Flexible Rectification, It includes two modules , One for correcting （rectification network）, The other is used to identify （recognition work）, As shown in the figure below .

Overview of model structure

ASTER yes 2018 A paper presented in , The full name of the paper is 《ASTER: An Attentional Scene Text Recognizer with Flexible Rectification》.ASTER be based on encoder-decoder The way , The overall model architecture consists of the following three parts ：

TPS(Thin-Plate-Spline)： It is divided into localization network and grid sampler, The former is used to return to the control point , The latter is used for grid sampling on the original image ;
encoder： Convolutional neural networks use resnet, The language model uses BiLSTM, It should be noted that in the following DTRB The language model in the paper will be separated separately , Here is still consistent with the original paper ;
decoder： The use is based on bahdanau attention Of decoder, Two... Are used here LSTM decoder. One from left to right , One from right to left , Two way decoding .

2.2 Orthotics

From the overview of the model structure ,ASTER Actually sum FAN There are many similarities , The biggest difference is TPS modular . therefore , Let's focus on how this module implements text correction . First let's take a look TPS Overall structure , about Shape is (N,C,H_in,W_in) The input image of I, after Down sampling obtain I_d, And then through localization network Get control points C’. With C‘ We can go through TPS Get a matrix transformation T, Next we pass grid generator Get the grid P, Shape is (N, H_out, W_out, 2), The last one dimension 2 representative xy. Next we pass Matrix transformation T Put the grid P Mapping to the original graph yields P’, The shape is still (N, H_out, W_out, 2). Finally, according to the grid of the original drawing P' sampling obtain I_r. Let's explain one by one .

2.2.1 Localization Network

localization network It's a convolutional neural network , It's full of 3x3 Of conv block, Finally, the control point is obtained through the full connection layer C‘, Shape is (20, 2). 20 On behalf of the upper and lower 10 A little bit , The second dimension is xy coordinate . Here, we need to pay attention to the problem of numerical initialization of the full connection layer . The author has proved that , When the offset term of the full connection layer is initialized to [(0.01, 0.01), (0.02, 0.01), ..., (0.01, 0.99), ..., (0.99, 0.99)] when , That is, when the upper and lower edges of the picture are equidistant sampled , The model converges faster .

Location networks （ When you have finished training for the test ） Of Input Is the uncorrected image to be identified , Output yes K The location of two control points .
The location network Training is useless K Control points as annotation Training , It is directly connected to the back Grid Generator + Sample Use the final recognition results , In a row end-to-end Training within the framework of .
Network structure Using a common convolution network designed by ourselves （6 Layer convolution + 5 individual max-pooling + 2 A full connection ） To predict the K individual control point The location of （K= 20）, The point correspondence is shown in the figure below ：

2.2.2 Thin Plate Transformation

from localization network We got C’, Then we also use equidistant sampling to get C,C It's the same shape as C‘ Agreement , But the distance between every two points is not 0.01, It is 0.05. Next, we get the transformation matrix through the following matrix operation T,

Mesh generator Input It's already there Control point Point set + Corrected figure （ Not generated yet , But given the size of the graph, you can take points ） Coordinates of a point on the , Output Is this point before correction （ Original picture ） Point coordinate position on

Mesh generator can be regarded as a matrix transformation operation （ Several parameters of the transformation a0-a2, b0-b2 Can pass Control point The position is obtained by solving the optimization problem , because Control Point The position on the graph before and after correction is known , Therefore, the corresponding relationship can be calculated ）, The actual prediction is also to calculate the relationship between the point to be measured and the known control point Location relationship of , Calculate the position in the original figure through a series of corresponding relations . Post a picture and feel the corresponding relationship as follows ,p Is the point position after correction ,C After correction Control point Point location of ,p' Is the point position before correction ,C’ by Control point At the point before correction ：

2.2.3 Sampler

Sampler Is to give the mapping relation of fixed points and the original graph , Generate a new corrected graph , Simple interpolation is used , And when it goes beyond the figure, it directly clip fall . in addition ,Sampler Using a differentiable sampling method , Convenient gradient bp.
Input It's the original picture + The corresponding position relationship of the points on the corrected graph on the original graph , Output It is the corrected figure

The first use of grid generator Get the grid P, Then we will P Mapping to the original graph P’. Be careful P and P‘ The values range from 0 To 1 Between , But in the final process of interpolation output , We will P’ Mapping to -1 To 1 Between , This will be seen in the following code .

summary ： As can be seen from the figure below , Actually TPS Is to get a transformation matrix , among C‘ Is a parameter that needs to be learned , and C It is the same. , That is, manually adjusted parameters . according to C and C’ You can get T, Then the final corrected image is obtained by sampling on the original image .

2.3 Feature extraction layer

The feature extraction layer of this paper is similar to FAN Agreement , They all go through resnet, Then go through a two-way LSTM, The final shape is (B, W, C) Three dimensional eigenvector , among B representative batch size, W yes time steps,C yes channels. For example, according to the original text , When the input size is (32, 100) when , Output is (B, 25, 512)

2.4 Decoding layer

The decoding layer and FAN similar , But there are two improvements . The first point is to change the original FAN One way attention The decoding is changed to bidirectional attention decode , The starting point of this improvement is very intuitive . For example, when decoding to a specific character , This character is not only related to the semantic information on the left , Also related to the one on the right . The specific methods of bidirectional decoding are as follows , Decode and output from left to right and from right to left respectively , Then go to log-softmax The one with a high score is the final output . the attention And FAN In the same , All are bahdanau attention, The specific formula will not be repeated .

The second improvement is in the final prediction output , Originally, we usually output the characters with the highest probability of each time step , In this paper, it is changed to bundle search , The search width is generally set to 5

3. Code reading

Let's focus on TPS as well as attention decoder, there attention decoder It is still one-way . If you want to change to two-way , Direct will (B, L, C) in L Change the order from right to left .

3.1 TPS

First, let's look at how to get back C‘, Notice how the last full connection layer is initialized .

3.2 attention decoder

This implementation uses GRU To decode , and FAN That's used in LSTM. In addition, this implementation will input (B, L, W) Medium L become 1, So you can use it directly GRU, instead of GRUCell decode . But in fact, I think it is GRUCell Decoding is more intuitive .

4. summary

ASTER In general attention based Of encoder-decoder On the basis of , Combined with the TPS As a correction module , It can partially alleviate the problem of inaccurate recognition caused by curved characters . Many subsequent papers are improved in this direction , for instance MORAN、ESIR wait . In the next article, I will continue to identify the direction of curved text , Introduce the use of 2d attention A paper on character recognition SAR.