当前位置:网站首页>Classic paper in the field of character recognition: aster
Classic paper in the field of character recognition: aster
2022-06-25 07:13:00 【Python's path to becoming a God】
Methods an overview
Methods of this paper It mainly solves the problem of character recognition of irregularly arranged characters , The paper is before CVPR206 Of paper(Robust Scene Text Recognition with Automatic Rectification, The method is abbreviated as RARE) Improved version .
1. Main idea
- For irregular text , First correct the text into a normal linear arrangement , Recognize again ;
- Integrate correction network and identification network into an end-to-end network to train ;
- Correct network usage STN, Identify the network with classic sequence to sequence + attention
2. Method framework and process
Method ASTER Its full name is Attentional Scene TExt Recognizer with Flexible Rectification, It includes two modules , One for correcting (rectification network), The other is used to identify (recognition work), As shown in the figure below .

Overview of model structure
ASTER yes 2018 A paper presented in , The full name of the paper is 《ASTER: An Attentional Scene Text Recognizer with Flexible Rectification》.ASTER be based on encoder-decoder The way , The overall model architecture consists of the following three parts :
TPS(Thin-Plate-Spline): It is divided into localization network and grid sampler, The former is used to return to the control point , The latter is used for grid sampling on the original image ;
encoder: Convolutional neural networks use resnet, The language model uses BiLSTM, It should be noted that in the following DTRB The language model in the paper will be separated separately , Here is still consistent with the original paper ;
decoder: The use is based on bahdanau attention Of decoder, Two... Are used here LSTM decoder. One from left to right , One from right to left , Two way decoding .

2.2 Orthotics
From the overview of the model structure ,ASTER Actually sum FAN There are many similarities , The biggest difference is TPS modular . therefore , Let's focus on how this module implements text correction . First let's take a look TPS Overall structure , about Shape is (N,C,H_in,W_in) The input image of I, after Down sampling obtain I_d, And then through localization network Get control points C’. With C‘ We can go through TPS Get a matrix transformation T, Next we pass grid generator Get the grid P, Shape is (N, H_out, W_out, 2), The last one dimension 2 representative xy. Next we pass Matrix transformation T Put the grid P Mapping to the original graph yields P’, The shape is still (N, H_out, W_out, 2). Finally, according to the grid of the original drawing P' sampling obtain I_r. Let's explain one by one .

2.2.1 Localization Network
localization network It's a convolutional neural network , It's full of 3x3 Of conv block, Finally, the control point is obtained through the full connection layer C‘, Shape is (20, 2). 20 On behalf of the upper and lower 10 A little bit , The second dimension is xy coordinate . Here, we need to pay attention to the problem of numerical initialization of the full connection layer . The author has proved that , When the offset term of the full connection layer is initialized to [(0.01, 0.01), (0.02, 0.01), ..., (0.01, 0.99), ..., (0.99, 0.99)] when , That is, when the upper and lower edges of the picture are equidistant sampled , The model converges faster .
- Location networks ( When you have finished training for the test ) Of Input Is the uncorrected image to be identified , Output yes K The location of two control points .
- The location network Training is useless K Control points as annotation Training , It is directly connected to the back Grid Generator + Sample Use the final recognition results , In a row end-to-end Training within the framework of .
- Network structure Using a common convolution network designed by ourselves (6 Layer convolution + 5 individual max-pooling + 2 A full connection ) To predict the K individual control point The location of (K= 20), The point correspondence is shown in the figure below :

2.2.2 Thin Plate Transformation
from localization network We got C’, Then we also use equidistant sampling to get C,C It's the same shape as C‘ Agreement , But the distance between every two points is not 0.01, It is 0.05. Next, we get the transformation matrix through the following matrix operation T,

Mesh generator Input It's already there Control point Point set + Corrected figure ( Not generated yet , But given the size of the graph, you can take points ) Coordinates of a point on the , Output Is this point before correction ( Original picture ) Point coordinate position on
Mesh generator can be regarded as a matrix transformation operation ( Several parameters of the transformation a0-a2, b0-b2 Can pass Control point The position is obtained by solving the optimization problem , because Control Point The position on the graph before and after correction is known , Therefore, the corresponding relationship can be calculated ), The actual prediction is also to calculate the relationship between the point to be measured and the known control point Location relationship of , Calculate the position in the original figure through a series of corresponding relations . Post a picture and feel the corresponding relationship as follows ,p Is the point position after correction ,C After correction Control point Point location of ,p' Is the point position before correction ,C’ by Control point At the point before correction :

2.2.3 Sampler
Sampler Is to give the mapping relation of fixed points and the original graph , Generate a new corrected graph , Simple interpolation is used , And when it goes beyond the figure, it directly clip fall . in addition ,Sampler Using a differentiable sampling method , Convenient gradient bp.
Input It's the original picture + The corresponding position relationship of the points on the corrected graph on the original graph , Output It is the corrected figure
The first use of grid generator Get the grid P, Then we will P Mapping to the original graph P’. Be careful P and P‘ The values range from 0 To 1 Between , But in the final process of interpolation output , We will P’ Mapping to -1 To 1 Between , This will be seen in the following code .

summary : As can be seen from the figure below , Actually TPS Is to get a transformation matrix , among C‘ Is a parameter that needs to be learned , and C It is the same. , That is, manually adjusted parameters . according to C and C’ You can get T, Then the final corrected image is obtained by sampling on the original image .

2.3 Feature extraction layer
The feature extraction layer of this paper is similar to FAN Agreement , They all go through resnet, Then go through a two-way LSTM, The final shape is (B, W, C) Three dimensional eigenvector , among B representative batch size, W yes time steps,C yes channels. For example, according to the original text , When the input size is (32, 100) when , Output is (B, 25, 512)

2.4 Decoding layer
The decoding layer and FAN similar , But there are two improvements . The first point is to change the original FAN One way attention The decoding is changed to bidirectional attention decode , The starting point of this improvement is very intuitive . For example, when decoding to a specific character , This character is not only related to the semantic information on the left , Also related to the one on the right . The specific methods of bidirectional decoding are as follows , Decode and output from left to right and from right to left respectively , Then go to log-softmax The one with a high score is the final output . the attention And FAN In the same , All are bahdanau attention, The specific formula will not be repeated .

The second improvement is in the final prediction output , Originally, we usually output the characters with the highest probability of each time step , In this paper, it is changed to bundle search , The search width is generally set to 5
3. Code reading
Let's focus on TPS as well as attention decoder, there attention decoder It is still one-way . If you want to change to two-way , Direct will (B, L, C) in L Change the order from right to left .
3.1 TPS
First, let's look at how to get back C‘, Notice how the last full connection layer is initialized .
3.2 attention decoder
This implementation uses GRU To decode , and FAN That's used in LSTM. In addition, this implementation will input (B, L, W) Medium L become 1, So you can use it directly GRU, instead of GRUCell decode . But in fact, I think it is GRUCell Decoding is more intuitive .
4. summary
ASTER In general attention based Of encoder-decoder On the basis of , Combined with the TPS As a correction module , It can partially alleviate the problem of inaccurate recognition caused by curved characters . Many subsequent papers are improved in this direction , for instance MORAN、ESIR wait . In the next article, I will continue to identify the direction of curved text , Introduce the use of 2d attention A paper on character recognition SAR.
Reference and learning materials for this article are recommended :
边栏推荐
- Americo technology launches professional desktop video editing solution
- [ACNOI2022]王校长的构造
- R & D thinking 07 - embedded intelligent product safety certification required
- Blue Bridge Cup SCM module code (matrix key) (code + comments)
- Want to self-study SCM, do you have any books and boards worth recommending?
- 【他字字不提爱,却句句都是爱】
- How do I get red green blue (RGB) and alpha back from a UIColor object?
- Tp6 interface returns three elements
- Conditional grouping with $exists inside $cond
- [ros2] Why use ros2? Introduction to ros2 system characteristics
猜你喜欢

我与CSDN的一年时光及大学经验分享

Simple and complete steps of vivado project

Kubernetes cluster dashboard & kuboard installation demo

Modify the default log level

The process of making wooden barrels with 3DMAX software: a three-step process

Several schemes of traffic exposure in kubernetes cluster

48 张图 | 手摸手教你微服务的性能监控、压测和调优

MCU IO explanation (pull-up pull-down quasi bidirectional input / output push-pull open drain)

2022 biological fermentation Exhibition (Jinan), which is a must read before the exhibition. The most comprehensive exhibition strategy will take you around the "fermentation circle"

赚够钱回老家吗
随机推荐
Tp6 interface returns three elements
48 张图 | 手摸手教你微服务的性能监控、压测和调优
Controlling volume mixer
lotus v1.16.0-rc2 Calibration-net
Uncaught TypeError: Cannot read properties of undefined (reading ‘prototype‘)
Make enough money to go back home
Analysis on the scale of China's smart airport industry in 2020: there is still a large space for competition in the market [figure]
lotus windowPoSt 手动触发时空证明计算
【一起上水硕系列】Day 5
Error reported during vivado simulation common 17-39
Non-contact infrared temperature measurement system for human body based on single chip microcomputer
StreamNative Platform 1.5 版本发布,集成 Istio、支持 OpenShift 部署
TorchServe避坑指南
聚类和分类的最基本区别。
Design of PWM breathing lamp based on FPGA
Keil debug view variable prompt not in scope
Which securities company do you want to buy stocks to open an account faster and safer
Qcom--lk phase I2C interface configuration scheme -i2c6
lotus v1.16.0-rc2 Calibration-net
Astronomers may use pulsars to detect merged supermassive black holes