当前位置：网站首页>Text recognition svtr paper interpretation

Text recognition svtr paper interpretation

2022-06-30 21:05:00 【‘Atlas’】

List of articles

solve the problem
Algorithm
experiment
Conclusion

The paper ：《SVTR: Scene Text Recognition with a Single Visual Model》
github： https://github.com/PaddlePaddle/PaddleOCR

solve the problem

The conventional text recognition model consists of two parts ： Visual model for feature extraction and sequence model for text transcription ;
problem ：
Although this model has high accuracy , But complex and inefficient ;
solve ：
The author puts forward SVTR, Only visual models , Eliminate the sequence model ;
1、 Decouple image text patch;
2、 The hierarchical stage passes mixing、merging、combining Loop execution ; Global and local mixing The module is used to sense the morphology within and between characters ;
SVTR-L It is fast in English and Chinese recognition accuracy high ;

Algorithm

SVTR The overall structure is shown in the figure 2 Shown ,
Insert picture description here
The process is as follows ：
1、 Enter text image $\times W \times 3$ , after patch embedding modular , Convert to $\frac H4 \times \frac W4$ The dimensions are $D_0$ Of patch;
2、 Three stage For feature extraction , Every stage There is a series of mixingblock And merging or combing constitute ;
Local and global mixing block It is used to extract local features of strokes and dependencies between elements ;
Use this backnobe, It can represent the character element features and dependencies of different distances and scales , The feature size is $\times \frac W4 \times D_3$ , Use symbols C Express ;
3、 Last pass FC Layer to get the character sequence ;

progressive overlap patch embedding

The author did not use vit in kernel=4, stride=4 Convolution ; Instead, use two kernel=3,stride=2 Convolution , Pictured 3; Although some calculations are added , But it is conducive to feature fusion ; See for ablation experiment 3.3
Insert picture description here

mixing block

mixing block Pictured 4 Shown ,
Local features ： The morphological characteristics of coded characters and the correlation between different parts of characters ;
Global features ： Between different characters 、 With or without text patch Relationship between ;
Insert picture description here

Merging

In order to reduce the amount of computation and remove redundant representations , Put forward Merging; adopt kernel=3,stride=(2,1),conv Sample the height down 2 times ; Because most texts are horizontal ; At the same time increase channel Dimension compensates for information loss ;

Combining & Prediction

Combining
First, the pool height dimension is 1, Next, the full connection layer 、 Nonlinear activation layer and dropout layer ;

Prediction
Linear classifiers have N Nodes , Generate $\frac W4$ Sequence , Ideally, the same character patch Will be transcribed into repeated characters , No text patch Will be transcribed as spaces ; In English N Set to 37, In Chinese N Set to 6625;
The maximum prediction length of the English model is 25, The maximum prediction length of the Chinese model is 40.

Structural variants

SVTR There are several hyperparameters in , Every stage in channel depth ,head Number ,mixing blockj Quantity and local mixing、global mixing Number , So there is SVTR- T (Tiny), SVTR-S (Small), SVTR-B (Base) and SVTR-L (Large), As shown in the table 1.
Insert picture description here

experiment

IC13:ICDAR 2013 Data sets , Rule text .
IC15:ICDAR 2015 Data sets , Irregular text .

patch embedding Ablation Experiment

As shown in the table 2 left , gradual embedding The mechanism goes beyond the limit 0.75%,2.8%, In irregular text recognition effect is obvious ;
Insert picture description here

Merging Ablation Experiment

As shown in the table 2 On the right side , The gradual resolution reduction network not only increases the amount of computation compared with the fixed resolution network , And the performance is improved

Replacement fusion module Ablation Experiment

As shown in the table 3,
1、 Each strategy has a certain degree of improvement , Due to full character feature perception ;
2、L6G6 The best way ,IC13 Performance improvement 1.9%,IC15 Performance improvement 6.6%.
3、 Switch their combination pit and you lead to the overall situation mixing block It doesn't work , It may repeatedly focus on local features ;
Insert picture description here

SOTA Compare

chart 5 For each model accuracy And parameter quantity 、 Speed relationship ;
Insert picture description here
surface 4 Compare the performance of various methods ,

SVTR Comprehensive time and accuracy Good performance ;

Conclusion

This paper presents a visual model for image text recognition SVTR, Multi - fine - grained character features are proposed to represent local strokes and the dependency between characters at multi - scales ; therefore SVTR Good effect .

原网站

版权声明
本文为[‘Atlas’]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/181/202206302040321873.html