当前位置:网站首页>Text recognition svtr paper interpretation
Text recognition svtr paper interpretation
2022-06-30 21:05:00 【‘Atlas’】
List of articles
The paper : 《SVTR: Scene Text Recognition with a Single Visual Model》
github: https://github.com/PaddlePaddle/PaddleOCR
solve the problem
The conventional text recognition model consists of two parts : Visual model for feature extraction and sequence model for text transcription ;
problem :
Although this model has high accuracy , But complex and inefficient ;
solve :
The author puts forward SVTR, Only visual models , Eliminate the sequence model ;
1、 Decouple image text patch;
2、 The hierarchical stage passes mixing、merging、combining Loop execution ; Global and local mixing The module is used to sense the morphology within and between characters ;
SVTR-L It is fast in English and Chinese recognition accuracy high ;
Algorithm
SVTR The overall structure is shown in the figure 2 Shown ,
The process is as follows :
1、 Enter text image H × W × 3 H \times W \times 3 H×W×3, after patch embedding modular , Convert to H 4 × W 4 \frac H4 \times \frac W4 4H×4W The dimensions are D 0 D_0 D0 Of patch;
2、 Three stage For feature extraction , Every stage There is a series of mixingblock And merging or combing constitute ;
Local and global mixing block It is used to extract local features of strokes and dependencies between elements ;
Use this backnobe, It can represent the character element features and dependencies of different distances and scales , The feature size is 1 × W 4 × D 3 1 \times \frac W4 \times D_3 1×4W×D3, Use symbols C Express ;
3、 Last pass FC Layer to get the character sequence ;
progressive overlap patch embedding
The author did not use vit in kernel=4, stride=4 Convolution ; Instead, use two kernel=3,stride=2 Convolution , Pictured 3; Although some calculations are added , But it is conducive to feature fusion ; See for ablation experiment 3.3
mixing block
mixing block Pictured 4 Shown ,
Local features : The morphological characteristics of coded characters and the correlation between different parts of characters ;
Global features : Between different characters 、 With or without text patch Relationship between ;
Merging
In order to reduce the amount of computation and remove redundant representations , Put forward Merging; adopt kernel=3,stride=(2,1),conv Sample the height down 2 times ; Because most texts are horizontal ; At the same time increase channel Dimension compensates for information loss ;
Combining & Prediction
Combining
First, the pool height dimension is 1, Next, the full connection layer 、 Nonlinear activation layer and dropout layer ;
Prediction
Linear classifiers have N Nodes , Generate W 4 \frac W4 4W Sequence , Ideally, the same character patch Will be transcribed into repeated characters , No text patch Will be transcribed as spaces ; In English N Set to 37, In Chinese N Set to 6625;
The maximum prediction length of the English model is 25, The maximum prediction length of the Chinese model is 40.
Structural variants
SVTR There are several hyperparameters in , Every stage in channel depth ,head Number ,mixing blockj Quantity and local mixing、global mixing Number , So there is SVTR- T (Tiny), SVTR-S (Small), SVTR-B (Base) and SVTR-L (Large), As shown in the table 1.
experiment
IC13:ICDAR 2013 Data sets , Rule text .
IC15:ICDAR 2015 Data sets , Irregular text .
patch embedding Ablation Experiment
As shown in the table 2 left , gradual embedding The mechanism goes beyond the limit 0.75%,2.8%, In irregular text recognition effect is obvious ;
Merging Ablation Experiment
As shown in the table 2 On the right side , The gradual resolution reduction network not only increases the amount of computation compared with the fixed resolution network , And the performance is improved
Replacement fusion module Ablation Experiment
As shown in the table 3,
1、 Each strategy has a certain degree of improvement , Due to full character feature perception ;
2、L6G6 The best way ,IC13 Performance improvement 1.9%,IC15 Performance improvement 6.6%.
3、 Switch their combination pit and you lead to the overall situation mixing block It doesn't work , It may repeatedly focus on local features ;
SOTA Compare
chart 5 For each model accuracy And parameter quantity 、 Speed relationship ;
surface 4 Compare the performance of various methods ,
SVTR Comprehensive time and accuracy Good performance ;
Conclusion
This paper presents a visual model for image text recognition SVTR, Multi - fine - grained character features are proposed to represent local strokes and the dependency between characters at multi - scales ; therefore SVTR Good effect .
边栏推荐
- Double solid histogram / double y-axis
- 凤凰架构——架构师的视角
- Qiao NPMS: search for NPM packages
- 判断js对象是否为空的方式
- 等级测评是什么意思?工作流程包含哪些?
- SQL必需掌握的100个重要知识点:创建和操纵表
- Basic components of STL
- Deflection lock / light lock / heavy lock lock is healthier. How to complete locking and unlocking
- Go build server Foundation
- 片荒吗?不用下载直接在线免费看的资源来了!2022年收藏夹必须有它!
猜你喜欢
随机推荐
ncat详细介绍(转载)
B_QuRT_User_Guide(35)
二叉查找树(一) - 概念与C语言实现
【数字IC应届生职业规划】Chap.1 IC行业产业链概述及代表企业大厂汇总
stacking集成模型预测回归问题
Gartner聚焦中国低代码发展 UniPro如何践行“差异化”
银行集体下架的智能投顾产品,为何成了“鸡肋”?
Dynamic style binding --style and class
Testing principle and precautions of biovendor rage ELISA Kit
SQL必需掌握的100个重要知识点:创建和操纵表
Lumiprobe核酸定量丨QuDye dsDNA BR 检测试剂盒
Go learning notes
Personal developed penetration testing tool Satania
ArcGIS构建发布简单路网Network数据服务及Rest调用测试
Understanding polymorphism
Et la dégradation du modèle de génération de texte? Simctg vous donne la réponse
BioVendor sRAGE Elisa试剂盒测试原理和注意事项
19.04 distributor
FreeRTOS记录(九、一个裸机工程转FreeRTOS的实例)
MFC界面库BCGControlBar v33.0 - 桌面警报窗口、网格控件升级等









