PyTorch implementation of an end-to-end Handwritten Text Recognition (HTR) system based on attention encoder-decoder networks

Overview

AttentionHTR

PyTorch implementation of an end-to-end Handwritten Text Recognition (HTR) system based on attention encoder-decoder networks. Scene Text Recognition (STR) benchmark model [1], trained on synthetic scene text images, is used to perform transfer learning from the STR domain to HTR. Different fine-tuning approaches are investigated using the multi-writer datasets: Imgur5K [2] and IAM [3].

For more details, refer to our paper at arXiv: https://arxiv.org/abs/2201.09390

Dependencies

This work was tested with Python 3.6.8, PyTorch 1.9.0, CUDA 11.5 and CentOS Linux release 7.9.2009 (Core). Create a new virtual environment and install all the necessary Python packages:

python3 -m venv attentionhtr-env
source attentionhtr-env/bin/activate
pip install --upgrade pip
python3 -m pip install -r AttentionHTR/requirements.txt

Content

Our pre-trained models

Download our pre-trained models from here. The names of the .pth files are explained in the table below. There are 6 models in total, 3 for each character set, corresponding to the dataset they perform best on.

Character set Imgur5K IAM Both datasets
Case-insensitive AttentionHTR-Imgur5K.pth AttentionHTR-IAM.pth AttentionHTR-General.pth
Case-sensitive AttentionHTR-Imgur5K-sensitive.pth AttentionHTR-IAM-sensitive.pth AttentionHTR-General-sensitive.pth

Print the character sets using the Python string module: string.printable[:36] for the case-insensitive and string.printable[:-6] for the case-sensitive character set.

Pre-trained STR benchmark models can be downloaded from here.

Demo

  • Download the AttentionHTR-General-sensitive.pth model and place it into /model/saved_models.

  • Directory /dataset-demo contains demo images. Go to /model and create an LMDB dataset from them with python3 create_lmdb_dataset.py --inputPath ../dataset-demo/ --gtFile ../dataset-demo/gt.txt --outputPath result/dataset-demo/. Note that under Windows you may need to tune the map_size parameter manually for the lmdb.open() function.

  • Obtain predictions with python3 test.py --eval_data result/dataset-demo --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn --saved_model saved_models/AttentionHTR-General-sensitive.pth --sensitive. The last two rows in the terminal should be

    Accuracy: 90.00000000
    Norm ED: 0.04000000
    
  • Inspect predictions in /model/result/AttentionHTR-General-sensitive.pth/log_predictions_dataset-demo.txt. Columns: batch number, ground truth string, predicted string, match (0/1), running accuracy.

Use the models for fine-tuning or predictions

Partitions

Prepare the train, validation (for fine-tuning) and test (for testing and for predicting on unseen data) partitions with word-level images. For the Imgur5K and the IAM datasets you may use our scripts in /process-datasets.

LMDB datasets

When using the PyTorch implementation of the STR benchmark model [1], images need to be converted into an LMDB dataset. See this section for details. An LMDB dataset offers extremely cheap read transactions [4]. Alternatively, see this demo that uses raw images.

Predictions and fine-tuning

The pre-trained models can be used for predictions or fine-tuning on additional datasets using an implementation in /model, which is a modified version of the official PyTorch implementation of the STR benchmark [1]. Use test.py for predictions and train.py for fine-tuning. In both cases use the following arguments:

  • --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn to define architecture.
  • --saved_model to provide a path to a pre-trained model. In case of train.py it will be used as a starting point in fine-tuning and in the case of test.py it will be used for predictions.
  • --sensitive for the case-sensitive character set. No such argument for the case-insensitive character set.

Specifically for fine-tuning use:

  • --FT to signal that model parameters must be initialized from a pre-trained model in --saved_model and not randomly.
  • --train_data and --valid_data to provide paths to training and validation data, respectively.
  • --select_data "/" and --batch_ratio 1 to use all data. Can be used to define stratified batches.
  • --manualSeed to assign an integer identifyer for the resulting model. The original purpose of this argument is to set a random seed.
  • --patience to set the number of epochs to wait for the validation loss to decrease below the last minimum.

Specifically for predicting use:

  • --eval_data to provide a path to evaluation data.

Note that test.py outputs its logs and a copy of the evaluated model into /result.

All other arguments are described inside the scripts. Original instructions for using the scripts in /model are available here.

For example, to fine-tune one of our case-sensitive models on an additional dataset:

CUDA_VISIBLE_DEVICES=3 python3 train.py \
--train_data my_train_data \
--valid_data my_val_data \
--select_data "/" \
--batch_ratio 1 \
--FT \
--manualSeed 1
--Transformation TPS \
--FeatureExtraction ResNet \
--SequenceModeling BiLSTM \
--Prediction Attn \
--saved_model saved_models/AttentionHTR-General-sensitive.pth \
--sensitive

To use the same model for predictions:

CUDA_VISIBLE_DEVICES=0 python3 test.py \
--eval_data my_unseen_data \
--Transformation TPS \
--FeatureExtraction ResNet \
--SequenceModeling BiLSTM \
--Prediction Attn \
--saved_model saved_models/AttentionHTR-General.pth \
--sensitive

Acknowledgements

  • Our implementation is based on Clova AI's deep text recognition benchmark.
  • The authors would like to thank Facebook Research for the Imgur5K dataset.
  • The computations were performed through resources provided by the Swedish National Infrastructure for Computing (SNIC) at Chalmers Centre for Computational Science and Engineering (C3SE).

References

[1]: Baek, J. et al. (2019). What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4715-4723). https://arxiv.org/abs/1904.01906

[2]: Krishnan, P. et al. (2021). TextStyleBrush: Transfer of Text Aesthetics from a Single Example. arXiv preprint arXiv:2106.08385. https://arxiv.org/abs/2106.08385

[3]: Marti, U. V., & Bunke, H. (2002). The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1), 39-46. https://doi.org/10.1007/s100320200071

[4]: Lightning Memory-Mapped Database. Homepage: https://www.symas.com/lmdb

Citation

@article{kass2022attentionhtr,
  title={AttentionHTR: Handwritten Text Recognition Based on Attention Encoder-Decoder Networks},
  author={Kass, D. and Vats, E.},
  journal={arXiv preprint arXiv:2201.09390},
  year={2022}
}

Contact

Dmitrijs Kass ([email protected])

Ekta Vats ([email protected])

Owner
Dmitrijs Kass
Data Science student at Uppsala University
Dmitrijs Kass
CoINN: Correlated-informed neural networks: a new machine learning framework to predict pressure drop in micro-channels

CoINN: Correlated-informed neural networks: a new machine learning framework to predict pressure drop in micro-channels Accurate pressure drop estimat

Alejandro Montanez 0 Jan 21, 2022
Classify the disease status of a plant given an image of a passion fruit

Passion Fruit Disease Detection I tried to create an accurate machine learning models capable of localizing and identifying multiple Passion Fruits in

3 Nov 09, 2021
Large scale embeddings on a single machine.

Marius Marius is a system under active development for training embeddings for large-scale graphs on a single machine. Training on large scale graphs

Marius 107 Jan 03, 2023
Vision-Language Pre-training for Image Captioning and Question Answering

VLP This repo hosts the source code for our AAAI2020 work Vision-Language Pre-training (VLP). We have released the pre-trained model on Conceptual Cap

Luowei Zhou 373 Jan 03, 2023
python library for invisible image watermark (blind image watermark)

invisible-watermark invisible-watermark is a python library and command line tool for creating invisible watermark over image.(aka. blink image waterm

Shield Mountain 572 Jan 07, 2023
HW3 ― GAN, ACGAN and UDA

HW3 ― GAN, ACGAN and UDA In this assignment, you are given datasets of human face and digit images. You will need to implement the models of both GAN

grassking100 1 Dec 13, 2021
MADE (Masked Autoencoder Density Estimation) implementation in PyTorch

pytorch-made This code is an implementation of "Masked AutoEncoder for Density Estimation" by Germain et al., 2015. The core idea is that you can turn

Andrej 498 Dec 30, 2022
Yolov3 pytorch implementation

YOLOV3 Pytorch实现 在bubbliiing大佬代码的基础上进行了修改,添加了部分注释。 预训练模型 预训练模型来源于bubbliiing。 链接:https://pan.baidu.com/s/1ncREw6Na9ycZptdxiVMApw 提取码:appk 训练自己的数据集 按照VO

4 Aug 27, 2022
PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

简体中文 | English PaddleRobotics paddleRobotics是基于paddle的机器人开源算法库集,包括人机交互、复杂运动控制、环境感知、slam定位导航等开源算法部分。 人机交互 主动多模交互技术TFVT-HRI 主动多模交互技术是通过视觉、语音、触摸传感器等输入机器人

185 Dec 26, 2022
Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

SEDE SEDE (Stack Exchange Data Explorer) is new dataset for Text-to-SQL tasks with more than 12,000 SQL queries and their natural language description

Rupert. 83 Nov 11, 2022
Subpopulation detection in high-dimensional single-cell data

PhenoGraph for Python3 PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") repr

Dana Pe'er Lab 42 Sep 05, 2022
Official implementation of Sparse Transformer-based Action Recognition

STAR Official implementation of S parse T ransformer-based A ction R ecognition Dataset download NTU RGB+D 60 action recognition of 2D/3D skeleton fro

Chonghan_Lee 15 Nov 02, 2022
CN24 is a complete semantic segmentation framework using fully convolutional networks

Build status: master (production branch): develop (development branch): Welcome to the CN24 GitHub repository! CN24 is a complete semantic segmentatio

Computer Vision Group Jena 123 Jul 14, 2022
A PyTorch implementation of the architecture of Mask RCNN

EDIT (AS OF 4th NOVEMBER 2019): This implementation has multiple errors and as of the date 4th, November 2019 is insufficient to be utilized as a reso

Sai Himal Allu 975 Dec 30, 2022
Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution.

convolver Python script that takes an Impulse response .wav and a input .wav to demonstrate audio convolution. Created by Sean Higley

Sean Higley 1 Feb 23, 2022
Reinforcement Learning via Supervised Learning

Reinforcement Learning via Supervised Learning Installation Run pip install -e . in an environment with Python = 3.7.0, 3.9. The code depends on MuJ

Scott Emmons 49 Nov 28, 2022
A state-of-the-art semi-supervised method for image recognition

Mean teachers are better role models Paper ---- NIPS 2017 poster ---- NIPS 2017 spotlight slides ---- Blog post By Antti Tarvainen, Harri Valpola (The

Curious AI 1.4k Jan 06, 2023
FNet Implementation with TensorFlow & PyTorch

FNet Implementation with TensorFlow & PyTorch. TensorFlow & PyTorch implementation of the paper "FNet: Mixing Tokens with Fourier Transforms". Overvie

Abdelghani Belgaid 1 Feb 12, 2022
Config files for my GitHub profile.

Canalyst Candas Data Science Library Name Canalyst Candas Description Built by a former PM / analyst to give anyone with a little bit of Python knowle

Canalyst Candas 13 Jun 24, 2022
Exporter for Storage Area Network (SAN)

SAN Exporter Prometheus exporter for Storage Area Network (SAN). We all know that each SAN Storage vendor has their own glossary of terms, health/perf

vCloud 32 Dec 16, 2022