OCR Post Correction for Endangered Language Texts

Overview

πŸ“Œ Coming soon: an update to the software including features from our paper on semi-supervised OCR post-correction, to be published in the Transactions of the Association for Computational Linguistics (TACL)!

Check out the paper here.

OCR Post Correction for Endangered Language Texts

This repository contains code for models and experiments from the paper "OCR Post Correction for Endangered Language Texts".

Textual data in endangered languages is often found in formats that are not machine-readable, including scanned images of paper books. Extracting the text is challenging because there is typically no annotated data to train an OCR system for each endangered language. Instead, we focus on post-correcting the OCR output from a general-purpose OCR system.

πŸ“Œ In the paper, we present a dataset containing annotations for documents in three critically endangered languages: Ainu, Griko, Yakkha.

πŸ“Œ Our model reduces the recognition error rate by 34% on average, over a state-of-the-art OCR system.

Learn more about the paper here!

OCR Post-Correction

The goal of OCR post-correction is to automatically correct errors in the text output from an existing OCR system.

The existing OCR system is used to obtain a first pass transcription of the input image (example below in the endangered language Griko):

First pass OCR transcription

The incorrectly recognized characters in the first pass are then corrected by the post-correction model.

Corrected transcription

Model

As seen in the example above, OCR post-correction is a text-based sequence-to-sequence task.

πŸ“Œ We use a character-level encoder-decoder architecture with attention and add several adaptations for the low-resource setting. The paper has all the details!

πŸ“Œ The model is trained in a supervised manner. The training data consists of first pass OCR outputs as the source with corresponding manually corrected transcriptions as the target.

πŸ“Œ Some books that contain texts in endangered languages also contain translations of the text in another (usually high-resource) language. We incorporate an additional encoder in the model, with a multisource framework, to use the information from these translations if they are available.

We provide instructions for both single-source and multisource models:

  • The single-source model can be used for almost any document and is significantly easier to set up.

  • The multisource model can only be used if translations are available.

Dataset

This repository contains a sample from our dataset in sample_dataset, which you can use to train the post-correction model. Get the full dataset here!

However, this repository can be used to train OCR post-correction models for documents in any language!

πŸš€ If you want to use our model with a new set of documents, construct a dataset by following the steps here.

πŸš€ We'd love to hear about the new datasets and models you build: send us an email at [email protected]!

Running Experiments

Once you have a suitable dataset (e.g., sample_dataset or your own dataset), you can train a model and run experiments on OCR post-correction.

If you have your own dataset, you can use the utils/prepare_data.py script to create train, development, and test splits (see the last step here).

The steps are described below, illustrated with sample_dataset/postcorrection. If using another dataset, simply change the experiment settings to point to your dataset and run the same scripts.

Requirements

Python 3+ is required. Pip can be used to install the packages:

pip install -r postcorr_requirements.txt

Training

The process of training the post-correction model has two main steps:

  • Pretraining with first pass OCR outputs.
  • Training with manually corrected transcriptions in a supervised manner.

For a single-source model, modify the experimental settings in train_single-source.sh to point to the appropriate dataset and desired output folder. It is currently set up to use sample_dataset.

Then run

bash train_single-source.sh

For multisource, use train_multi-source.sh.

Log files and saved models are written to the user-specified experiment folder for both the pretraining and training steps. For a list of all available hyperparameters and options, look at postcorrection/constants.py and postcorrection/opts.py.

Testing

For testing with a single-source model, modify the experimental settings in test_single-source.sh. It is currently set up to use sample_dataset.

Then run

bash test_single-source.sh

For multisource, use test_multi-source.sh.

Citation

Please cite our paper if this repository was useful.

@inproceedings{rijhwani-etal-2020-ocr,
    title = "{OCR} {P}ost {C}orrection for {E}ndangered {L}anguage {T}exts",
    author = "Rijhwani, Shruti  and
      Anastasopoulos, Antonios  and
      Neubig, Graham",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.478",
    doi = "10.18653/v1/2020.emnlp-main.478",
    pages = "5931--5942",
}

License

Owner
Shruti Rijhwani
Ph.D. student at CMU, working on natural language processing.
Shruti Rijhwani
Distance-Ratio-Based Formulation for Metric Learning

Distance-Ratio-Based Formulation for Metric Learning Environment Python3 Pytorch (http://pytorch.org/) (version 1.6.0+cu101) json tqdm Preparing datas

Hyeongji Kim 1 Dec 07, 2022
MagFace: A Universal Representation for Face Recognition and Quality Assessment

MagFace MagFace: A Universal Representation for Face Recognition and Quality Assessment in IEEE Conference on Computer Vision and Pattern Recognition

Qiang Meng 523 Jan 05, 2023
Unofficial implementation of PatchCore anomaly detection

PatchCore anomaly detection Unofficial implementation of PatchCore(new SOTA) anomaly detection model Original Paper : Towards Total Recall in Industri

Changwoo Ha 268 Dec 22, 2022
Pipeline for employing a Lightweight deep learning models for LOW-power systems

PL-LOW A high-performance deep learning model lightweight pipeline that gradually lightens deep neural networks in order to utilize high-performance d

POSTECH Data Intelligence Lab 9 Aug 13, 2022
Poisson Surface Reconstruction for LiDAR Odometry and Mapping

Poisson Surface Reconstruction for LiDAR Odometry and Mapping Surfels TSDF Our Approach Table: Qualitative comparison between the different mapping te

Photogrammetry & Robotics Bonn 305 Dec 21, 2022
SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SpecAugment An implementation of SpecAugment for Pytorch How to use Install pytorch, version=1.9.0 (new feature (torch.Tensor.take_along_dim) is used

IMLHF 3 Oct 11, 2022
A Comparative Framework for Multimodal Recommender Systems

Cornac Cornac is a comparative framework for multimodal recommender systems. It focuses on making it convenient to work with models leveraging auxilia

Preferred.AI 671 Jan 03, 2023
Pytorch implementation of "Get To The Point: Summarization with Pointer-Generator Networks"

About this repository This repo contains an Pytorch implementation for the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Netwo

wxDai 7 Oct 14, 2022
Material for my PyConDE & PyData Berlin 2022 Talk "5 Steps to Speed Up Your Data-Analysis on a Single Core"

5 Steps to Speed Up Your Data-Analysis on a Single Core Material for my talk at the PyConDE & PyData Berlin 2022 Description Your data analysis pipeli

Jonathan Striebel 9 Dec 12, 2022
World Models with TensorFlow 2

World Models This repo reproduces the original implementation of World Models. This implementation uses TensorFlow 2.2. Docker The easiest way to hand

Zac Wellmer 234 Nov 30, 2022
Ensemble Visual-Inertial Odometry (EnVIO)

Ensemble Visual-Inertial Odometry (EnVIO) Authors : Jae Hyung Jung, Yeongkwon Choe, and Chan Gook Park 1. Overview This is a ROS package of Ensemble V

Jae Hyung Jung 95 Jan 03, 2023
Logsig-RNN: a novel network for robust and efficient skeleton-based action recognition

GCN_LogsigRNN This repository holds the codebase for the paper: Logsig-RNN: a novel network for robust and efficient skeleton-based action recognition

7 Oct 14, 2022
An self sufficient AI that crawls the web to learn how to generate art from keywords

Roxx-IO - The Smart Artist AI! TO DO / IDEAS Implement Web-Scraping Functionality Figure out a less annoying (and an off button for it) text to speech

Tatz 5 Mar 21, 2022
Train an imgs.ai model on your own dataset

imgs.ai is a fast, dataset-agnostic, deep visual search engine for digital art history based on neural network embeddings.

Fabian Offert 5 Dec 21, 2021
Deep Distributed Control of Port-Hamiltonian Systems

De(e)pendable Distributed Control of Port-Hamiltonian Systems (DeepDisCoPH) This repository is associated to the paper [1] and it contains: The full p

Dependable Control and Decision group - EPFL 3 Aug 17, 2022
Using deep actor-critic model to learn best strategies in pair trading

Deep-Reinforcement-Learning-in-Stock-Trading Using deep actor-critic model to learn best strategies in pair trading Abstract Partially observed Markov

281 Dec 09, 2022
Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Minimal code and simple experiments to play with Denoising Diffusion Probabilist

Rithesh Kumar 16 Oct 06, 2022
CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

SMPLify-XMC This repo is part of our project: On Self-Contact and Human Pose. [Project Page] [Paper] [MPI Project Page] License Software Copyright Lic

Lea MΓΌller 83 Dec 14, 2022
PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Cross-Speaker-Emotion-Transfer - PyTorch Implementation PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Conditio

Keon Lee 114 Jan 08, 2023
Use of Attention Gates in a Convolutional Neural Network / Medical Image Classification and Segmentation

Attention Gated Networks (Image Classification & Segmentation) Pytorch implementation of attention gates used in U-Net and VGG-16 models. The framewor

Ozan Oktay 1.6k Dec 30, 2022