A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

Overview

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control over the text recognition process.

go-ocr

A tool for extracting plain text from scanned documents (pdf or djvu), with user-defined postprocessing.

Motivation

Once I had a task of OCR'ing a number of scanned documents in pdf format. I quickly built a pipeline of the tools to extract images from the input files and to convert them to plain text, but then I realised that modern OCR software is still less than ideal in terms of recognising text, so a good deal of postprocessing was needed in order to remove at least some of those OCR artefacts and irregularities. I ended up with a long pipeline of sed/grep filters which also had to be adjusted per each document and per each document language. What I wanted was a tool that could combine the OCR tools invocation with filters application, also giving an easy way of modifying and combining the filter definitions.

The tool

Given an input file in either pdf or djvu format, the tool performs the following steps:

  1. Images get extracted from the input file using pdfimages or ddjvu tool;
  2. The extracted images get converted to plain text using tesseract tool, in parallel;
  3. The specified filters get applied to the text.

Invocation

go-ocr [OPTION]... FILE

Command line options:

-f,--first N        first page number (optional, default: 1)
-l,--last  N        last page number (optional, default: last page of the document)
-F,--filter FILE    filter specification file name (optional, may be given multiple times)
-L,--language LANG  document language (optional, default: 'eng')
-o,--output FILE    output file name (optional, default: stdout)
-h,--help           display this help and exit
-v,--version        output version information and exit
Example

The following command processes a document some.pdf in Russian, from page 12 to page 26 (inclusive), without any postprocessing, storing the result in the file document.txt:

./go-ocr --first 12 --last 26 --language rus --output document.txt some.pdf

Filter definitions

Filter definition file is a plain text file containing rewriting rules and C-style comments. Each rewriting rule has the following format:

scope type "match" "substitution"

where

  • scope is either line or text;
  • type is either word or regex;
  • match and substitution are Go strings.

Each rule must be on one line.

Each rule of the scope line is applied to each line of the text. There is no processing done to the line by the tool itself other than trimming the trailing whitespace, which means that a line does not have a trailing newline symbol when the rule is applied. After that all the lines get combined into text with newline symbols inserted between them.

Each rule of the scope text is applied to the whole text after all the line rules. All newline symbols are visible to the rule which allows for combining multiple lines into one.

The reason for having two different scopes for the rules is that applying a rule to a line is computationally cheaper that applying to the whole text. Also, this makes the line regular expressions a bit simpler as, for example, \s regex cannot match a newline.

Rules of type word do a simple substitution replacing any match string with its corresponding substitution string.

Rules of type regex search the input for any match of the match regular expression and replace it with the substitution string. The syntax of the regular expression is that of the Go regexp engine. The substuitution string may contain references to the content of capturing groups from the corresponding match regular expression. From the Go documentation, each reference

is denoted by a substring of the form $name or ${name}, where name is a non-empty sequence of letters, digits, and underscores. A purely numeric name like $1 refers to the submatch with the corresponding index; other names refer to capturing parentheses named with the (?P<name>...) syntax. A reference to an out of range or unmatched index or a name that is not present in the regular expression is replaced with an empty slice.

In the $name form, name is taken to be as long as possible: $1x is equivalent to ${1x}, not ${1}x, and, $10 is equivalent to ${10}, not ${1}0.

To insert a literal $ in the output, use $$ in the template.

All filter definition files are always processed in the order in which they are specified on the command line. Within each file, the rules are grouped by the scope, and applied in the order of specification. This allows for each rule to rely on the outcome of all the rules before it.

Rewriting rules examples

Rule to replace ellipsis with a single utf-8 symbol:

line word	"..."  "…"

Rule to replace all whitespace sequences with a single space character:

line regex	`\s+`	" "

Rule to remove all newline characters from the middle of a sentence:

text regex	`([a-z\(\),])\n+([a-z\(\)])` "${1} ${2}"

More examples can be found in the files filter-eng and filter-rus.

In practice, it is often useful to maintain one filter definition file with rules to remove common OCR artefacts, and another file with rules specific to a particular document. In general, it is probably impossible to avoid all manual editing altogether by using this tool, but from my experience, a few hours spent on setting up the appropriate filters for a 700 pages document can dramatically reduce the amount of manual work needed afterwards.

Other tools

Internally the program relies on pdfimages and ddjvu tools for extracting images from the input file, and on tesseract program for the actual OCR'ing. The tool pdfimages is usually a part of poppler-utils package, the tool ddjvu comes from djvulibre-bin package, and tesseract is included in tesseract-ocr package. By default, tesseract comes with the English language support only, other languages should be installed separately, for example, run sudo apt install tesseract-ocr-rus to install the Russian language support. To find out what languages are currently installed type tesseract --list-langs.

Compilation

Invoke make (or make debug) from the directory of the project to compile the code with debug information included, or make release to compile without debug symbols. This creates executable file go-ocr.

Technical details

The tool first runs pdfimages or ddjvu program to extract images to a temporary directory, and then invokes tesseract on each image in parallel to produce lines of plain text. Those lines are then passed through the line filters, if any, then assembled into one text string and passed through text filters, if any. regexp filters are implemented using Regexp.ReplaceAll() function, and word filters are invocations of bytes.Replace() function.

Known issues

Older versions of pdfimages tool do not have -tiff option, resulting in an error.

Platform

Linux (tested on Linux Mint 18 64bit, based on Ubuntu 16.04), will probably work on MacOS as well.

Tools:

$ go version
go version go1.6.2 linux/amd64
$ tesseract --version
tesseract 3.04.01
...
$ pdfimages --version
pdfimages version 0.41.0
...
$ ddjvu --help
DDJVU --- DjVuLibre-3.5.27
...
Lisence: BSD
You might also like...
A set of workflows for corpus building through OCR, post-correction and normalisation
A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

Toolbox for OCR post-correction

Ochre Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress! Overview of OCR pos

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

English | 简体中文 Introduction PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and a

Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Library used to deskew a scanned document
Library used to deskew a scanned document

Deskew //Note: Skew is measured in degrees. Deskewing is a process whereby skew is removed by rotating an image by the same amount as its skew but in

Unofficial implementation of
Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

Extract tables from scanned image PDFs using Optical Character Recognition.

ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition. Install Requirements Tesseract OCR sudo apt

Python library to extract tabular data from images and scanned PDFs
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Some bits of javascript to transcribe scanned pages using PageXML

nashi (nasḫī) Some bits of javascript to transcribe scanned pages using PageXML. Both ltr and rtl languages are supported. Try it! But wait, there's m

Comments
  • ocrpdf fails with error message from pdfimages

    ocrpdf fails with error message from pdfimages

    I compiled ocrpdf on linux ubuntu 14.04 but it won't process a pdf file. It issues an error message from pdfimages

    eneafse:~/Downloads$ ocrpdf declasspart4.pdf ERROR: pdfimages version 3.04 Copyright 1996-2014 Glyph & Cog, LLC Usage: pdfimages [options] -f : first page to convert -l : last page to convert -j : write JPEG images as JPEG files -opw : owner password (for encrypted files) -upw : user password (for encrypted files) -q : don't print any messages or errors -cfg : configuration file to use in place of .xpdfrc -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information

    Thanks.

    E.J. Neafsey

    opened by Ejneafsey 1
Releases(v0.4.2)
Owner
Maxim
Maxim
Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector

CRAFT: Character-Region Awareness For Text detection Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector | Paper |

188 Dec 28, 2022
Forked from argman/EAST for the ICPR MTWI 2018 CHALLENGE

EAST_ICPR: EAST for ICPR MTWI 2018 CHALLENGE Introduction This is a repository forked from argman/EAST for the ICPR MTWI 2018 CHALLENGE. Origin Reposi

Haozheng Li 157 Aug 23, 2022
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

SCUT-CTW1500 Datasets We have updated annotations for both train and test set. Train: 1000 images [images][annos] Additional point annotation for each

Yuliang Liu 600 Dec 18, 2022
Memory tests solver with using OpenCV

Human Benchmark project This project is OpenCV based programs which are puzzle solvers for 7 different games for https://humanbenchmark.com/. made as

Bahadır Araz 24 Dec 27, 2022
Official PyTorch implementation for "Mixed supervision for surface-defect detection: from weakly to fully supervised learning"

Mixed supervision for surface-defect detection: from weakly to fully supervised learning [Computers in Industry 2021] Official PyTorch implementation

ViCoS Lab 169 Dec 30, 2022
Python tool that takes the OCR.space JSON output as input and draws a text overlay on top of the image.

OCR.space OCR Result Checker = Draw OCR overlay on top of image Python tool that takes the OCR.space JSON output as input, and draws an overlay on to

a9t9 4 Oct 18, 2022
一款基于Qt与OpenCV的仿真数字示波器

一款基于Qt与OpenCV的仿真数字示波器

郭赟 4 Nov 02, 2022
Volume Control using OpenCV

Gesture-Volume-Control Volume Control using OpenCV Here i made volume control using Python and OpenCV in which we can control the volume of our laptop

Mudit Sinha 3 Oct 10, 2021
This is a project to detect gestures to zoom in or out, using the real-time distance between the index finger and the thumb. It's based on OpenCV and Mediapipe.

Pinch-zoom This is a python project based on real-time hand-gesture detection, to zoom in or out, using the distance between the index finger and the

Harshit Bhalla 6 Jul 11, 2022
Handwritten Number Recognition using CNN and Character Segmentation

Handwritten-Number-Recognition-With-Image-Segmentation Info About this repository This Repository is aimed at reading handwritten images of numbers an

Sparsha Saha 17 Aug 25, 2022
scene-linear test images

Scene-Referred Image Collection A collection of OpenEXR Scene-Referred images, encoded as max 2048px width, DWAA 80 compression. All exrs are encoded

Gralk Klorggson 7 Aug 25, 2022
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Table of Contents Overview Requirements Demo Modules Overview This python package contains modules to help with finding and extracting tabular data fr

Eric Ihli 311 Dec 24, 2022
An official PyTorch implementation of the paper "Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences", ICCV 2021.

PyTorch implementation of Learning by Aligning (ICCV 2021) This is an official PyTorch implementation of the paper "Learning by Aligning: Visible-Infr

CV Lab @ Yonsei University 30 Nov 05, 2022
2 telegram-bots: for image recognition and for text generation

💻 📱 Telegram_Bots 🔎 & 📖 2 telegram-bots: for image recognition and for text generation. About Image recognition bot: User sends a photo and bot de

Marina Polukoshko 1 Jan 27, 2022
Brief idea about our project is mentioned in project presentation file.

Brief idea about our project is mentioned in project presentation file. You just have to run attendance.py file in your suitable IDE but we prefer jupyter lab.

Dhruv ;-) 3 Mar 20, 2022
Code for the "Sensing leg movement enhances wearable monitoring of energy expenditure" paper.

EnergyExpenditure Code for the "Sensing leg movement enhances wearable monitoring of energy expenditure" paper. Additional data for replicating this s

Patrick S 42 Oct 26, 2022
https://arxiv.org/abs/1904.01941

Character-Region-Awareness-for-Text-Detection- https://arxiv.org/abs/1904.01941 Train You can train SynthText data use python source/train_SynthText.p

DayDayUp 120 Dec 28, 2022
Automatically resolve RidderMaster based on TensorFlow & OpenCV

AutoRiddleMaster Automatically resolve RidderMaster based on TensorFlow & OpenCV 基于 TensorFlow 和 OpenCV 实现的全自动化解御迷士小马谜题 Demo How to use Deploy the ser

神龙章轩 5 Nov 19, 2021
Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

Overview This collection demonstrates how to construct and train a deep, bidirectional stacked LSTM using CNN features as input with CTC loss to perfo

Jerod Weinman 489 Dec 21, 2022
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

213 Nov 12, 2022