Awesome OCR
This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).
Contributions are welcome, as is feedback.
Software
OCR engines
- tesseract - The definitive Open Source OCR engine
Apache 2.0
- EasyOCR - OCR engine built on PyTorch by JaidedAI,
Apache 2.0
- ocropus - OCR engine based on LSTM,
Apache 2.0
- ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
- kraken - Ocropus fork with sane defaults
- gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
- Ocrad - The GNU OCR.
GPL
- ocular - Machine-learning OCR for historic documents
- SwiftOCR - fast and simple OCR library written in Swift
- attention-ocr - OCR engine using visual attention mechanisms
- RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
- simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
- Calamari - OCR Engine based on OCRopy and Kraken
Older and possibly abandoned OCR engines
- Clara OCR - Open source OCR in C
GPL
- Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
- Eye - an experimental Java OCR (image-to-text) application
- kognition - An omnifont OCR software for KDE
- OCRchie - Modular Optical Character Recognition Software
- ocre - o.c.r. easy
- xplab - A GTK 2 tool for pattern matching
- hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article)
GPL
OCR file formats
hOCR
- hocr-tools - Tools for doing various useful things with hOCR files,
Apache 2.0
- hocr-spec - hOCR 1.2 specification
- ocr-transform - CLI tool to convert between hOCR and ALTO,
MIT
- hocr-parser - hOCR Specification Python Parser
- hOCRTools - hOCR to ALTO conversion XSLT
ALTO XML
- ALTO XML Schema - XML Schema and development of the ALTO XML format
- ALTO XML Documentation - Documentation and use cases for ALTO
- alto-tools - Various tools to work with ALTO files, Python
- AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML
TEI
- TEI-OCR - TEI customization for OCR generated layout and content information
- TEI SIG on Libraries - Best Practices for TEI in Libraries
- GDZ - METS/TEI-based GDZ document format
PAGE XML
- PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
- omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
- py-pagexml - Python library for handling PAGE XML and OPF files.
OCR CLI
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
- Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
- Ocrocis - Project manager interface for Ocropy, see also external project homepage
- tesseract-recognize - Tesseract-based tool that outputs result in Page XML format (docker image).
OCR GUI
- moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
- qt-box-editor - QT4 editor of tesseract-ocr box files.
- ocr-gt-tools - Client-Server application for editing OCR ground truth.
- Paperwork - Using scanners and OCR to grep paper documents the easy way.
- Paperless - Scan, index, and archive all of your paper documents.
- gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
- VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
- PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
- OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
- PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
- LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
- archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.
- nw-page-editor - Simple app for visual editing of Page XML files. Provides desktop and server docker-based versions.
OCR Preprocessing
- NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
- binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
- typeface-corpus - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities.
- binarizewolfjolion - Comparison of binarization algorithms. Blog post
crop_morphology.py
in oldnyc - Cropping a page to just the text block- Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
- Fred's ImageMagick script textcleaner - Processes a scanned document of text to clean the text background
- localcontrast - Fast O(1) local contrast optimization
OCR as a Service
- Open OCR - Run Tesseract in Docker containers
- tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
- docker-ocropy - A Docker container for running the ocropy OCR system.
- ABBYY Cloud OCR SDK Code samples - Code samples for using the proprietary commercial ABBYY OCR API.
- nidaba - An expandable and scalable OCR pipeline
- gamera - A meta-framework for building document processing applications, e.g. OCR
- ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
- ocrad-docker - Run the ocrad OCR engine in a docker container
- kraken-docker - Run the kraken OCR engine in a docker container
- Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
- ocr.space - Free Online OCR and OCR API by @a9t9 based on Tesseract (code is not open)
- OCR4all - Provides OCR services through web applications. Included Projects: LAREX, OCRopus, calamari and nashi.
OCR evaluation
- ISRI OCR Evaluation Tools with a User Guide from 1996 :!:
- isri-ocr-evaluation-tools - further development by @eddieantonio (2015, 2016)
- ancientgreekocr-evaluation-tools - further development by @nickjwhite (2013, 2014)
- ocrevalUAtion - Cross-format evaluation, CLI and GUI
- ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
- quack - Quality-Assurance-tool for scans with corresponding ALTO-files
OCR libraries by programming language
Go
- gosseract - Golang OCR library, wrapping Tesseract-ocr.
Java
- Tess4J - Java Native Access bindings to Tesseract.
- tess-two - Tools for compiling Tesseract on Android and Java API.
.Net
- tesseract for .net - A .Net wrapper for tesseract-ocr.
Object Pascal
- TTesseractOCR4 - Object Pascal binding for tesseract-ocr 4.x.
PHP
- Tesseract OCR for PHP - Tesseract PHP bindings.
Python
- pytesseract - A Python wrapper for Google Tesseract.
- pyocr - A Python wrapper for Tesseract and Cuneiform.
- ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
- tesserocr - A Python wrapper for the tesseract-ocr API
Javascript
- ocracy - pure javascript lstm rnn implementation based on ocropus
- gocr.js - Javascript port (emscripten) of gocr
- ocrad.js - Javascript port (emscripten) of ocrad
- tesseract.js - Javascript port (emscripten) of Tesseract
- node-tesseract-ocr - A simple wrapper for the Tesseract OCR package.
- node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.
Ruby
- rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
- ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
- ocr_space - API wrapper for free ocr service ocr.space. Includes CLI
Rust
- tesseract.rs - Rust bindings for tesseract OCR.
- leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.
R
- tesseract - R bindings for tesseract OCR.
Swift
- Tesseract OCR iOS - Swift and Objective-C wrapper for Tesseract OCR.
- SwiftOCR - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes.
OCR training tools
- glyph-miner - A system for extracting glyphs from early typeset prints
- ocrodeg - Document image degradation for OCR data augmentation
Datasets
Ground Truth
- archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via archiscribe
CC-BY 4.0
- CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for PoCoTo
- Rescribe - Transcriptions of Caroline Minuscule Manuscripts
PDM 1.0
- CLTK - Corpora from Classical Language Toolkit
PDM 1.0
- DIVA-HisDB - 150 pagesPAGE-XML of three medieval manuscripts
CC-BY-NC 3.0
- EarlyPrintedBooks - ~8,800 lines from several early printed books
CC-BY-NC-SA 4.0
- EEBO-TCP - 25,363 EEBO documents transcribed by TCP
PDM 1.0
- ECCO-TCP - 2,188 ECCO documents transcribed by TCP
PDM 1.0
- eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by eMOP
PDM 1.0
- Evans-TCP - 4,977 Evans documents transcribed by TCP
- FDHN - Finnish Digitised Historical Newspapers, Paper, (free) registration required, Terms of Use
- FROC-MSS - 4 Old French Medieval Manuscripts
CC-BY 4.0
- GERMANA - 764 Spanish manuscript pages, (free) registration required
non-commercial use only
- GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin
CC-BY 4.0
- imagessan - Sanskrit images & ground truth (Devanagari script)
- IMPACT-BHL - 2,418 pagesPAGE-XML from the Biodiversity Heritage Library, [email protected]
CC-BY 3.0
- IMPACT-BL - 294 pagesPAGE-XML from the British Library, (free) registration required
PDM 1.0
- IMPACT-BNE - 215 pagesPAGE-XML from the National Library of Spain, (free) registration required, [email protected]
CC-BY-NC-SA 4.0
- IMPACT-BNF - 151 pagesPAGE-XML from the National Library of France, (free) registration required
CC-BY-NC-SA 4.0
- IMPACT-KB - 142 pagesPAGE-XML from the National Library of the Netherlands
CC-BY 4.0
- IMPACT-NKC - 187 pagesPAGE-XML from the Czech National Library, (free) registration required
CC-BY-NC-SA 4.0
- IMPACT-NLB - 19 pagesPAGE-XML from the National Library of Bulgaria, (free) registration required
CC-BY-NC-ND 4.0
- IMPACT-NUK - 209 pagesPAGE-XML from the National Library of Slovenia, (free) registration required
CC-BY-NC-SA 4.0
- IMPACT-PSNC - 478 pagesPAGE-XML from four Polish digital libraries, [email protected]
CC-BY 3.0
- LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
- MJSynth - 9m synthetic images covering 90k English words
- OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via Text+Berg digital
CC-BY 4.0
- OCR-D - 180 pagesPAGE-XML of German historical prints from OCR-D
CC-BY-SA 4.0
- OCR_GS_Data - Double-checked Arabic Gold Standard from OpenITI
- old-books - 322 old books from Project Gutenberg
GPL 3.0
- PRImA-ENP - 528 pagesPAGE-XML historic newspapers from Europeana Newspapers, (free) registration required
PDM 1.0
- RODRIGO - 853 Spanish manuscript pages, (free) registration required
non-commercial use only
- Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch
Literature
OCR-related publication and link lists
- IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
- OCR-D - List of OCR-related academic articles in the context of the OCR-D project.
🇩🇪 - Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
- eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
- Wikipedia: Comparison of optical character recognition software
- OCR [and Deep Learning] by @handong1587
- Ocropus Wiki: Publications
Blog Posts and Tutorials
- Tesseract Blends Old and New OCR Technology (2016) @theraysmith
- T[email protected], Updated "What You Always Wanted to Know" slides
- What You Always Wanted To Know About Tesseract (2014) @theraysmith
- [email protected], includes demos
- Extracting text from an image using Ocropus (2015)
- Training an Ocropus OCR model (2015) @danvk
- Ocropus Wiki: Compute errors and confusions (2016) @zuphilip
- Ocropus Wiki: Working with Ground Truth (2016) @zuphilip
- OCRopus (2016) @jze
- mostly on column separation in ocropus
- 10 Tips for making your OCR project succeed (2013) @cneud
- general things to consider for OCR projects
- Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -
- feature list for a commercial image pre-processing library; has nice before-after samples for pre-processing steps related to OCR
- Extracting Text from PDFs; Doing OCR; all within R @shawngraham
- How to work with OCR from PDFs in the R programming environment
- Tutorial: Command-line OCR on a Mac @bmschmidt
- Tutorial on how to run tesseract in Mac OSX
- Practical Expercience with OCRopus Model Training (2016) @jze
- Homemade Manuscript OCR (1): OCRopy (2017) @Jean-Baptiste-Camps
- Tutorial on applying OCR to medieval manuscripts with OCRopy
- Optimizing Binarization for OCRopus (2017) @jze
- Prototype demo for OCR postfix in Danish Newspapers (2016) @thomasegense
- How Can I OCR My Dictionary? (2016) @JessedeDoes
- "Needlessly complex" blog (2016) @mzucker. Several image processing how-tos (Python based), particularly:
- (Open-Source-)OCR-Workflows (2017) @wrznr
🇩🇪 overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the @OCR-D project. - A gentle introduction to OCR (2018) @shgidi
- Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR (2019) @eliaskreyenbuehl
🇩🇪 A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts.
OCR Showcases
- abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
- cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
- MathOCR - A printed scientific document recognition system, pre-alpha
Academic articles
2011 and before
- High performance document layout analysis (2003) Breuel
- Adaptive degraded document image binarization (2006) Gatos, Pratikakis, Perantonis
- [Internship Report] (2007) Gupta
- OCRopus Addons (Internship Report) (2007) Dantrey
2012
- Local Logistic Classifiers for Large Scale Learning (2012) Yousefi, Breuel
2013
- High Performance OCR for Printed English and Fraktur using LSTM Networks (2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait
- Can we build language-independent OCR using LSTM networks? (2013) Ul-Hasan, Breuel
- Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks (2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel
2014
- OCR of historical printings of Latin texts: Problems, Prospects, Progress. (2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink
- Correcting Noisy OCR: Context beats Confusion (2014) Evershed, Fitch
2015
- TypeWright: An Experiment in Participatory Curation (2015) Bilansky
- On crowd-sourcing OCR postcorrection
- Benchmarking of LSTM Networks (2015) Breuel
- Recognition of Historical Greek Polytonic Scripts Using LSTM (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
- A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015) Karayil, Ul-Hasan, Breuel
- A Sequence Learning Approach for Multiple Script Identification (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel
2016
- Important New Developments in Arabographic Optical Character Recognition (OCR) (2016) Romanov, Miller, Savant, Kiessling
- on kraken
- using OpenArabic/OCR_GS_Data for ground truth data
- OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus (2016) Springmann, Lüdeling
- Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents (2016) Springmann, Fink, Schulz
- Generic Text Recognition using Long Short-Term Memory Networks (2016) Ul-Hasan -- Ph.D Thesis
- OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters (2016) Dengel, Ul-Hasan, Bukhari
- Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016) Lee, Osindero
2017
- Telugu OCR Framework using Deep Learning (2015/2017) Achanta, Hastie
- see also TeluguOCR, banti_telugu_ocr, chamanti_ocr, #49
2018
- A Two-Stage Method for Text Line Detection in Historical Documents (2018) Grüning, Leifert, Strauß, Labahn. Code available at https://github.com/TobiasGruening/ARU-Net