Run tesseract with the tesserocr bindings with @OCR-D's interfaces

Related tags

Computer Visionocr-d
Overview

ocrd_tesserocr

Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr

image image image Docker Automated build

Introduction

This package offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr. (Each processor is a parameterizable step in a configurable workflow of the OCR-D functional model. There are usually various alternative processor implementations for each step. Data is represented with METS and PAGE.)

It includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition.

Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, font attributes via TextStyle, script via @primaryScript, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.

Installation

Required ubuntu packages:

  • Tesseract headers (libtesseract-dev)
  • Some Tesseract language models (tesseract-ocr-{eng,deu,frk,...} or script models (tesseract-ocr-script-{latn,frak,...}); or better yet custom trained models
  • Leptonica headers (libleptonica-dev)

From PyPI

This is the best option if you want to use the stable, released version.


NOTE

ocrd_tesserocr requires Tesseract >= 4.1.0. The Tesseract packages bundled with Ubuntu < 19.10 are too old. If you are on Ubuntu 18.04 LTS, please use Alexander Pozdnyakov's PPA repository, which has up-to-date builds of Tesseract and its dependencies:

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update

sudo apt-get install git python3 python3-pip libtesseract-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr wget
pip install ocrd_tesserocr

With docker

This is the best option if you want to run the software in a container.

You need to have Docker

docker pull ocrd/tesserocr

To run with docker:

docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...

From git

This is the best option if you want to change the source code or install the latest, unpublished changes.

We strongly recommend to use venv.

git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
sudo make deps-ubuntu # or manually with apt-get
make deps        # or pip install -r requirements
make install     # or pip install .

Usage

For details, see docstrings in the individual processors and ocrd-tool.json descriptions, or simply --help.

Available OCR-D processors are:

  • ocrd-tesserocr-crop (simplistic)
    • sets Border of pages and adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-deskew (for skew and orientation; mind operation_level)
    • sets @orientation of regions or pages and adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-binarize (Otsu – not recommended)
    • adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-recognize (optionally including segmentation; mind segmentation_level and textequiv_level)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation (optionally)
    • adds TextRegions to TableRegions and sets their @orientation (optionally)
    • adds TextLines to TextRegions (optionally)
    • adds Words to TextLines (optionally)
    • adds Glyphs to Words (optionally)
    • adds TextEquiv
  • ocrd-tesserocr-segment (all-in-one segmentation – recommended; delegates to recognize)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation
    • adds TextRegions to TableRegions and sets their @orientation
    • adds TextLines to TextRegions
    • adds Words to TextLines
    • adds Glyphs to Words
  • ocrd-tesserocr-segment-region (only regions – with overlapping bboxes; delegates to recognize)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation
  • ocrd-tesserocr-segment-table (only table cells; delegates to recognize)
    • adds TextRegions to TableRegions
  • ocrd-tesserocr-segment-line (only lines – from overlapping regions; delegates to recognize)
    • adds TextLines to TextRegions
  • ocrd-tesserocr-segment-word (only words; delegates to recognize)
    • adds Words to TextLines
  • ocrd-tesserocr-fontshape (only text style – via Tesseract 3 models)
    • adds TextStyle to Words

The text region @types detected are (from Tesseract's PolyBlockType):

  • paragraph: normal block (aligned with others in the column)
  • floating: unaligned block (is in a cross-column pull-out region)
  • heading: block that spans more than one column
  • caption: block for text that belongs to an image

If you are unhappy with these choices, consider post-processing with a dedicated custom processor in Python, or by modifying the PAGE files directly (e.g. xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.="floating"]' -v paragraph filegrp/*.xml).

All segmentation is currently done as bounding boxes only by default, i.e. without precise polygonal outlines. For dense page layouts this means that neighbouring regions and neighbouring text lines may overlap a lot. If this is a problem for your workflow, try post-processing like so:

  • after line segmentation: use ocrd-cis-ocropy-resegment for polygonalization, or ocrd-cis-ocropy-clip on the line level
  • after region segmentation: use ocrd-segment-repair with plausibilize (and sanitize after line segmentation)

It also means that Tesseract should be allowed to segment across multiple hierarchy levels at once, to avoid introducing inconsistent/duplicate text line assignments in text regions, or word assignments in text lines. Hence,

  • prefer ocrd-tesserocr-recognize with segmentation_level=region over ocrd-tesserocr-segment followed by ocrd-tesserocr-recognize, if you want to do all in one with Tesseract,
  • prefer ocrd-tesserocr-recognize with segmentation_level=line over ocrd-tesserocr-segment-line followed by ocrd-tesserocr-recognize, if you want to do everything but region segmentation with Tesseract,
  • prefer ocrd-tesserocr-segment over ocrd-tesserocr-segment-region followed by (ocrd-tesserocr-segment-table and) ocrd-tesserocr-segment-line, if you want to do everything but recognition with Tesseract.

However, you can also run ocrd-tesserocr-segment* and ocrd-tesserocr-recognize with shrink_polygons=True to get polygons by post-processing each segment, shrinking to the convex hull of all its symbol outlines.

Testing

make test

This downloads some test data from https://github.com/OCR-D/assets under repo/assets, and runs some basic test of the Python API as well as the CLIs.

Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).

Comments
  • Add fontshape processor and all-in-one segmentation

    Add fontshape processor and all-in-one segmentation

    We can probably remove both the old segment-region/line/word and new (all-in-one) segment altogether now that we can configure them via overwrite_* and textequiv_level in recognize. Or we keep the CLI names, but delegate to recognize @kba?

    opened by bertsky 58
  • Memory leaks

    Memory leaks

    The memory usage of ocrd-tesserocr-segment-region increases for each page, resulting in a total of about 7 GB for 200 pages, 8 GB for 248 pages, 10 GB for 282 pages, 11 GB for 313 pages (observed for http://nbn-resolving.de/urn:nbn:de:bsz:180-digad-22977).

    ocrd-tesserocr-segment-line shows a similar effect.

    For that book, a machine with 8 GB RAM would have started swapping, thus slowing down the process extremely. Even a large server would get memory problems when processing large books with more than 1000 pages in parallel.

    opened by stweil 31
  • improve segmentation

    improve segmentation

    This fixes #101 (using raw_lines by default for textline images, but there are still some corner cases that need to be fixed in Tesseract) and brings a number of segmentation-related improvements:

    • interprete overwrite_regions more consistently
    • annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks
    • no separators and noise regions in reading order
    • segment tables into cells and lines so they can be OCRed, too
    opened by bertsky 28
  • Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid)

    Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid)

    21:19:10.443 INFO processor.TesserocrSegmentLine - INPUT FILE 65 / phys396119
    21:19:10.577 INFO processor.TesserocrSegmentLine - Page 'phys396119' images will use DPI estimated from segmentation
    21:19:10.850 ERROR shapely.geos - TopologyException: Input geom 1 is invalid: Self-intersection at or near point 0 107 at 0 107
    Traceback (most recent call last):
      File "/home/stweil/src/github/OCR-D/venv-20200904/bin/ocrd-tesserocr-segment-line", line 8, in <module>
        sys.exit(ocrd_tesserocr_segment_line())
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 26, in ocrd_tesserocr_segment_line
        return ocrd_cli_wrap_processor(TesserocrSegmentLine, *args, **kwargs)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
        run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
        processor.process()
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/segment_line.py", line 119, in process
        interline = line_poly.intersection(region_poly)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/geometry/base.py", line 676, in intersection
        return geom_factory(self.impl['intersection'](self, other))
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 70, in __call__
        self._check_topology(err, this, other)
      File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/shapely/topology.py", line 38, in _check_topology
        self.fn.__name__, repr(geom)))
    shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f89253f7c88>
    
    opened by stweil 23
  • Integration with OCR-D/spec#169 (resource manager)

    Integration with OCR-D/spec#169 (resource manager)

    This is just a proof-of-concept that it is possible to load tesseract models installed with ocrd resmgr into the cache directory.

    The tricky part here is that there is only one TESSDATA_PREFIX but potentially multiple directories with models. So while it is no problem to look up models in various folders, only one of the can be used as the TESSDATA_PREFIX. Suggestions for a reasonable resolution to this dilemma are welcome.

    opened by kba 22
  • support more textequiv levels

    support more textequiv levels

    This is an attempt to implement the other annotation levels. In my opinion, the behaviour for the different levels cannot be made completely analogous with Tesseract: simply pointing it to rectangles for words and glyphs (from an external layout segmentation) would produce results of far worse quality than always recognizing one complete line and allowing its own segmentation below it (accessible by iterators). In contrast, from the line level upwards we can reliably use its respective page segmentation mode (SINGLE_LINE / SINGLE_BLOCK / AUTO). Perhaps warnings and exceptions should be dealt with in a different, more systematic way though.

    opened by bertsky 18
  • move to AlternativeImage feature selectors in OCR-D/core#294:

    move to AlternativeImage feature selectors in OCR-D/core#294:

    • all: use second output position as fileGrp USE to produce AlternativeImage
    • all: rid of MetadataItem/Labels-related FIXME: with the updated PAGE model, we can now use @externalModel and @externalId
    • all: use OcrdExif.resolution instead of xResolution
    • all: create images with monotonically growing @comments (features)
    • crop: use ocrd_utils.crop_image instead of PIL.Image.crop
    • crop: fix bug when trying to access page_image if there are already region coordinates that we are ignoring
    • crop: filter images already deskewed and cropped! (we must crop ourselves, and deskewing can not happen until afterwards)
    • deskew: fix bugs in configuration-dependent corner cases related to whether deskewing has already been applied (on the page or region level):
      • for the page image, never use images already rotated (both for page level and region level processing, but for the region level, do rotate images ad hoc if @orientation is present on the page level)
      • for the region image, never use images already rotated (except for our own page-level rotation)
    • segment-region: forgot to add feature "cropped" when producing cropped images
    bug enhancement 
    opened by bertsky 16
  • pip install ocrd_tesserocr fails with tesseract  version 4.0.0-beta-26-gfd49

    pip install ocrd_tesserocr fails with tesseract version 4.0.0-beta-26-gfd49

    I use pip install ocrd_tesserocr to install ocrd_tesseract into my virtualenv environment. The installation fails with:

    ...
      Running setup.py bdist_wheel for tesserocr ... error
      Complete output from command /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-k_dgo547/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-q7mozwr8 --python-tag cp37:
      Supporting tesseract v4.0.0
      Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['tesseract', 'lept'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
      running bdist_wheel
      running build
      running build_ext
      building 'tesserocr' extension
      creating build
      creating build/temp.linux-x86_64-3.7
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -fPIC -I/usr/include -I/usr/include/python3.7m -c tesserocr.cpp -o build/temp.linux-x86_64-3.7/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
      tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)':
      tesserocr.cpp:12196:43: error: 'class tesseract::ResultIterator' has no member named 'GetBestLSTMSymbolChoices'
         __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                                 ^~~~~~~~~~~~~~~~~~~~~~~~
      error: command 'gcc' failed with exit status 1
    
      ----------------------------------------
      Failed building wheel for tesserocr
      Running setup.py clean for tesserocr
    Failed to build tesserocr
    Installing collected packages: tesserocr, ocrd-tesserocr
      Running setup.py install for tesserocr ... error
        Complete output from command /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-k_dgo547/tesserocr/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-fc87h61b/install-record.txt --single-version-externally-managed --compile --install-headers /run/media/flo/a57ed1c0-7fc5-41b1-a6e5-0d43b3ae6a40/data/devel/work/cis-ocrd-py/env/include/site/python3.7/tesserocr:
        Supporting tesseract v4.0.0
        Configs from pkg-config: {'include_dirs': ['/usr/include'], 'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 67108864}}
        running install
        running build
        running build_ext
        building 'tesserocr' extension
        creating build
        creating build/temp.linux-x86_64-3.7
        gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fstack-protector-strong -fno-plt -fPIC -I/usr/include -I/usr/include/python3.7m -c tesserocr.cpp -o build/temp.linux-x86_64-3.7/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
        tesserocr.cpp: In function 'PyObject* __pyx_pf_9tesserocr_16PyResultIterator_8GetBestLSTMSymbolChoices(__pyx_obj_9tesserocr_PyResultIterator*)':
        tesserocr.cpp:12196:43: error: 'class tesseract::ResultIterator' has no member named 'GetBestLSTMSymbolChoices'
           __pyx_v_output = (__pyx_v_self->_riter->GetBestLSTMSymbolChoices()[0]);
                                                   ^~~~~~~~~~~~~~~~~~~~~~~~
        error: command 'gcc' failed with exit status 1
    ...
    

    tesseract is installed on the system:

    tesseract 4.0.0-beta.4-26-gfd49
     leptonica-1.77.0
      libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.1) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1
     Found AVX
     Found SSE
    
    opened by finkf 16
  • Superfluous newlines

    Superfluous newlines

    At the moment, superfluous newlines are appended to the TextEquiv/Unicode entries:

                        <pc:TextEquiv>
                            <pc:Unicode>Groſzmaͤchtigſter</pc:Unicode>
                        </pc:TextEquiv>
                        <pc:TextEquiv>
                            <pc:Unicode>stzmächtigstcr
    </pc:Unicode>
    
    opened by finkf 16
  • Make it clearer which Tesseract engine is being used

    Make it clearer which Tesseract engine is being used

    Since Tesseract 4, two OCR engines are available: rule-based (i.e. --oem 0), LSTM (--oem 1). The command-line also exposes an ensemble of the two OCR engines (--oem 2). The documentation for ocrd-tesserocr-recognize does not make it clear which engine is used and using either the following parameters seems to have no effect on the recognition results:

    • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "0" }'
    • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "1" }'
    • -P tesseract_parameters '{ "tessedit_ocr_engine_mode" : "2" }'

    Which one of the OCR engines are we currently using?

    opened by Witiko 12
  • ocrd-tesserocr-segment: segmentation fault

    ocrd-tesserocr-segment: segmentation fault

    And with this image:

    https://digi.ub.uni-heidelberg.de/diglitData/v/justinian1627bd2_-_1281.tif

    and ocrd.sif (singularity container) created from docker ocrd_all at Nov 9 10:13 2021 & at Jan 17 15:11 2022 [UPDATE]

    and this workflow:

    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace init >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif >>ocrd.log 2>&1 || exit
    
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -P k 0.10 -I OCR-D-IMG -O OCR-D-001 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true -I OCR-D-004 -O OCR-D-005 >>ocrd.log 2>&1 || exit
    /usr/bin/time singularity exec $HOME/ocrd.sif ocrd-calamari-recognize -I OCR-D-005 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/calamari_models_experimental/historical_french_2020-10-14/*.ckpt.json" >>ocrd.log 2>&1 || exit
    

    I'll get a segmentation fault

    Core was generated by `/usr/bin/python3 /usr/bin/ocrd-tesserocr-segment -P find_tables false -P shrink'.
    Program terminated with signal 11, Segmentation fault.
    
    opened by jbarth-ubhd 11
  • reverse order of glyphs inside words in PAGE-File for RTL languages

    reverse order of glyphs inside words in PAGE-File for RTL languages

    when using for example Arabic model, recognition works fine but the words inside the generated PAGE-XML contains reversed letters. But the sequence of words itself is correct, here an example: generated word with wrong sequence of letters:

                   <pc:Word id="region0001_line0001_word0000">
                        <pc:Coords points="1620,372 1620,402 1703,402 1703,375 1647,376"/>
                        <pc:TextEquiv conf="0.877831573486328">
                            <pc:Unicode>رصم</pc:Unicode>
                        </pc:TextEquiv>
                    </pc:Word>
    

    but the line containing the recogized word should look like this:

                            <pc:Unicode>مصر</pc:Unicode>
    

    (I know it is not easy to see clearly that it is reversed because the letters in Arabic changes appearance depending on position inside word, but this is handled by font.)

    Here is the equivalent portion of the image: the word Msr

    REMARK: when using tesseract as standalone and generating alto, the sequence is correct!

    opened by MihoMahi 3
  • montfaucon1719bd2_1, page 210, ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true

    montfaucon1719bd2_1, page 210, ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true

    this image

    https://digi.ub.uni-heidelberg.de/diglitData/v/montfaucon1719bd2_1.210.tif

    UPDATE same for https://digi.ub.uni-heidelberg.de/diglitData/v/montfaucon1719bd2_1.168a_Planche_72.tif

    with this workflow (latest ocrd_all as of 2021-12-01)

    ocrd workspace init 
    ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff OCR-D-IMG/00001.tif 
    
    ocrd-olena-binarize -P k 0.10 -I OCR-D-IMG -O OCR-D-001 
    ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002 
    ocrd-olena-binarize -I OCR-D-002 -O OCR-D-003 
    ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D-004 
    ocrd-tesserocr-segment -P find_tables false -P shrink_polygons true -I OCR-D-004 -O OCR-D-005 
    ocrd-calamari-recognize -I OCR-D-005 -O OCR-D-OCR -P checkpoint "$HOME/ocrd/_models/ocrd-calamari-recognize/c1_latin-script-hist-3/*.ckpt.json" 
    

    leads to this error messages:

    10:06:58.121 INFO processor.TesserocrSegment - INPUT FILE 0 / P_00001
    10:06:59.193 INFO processor.TesserocrSegment - Page 'P_00001' images will use 333 DPI from image 
    meta-data
    10:06:59.193 INFO processor.TesserocrSegment - Processing page 'P_00001'
    10:07:00.229 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-005_00001.IMG-BIN, 
    file_grp: OCR-D-005, path: OCR-D-005/OCR-D-005_00001.IMG-BIN.png
    /build/ocrd_tesserocr/ocrd_tesserocr/recognize.py:510: ShapelyDeprecationWarning: The proxy 
    geometries (through the 'asShape()', 'asPolygon()' or 'PolygonAdapter()' constructors) are 
    deprecated and will be removed in Shapely 2.0. Use the 'shape()' function or the standard 
    'Polygon()' constructor instead.
      for symbol in iterate_level(it, RIL.SYMBOL, parent=RIL.BLOCK)])
    Exception ignored in: <bound method BaseGeometry.__del__ of 
    <shapely.geometry.polygon.PolygonAdapter object at 0x7fc431060358>>
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 209, in __del__
        self._empty(val=None)
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 199, in _empty
        self._is_empty = True
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/proxy.py", line 44, in __setattr__
        object.__setattr__(self, name, value)
    AttributeError: can't set attribute
    10:07:00.930 INFO processor.TesserocrSegment - Detected region 'region0000': 2867,801 2418,798 
    1883,799 1527,803 1527,803 1184,824 1184,824 1183,824 1183,824 1183,824 1183,824 1183,824 1183,825 
    1181,827 1180,827 1180,827 1180,827 1180,827 1180,827 1180,828 1180,828 1180,828 1180,838 1172,2362 
    1171,3063 1175,3451 1175,3451 1175,3451 1175,3452 1175,3452 1175,3452 1175,3452 1175,3452 1176,3452 
    1176,3453 1176,3453 1176,3453 1176,3453 1176,3453 1177,3453 1260,3474 1260,3474 1260,3474 1304,3474 
    1945,3458 1945,3458 3324,3389 3324,3389 3325,3389 3348,3382 3348,3382 3348,3382 3348,3382 3348,3382 
    3348,3381 3349,3381 3349,3381 3349,3381 3349,3381 3349,3381 3349,3380 3349,3380 3349,3380 3387,1134 
    3388,1069 3388,1069 3377,954 3377,954 3377,953 3377,953 3377,953 3377,953 3354,913 3354,913 
    3353,913 3353,912 3353,912 3353,912 3353,912 3130,804 3130,804 3129,804 3129,804 3129,804 
    (FLOWING_TEXT)
    ...
    ...
    ...
    Exception ignored in: <bound method BaseGeometry.__del__ of 
    <shapely.geometry.polygon.PolygonAdapter object at 0x7fc40f820710>>
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 209, in __del__
        self._empty(val=None)
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/base.py", line 199, in _empty
        self._is_empty = True
      File "/usr/local/lib/python3.6/dist-packages/shapely/geometry/proxy.py", line 44, in __setattr__
        object.__setattr__(self, name, value)
    AttributeError: can't set attribute
    10:07:16.823 INFO processor.TesserocrSegment - Detected line 'region0005_line0010': 2366,4729 
    2366,4729 2366,4729 2291,4740 2290,4740 2290,4740 2290,4740 2290,4740 2290,4740 2290,4741 2289,4741 
    2289,4741 2289,4741 2289,4741 2289,4741 2289,4742 2289,4742 2289,4742 2289,4780 2289,4780 2289,4780 
    2289,4781 2289,4781 2289,4781 2289,4781 2289,4781 2290,4781 2290,4782 2290,4782 2290,4782 2290,4782 
    2290,4782 2291,4782 2291,4782 2291,4782 2650,4795 2895,4801 2905,4801 2905,4801 3188,4781 3188,4781 
    3189,4781 3189,4781 3189,4781 3189,4781 3189,4781 3189,4780 3190,4780 3190,4780 3190,4780 3190,4780 
    3190,4780 3190,4779 3190,4779 3190,4779 3190,4768 3190,4768 3190,4768 3190,4767 3190,4767 3190,4767 
    3190,4767 3190,4767 3189,4767 3189,4766 3189,4766 3189,4766 3189,4766 3189,4766 3188,4766 3188,4766 
    2705,4736 2705,4736 2638,4732
    Traceback (most recent call last):
      File "/usr/local/sub-venv/headless-tf2/bin/ocrd-calamari-recognize", line 33, in <module>
        sys.exit(load_entry_point('ocrd-calamari', 'console_scripts', 'ocrd-calamari-recognize')())
      File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1128, in 
    __call__
        return self.main(*args, **kwargs)
      File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1053, in 
    main
        rv = self.invoke(ctx)
      File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 1395, in 
    invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/local/sub-venv/headless-tf2/lib/python3.6/site-packages/click/core.py", line 754, in 
    invoke
        return __callback(*args, **kwargs)
      File "/build/ocrd_calamari/ocrd_calamari/cli.py", line 13, in ocrd_calamari_recognize
        return ocrd_cli_wrap_processor(CalamariRecognize, *args, **kwargs)
      File "/build/core/ocrd/ocrd/decorators/__init__.py", line 90, in ocrd_cli_wrap_processor
        raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
    Exception: Invalid input/output file grps:
            Input fileGrp[@USE='OCR-D-005'] not in METS!
    ```
    opened by jbarth-ubhd 0
  • ocrd_tesserocr processors waste CPU performance because of numpy blas threads

    ocrd_tesserocr processors waste CPU performance because of numpy blas threads

    The current code imports numpy although it only uses a single function from that library. Including numpy creates a number of threads for the BLAS algorithms by default. Those threads use a lot of CPU time without doing anything useful.

    Setting the environment variable OMP_THREAD_LIMIT=1 avoids those additional threads.

    Maybe there exists a better solution which does not require an environment variable, for example removing the numpy requirement.

    opened by stweil 6
  • Problem with table recognition

    Problem with table recognition

    With tables where there are no horizontal lines, the workflow results in a wrong reading order by only recognizing the columns and no rows.
    See the following image as an example: catalog46muse_0564

    The result is as follows: OCR-D-TXT_catalog46muse_0564.txt

    This is the used workfow:

    ocrd-olena-binarize -I OCR-D-OPT -O OCR-D-BIN -p '{"impl": "sauvola-ms-split"}'
    ocrd-cis-ocropy-denoise -I OCR-D-BIN -O OCR-D-DENOISE -p '{"level-of-operation":"page"}'
    ocrd-cis-ocropy-deskew -I OCR-D-DENOISE -O OCR-D-DESKEW-PAGE -p '{"level-of-operation":"page"}'
    ocrd-tesserocr-segment-region -I OCR-D-DESKEW-PAGE -O OCR-D-SEG-REG
    ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -p '{"plausibilize":true}'
    ocrd-cis-ocropy-binarize -I OCR-D-SEG-REPAIR -O OCR-D-BIN2 -p '{"level-of-operation":"region"}'
    ocrd-tesserocr-deskew -I OCR-D-BIN2 -O OCR-D-DESKEW-TEXT
    ocrd-tesserocr-segment-line -I OCR-D-DESKEW-TEXT -O OCR-D-SEG-LINE
    ocrd-cis-ocropy-resegment -I OCR-D-SEG-LINE -O OCR-D-RESEG
    ocrd-cis-ocropy-dewarp -I OCR-D-RESEG -O OCR-D-DEWARP-LINE
    ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -p '{"model": "deu"}'
    
    opened by Shanksum 9
Releases(v0.16.0)
  • v0.16.0(Oct 25, 2022)

  • v0.15.0(Oct 23, 2022)

    Added:

    • binarize: dpi numerical parameter to specify pixel density, #186
    • binarize: tiseg boolean parameter to specify whether to call tessapi.AnalyseLayout for text-image separation, #186

    Changed:

    • regonize: improved polygon handling, #186
    • resources: proper support for moduledir, companion to OCR-D/core#904, #187
    Source code(tar.gz)
    Source code(zip)
  • v0.14.0(Aug 14, 2022)

  • v0.13.6(Sep 28, 2021)

    Fixed:

    • segment/recognize: no find_tables when already looking for cells

    Changed:

    • segment/recognize: add param find_staves (for pageseg_apply_music_mask)
    • segment/recognize: :fire: set find_staves=false by default
    Source code(tar.gz)
    Source code(zip)
  • v0.13.5(Sep 28, 2021)

  • v0.13.4(Jul 20, 2021)

    Fixed:

    • recognize: only reset API when xpath_model or auto_model is active
    • recognize: for glyph level output, reduce choice confidence threshold
    • recognize: for glyph level output, skip choices with same text
    • recognize: avoid projecting empty text results from lower levels

    Changed:

    • recognize: allow setting init-time (model-related) parameters
    Source code(tar.gz)
    Source code(zip)
  • v0.13.3(Jul 20, 2021)

  • v0.13.2(Jul 20, 2021)

  • v0.13.1(Jul 20, 2021)

    Fixed:

    • deps-ubuntu/Docker: adapt to resmgr location mechanism, link to PPA models
    • recognize: :bug: skip detected segments if polygon cannot be made valid

    Changed:

    • deskew: add line-level operation for script detection
    • recognize: query more choices for textequiv_level=glyph if available
    • recognize: :fire: reset Tesseract API when applying model/param settings per segment
    • recognize: :eyes: allow configuring Tesseract parameters per segment via XPath queries
    • recognize: :eyes: allow selecting recognition model per segment via XPath queries
    • recognize: :eyes: allow selecting recognition model automatically via confidence
    Source code(tar.gz)
    Source code(zip)
  • v0.13.0(Jun 30, 2021)

  • v0.12.0(Mar 5, 2021)

    Changed:

    • resource lookup in a function to avoid module-level instantiation, #172
    • skip recognition of elements if they have pc:TextEquiv and overwrite_text is false-y, #170

    Added:

    • New parameter oem to explicitly set the engine backend to use, #168, #170
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Jan 29, 2021)

  • v0.10.1(Dec 10, 2020)

    Fixed:

    • segment*/recognize: reduce minimal region height to sane value
    • segment*/recognize: also disable text recognition if model is empty
    • segment-{region,line,word}: apply only single-level segmentation again
    • segment*/recognize: skip empty non-text blocks and all-reject words

    Changed:

    • segment*/recognize: add option shrink_polygons, default to false
    • segment*/recognize: add Tesseract version to meta-data
    • recognize: add option tesseract_parameters to expose all variables
    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Dec 1, 2020)

    Fixed:

    • when padding images, add the offset to coords of new segments
    • when segmenting regions, skip empty output coords more robustly
    • deskew/segment/recognize: skip empty input images more robustly
    • crop: fix pageId of new derived image
    • recognize: fix missing RIL for terminal GetUTF8Text()
    • recognize: fix Confidence() vs MeanTextConf()

    Changed:

    • recognize: add all-in-one segmentation with flexible entry point
    • recognize: re-parameterize to segmentation_level+textequiv_level
    • recognize: :fire: rename overwrite_words to overwrite_segments
    • segment*: delegate to recognize
    • recognize: also annotate orientation and skew when segmenting regions
    • fontshape: new processor for TextStyle detection via pre-LSTM models
    • crop: also use existing text regions, if any
    • deskew: delegate to core for reflection and rotation
    • deskew: always get new image and set feature deskewed (even for 0°)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.5(Oct 1, 2020)

  • v0.9.4(Sep 24, 2020)

  • v0.9.3(Sep 15, 2020)

  • v0.9.2(Sep 4, 2020)

  • v0.9.1(Aug 16, 2020)

  • v0.9.0(Aug 6, 2020)

  • v0.8.5(Jun 5, 2020)

  • v0.8.4(Jun 5, 2020)

    Changed:

    • segment-region: in sparse_text mode, also add text lines

    Fixed:

    • Always set path to TESSDATA_PREFIX for tesserocr.get_languages, #129

    Source code(tar.gz)
    Source code(zip)
  • v0.8.3(May 12, 2020)

  • v0.8.2(Apr 8, 2020)

    Fixed:

    • segment-region: no empty (invalid) ReadingOrder when no regions
    • segment-region: add sparse_text mode choice
    • segment-line: make intersection with parent more robust
    • segment-table: use SPARSE_TEXT mode for cells

    Changed:

    • Depend on OCR-D/core v2.4.4
    • Depend on sirfz/tesserocr v2.51
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Feb 17, 2020)

  • v0.8.0(Feb 17, 2020)

    Changed:

    • recognize: use lstm_choice_mode=2 for textequiv_level=glyph, #110
    • recognize: add char white/un/blacklisting parameters enhancement, #109

    Added:

    • all: add dpi parameter as manual override to image metadata enhancement, #108
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Feb 17, 2020)

    Added:

    • segment-table: new processor that adds table cells as text regions, #104
    • raw_lines option, #104
    • interprete overwrite_regions more consistently, #104
    • annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks, #104
    • no separators and noise regions in reading order, #104

    Changed:

    • docker image built on Ubuntu 18.04, #94, #97
    • Consistent setup of docker, #97
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Nov 5, 2019)

  • v0.5.1(Nov 5, 2019)

  • v0.4.1(Oct 31, 2019)

    • Adapt to feature selection/filtering mechanism for derived images in core
    • Fixes for image-feature-related corner cases in crop and deskew
    • Use explicit (second) output fileGrp when producing derived images
    • Upgrade to upstream tesserocr 2.4.1
    • Use OCR core >= stable 1.0.0
    Source code(tar.gz)
    Source code(zip)
Owner
OCR-D
DFG-Koordinierungsprojekt zur Weiterentwicklung von Verfahren der Optical Character Recognition
OCR-D
Binarize document images

Binarization Binarization for document images Examples Introduction This tool performs document image binarization (i.e. transform colour/grayscale to

QURATOR-SPK 48 Jan 02, 2023
TedEval: A Fair Evaluation Metric for Scene Text Detectors

TedEval: A Fair Evaluation Metric for Scene Text Detectors Official Python 3 implementation of TedEval | paper | slides Chae Young Lee, Youngmin Baek,

Clova AI Research 167 Nov 20, 2022
Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Deskew by Marek Mauder https://galfar.vevb.net/deskew https://github.com/galfar/deskew v1.30 2019-06-07 Overview Deskew is a command line tool for des

Marek Mauder 127 Dec 03, 2022
Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz.

opencv_yuz_bulma Bu uygulamada Python ve Opencv kullanarak bilgisayar kamerasından yüz tespiti yapıyoruz. Bilgisarın kendi kamerasını kullanmak için;

Ahmet Haydar Ornek 6 Apr 16, 2022
MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9

Deep Insight 99 Nov 01, 2022
Um RPG de texto orientado a objetos.

RPG de texto Um RPG de texto orientado a objetos, sem história. Um RPG (Role-playing game) baseado em texto em que você pode viajar para alguns locais

Vinicius 3 Oct 05, 2022
pulse2percept: A Python-based simulation framework for bionic vision

pulse2percept: A Python-based simulation framework for bionic vision Retinal degenerative diseases such as retinitis pigmentosa and macular degenerati

67 Dec 29, 2022
The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

3D Human Pose Estimation with Spatial and Temporal Transformers This repo is the official implementation for 3D Human Pose Estimation with Spatial and

Ce Zheng 363 Dec 28, 2022
Let's explore how we can extract text from forms

Form Segmentation Let's explore how we can extract text from any forms / scanned pages. Objectives The goal is to find an algorithm that can extract t

Philip Doxakis 42 Jun 05, 2022
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

SCUT-CTW1500 Datasets We have updated annotations for both train and test set. Train: 1000 images [images][annos] Additional point annotation for each

Yuliang Liu 600 Dec 18, 2022
An organized collection of tutorials and projects created for aspriring computer vision students.

A repository created with the purpose of teaching students in BME lab 308A- Hanoi University of Science and Technology

Givralnguyen 5 Nov 24, 2021
The world's simplest facial recognition api for Python and the command line

Face Recognition You can also read a translated version of this file in Chinese 简体中文版 or in Korean 한국어 or in Japanese 日本語. Recognize and manipulate fa

Adam Geitgey 47k Jan 07, 2023
Textboxes_plusplus implementation with Tensorflow (python)

TextBoxes++-TensorFlow TextBoxes++ re-implementation using tensorflow. This project is greatly inspired by slim project And many functions are modifie

81 Dec 07, 2022
Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract

Responsive Doc. scanner using U^2-Net, Textcleaner and Tesseract Toolset U^2-Net is used for background removal Textcleaner is used for image cleaning

3 Jul 13, 2022
An unofficial package help developers to implement ZATCA (Fatoora) QR code easily which required for e-invoicing

ZATCA (Fatoora) QR-Code Implementation An unofficial package help developers to implement ZATCA (Fatoora) QR code easily which required for e-invoicin

TheAwiteb 28 Nov 03, 2022
This is a implementation of CRAFT OCR method

This is a implementation of CRAFT OCR method

Esaka 0 Nov 01, 2021
A curated list of resources for text detection/recognition (optical character recognition ) with deep learning methods.

awesome-deep-text-detection-recognition A curated list of awesome deep learning based papers on text detection and recognition. Text Detection Papers

2.4k Jan 08, 2023
Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

Tencent YouTu Research 146 Dec 24, 2022
A Python wrapper for Google Tesseract

Python Tesseract Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded i

Matthias A Lee 4.6k Jan 06, 2023
https://arxiv.org/abs/1904.01941

Character-Region-Awareness-for-Text-Detection- https://arxiv.org/abs/1904.01941 Train You can train SynthText data use python source/train_SynthText.p

DayDayUp 120 Dec 28, 2022