Python library to extract tabular data from images and scanned PDFs

Overview

image

image image image

Overview

ExtractTable - API to extract tabular data from images and scanned PDFs

The motivation is to make it easy for developers to extract tabular data from images or scanned PDF files without worrying about the table area, column coordinates, rotation et al.

Prerequisite

API Key: All requests to ExtractTable are authorized by an API Key. FREE credits here. The same API Key can also be used for conversions on the browser at Web Pro.

Installation

pip install -U ExtractTable

Basic Usage

Ok, enough selling. Let the ease in coding do the talk, and the output encourages you to buy credits; put that timer on and count the LOC.

from ExtractTable import ExtractTable
et_sess = ExtractTable(api_key=YOUR_API_KEY)        # Replace your VALID API Key here
print(et_sess.check_usage())        # Checks the API Key validity as well as shows associated plan usage 
table_data = et_sess.process_file(filepath=Location_of_Image_with_Tables, output_format="df")

# To process PDF, make use of pages ("1", "1,3-4", "all") params in the read_pdf function
table_data = et_sess.process_file(filepath=Location_of_PDF_with_Tables, output_format="df", pages="all")

Detailed Library Usage

The tutorial available at Open In Colab takes you through

1. Installation
2. Import and check version
3. Create Session & Validate API Key
    3.1 Create Session with your API Key
    3.2 Validate the Key and check the plan usage
    3.3 Check Usage Details
4. Trigger the extraction process
    4.1 Accepted Input Types
    4.2 Process an IMAGE Input
    4.3 Process a PDF Input
    4.4 Output options
    4.5 Explore session objects
5. Explore the Output
    5.1 Output Structure
    5.2 Output Details
6. Make Corrections
    6.1 Split Merged Rows
    6.2 Split Merged Columns
    6.3 Fix Decimal Format
    6.4 Fix Date Format
7. Helpful Code Snippets
    7.1 Get text data
    7.2 Table output to Excel

Woahh, as simple as that ?!

Certainly. Do you know the current ExtractTable users use it for

  • Bank Statement
  • Medical Records
  • Invoice Details
  • Tax forms
  • Tender Notices

Its up to you now to explore the ways.

Explore

check the complete server response of the latest job with et_sess.ServerResponse.json()

{
    "JobStatus": <string>,                              # Status of the triggered Process  @ JOB-LEVEL
    "Pages": <integer>,                                 # Number of pages processed in this request @ PAGE-LEVEL
    "Tables": [<list of key-value objects of table>     # List of all tables found @ TABLE-LEVEL
        {
            "Page": <integer>,                              ## Page number in which this table is found
            "CharacterConfidence": <float>,                 ## Accuracy of Characters recognized from the input-page
            "LayoutConfidence": <float>,                    ## Accuracy of table layout's design decision
            "TableJson": <dict>,                            ## Table Cell Text in key-value format with index orientation - {row#: {col#: <str>}}
            "TableCoordinates": <dict>,                     ## Top-left & Bottom-right Cell Coordinates - {row#: {col#: <list(x1,y1,x2,y2)>}}
            "TableConfidence": <dict>                       ## Cell level accuracy of detected characters - {row#: {col#: <float>}}
        },
    {...}                                               ## ... more "Tables" objects
    ],
    "Lines": [<list of key-value objects>               # Pagewise Line details @ PAGE-LEVEL
        {
            "Page": <integer>,                          # Page number in which the lines are found
            "CharacterConfidence": <float>,             # Average Accuracy of all Characters recognized from the input-page
            "LinesArray": [
                <list of key-value objects of line>     # Ordered list of lines in this page @ LINE-LEVEL
                {
                    "Line": <str>,                          ## Detected text of the complete line
                    "WordsArray": [
                        <list of key-value objects>         ## Word level datails in this line @ WORD-LEVEL
                        {
                            "Conf": <float>,                    ### Accuracy of recognized characters of the word
                            "Word": <str>,                      ### Detected text of the word
                            "Loc": [x1, y1, x2, y2]             ### Top-left & Bottom-right coordinates, w.r.t the input-page width-height dimensions
                        },
                    {...}                                   ### More "WordsArray" objects
                    ]
                },
            {...}                                       ## More "LinesArray" objects
            ]
        },
    {...}                                               # More Pagewise "Lines" details
    ]
}

Bug Reports

Bug reports/fixes are most welcome and greatly appreciated with API credits. For support reach us at [email protected]

License

This project is licensed under the Apache License 2.0, see the LICENSE file for details.

Social Media

Follow us on Social media for library updates and free credits.

Image      Image

Comments
  • bug: holding when the program running after some samples

    bug: holding when the program running after some samples

    Describe the bug A clear and concise description of what the bug is. keep holding my apI key prefix is o6No6aqYRhrQ2MWxtDDyTeHiiUg**** image

    To Reproduce Steps to reproduce the behavior: or the code you tried

    Expected behavior A clear and concise description of what you expected to happen.

    Additional context Add any other context about the problem here.

    bug 
    opened by franztao 5
  • bug: function

    bug: function "et_sess.save_output(output_folder, output_format="csv")" output file, the file name lack some alpha of the origin full name

    Describe the bug A clear and concise description of what the bug is. my picture name is all suffix png. such as "[email protected]_14-1-4.png"

    image

    To Reproduce Steps to reproduce the behavior: or the code you tried

    Expected behavior A clear and concise description of what you expected to happen.

    Additional context Add any other context about the problem here.

    bug 
    opened by franztao 3
  • found some bugs and list the bugs out

    found some bugs and list the bugs out

    Describe the bug A clear and concise description of what the bug is. 1.不能识别出垮列的文本,识别成表格时,不符合逻辑的分开成两边 image image

    2.不能识别加减号,can not recognize Plus minus sign. 31.2 + 4.98 image 3.不能够识别上下标,can not recognize subscript and supscript. image 4.ocr识别丢失字符 loss some recognized tokens image 5.长的表格,有部分没有识别出来 long size table,can not recognize the bottem part image image 6.cell中有化学式的,识别不出来,when there is chemical formulate in cell, can not recognize the table image

    To Reproduce Steps to reproduce the behavior: or the code you tried

    Expected behavior A clear and concise description of what you expected to happen. I can solve these problems with us.

    Additional context Add any other context about the problem here.

    bug 
    opened by franztao 2
  • question: what meaning is LayoutConfidence?

    question: what meaning is LayoutConfidence?

    "CharacterConfidence": , # Average Accuracy of all Characters recognized from the input-page "LayoutConfidence": , ## Accuracy of table layout's design decision please give out the detaild decription or calculate function code about CharacterConfidence,LayoutConfidence

    good first issue 
    opened by franztao 2
  • Invalid cross-device link

    Invalid cross-device link

    Describe the bug On some OS, we can not save output file to temporary directory (let's say /tmp) and move it to a new place. It throws the following error :

    os.replace(each_tbl_path, os.path.join(output_folder, input_fname+os.path.basename(each_tbl_path)))
    OSError: [Errno 18] Invalid cross-device link: '/tmp/tmp7hqcm0fh/_table_1.csv' -> '/var/www/python/app/tmp/details_table_1.csv'
    

    After checking the source code, it appears ExtractTable use os.replace to move the file. This method does not support moving file from a partition to an other : https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link

    To Reproduce I use Python 3.6 in a venv. You will need two different system parts, and invoke save_output from ExtractTable-py library, to save file from a filesystem to an other. I have not tried, but I think you can simply reproduce this bug by invoking os.replace without calling ExtractTable-py.

    Expected behavior Move the file from a filesystem to an other. I think using shutil.move would be a preferable way to achieve file moving than os.replace.

    bug 
    opened by Elegye 2
  • MakeCorrections API - How do you chain corrections

    MakeCorrections API - How do you chain corrections

    Hi there, I'm trying to use multiple correction commands but it isn't working as the object becomes a list after the first correction. Is there something I'm missing here? Thanks!

    good first issue 
    opened by kylebutts 1
  • character ocr can support latex format?

    character ocr can support latex format?

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

    Additional context Add any other context or screenshots about the feature request here.

    opened by franztao 1
  • please, do you have tools of transform ExtracTable output file type to CoCo file type(other open source Detection file type)?

    please, do you have tools of transform ExtracTable output file type to CoCo file type(other open source Detection file type)?

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

    Additional context Add any other context or screenshots about the feature request here.

    opened by franztao 1
  • Custom output path when the output_format is csv

    Custom output path when the output_format is csv

    Is your feature request related to a problem? Please describe. When the output_format is set to csv the csv file is written to some random path in /tmp location.

    Describe the solution you'd like [optional, but helpful] Define a parameter in the process_file like output_file which takes the absolute path where the file needs to be written along with the file name

    opened by padmano 1
  • Is it possible to  get the data in excel by maintaining table structure?

    Is it possible to get the data in excel by maintaining table structure?

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Describe the solution you'd like [optional, but helpful] A clear and concise description of what you want to happen.

    Additional context Add any other context or screenshots about the feature request here.

    opened by jcthink 1
  • Character and Layout Confidence

    Character and Layout Confidence

    Hi, need some definition material for Character and Layout Confidence like how it is calculated mathematically using below code. Thanks.

    for idx, each_table in enumerate(et_sess.ServerResponse.json()['Tables']):
        print("CharacterConfidence = ", each_table['CharacterConfidence'])
        print("LayoutConfidence = ", each_table['LayoutConfidence'])
    
    good first issue 
    opened by muhdzubair 1
  • Consider user hints on the table structure information

    Consider user hints on the table structure information

    Is your feature request related to a problem? Please describe. "while you do whatever you want, why not consider the our hints" is the developers feedback on many instances

    Describe alternatives you've considered Developers are tackling with their custom post processing.

    Describe the solution you'd like [optional, but helpful] Pros: May be it is a worth taking a look as most of the post processing involves in similar approaches that resolves majority issues. Cons: computing cost

    feature/idea 
    opened by akshowhini 0
  • Capture Vertically center aligned columns

    Capture Vertically center aligned columns

    Refer: https://stackoverflow.com/questions/58238981/extracting-table-from-a-pdf-table-without-vertical-lines

    Do not miss: Joelgeraci's comment to the question

    feature/idea 
    opened by akshowhini 0
Releases(v2.4.0)
Owner
Org. Account
You, I and they have the same problem to solve !?!?
Org. Account
Satoshi is a discord bot template in python using discord.py that allow you to track some live crypto prices with your own discord bot.

Satoshi ~ DiscordCryptoBot Satoshi is a simple python discord bot using discord.py that allow you to track your favorites cryptos prices with your own

Théo 2 Sep 15, 2022
Fun program to overlay a mask to yourself using a webcam

Superhero Mask Overlay Description Simple project made for fun. It consists of placing a mask (a PNG image with transparent background) on your face.

KB Kwan 10 Dec 01, 2022
Detect and fix skew in images containing text

Alyn Skew detection and correction in images containing text Image with skew Image after deskew Install and use via pip! Recommended way(using virtual

Kakul 230 Dec 21, 2022
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 671 Dec 27, 2022
Converts an image into funny, smaller amongus characters

SussyImage Converts an image into funny, smaller amongus characters Demo Mona Lisa | Lona Misa (Made up of AmongUs characters) API I've also added an

Dhravya Shah 14 Aug 18, 2022
code for our ICCV 2021 paper "DeepCAD: A Deep Generative Network for Computer-Aided Design Models"

DeepCAD This repository provides source code for our paper: DeepCAD: A Deep Generative Network for Computer-Aided Design Models Rundi Wu, Chang Xiao,

Rundi Wu 85 Dec 31, 2022
Application that instantly translates sign-language to letters.

Sign Language Translator Project Description The main purpose of project is translating sign-language to letters. In accordance with this purpose we d

3 Sep 29, 2022
🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.

Charset Detection, for Everyone 👋 The Real First Universal Charset Detector A library that helps you read text from an unknown charset encoding. Moti

TAHRI Ahmed R. 332 Dec 31, 2022
原神风花节自动弹琴辅助

GenshinAutoPlayBalladsofBreeze 原神风花节自动弹琴辅助(已适配1920*1080分辨率) 本程序基于opencv图像识别技术,不存在任何封号。 因为正确率取决于你的cpu性能,10900k都不一定全对。 由于图像识别存在误差,根本无法确定出错时间。更不用说被检测到了。

晓轩 20 Oct 27, 2022
This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images.

Welcome This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images. Installation There are curren

8 Jul 29, 2022
Using Opencv ,based on Augmental Reality(AR) and will show the feature matching of image and then by finding its matching

Using Opencv ,this project is based on Augmental Reality(AR) and will show the feature matching of image and then by finding its matching ,it will just mask that image . This project ,if used in cctv

1 Feb 13, 2022
7th place solution

SIIM-FISABIO-RSNA-COVID-19-Detection 7th place solution Validation: We used iterative-stratification with 5 folds (https://github.com/trent-b/iterativ

11 Jul 17, 2022
Handwritten Text Recognition (HTR) using TensorFlow 2.x

Handwritten Text Recognition (HTR) system implemented using TensorFlow 2.x and trained on the Bentham/IAM/Rimes/Saint Gall/Washington offline HTR data

Arthur Flôr 160 Dec 21, 2022
Morphological edge detection or object's boundary detection using erosion and dialation in OpenCV python

Morphologycal-edge-detection-using-erosion-and-dialation the task is to detect object boundary using erosion or dialation . Here, use the kernel or st

Tamzid hasan 3 Nov 25, 2022
MONAI Label is a server-client system that facilitates interactive medical image annotation by using AI.

MONAI Label is a server-client system that facilitates interactive medical image annotation by using AI. It is an open-source and easy-to-install ecosystem that can run locally on a machine with one

Project MONAI 344 Dec 23, 2022
A Vietnamese personal card OCR website built with Django.

Django VietCardOCR Installation Creation of virtual environments is done by executing the command venv: python -m venv venv That will create a new fol

Truong Hoang Thuan 4 Sep 04, 2021
OCR software for recognition of handwritten text

Handwriting OCR The project tries to create software for recognition of a handwritten text from photos (also for Czech language). It uses computer vis

Břetislav Hájek 562 Jan 03, 2023
Image augmentation for machine learning experiments.

imgaug This python library helps you with augmenting images for your machine learning projects. It converts a set of input images into a new, much lar

Alexander Jung 13.2k Jan 02, 2023
This is a real life mario project using python and mediapipe

real-life-mario This is a real life mario project using python and mediapipe How to run to run this just run - realMario.py file requirements This req

Programminghut 42 Dec 22, 2022
An Agnostic Computer Vision Framework - Pluggable to any Training Library: Fastai, Pytorch-Lightning with more to come

An Agnostic Object Detection Framework IceVision is the first agnostic computer vision framework to offer a curated collection with hundreds of high-q

airctic 790 Jan 05, 2023