Camelot: PDF Table Extraction for Humans
Camelot is a Python library that can help you extract tables from PDFs!
Note: You can also check out Excalibur, the web interface to Camelot!
Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite >>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings Improved Speed Decreased Accel Eliminate Stops Decreased Idle 2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4% 2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7% 4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3% 2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2% 4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5% Camelot also comes packaged with a command-line interface!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
You can check out some frequently asked questions here.
Why Camelot?
- Configurability: Camelot gives you control over the table extraction process with tweakable settings.
- Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
- Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.
See comparison with similar libraries and tools.
Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.
Installation
Using conda
The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.
$ conda install -c conda-forge camelot-pyUsing pip
After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:
$ pip install "camelot-py[base]"From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/camelot-dev/camelotand install Camelot using pip:
$ cd camelot $ pip install ".[base]"Documentation
The documentation is available at http://camelot-py.readthedocs.io/.
Wrappers
- camelot-php provides a PHP wrapper on Camelot.
Contributing
The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.
Versioning
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.
License
This project is licensed under the MIT License, see the LICENSE file for details.
Owner
Camelot and Excalibur: PDF Table Extraction for HumansExtract the table in the PDF,outputs the data similar to the json format
extract the table in the PDF,outputs the data similar to the json format
3 Nov 25, 2021PyMuPDF is a Python binding with support for MuPDF
PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, I
1.9k Jan 03, 2023Python lib for Simple PDF text extraction
Python lib for Simple PDF text extraction
651 Jan 01, 2023PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files
PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files
9 Jan 30, 2022OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf
8k Jan 08, 2023A python library for extracting text from PDFs without losing the formatting of the PDF content.
Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins
49 Nov 07, 2022Scans pdfs for links written in plaintext and checks if they are active or returns an error code.
Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata fr
22 Nov 21, 2022pikepdf is a Python library for reading and writing PDF files.
A Python library for reading and writing PDF, powered by qpdf
1.6k Jan 03, 2023A simple Python script to convert multiple images (well technically also a single image) into a pdf.
PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m
1 Jun 28, 2022Python script that split PDF files.
Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get
5 Apr 02, 2022Convert MD files to PDF automatically (with CSS) 📄🚀
MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app
1 Feb 09, 2022WeasyPrint is a smart solution helping web developers to create PDF documents.
WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…
5.4k Jan 08, 2023rst2pdf: Use a text editor. Make a PDF.
rst2pdf: Use a text editor. Make a PDF.
487 Jan 06, 2023Merge multiple PDF files into one.
PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen
6 Oct 03, 2022Convert given source code into .pdf with syntax highlighting and more features
Code2pdf 📠 Convert given source code into .pdf with syntax highlighting and more features Build Status Version Downloads Python Demo Installation Bui
343 Jan 05, 2023Split given PDF document into 4 page groups and convert them to booklet format
PUTO: PDF to Booklet converter Split given PDF document into 4 page groups and convert them to booklet format. It creates a PDF like shown below: Fir
3 Mar 12, 2022Auto Convert PDFs to png files in python
This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files
4 Dec 05, 2021Program that locks/unlocks pdf files🐍
🐍 📄 PDFtools 📄 🐍 Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela 🚨 Aviso 🚨 Altere os caminhos referente
1 Nov 04, 2021Mipdfcompressor - 💕A simple pdf size compressing telegram robot
Pdf Compressor Telegram Bot A simple pdf size compressing telegram robot. Useful for digital documentation. Mandatory Variables API_HASH - Your A
1 Feb 14, 2022Compare-pdf - A Flask driven restful API for comparing two PDF files
COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description
3 Mar 13, 2022