A modern pure-Python library for reading PDF files

Related tags

Deep Learningpdf
Overview

PyPI version Code Actions Status Code style: black

pdf

A modern pure-Python library for reading PDF files.

The goal is to have a modern interface to handle PDF files which is consistent with itself and typical Python syntax.

The library should be Python-only (hence no C-extensions), but allow to change the backend. Similar in concept to matplotlib backends and Keras backends.

The default backend could be PyPDF2.

Possible other backends could be PyMuPDF (using MuPDF) and PikePDF (using QPDF).

WARNING: This library is UNSTABLE at the moment! Expect many changes!

Installation

pip install pdffile

Usage

Retrieve Metadata

>>> import pdf

>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1

>>> doc.metadata
Metadata(
    title=None,
    producer='pdfTeX-1.40.23',
    creator='TeX',
    creation_date=datetime.datetime(2022, 4, 3, 18, 5, 42),
    modification_date=datetime.datetime(2022, 4, 3, 18, 5, 42)
    other={
         '/CreationDate': "D:20220403180542+02'00'",
         '/ModDate': "D:20220403180542+02'00'",
         '/Trapped': '/False',
         '/PTEX.Fullbanner': 'This is pdfTeX, V...'})

Encrypted PDFs

If you have an encrypted PDF, just provide the key:

doc = pdf.PdfFile(pdf_path, password=password)

All following operations work just as described.

Get Outline

>>> import pdf
>>> doc = pdf.PdfFile(pdf_path, password=password)
>>> doc.outline
[
    Links(page=5, text='1 Header'),
    Links(page=5, text='1.1 A section'),
    Links(page=9, text='2 Foobar'),
    Links(page=108, text='References')
]

Extract Text

>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
<pdf.PdfPage object at 0x7f72d2b04100>
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'

Alternatively, you can use doc.text to get the text of all pages.

Owner
The py-pdf organization maintains Python packages that deal with the PDF file format
Steerable discovery of neural audio effects

Steerable discovery of neural audio effects Christian J. Steinmetz and Joshua D. Reiss Abstract Applications of deep learning for audio effects often

Christian J. Steinmetz 182 Dec 29, 2022
Fight Recognition from Still Images in the Wild @ WACVW2022, Real-world Surveillance Workshop

Fight Detection from Still Images in the Wild Detecting fights from still images is an important task required to limit the distribution of social med

Şeymanur Aktı 10 Nov 09, 2022
Awesome Long-Tailed Learning

Awesome Long-Tailed Learning This repo pays specially attention to the long-tailed distribution, where labels follow a long-tailed or power-law distri

Stomach_ache 284 Jan 06, 2023
[IEEE TPAMI21] MobileSal: Extremely Efficient RGB-D Salient Object Detection [PyTorch & Jittor]

MobileSal IEEE TPAMI 2021: MobileSal: Extremely Efficient RGB-D Salient Object Detection This repository contains full training & testing code, and pr

Yu-Huan Wu 52 Jan 06, 2023
GNN-based Recommendation Benchmark

GRecX A Fair Benchmark for GNN-based Recommendation Homepage and Documentation Homepage: Documentation: Paper: GRecX: An Efficient and Unified Benchma

73 Oct 17, 2022
In generative deep geometry learning, we often get many obj files remain to be rendered

a python prompt cli script for blender batch render In deep generative geometry learning, we always get many .obj files to be rendered. Our rendered i

Tian-yi Liang 1 Mar 20, 2022
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023
Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.

Jittor: a Just-in-time(JIT) deep learning framework Quickstart | Install | Tutorial | Chinese Jittor is a high-performance deep learning framework bas

2.7k Jan 03, 2023
Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Automatic, Readable, Reusable, Extendable Machin is a reinforcement library designed for pytorch. Build status Platform Status Linux Windows Supported

Iffi 348 Dec 24, 2022
Square Root Bundle Adjustment for Large-Scale Reconstruction

RootBA: Square Root Bundle Adjustment Project Page | Paper | Poster | Video | Code Table of Contents Citation Dependencies Installing dependencies on

Nikolaus Demmel 205 Dec 20, 2022
Reliable probability face embeddings

ProbFace, arxiv This is a demo code of training and testing [ProbFace] using Tensorflow. ProbFace is a reliable Probabilistic Face Embeddging (PFE) me

Kaen Chan 34 Dec 31, 2022
A Python module for the generation and training of an entry-level feedforward neural network.

ff-neural-network A Python module for the generation and training of an entry-level feedforward neural network. This repository serves as a repurposin

Riadh 2 Jan 31, 2022
Towards Implicit Text-Guided 3D Shape Generation (CVPR2022)

Towards Implicit Text-Guided 3D Shape Generation Towards Implicit Text-Guided 3D Shape Generation (CVPR2022) Code for the paper [Towards Implicit Text

55 Dec 16, 2022
[TIP 2021] SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

SADRNet Paper link: SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction Requirements python

Multimedia Computing Group, Nanjing University 99 Dec 30, 2022
An end-to-end framework for mixed-integer optimization with data-driven learned constraints.

OptiCL OptiCL is an end-to-end framework for mixed-integer optimization (MIO) with data-driven learned constraints. We address a problem setting in wh

Holly Wiberg 57 Dec 26, 2022
Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

Sunbow Liu 22 Nov 25, 2022
Unsupervised phone and word segmentation using dynamic programming on self-supervised VQ features.

Unsupervised Phone and Word Segmentation using Vector-Quantized Neural Networks Overview Unsupervised phone and word segmentation on speech data is pe

Herman Kamper 13 Dec 11, 2022
Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

MusCaps: Generating Captions for Music Audio Ilaria Manco1 2, Emmanouil Benetos1, Elio Quinton2, Gyorgy Fazekas1 1 Queen Mary University of London, 2

Ilaria Manco 57 Dec 07, 2022
MPViT:Multi-Path Vision Transformer for Dense Prediction

MPViT : Multi-Path Vision Transformer for Dense Prediction This repository inlcu

Youngwan Lee 272 Dec 20, 2022
Semi-Supervised Learning, Object Detection, ICCV2021

End-to-End Semi-Supervised Object Detection with Soft Teacher By Mengde Xu*, Zheng Zhang*, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai,

Microsoft 789 Dec 27, 2022