PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

Overview

PyPDF2

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Homepage
http://mstamy2.github.io/PyPDF2/

Examples

Please see the Sample_Code folder.

Documentation

Documentation is available at
https://pythonhosted.org/PyPDF2/

FAQ

Please see
http://mstamy2.github.io/PyPDF2/FAQ.html

Tests

PyPDF2 includes a test suite built on the unittest framework. All tests are located in the "Tests" folder. Tests can be run from the command line by:

python -m unittest Tests.tests
Comments
  • PyCryptodome is required for some PDFs, but is not installed automatically as a dependency

    PyCryptodome is required for some PDFs, but is not installed automatically as a dependency

    When pycryptodome is not installed, pypdf fails to read some PDFs, and gives this error:

    pypdf.errors.DependencyError: PyCryptodome is required for AES algorithm
    

    Because I wasn't familiar with pycryptodome, I wasn't sure what I needed to do to get it working. Eventually I figured out that pycryptodome was a Python library, and all I had to do was run pip3 install pycryptodome to fix the error.

    If possible, it would be nice if pypdf could 1) install pycryptodome as a dependency as part of the installation process for pypdf, OR 2) provide more information in the error, letting the user know that pycryptodome is a Python library than can be installed via pip.

    Environment

    Which environment were you using when you encountered the problem?

    $ python3 -m platform
    macOS-13.1-arm64-arm-64bit
    
    $ python3 -c "import pypdf;print(pypdf.__version__)"
    3.1.0
    

    Code + PDF

    This is a minimal, complete example that shows the issue:

    1. Install pypdf (pip3 install pypdf).
    2. Make sure pycryptodome is not installed (pip3 uninstall pycryptodome).
    3. Run the following Python script:
    from pypdf import PdfReader
    from urllib.request import urlopen
    from io import BytesIO
    
    # Get the PDF and convert it into a byte stream
    pdf_url = 'https://web.archive.org/web/30000101000000if_/http://www.latterdaytruth.org/pdf/100846.pdf'
    pdf_file = urlopen(pdf_url).read()
    pdf_bytes_stream = BytesIO(pdf_file)
    
    # Load the file with pypdf
    pdf_reader = PdfReader(pdf_bytes_stream)
    
    # Print the number of pages
    pages_count = len(pdf_reader.pages)
    print('Number of pages: {0}'.format(pages_count))
    

    This is the PDF I'm attempting to read: https://web.archive.org/web/30000101000000if_/http://www.latterdaytruth.org/pdf/100846.pdf

    Traceback

    Traceback (most recent call last):
      File "/Users/sbradshaw/Desktop/test-pypdf-pages.py", line 14, in <module>
        pages_count = len(pdf_reader.pages)
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_page.py", line 2063, in __len__
        return self.length_function()
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_reader.py", line 445, in _get_num_pages
        return self.trailer[TK.ROOT]["/Pages"]["/Count"]  # type: ignore
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 266, in __getitem__
        return dict.__getitem__(self, key).get_object()
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/generic/_base.py", line 259, in get_object
        obj = self.pdf.get_object(self)
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_reader.py", line 1205, in get_object
        retval = self._get_object_from_stream(indirect_reference)  # type: ignore
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_reader.py", line 1136, in _get_object_from_stream
        obj_stm: EncodedStreamObject = IndirectObject(stmnum, 0, self).get_object()  # type: ignore
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/generic/_base.py", line 259, in get_object
        obj = self.pdf.get_object(self)
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_reader.py", line 1269, in get_object
        retval = self._encryption.decrypt_object(
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_encryption.py", line 761, in decrypt_object
        return cf.decrypt_object(obj)
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_encryption.py", line 185, in decrypt_object
        obj._data = self.stmCrypt.decrypt(obj._data)
      File "/Users/sbradshaw/.pyenv/versions/3.10.2/lib/python3.10/site-packages/pypdf/_encryption.py", line 147, in decrypt
        raise DependencyError("PyCryptodome is required for AES algorithm")
    pypdf.errors.DependencyError: PyCryptodome is required for AES algorithm
    
    opened by samuelbradshaw 3
  • PERF: Use __slots__

    PERF: Use __slots__

    opened by MartinThoma 1
  • ROB: ignore_eof everywhere for read_until_regex

    ROB: ignore_eof everywhere for read_until_regex

    This was initially motivated by NumberObject.read_from_stream, which was calling read_until_regex with the default value of ignore_eof=False and thus raising exceptions like:

    PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly
    

    https://github.com/py-pdf/PyPDF2/commit/431ba7092037af7d1c296f8f280aca167859ce61 demonstrates a similar fix for NameObject.read_from_stream.

    From discussion in https://github.com/py-pdf/pypdf/pull/1505, it was realized that the change to NumberObject.read_from_stream had now made ALL callers of read_until_regex pass ignore_eof=True. It's cleaner to remove the parameter entirely and change the default behaviour.

    opened by rraval 1
  • Extracting text doesn't work for cropped boxes

    Extracting text doesn't work for cropped boxes

    I am trying to read a few boxes from a nested pdf. My approach is to crop only these three boxes, save them as a temp file and read them. But when I try to read these boxes I don't get nothing or I get the whole pdf (as if I haven't cutted it) I can see from the temp file that the cutting went good

    Environment

    python (miniconda) 3.9.15 pypdf2 2.11.1

    Code + PDF

    file_path = "example.pdf"
    temp_path = "temp.pdf"
    
    from PyPDF2 import PdfReader, PdfWriter
    import os
    from copy import copy
    
    def crop(PAGE, LEFT, TOP, RIGHT, BOTTOM):
        # pyPDF2 start from the bottom left
        page_x, page_y = PAGE.cropBox.getUpperLeft()
    
        # convert pyPDF2.FloatObjects into floats
        upper_left = [page_x.as_numeric(), page_y.as_numeric()]
    
        # find new margins
        new_upper_left  = (upper_left[0] + LEFT, upper_left[1] - TOP)
        new_lower_right = (upper_left[0] + RIGHT, upper_left[1] - BOTTOM)
    
        #crop
        PAGE.cropbox.upper_left = new_upper_left
        PAGE.cropbox.lower_right = new_lower_right
    
    def read_pdf(FILE):
        input = PdfReader(FILE)
        output = []
    
        month_page_num = 0
        employee_page_num = 1
        salary_page_num = 2
    
        tot_pages = len(input.pages)
        #last_page = int(tot_pages/2) # do not mind about this
        last_page = 1
        print('Pages in PDF: ' + str(tot_pages))
    
        for page in range(last_page):
            temp = PdfWriter()
    
            # Remove temp file
            if os.path.exists(temp_path):
                os.remove(temp_path)
    
            print('Working on page: ' + str(page+1))
    
            # Month
            month_page = copy(input.pages[page])
            crop(month_page, 360, 60, 450, 75)
            temp.add_page(month_page)
    
            # Employee
            employee_page = copy(input.pages[page])
            crop(employee_page, 365, 108, 490, 124)
            temp.add_page(employee_page)
    
            # Salary
            salary_page = copy(input.pages[page])
            crop(salary_page, 378, 698, 448, 707)
            temp.add_page(salary_page)
    
            with open(temp_path, "wb") as pdf:
                temp.write(pdf)
    
            # extracting text from pages
            raw = PdfReader(temp_path)
            output.append({
                'month': raw.pages[month_page_num].extractText(),
                'employee': raw.pages[employee_page_num].extractText(),
                'salary': raw.pages[salary_page_num].extractText()
            })
    
        return output
    
    data = read_pdf(file_path)
    print(data)
    
    

    example.pdf temp.pdf

    workflow-text-extraction 
    opened by aster94 2
  • Random whitespaces are inserted when using page.extract_text()

    Random whitespaces are inserted when using page.extract_text()

    I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.

    Environment

    Using VS code and running via command prompt.

    $ python -m platform
    Windows-10-10.0.22621-SP0
    
    $ python -c "import PyPDF2;print(PyPDF2.__version__)"
    2.12.1
    

    Code + PDF

    This is a minimal, complete example that shows the issue:

    test_doc.pdf (PDF was generated using default settings in Microsoft word). It looks like this:

    image

    The code is:

    import os
    
    from PyPDF2 import PdfReader, __version__
    
    pdf = PdfReader(os.path.join(os.getcwd(), "test_doc.pdf"))
    
    print(f"PyPDF2=={__version__}")
    
    text = ""
    for page in pdf.pages:
        page_content = page.extract_text()
        text = text + page_content
    print(text)
    
    

    Output

    PyPDF2==2.12.1
    This is a test document by Ethan Nelson.  
     
    Tuesday was a good time to call ( 000) 000-0000 . This is his ph one mu mber . This is a random address for 
    testing purposes : 341 Maple st Paytonville Maine 45681.  
    Anyway, there are random whitespaces here . 
    
    workflow-text-extraction 
    opened by einelson 11
  • ROB: Ignore EOF in NumberObject.read_from_stream

    ROB: Ignore EOF in NumberObject.read_from_stream

    Use ignore_eof=True just like NameObject does, which is the only other caller to read_until_regex in this module.

    This helps prevent arcane exceptions when trying to parse a number:

    PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly
    

    The motivation is essentially identical to the change that introduced ignore_eof=True on NameObjects: https://github.com/py-pdf/PyPDF2/commit/431ba7092037af7d1c296f8f280aca167859ce61

    opened by rraval 4
Releases(3.2.0)
  • 3.2.0(Dec 31, 2022)

    What's Changed

    Performance Improvement (PI)

    • Help the specializing adpative interpreter (#1522)

    New Features (ENH)

    • Add support for page labels (#1519)

    Bug Fixes (BUG)

    • upgrade clone_document_root (#1520) by @pubpub-zz

    Miscellaneous

    • DOC: Fix migration guide link by @abyesilyurt in https://github.com/py-pdf/pypdf/pull/1516
    • MAINT: Minor Improvements by @robbiebusinessacc in https://github.com/py-pdf/pypdf/pull/1523

    New Contributors

    • @abyesilyurt made their first contribution in https://github.com/py-pdf/pypdf/pull/1516
    • @robbiebusinessacc made their first contribution in https://github.com/py-pdf/pypdf/pull/1523

    Full Changelog: https://github.com/py-pdf/pypdf/compare/3.1.0...3.2.0

    Source code(tar.gz)
    Source code(zip)
  • 3.1.0(Dec 23, 2022)

    What's Changed

    Move PyPDF2 to pypdf (#1513). This now it's all lowercase, no number in the name. For installation and for import. PyPDF2 will no longer receive updates. The community should move back to its roots (pydf).

    Full Changelog: https://github.com/py-pdf/pypdf/compare/3.0.0...3.1.0

    Source code(tar.gz)
    Source code(zip)
  • 3.0.0(Dec 22, 2022)

    What's Changed

    BREAKING CHANGES

    • Deprecate features with PyPDF2==3.0.0 (#1489)
    • Refactor Fit / Zoom parameters (#1437)

    New Features (ENH)

    • Add Cloning (#1371) by @pubpub-zz
    • Allow int for indirect_reference in PdfWriter.get_object (#1490)

    Documentation (DOC)

    • How to read PDFs from S3 (#1509)
    • Make MyST parse all links as simple hyperlinks (#1506) by @mbromet
    • Changed 'latest' for 'stable' generated docs (#1495) by @olsonperrensen
    • Adjust deprecation procedure (#1487)

    Maintenance (MAINT)

    • Use typing.IO for file streams (#1498) by @thehale

    New Contributors

    • @olsonperrensen made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1495
    • @thehale made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1498
    • @mbromet made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1506

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.12.1...3.0.0

    Source code(tar.gz)
    Source code(zip)
  • 2.12.1(Dec 10, 2022)

    What's Changed

    Documentation (DOC)

    • Deduplicate extract_text docstring (#1485)
    • How to cite PyPDF2 (#1476)

    Maintenance (MAINT)

    Consistency changes:

    • indirect_ref/ido ➔ indirect_reference, dest➔ page_destination (#1467) by @kygoben
    • owner_pwd/user_pwd ➔ owner_password/user_password (#1483)
    • position ➜ page_number in Merger.merge (#1482) by @Infus3d
    • indirect_ref ➜ indirect_reference (#1484)

    New Contributors

    • @kygoben made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1467
    • @Infus3d made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1482

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.12.0...2.12.1

    Source code(tar.gz)
    Source code(zip)
  • 2.12.0(Dec 10, 2022)

    What's Changed

    Version 2.12.0, 2022-12-10

    New Features (ENH)

    • Add support to extract gray scale images (#1460) by @joeywang4
    • Make PdfReader.get_object accept integer arguments (#1459) by @pubpub-zz
    • Add 'threads' property to PdfWriter (#1458) by @pubpub-zz
    • Add 'open_destination' property to PdfWriter (#1431) by @pubpub-zz

    Bug Fixes (BUG)

    • Scale PDF annotations (#1479) by @joshhendo

    Robustness (ROB)

    • Padding issue with AES encryption (#1469)
    • Accept empty object as null objects (#1477) by @pubpub-zz

    Documentation (DOC)

    • Add module documentation the PaperSize class (#1447) by @MagnumBarrage

    Maintenance (MAINT)

    • Use 'page_number' instead of 'pagenum' (#1365)
    • Add List of pages to PageRangeSpec (#1456) by @pubpub-zz

    Testing (TST)

    • Cleanup temporary files (#1454) by @pubpub-zz
    • Mark test_tounicode_is_identity as external (#1449) by @heirecka
    • Use Ubuntu 20.04 for running CI test suite (#1452) by @MasterOdin

    Full Changelog

    New Contributors

    • @heirecka made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1449
    • @MagnumBarrage made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1447
    • @joeywang4 made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1460
    • @joshhendo made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1479

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.11.2...2.12.0

    Source code(tar.gz)
    Source code(zip)
  • 2.11.2(Nov 20, 2022)

    What's Changed

    New Features (ENH)

    • Add remove_from_tree (#1432) by @pubpub-zz
    • Add AnnotationBuilder.rectangle (#1388)

    Bug Fixes (BUG)

    • JavaScript executed twice (#1439) by @pubpub-zz
    • ToUnicode stores /Identity-H instead of stream (#1433) by @pubpub-zz
    • Declare Pillow as optional dependency (#1392)

    Developer Experience (DEV)

    • Link 'Full Changelog' automatically
    • Modify read_string_from_stream to a benchmark (#1415)
    • Improve error reporting of read_object (#1412) by @pubpub-zz
    • Test Python 3.11 (#1404)
    • Extend Flake8 ignore list (#1410)
    • Use correct pytest markers (#1407)
    • Move project configuration to pyproject.toml (#1382) by @singingwolfboy

    Documentation (DOC)

    • Fix typos in installation.md by @amyreyespdx in https://github.com/py-pdf/PyPDF2/pull/1419
    • Typos in PDF format documentation by @pavlidvg in https://github.com/py-pdf/PyPDF2/pull/1438

    Full Changelog

    New Contributors

    • @singingwolfboy made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1391
    • @amyreyespdx made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1419
    • @pavlidvg made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1438

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.11.1...2.11.2

    Source code(tar.gz)
    Source code(zip)
  • 2.11.1(Oct 9, 2022)

    What's Changed

    Bug Fixes (BUG)

    • td matrix (#1373) by @srogmann
    • Cope with cmap from #1322 (#1372) by @pubpub-zz

    Robustness (ROB)

    • Cope with str returned from get_data in cmap (#1380) by @pubpub-zz

    Documentation (DOC)

    • Remove watermark PageObject declaration as it is already present inside for-loop (#1384) by @cs2sandeep

    New Contributors

    • @cs2sandeep made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1384

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.11.0...2.11.1

    Source code(tar.gz)
    Source code(zip)
  • 2.11.0(Sep 25, 2022)

    What's Changed

    New Features (ENH):

    • Addition of optional visitor-functions in extract_text() (#1252) by @srogmann
    • Add PageObject.images attribute (#1330) by @MartinThoma
    • Add metadata.creation_date and modification_date (#1364) by @MartinThoma

    Bug Fixes (BUG):

    • Lookup index in _xobj_to_image can be ByteStringObject (#1366)
    • 'IndexError: index out of range' when using extract_text (#1361)
    • Errors in transfer_rotation_to_content() (#1356) by @pubpub-zz

    Robustness (ROB):

    • Ensure update_page_form_field_values does not fail if no fields (#1346) by @pubpub-zz

    Testing (TST):

    • read_string_from_stream performance (#1355) by ### @mergezalot

    New Contributors

    • @srogmann made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1252

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.9...2.11.0

    Source code(tar.gz)
    Source code(zip)
  • 2.10.9(Sep 18, 2022)

    What's Changed

    New Features (ENH)

    • Add rotation property and transfer_rotate_to_content (#1348) by @pubpub-zz

    Performance Improvements (PI)

    • Avoid string concatenation with large embedded base64-encoded images (#1350) by @mergezalot

    Bug Fixes (BUG)

    • Format floats using their intrinsic decimal precision (#1267) by @programmarchy

    Robustness (ROB)

    • Fix merge_page for pages without resources (#1349) by @pubpub-zz

    New Contributors

    • @mergezalot made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1350
    • @programmarchy made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1267

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.8...2.10.9

    Source code(tar.gz)
    Source code(zip)
  • 2.10.8(Sep 14, 2022)

    What's Changed

    • ROB: Improve NameObject reading/writing by @pubpub-zz in https://github.com/py-pdf/PyPDF2/pull/1345
    • ENH: Add PageObject.user_unit property by @MartinThoma in https://github.com/py-pdf/PyPDF2/pull/1336

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.7...2.10.8

    Source code(tar.gz)
    Source code(zip)
  • 2.10.7(Sep 11, 2022)

    What's Changed

    Bug Fixes (BUG)

    • Fix Error in transformations (#1341) by @pubpub-zz
    • Decode #23 in NameObject (#1342) by @pubpub-zz

    Testing (TST)

    • Use pytest.warns() for warnings, and .raises() for exceptions (#1325) by @mgorny

    New Contributors

    • @mgorny made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1325

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.6...2.10.7

    Source code(tar.gz)
    Source code(zip)
  • 2.10.6(Sep 9, 2022)

    What's Changed

    Two robustness issues were fixed by @pubpub-zz - thank you :pray: The infinite loop issue might also be a security concern, depending on how you use PyPDF2.

    Robustness (ROB):

    • Fix infinite loop due to Invalid object (#1331)
    • Fix image extraction issue with superfluous whitespaces (#1327)

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.5...2.10.6

    Source code(tar.gz)
    Source code(zip)
  • 1.28.6(Sep 8, 2022)

    This is a bugfix for the old 1.x branch of PyPDF2 that still supports Python 2. Please try to update to the latest PyPDF2 > 2.0.0 version to get way better text extraction, support for modern encryption, and much more.

    What's Changed

    • BUG: Adjust 'super' calls for Python 2 by @omit66 in https://github.com/py-pdf/PyPDF2/pull/1335

    New Contributors

    • @omit66 made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1335

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.28.5...1.28.6

    Source code(tar.gz)
    Source code(zip)
  • 2.10.5(Sep 4, 2022)

    What's Changed

    New Features (ENH)

    • Process XRefStm (#1297)
    • Auto-detect RTL for text extraction (#1309) by @pubpub-zz

    Bug Fixes (BUG)

    • Avoid scaling cropbox twice (#1314) by @yegorLitvinov

    Robustness (ROB)

    • Fix offset correction in revised PDF (#1318) by @pubpub-zz
    • Crop data of /U and /O in encryption dictionary to 48 bytes (#1317) by @exiledkingcc
    • MultiLine bfrange in cmap (#1299) by @pubpub-zz
    • Cope with 2 digit codes in bfchar (#1310) by @pubpub-zz
    • Accept '/annn' charset as ASCII code (#1316) by @pubpub-zz
    • Log errors during Float / NumberObject initialization (#1315) by @pubpub-zz
    • Cope with corrupted entries in xref table (#1300) by @pubpub-zz

    Documentation (DOC)

    • Migration guide (PyPDF2 1.x ➔ 2.x) (#1324)
    • Creating a coverage report (#1319)
    • Fix AnnotationBuilder.free_text example (#1311)
    • Fix usage of page.scale by replacing it with page.scale_by (#1313) by @yegorLitvinov

    Developer Experience (DEV)

    • Only run coverage for PyPDF2

    Maintenance (MAINT)

    • PdfReaderProtocol (#1303)
    • Throw PdfReadError if Trailer can't be read (#1298) by @ediamondscience
    • Remove catching OverflowException (#1302)

    Testing (TST)

    • Catch Exception for sample-files repo (#1307)

    New Contributors

    • @ediamondscience made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1298
    • @yegorLitvinov made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1313
    • @markdlevy made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1311

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.4...2.10.5

    Source code(tar.gz)
    Source code(zip)
  • 2.10.4(Aug 28, 2022)

    What's Changed

    Robustness (ROB)

    • Fix errors/warnings on no /Resources within extract_text (#1276) by @pubpub-zz
    • Add required line separators in ContentStream ArrayObjects (#1281) by @pubpub-zz

    Maintenance (MAINT)

    • Use NameObject idempotency (#1290)

    Testing (TST)

    • Rectangle deletion (#1289)
    • Add workflow tests (#1287)
    • Remove files after tests ran (#1286)

    Packaging (PKG)

    • Add minimum version for typing_extensions requirement (#1277) by @Shortfinga

    New Contributors

    • @Shortfinga made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1277

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.3...2.10.4

    Source code(tar.gz)
    Source code(zip)
  • 2.10.3(Aug 21, 2022)

    What's Changed

    Robustness (ROB)

    • Decrypt returns empty bytestring (#1258) by @pubpub-zz

    Developer Experience (DEV)

    • Modify CI to better verify built package contents (#1244) by @MasterOdin

    Maintenance (MAINT)

    • Let PdfMerger._create_stream raise NotImplemented (#1251) and remove 'mine' as PdfMerger always creates the stream (#1261)
    • password param of _security._alg32(...) is only a string, not bytes (#1259)
    • Remove unreachable code in read_block_backwards (#1250) and _extract_text (#1262)

    Testing (TST)

    • Delete annotations (#1263)
    • Close PdfMerger in tests (#1260)
    • PdfReader.xmp_metadata workflow (#1257)
    • Various PdfWriter (Layout, Bookmark deprecation) (#1249)

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.2...2.10.3

    Source code(tar.gz)
    Source code(zip)
  • 2.10.2(Aug 15, 2022)

  • 2.10.1(Aug 15, 2022)

    What's Changed

    Bug Fixes (BUG)

    • TreeObject.remove_child had a non-PdfObject assignment for Count (#1233, #1234)
    • Fix stream truncated prematurely (#1223) by @pubpub-zz

    Documentation (DOC)

    • Fix docstring formatting (#1228)

    Maintenance (MAINT)

    • Split generic.py (#1229)

    Testing (TST)

    • Decrypt AlgV4 with owner password (#1239)
    • AlgV5.generate_values (#1238)
    • TreeObject.remove_child / empty_tree (#1235, #1236)
    • create_string_object (#1232)
    • Free-Text annotations (#1231)
    • generic._base (#1230)
    • Strict get fonts (#1226)
    • Increase PdfReader coverage (#1219, #1225)
    • Increase PdfWriter coverage (#1237)
    • 100% coverage for utils.py (#1217)
    • PdfWriter exception non-binary stream (#1218)
    • Don't check coverage for deprecated code (#1216)

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.10.0...2.10.1

    Source code(tar.gz)
    Source code(zip)
  • 2.10.0(Aug 7, 2022)

    What's Changed

    New Features (ENH):

    • "with" support for PdfMerger and PdfWriter (#1193) by @JianzhengLuo
    • Add AnnotationBuilder.text(...) to build text annotations (#1202)

    Bug Fixes (BUG):

    • Allow IndirectObjects as stream filters (#1211)

    Documentation (DOC):

    • Font scrambling
    • Page vs Content scaling (#1208)
    • Example for orientation parameter of extract_text (#1206) by @pubpub-zz
    • Fix AnnotationBuilder parameter formatting (#1204)

    Developer Experience (DEV):

    • Add flake8-print (#1203)

    Maintenance (MAINT):

    • Introduce WrongPasswordError / FileNotDecryptedError / EmptyFileError (#1201) by @chilledgeek

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.9.0...2.10.0

    New Contributors 🎉

    • @JianzhengLuo made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1193
    • @chilledgeek made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1201

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.9.0...2.10.0

    Source code(tar.gz)
    Source code(zip)
  • 2.9.0(Jul 31, 2022)

    What's Changed

    New Features (ENH)

    • Add ability to add hex encoded colors to outline items (#1186) by @mtd91429
    • Add support for pathlib.Path in PdfMerger.merge (#1190) by @MartinThoma
    • Add link annotation (#1189) by @MartinThoma
    • Add capability to filter text extraction by orientation (#1175) by @pubpub-zz

    Bug Fixes (BUG)

    • Named Dest in PDF1.1 (#1174) by @pubpub-zz
    • Incomplete Graphic State save/restore (#1172) by @pubpub-zz

    Documentation (DOC)

    • Update changelog url in package metadata (#1180) by @mkniewallner
    • Mention camelot for table extraction (#1179) by @MartinThoma
    • Mention pyHanko for signing PDF documents (#1178) by @MartinThoma
    • We have CMAP support since a while (#1177) by @MartinThoma

    Maintenance (MAINT)

    • Consistent usage of warnings / log messages (#1164) by @MartinThoma
    • Consistent terminology for outline items (#1156) by @mtd91429

    New Contributors

    • @mkniewallner made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1180 :tada:

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.8.1...2.9.0

    Source code(tar.gz)
    Source code(zip)
  • 2.8.1(Jul 25, 2022)

    What's Changed

    Bug Fixes (BUG)

    • u_hash in AlgV4.compute_key (#1170) by @exiledkingcc

    Robustness (ROB)

    • Fix loading of file from #134 (#1167)
    • Cope with empty DecodeParams (#1165) by @pubpub-zz

    Documentation (DOC)

    • Typo in merger deprecation warning message (#1166) by @pubpub-zz

    Maintenance (MAINT)

    • Package updates; solve mypy strict remarks (#1163)

    Testing (TST)

    • Add test from #325 (#1169) by @pubpub-zz

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.8.0...2.8.1

    Source code(tar.gz)
    Source code(zip)
  • 2.8.0(Jul 24, 2022)

    What's Changed

    Thank you @pubpub-zz and @exiledkingcc for your contributions :heart:

    New Features (ENH)

    • Add writer.add_annotation, page.annotations, and generic.AnnotationBuilder (#1120)

    Bug Fixes (BUG)

    • Set /AS for /Btn form fields in writer (#1161)
    • Ignore if /Perms verify failed (#1157)

    Robustness (ROB)

    • Cope with utf16 character for space calculation (#1155)
    • Cope with null params for FitH / FitV destination (#1152)
    • Handle outlines without valid destination (#1076)

    Developer Experience (DEV)

    • Introduce _utils.logger_warning (#1148)

    Maintenance (MAINT)

    • Break up parse_to_unicode (#1162)
    • Add diagnostic output to exception in read_from_stream (#1159)
    • Reduce PdfReader.read complexity (#1151)

    Testing (TST)

    • Add workflow tests found by arc testing (#1154)
    • Decrypt file which is not encrypted (#1149)
    • Test CryptRC4 encryption class; test image extraction filters (#1147)

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.7.0...2.8.0

    Source code(tar.gz)
    Source code(zip)
  • 2.7.0(Jul 21, 2022)

    What's Changed

    New Features (ENH)

    • Add outline_count property (#1129)

    Bug Fixes (BUG)

    • Make reader.get_fields also return dropdowns with options (#1114)
    • Add deprecated EncodedStreamObject functions back until PyPDF2==3.0.0 (#1139)

    Robustness (ROB)

    • Cope with missing /W entry (#1136)
    • Cope with invalid parent xref (#1133)

    Documentation (DOC)

    • Contributors file (#1132)
    • Fix type in signature of PdfWriter.add_uri (#1131)

    Developer Experience (DEV)

    • Add .git-blame-ignore-revs (#1141)

    Code Style (STY)

    • Fixing typos (#1137)
    • Re-use code via get_outlines_property in tests (#1130)

    New Contributors

    • @KourFrost made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1114

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.6.0...2.7.0

    Source code(tar.gz)
    Source code(zip)
  • 1.28.5(Jul 21, 2022)

    What's Changed

    • BUG: Add missing deprecated EncodedStreamObject functions by @MasterOdin in https://github.com/py-pdf/PyPDF2/pull/1140

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/1.28.4...1.28.5

    Source code(tar.gz)
    Source code(zip)
  • 2.6.0(Jul 17, 2022)

    What's Changed

    New Features (ENH)

    • Add color and font_format to PdfReader.outlines[i] (#1104)
    • Extract Text Enhancement (whitespaces) (#1084)

    Bug Fixes (BUG)

    • Use build_destination for named destination outlines (#1128)
    • Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118)
    • Prevent deduplication of PageObject (#1105)
    • None-check in DictionaryObject.read_from_stream (#1113)
    • Avoid IndexError in _cmap.parse_to_unicode (#1110)

    Documentation (DOC)

    • Explanation for git submodule
    • Watermark and stamp (#1095)

    Maintenance (MAINT)

    • Text extraction improvements (#1126)
    • Destination.color returns ArrayObject instead of tuple as fallback (#1119)
    • Use add_bookmark_destination in add_bookmark (#1100)
    • Use add_bookmark_destination in add_bookmark_dict (#1099)

    Testing (TST)

    • Add test for arab text (#1127)
    • Add xfail for decryption fail (#1125)
    • Add xfail test for IndexError when extracting text (#1124)
    • Add MCVE showing outline title issue (#1123)

    Code Style (STY)

    • Use IntFlag for permissions_flag / update_page_form_field_values (#1094)
    • Simplify code (#1101)

    New Contributors

    • @mtd91429 made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1104
    • @dkg made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1110
    • @jlshin made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1113

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.5.0...2.6.0

    Source code(tar.gz)
    Source code(zip)
  • 2.5.0(Jul 10, 2022)

    What's Changed

    New Features (ENH)

    • Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067)
    • Add PageObject._get_fonts (#1083)

    Performance Improvements (PI)

    • Use iterative DFS in PdfWriter._sweep_indirect_references (#1072)

    Bug Fixes (BUG)

    • Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066)
    • Column default for CCITTFaxDecode (#1079)

    Robustness (ROB)

    • Guard against None-value in _get_outlines (#1060)

    Documentation (DOC)

    • Stamps and watermarks (#1082)
    • OCR vs PDF text extraction (#1081)
    • Python Version support
    • Formatting of CHANGELOG

    Developer Experience (DEV)

    • Cache downloaded files (#1070)
    • Speed-up for CI (#1069)

    Maintenance (MAINT)

    • Set page.rotate(angle: int) (#1092)
    • Issue #416 was fixed by #1015 (#1078)

    Testing (TST)

    • Image extraction (#1080)
    • Image extraction (#1077)

    Code Style (STY)

    • Apply black
    • Typo in Changelog

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.4.2...2.5.0

    Source code(tar.gz)
    Source code(zip)
  • 2.4.2(Jul 5, 2022)

    What's Changed

    New Features (ENH)

    • Add PdfReader.xfa attribute (#1026)

    Bug Fixes (BUG)

    • Wrong page inserted when PdfMerger.merge is done (#1063)
    • Resolve IndirectObject when it refers to a free entry (#1054)

    Developer Experience (DEV)

    • Added {posargs} to tox.ini (#1055)

    Maintenance (MAINT)

    • Remove PyPDF2._utils.bytes_type (#1053)

    Testing (TST)

    • Scale page (indirect rect object) (#1057)
    • Simplify pathlib PdfReader test (#1056)
    • IndexError of VirtualList (#1052)
    • Invalid XML in xmp information (#1051)
    • No pycryptodome (#1050)
    • Increase test coverage (#1045)

    Code Style (STY)

    • DOC of compress_content_streams (#1061)
    • Minimize diff for #879 (#1049)

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.4.1...2.4.2

    Source code(tar.gz)
    Source code(zip)
  • 2.4.1(Jun 30, 2022)

    What's Changed

    New Features (ENH)

    • Add writer.pdf_header property (getter and setter) (#1038)

    Performance Improvements (PI)

    • Remove b_ call in FloatObject.write_to_stream (#1044)
    • Check duplicate objects in writer._sweep_indirect_references (#207)

    Documentation (DOC)

    • How to surppress exceptions/warnings/log messages (#1037)
    • Remove hyphen from lossless (#1041)
    • Compression of content streams (#1040)
    • Fix inconsistent variable names in add-watermark.md (#1039)
    • File size reduction
    • Add CHANGELOG to the rendered docs (#1023)

    Maintenance (MAINT)

    • Handle XML error when reading XmpInformation (#1030)
    • Deduplicate Code / add mutmut config (#1022)

    Code Style (STY)

    • Use unnecessary one-line function / class attribute (#1043)
    • Docstring formatting (#1033)

    New Contributors

    • @Hatell made their first contribution in https://github.com/py-pdf/PyPDF2/pull/207
    • @behzadfhm made their first contribution in https://github.com/py-pdf/PyPDF2/pull/1039

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.4.0...2.4.1

    Source code(tar.gz)
    Source code(zip)
  • 2.4.0(Jun 26, 2022)

    What's Changed

    Thanks to @exiledkingcc PyPDF2 now also supports R6 decryption 🎉 Thank you 🤗

    New Features (ENH)

    • Support R6 decrypting (#1015)
    • Add PdfReader.pdf_header (#1013)

    Performance Improvements (PI)

    • Remove ord_ calls (#1014)

    Bug Fixes (BUG)

    • Fix missing page for bookmark (#1016)

    Robustness (ROB)

    • Deal with invalid Destinations (#1028)

    Documentation (DOC)

    • get_form_text_fields does not extract dropdown data (#1029)
    • Adjust PdfWriter.add_uri docstring
    • Mention crypto extra_requires for installation (#1017)

    Developer Experience (DEV)

    • Use /n line endings everywhere (#1027)
    • Adjust string formatting to be able to use mutmut (#1020)
    • Update Bug report template

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.3.1...2.4.0

    Source code(tar.gz)
    Source code(zip)
  • 2.3.1(Jun 19, 2022)

    What's Changed

    Bug Fixes (BUG)

    • Forgot to add the interal _codecs subpackage.

    Full Changelog: https://github.com/py-pdf/PyPDF2/compare/2.3.0...2.3.1

    Source code(tar.gz)
    Source code(zip)
Simple pdf editor while preserving structure and format.

SIMPdf Simple pdf editor while preserving structure and format.

Shashwat Singh 242 Jan 04, 2023
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 07, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

Matthew Stamy 5k Jan 04, 2023
Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

130 Dec 26, 2022
Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

9 Jan 30, 2022
pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

Will Angley 2 Dec 17, 2021
Generate a preview image for a PDF.

PDF ➡️ Preview A simple tool to save me time on Illustrator. Generates a preview image for a PDF file. Useful for sneak peeks to academic publications

David Chuan-En Lin 51 Sep 22, 2022
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 01, 2023
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 02, 2022
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
Convert Lecture Videos to PDF

Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide

Emilio Kartono 20 Nov 25, 2022
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Hemachandran P 1 Nov 09, 2021
Produce pdf in python backend from simple bootstrap vue frontend and download to browser

vollmacht produce pdf in python backend from simple bootstrap vue frontend and download to browser Frontend in one file with bootstrap-vue (allthough

Otto 1 Nov 08, 2020
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

RISHABH MISHRA 1 Feb 13, 2022
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 03, 2023
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 01, 2023
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f

Aman Nirala 3 Apr 23, 2022
Split given PDF document into 4 page groups and convert them to booklet format

PUTO: PDF to Booklet converter Split given PDF document into 4 page groups and convert them to booklet format. It creates a PDF like shown below: Fir

3 Mar 12, 2022
Excalibur: A web interface to extract tabular data from PDFs

Excalibur: A web interface to extract tabular data from PDFs Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It i

1.2k Jan 04, 2023