Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Overview

linkrot logo

Introduction

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

Features

  • Extract references and metadata from a given PDF.
  • Detects pdf, url, arxiv and doi references.
  • Checks for valid SSL certificate.
  • Find broken hyperlinks (using the -c flag).
  • Output as text or JSON (using the -j flag).
  • Extract the PDF text (using the --text flag).
  • Use as command-line tool or Python package.
  • Works with local and online pdfs.

Installation

Grab a copy of the code with pip:

pip install linkrot

Usage

linkrot can be used to extract info from a PDF in two ways:

  • Command line/Terminal tool linkrot
  • Python library import linkrot

1. Command Line/Terminal tool

linkrot [pdf-file-or-url]

Run linkrot -h to see the help output:

linkrot -h

usage:

linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

Extract metadata and references from a PDF, and optionally download all referenced PDFs.

Arguments

positional arguments:

pdf (Filename or URL of a PDF file)

optional arguments:

-h, --help            (Show this help message and exit)  
-d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
-c, --check-links     (Check for broken links)  
-j, --json            (Output infos as JSON (instead of plain text))  
-v, --verbose         (Print all references (instead of only PDFs))  
-t, --text            (Only extract text (no metadata or references))  
-o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
--version             (Show program's version number and exit)  

Examples

Extract text to console

linkrot https://example.com/example.pdf -t

Extract text to file

linkrot https://example.com/example.pdf -t -o pdf-text.txt

Check Links

linkrot https://example.com/example.pdf -c

2. Main Python Library

Import the library:

import linkrot

Create an instance of the linkrot class like so:

pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class

Now the following function can be used to extract specific data from the pdf:

get_metadata()

Arguments: None

Usage:

metadata = pdf.get_metadata() #pdf is the instance of the linkrot class

Return type: Dictionary

Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...

get_text()

Arguments: None

Usage:

text = pdf.get_text() #pdf is the instance of the linkrot class

Return type: String

Information Provided: The entire content of the PDF in string form.

get_references(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_list = pdf.get_references() #pdf is the instance of the linkrot class

Return type: Set of

linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced

Information Provided: All references with their corresponding type and page number.

get_references_as_dict(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class

Return type: Dictionary with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list of refs of that type.

Information Provided: All references in their corresponding type list.

download_pdfs(target_dir)

Arguments:

target_dir: The path of the directory to which the reference pdfs should be downloaded 

Usage:

pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class

Return type: None

Information Provided: Downloads all the reference pdfs to specified directory.

3. Linkrot downloader functions

Import:

from linkrot.downloader import sanitize_url, get_status_code, check_refs

sanitize_url(url)

Arguments:

url: The url to be sanitized.

Usage:

new_url = sanitize_url(old_url) 

Return type: String

Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.

get_status_code(url)

Arguments:

url: The url to be checked for its status. 

Usage:

status_code = get_status_code(url) 

Return type: String

Information Provided: Checks if the url is active or broken.

check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)

Arguments:

refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading

Usage:

check_refs(pdf.get_references()) #pdf is the instance of the linkrot class

Return type: None

Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.

4. Linkrot extractor functions

Import:

from linkrot.extractor import extract_urls, extract_doi, extract_arxiv

Get pdf text:

text = pdf.get_text() #pdf is the instance of the linkrot class

extract_urls(text)

Arguments:

text: String of text to extract urls from

Usage:

urls = extract_urls(text)

Return type: Set of URLs

Information Provided: All URLs in the text

extract_arxiv(text)

Arguments:

text: String of text to extract arxivs from

Usage:

arxiv = extract_arxiv(text)

Return type: Set of arxivs

Information Provided: All arxivs in the text

extract_doi(text)

Arguments:

text: String of text to extract dois from

Usage:

doi = extract_doi(text)

Return type: Set of dois

Information Provided: All dois in the text

Code of Conduct

To view our code of conduct please visit our Code of Conduct page.

License

This program is licensed with an MIT License.

Comments
  • xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

    xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

    Receive this error when I run the file. Traceback below. File Attached.

    Traceback (most recent call last): File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main return run_code(code, main_globals, None, File "c:\python38\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Python38\Scripts\linkrot.exe_main.py", line 7, in File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main pdf = linkrot.linkrot(args.pdf) File "c:\python38\lib\site-packages\linkrot_init.py", line 131, in init self.reader = PDFMinerBackend(self.stream) File "c:\python38\lib\site-packages\linkrot\backends.py", line 213, in init self.metadata.update(xmp_to_dict(metadata)) File "c:\python38\lib\site-packages\linkrot\libs\xmp.py", line 92, in xmp_to_dict return XmpParser(xmp).meta File "c:\python38\lib\site-packages\linkrot\libs\xmp.py", line 41, in init self.tree = ET.XML(xmp) File "c:\python38\lib\xml\etree\ElementTree.py", line 1320, in XML parser.feed(text) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

    ah-5.pdf

    bug help wanted good first issue hacktoberfest python 
    opened by marshalmiller 11
  • Remove Python 2 checks and functionality.

    Remove Python 2 checks and functionality.

    Keeping support for Python 2 might be slowing down some of the process. Of more concern is that in order to patch vulnerabilities that exist in some libraries Python 2 depends on, we have had to cut support for some versions of Python 3. Specifically 3.6,3.7. 3.7 is still fairly widely used and I think I'd prefer to remove Python 2 support and bring back 3.7. Even though it's clearly a bigger task.

    enhancement help wanted good first issue dependencies python 
    opened by marshalmiller 10
  • Move from `requirements.txt`, `requirements_dev.txt`, `setup.cfg`, and `setup.py` to `pyproject.toml`.

    Move from `requirements.txt`, `requirements_dev.txt`, `setup.cfg`, and `setup.py` to `pyproject.toml`.

    Is your feature request related to a problem? Please describe. Hey @marshalmiller. As you may already know, the use of setup.cfg, setup.py, and requirements.txt files is quite outdated. Because of PEP 517, PEP 660, and PEP 631, the packaging is now being standardized on the usage of the pyproject.toml file.

    Describe the solution you'd like Given the above info, the project packaging should add support for pyproject.toml.

    Describe alternatives you've considered Not available.

    Additional context That's pretty much it. What do you think? Also, I would like to work on this issue.

    enhancement hacktoberfest python 
    opened by wiseaidev 7
  • (Bug) AttributeError: 'NoneType' object has no attribute 'findall'

    (Bug) AttributeError: 'NoneType' object has no attribute 'findall'

    Describe the bug Certain PDFs give Attribute Error

    To Reproduce Steps to reproduce the behavior:

    1. Download Research_Ethics.pdf
    2. Open terminal and run:
    linkrot <path_to_above_file>
    

    Expected behavior It should generate the expected linkrot report.

    Screenshots Screenshot from 2021-10-12 23-37-47

    bug help wanted hacktoberfest 
    opened by aditirao7 7
  • Add Link Archiving

    Add Link Archiving

    I'd like to add a feature that takes all links that are verified to be active and add them to the Internet Archive Wayback Machine to preserve them in time. There is a draft python script in lib called archive.py. The idea is that you navigate to https://web.archive.org/save/{url} the service automatically archives that page. So after verifying that it returns a valid code, we would just connect to all of those sites and it would create a snapshot. I'd love for this to be an optional argument like -a or something. This way it is optional and we don't take more resources than we need. Anyone able to complete this task, please take a stab at it.

    enhancement help wanted good first issue hacktoberfest python 
    opened by marshalmiller 6
  • UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to <undefined>.

    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to .

    Receiving this error when running the file. Traceback Below. File Attached.

    > Traceback (most recent call last):
    >   File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main
    >     return _run_code(code, main_globals, None,
    >   File "c:\python38\lib\runpy.py", line 86, in _run_code
    >     exec(code, run_globals)
    >   File "C:\Python38\Scripts\linkrot.exe\__main__.py", line 7, in <module>
    >   File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main
    >     pdf = linkrot.linkrot(args.pdf)
    >   File "c:\python38\lib\site-packages\linkrot\__init__.py", line 131, in __init__
    >     self.reader = PDFMinerBackend(self.stream)
    >   File "c:\python38\lib\site-packages\linkrot\backends.py", line 204, in __init__
    >     self.metadata[k] = make_compat_str(v)
    >   File "c:\python38\lib\site-packages\linkrot\backends.py", line 67, in make_compat_str
    >     out_str = in_str.decode(enc["encoding"])
    >   File "c:\python38\lib\encodings\cp1254.py", line 15, in decode
    >     return codecs.charmap_decode(input,errors,decoding_table)
    > UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to <undefined>
    

    ah-1.pdf

    bug help wanted hacktoberfest python 
    opened by marshalmiller 5
  • (Update) documentation for python library usage

    (Update) documentation for python library usage

    The main documentation needs to be updated to include the usage of linkrot as a python library as well. Some of it can be found in the docstrings of this file.

    enhancement 
    opened by aditirao7 5
  • Separate code from data

    Separate code from data

    Is your feature request related to a problem? Please describe.

    The current size of the repo is too big because of pdf data samples:

    ➜  du -sh * | sort -h
    4.0K	CONTRIBUTING.md
    4.0K	LICENSE
    4.0K	Makefile
    4.0K	pyproject.toml
    4.0K	SECURITY.md
    8.0K	code_of_conduct.md
    8.0K	README.md
    44K	branding
    68K	linkrot
    1.7M	tests
    919M	Random PDF Samples
    

    Describe the solution you'd like I suggest either storing the pdf files in a separate repo or on a cloud provider's bucket.

    Describe alternatives you've considered Not available.

    Additional context That's pretty much. I am currently working on this issue.

    documentation enhancement hacktoberfest 
    opened by wiseaidev 4
  • Add Link Check Results to CLI Output

    Add Link Check Results to CLI Output

    Right now, if you use the -o argument to export the results to a text file, the document metadata and the list of links are the only components listed. I would like to add the results of the link check to this output as well.

    enhancement help wanted good first issue hacktoberfest python hacktoberfest-accepted 
    opened by marshalmiller 4
  • Displays Page Number Wrong in Results

    Displays Page Number Wrong in Results

    When it returns the results of links that it tests, it gives a list of the links, along with a page number. The page number would appear to be the page the link was found on but it is actually just the total number of pages in the PDF. It would be extremely helpful if we could get it to display the correct page number.

    bug enhancement help wanted hacktoberfest python hacktoberfest-accepted 
    opened by marshalmiller 4
  • Update Tests

    Update Tests

    The tests written for this repo were developed during the very early stages of this project. I don't think they are a great representation of where the project is now. I'd love to have them updated to be more rigorous and keep the quality of the project high.

    enhancement help wanted good first issue hacktoberfest python 
    opened by marshalmiller 2
  • Update ReadMe to Include Changes from Hacktoberfest.

    Update ReadMe to Include Changes from Hacktoberfest.

    We have had a lot of great improvements already during Hacktoberfest. I will update the ReadMe with all the changes once the event is over, if not before.

    documentation enhancement hacktoberfest 
    opened by marshalmiller 3
  • Consider Replacing Threadpool with Redis

    Consider Replacing Threadpool with Redis

    Given the performance and timeout issues with the flask app, I am wondering if I should be replacing the current thread pool with a Redis model, as suggested by other forums and Heroku.

    https://python-rq.org/

    enhancement help wanted dependencies hacktoberfest python 
    opened by marshalmiller 2
Releases(3.9.5)
  • 3.9.5(Oct 3, 2022)

    What's Changed

    • Add test cases for detecting embedded URLs by @marwansalem in https://github.com/marshalmiller/linkrot/pull/161
    • rm Random PDF Samples by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/163
    • updated .gitignore, added mega.py, rm pdfs, cleanups by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/164
    • cleanup python 2 syntax by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/165

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.4...3.9.5

    Source code(tar.gz)
    Source code(zip)
  • 3.9.4(Oct 2, 2022)

    What's Changed

    • Migrating from setup.py to pyproject.toml by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/149
    • Upgrade to PyProject by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/156
    • add missing dependencies by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/158
    • add missing cli entry point by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/157
    • handle UnicodeDecode exception by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/159

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.3...3.9.4

    Source code(tar.gz)
    Source code(zip)
  • 3.9.3(Oct 2, 2022)

    What's Changed

    • Resolved Add Link Archiving #102 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/150
    • add etree xml_parser to ignore invalid tags by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/155

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.2...3.9.3

    Source code(tar.gz)
    Source code(zip)
  • 3.9.2(Oct 1, 2022)

    What's Changed

    • Fix the page number error, in the link checker by @ajratnam in https://github.com/marshalmiller/linkrot/pull/147
    • Add Link Check Results to CLI Output #120 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/145

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.1...3.9.2

    Source code(tar.gz)
    Source code(zip)
  • 3.9.1(Oct 1, 2022)

    What's Changed

    • Bump mypy from 0.971 to 0.981 by @dependabot in https://github.com/marshalmiller/linkrot/pull/142
    • Bump coverage from 6.4.4 to 6.5.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/143
    • Resolved Add DOIs to References Summary #128 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/144
    • Remove numpy import by @ajratnam in https://github.com/marshalmiller/linkrot/pull/146

    New Contributors

    • @mailtodanish made their first contribution in https://github.com/marshalmiller/linkrot/pull/144
    • @ajratnam made their first contribution in https://github.com/marshalmiller/linkrot/pull/146

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9...3.9.1

    Source code(tar.gz)
    Source code(zip)
  • 3.9(Sep 25, 2022)

    What's Changed

    • Bump flake8 from 5.0.3 to 5.0.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/131
    • Bump coverage from 6.4.2 to 6.4.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/132
    • Bump numpy from 1.23.1 to 1.23.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/133
    • Bump coverage from 6.4.3 to 6.4.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/134
    • Bump pylint from 2.14.5 to 2.15.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/135
    • Bump black from 22.6.0 to 22.8.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/136
    • Bump pytest from 7.1.2 to 7.1.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/137
    • Bump pylint from 2.15.0 to 2.15.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/138
    • Bump numpy from 1.23.2 to 1.23.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/139
    • Bump pylint from 2.15.2 to 2.15.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/141
    • Resolve issue130 by @westofwest in https://github.com/marshalmiller/linkrot/pull/140

    New Contributors

    • @westofwest made their first contribution in https://github.com/marshalmiller/linkrot/pull/140

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.8...3.9

    Source code(tar.gz)
    Source code(zip)
  • 3.8.8(Aug 2, 2022)

  • 3.8.5(Aug 2, 2022)

    What's Changed

    • Bump flake8 from 5.0.1 to 5.0.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/129

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.4...3.8.5

    Source code(tar.gz)
    Source code(zip)
  • 3.5(Jun 1, 2022)

    What's Changed

    • Bump mypy from 0.910 to 0.920 by @dependabot in https://github.com/marshalmiller/linkrot/pull/71
    • Bump mypy from 0.920 to 0.930 by @dependabot in https://github.com/marshalmiller/linkrot/pull/73
    • Bump mypy from 0.930 to 0.931 by @dependabot in https://github.com/marshalmiller/linkrot/pull/75
    • Bump mccabe from 0.6.1 to 0.7.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/76
    • Bump coverage from 6.2 to 6.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/77
    • Bump black from 21.12b0 to 22.1.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/78
    • Bump coverage from 6.3 to 6.3.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/79
    • Bump pytest from 6.2.5 to 7.0.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/80
    • Bump pytest from 7.0.0 to 7.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/81
    • Bump coverage from 6.3.1 to 6.3.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/82
    • Bump pytest from 7.0.1 to 7.1.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/84
    • Bump mypy from 0.931 to 0.940 by @dependabot in https://github.com/marshalmiller/linkrot/pull/83
    • Bump mypy from 0.940 to 0.941 by @dependabot in https://github.com/marshalmiller/linkrot/pull/85
    • Bump pytest from 7.1.0 to 7.1.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/86
    • Bump pdfminer-six from 20211012 to 20220319 by @dependabot in https://github.com/marshalmiller/linkrot/pull/87
    • Bump mypy from 0.941 to 0.942 by @dependabot in https://github.com/marshalmiller/linkrot/pull/88
    • Bump pylint from 2.12.2 to 2.13.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/89
    • Bump pylint from 2.13.0 to 2.13.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/90
    • Bump black from 22.1.0 to 22.3.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/91
    • Bump pylint from 2.13.2 to 2.13.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/92
    • Bump pylint from 2.13.3 to 2.13.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/93
    • Bump pylint from 2.13.4 to 2.13.5 by @dependabot in https://github.com/marshalmiller/linkrot/pull/94
    • Bump pylint from 2.13.5 to 2.13.7 by @dependabot in https://github.com/marshalmiller/linkrot/pull/95
    • Bump pytest from 7.1.1 to 7.1.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/96
    • Bump mypy from 0.942 to 0.950 by @dependabot in https://github.com/marshalmiller/linkrot/pull/97
    • Bump pylint from 2.13.7 to 2.13.8 by @dependabot in https://github.com/marshalmiller/linkrot/pull/98
    • Bump pdfminer-six from 20220319 to 20220506 by @dependabot in https://github.com/marshalmiller/linkrot/pull/99
    • Bump coverage from 6.3.2 to 6.3.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/100
    • Bump pylint from 2.13.8 to 2.13.9 by @dependabot in https://github.com/marshalmiller/linkrot/pull/101
    • Bump coverage from 6.3.3 to 6.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/103
    • Bump pdfminer-six from 20220506 to 20220524 by @dependabot in https://github.com/marshalmiller/linkrot/pull/104
    • Bump mypy from 0.950 to 0.960 by @dependabot in https://github.com/marshalmiller/linkrot/pull/105
    • A fix for: Exclude Email Addresses #106 by @marwansalem in https://github.com/marshalmiller/linkrot/pull/107

    New Contributors

    • @marwansalem made their first contribution in https://github.com/marshalmiller/linkrot/pull/107

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.4...3.5

    Source code(tar.gz)
    Source code(zip)
  • 3.4(Dec 11, 2021)

    What's Changed

    • Added documentation for library by @aditirao7 in https://github.com/marshalmiller/linkrot/pull/41
    • fix(downloader.py): change string comparison to use regex by @sousatg in https://github.com/marshalmiller/linkrot/pull/42
    • Bump flake8 from 4.0.0 to 4.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/43
    • Bump coverage from 6.0.1 to 6.0.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/44
    • Bump pdfminer-six from 20201018 to 20211012 by @dependabot in https://github.com/marshalmiller/linkrot/pull/46
    • Bring up to date by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/47
    • Replace pagenos with a safe default value by @alanyee in https://github.com/marshalmiller/linkrot/pull/48
    • Staging to Main 10-17-2021 by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/49
    • Start testing for Python 3.10 by @alanyee in https://github.com/marshalmiller/linkrot/pull/50
    • Checking the rdftree before parsing the metadata #45 by @rosdyana in https://github.com/marshalmiller/linkrot/pull/51
    • Staging by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/52
    • Bump black from 21.9b0 to 21.10b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/55
    • Bump coverage from 6.0.2 to 6.1.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/54
    • Add comments to colorprint.py by @vacom13 in https://github.com/marshalmiller/linkrot/pull/56
    • Bump coverage from 6.1.1 to 6.1.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/57
    • Bump black from 21.10b0 to 21.11b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/58
    • Add Comments to cli.py by @vacom13 in https://github.com/marshalmiller/linkrot/pull/60
    • Bump black from 21.11b0 to 21.11b1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/59
    • Bump pylint from 2.11.1 to 2.12.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/61
    • Bump coverage from 6.1.2 to 6.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/63
    • Bump black from 21.11b1 to 21.12b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/67
    • Bump pylint from 2.12.1 to 2.12.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/66

    New Contributors

    • @sousatg made their first contribution in https://github.com/marshalmiller/linkrot/pull/42
    • @alanyee made their first contribution in https://github.com/marshalmiller/linkrot/pull/48
    • @rosdyana made their first contribution in https://github.com/marshalmiller/linkrot/pull/51
    • @vacom13 made their first contribution in https://github.com/marshalmiller/linkrot/pull/56

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/2.1.1...3.4

    Source code(tar.gz)
    Source code(zip)
  • 2.3(Oct 24, 2021)

    What's Changed

    • Added documentation for library by @aditirao7 in https://github.com/marshalmiller/linkrot/pull/41
    • fix(downloader.py): change string comparison to use regex by @sousatg in https://github.com/marshalmiller/linkrot/pull/42
    • Bump flake8 from 4.0.0 to 4.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/43
    • Bump coverage from 6.0.1 to 6.0.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/44
    • Bump pdfminer-six from 20201018 to 20211012 by @dependabot in https://github.com/marshalmiller/linkrot/pull/46
    • Bring up to date by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/47
    • Replace pagenos with a safe default value by @alanyee in https://github.com/marshalmiller/linkrot/pull/48
    • Staging to Main 10-17-2021 by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/49
    • Start testing for Python 3.10 by @alanyee in https://github.com/marshalmiller/linkrot/pull/50
    • Checking the rdftree before parsing the metadata #45 by @rosdyana in https://github.com/marshalmiller/linkrot/pull/51
    • Staging by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/52

    New Contributors

    • @sousatg made their first contribution in https://github.com/marshalmiller/linkrot/pull/42
    • @alanyee made their first contribution in https://github.com/marshalmiller/linkrot/pull/48
    • @rosdyana made their first contribution in https://github.com/marshalmiller/linkrot/pull/51

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/2.1.1...2.3

    Source code(tar.gz)
    Source code(zip)
Owner
Marshal Miller
Marshal Miller
A bot for PDF for doing Many Things....

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

Mr. Developer 60 Dec 27, 2022
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

Nemo 387 Dec 10, 2022
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 03, 2023
Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

Jason Alan Palmer 651 Jan 01, 2023
A backend for mdbook in Python for generating PDF based on Chrome DevTools Protocol.

mdbook-pdf A backend for mdbook written in Python for generating PDF based on Chrome DevTools Protocol. Python library dependency Usage Put mdbook-pdf

Hollow Man 49 Dec 27, 2022
x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

Free Law Project 73 Dec 19, 2022
Split given PDF document into 4 page groups and convert them to booklet format

PUTO: PDF to Booklet converter Split given PDF document into 4 page groups and convert them to booklet format. It creates a PDF like shown below: Fir

3 Mar 12, 2022
Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022
Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

Yusuke Shinyama 4.9k Jan 04, 2023
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Hemachandran P 1 Nov 09, 2021
Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

Will Fantom 1 Feb 09, 2022
Busca no nome e conteúdo de arquivos PDF no diretório e subdiretórios.

PDF Finder Este script auxilia na pesquisa em pastas com inúmeros arquivos PDF. A pesquisa é feita em todos os arquivos do doretório e subdiretórios.

William Pilger 1 Nov 27, 2021
Extract the table in the PDF,outputs the data similar to the json format

extract the table in the PDF,outputs the data similar to the json format

3 Nov 25, 2021
CLI tool to generate pdf invoices written in python

invoicepy CLI invoice tool, store and print invoices as pdf. save companies and customers for later use. installation pip install invoicepy config co

Adam Wojtczak 9 Aug 01, 2022
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022
Convert Lecture Videos to PDF

Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide

Emilio Kartono 20 Nov 25, 2022
Simple pdf editor while preserving structure and format.

SIMPdf Simple pdf editor while preserving structure and format.

Shashwat Singh 242 Jan 04, 2023
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 02, 2022