This project (version 0.1.0a3) is a Python port of ebook-tools which is written in Shell by na--. The Python script ebooktools.py is a collection of tools for automated organization and management of large ebook collections.
Check also my other project search-ebooks which is based on pyebooktools for searching through the content and metadata of ebooks.
- For the moment, the
ebooktools.py
script is only tested on macOS. It will be tested also on linux.- More to come! Check the Roadmap to know what is coming soon.
Contents
About
The ebooktools.py script is a Python port of the shell scripts from ebook-tools and makes use of the following modules:
edit_config.py
edits a configuration file which can either be the main config file that contains all the options defined below or the logging config file whose default values is defined in default_logging.py. The edit subcommand from theebooktools.py
script uses this module.convert_to_txt.py
converts the supplied file to a text file. It can optionally also use OCR for.pdf
,.djvu
and image files. The convert subcommand from theebooktools.py
script uses this module.find_isbns.py
tries to find valid ISBNs inside a file or in astring
if no file was specified. Searching for ISBNs in files uses progressively more resource-intensive methods until some ISBNs are found, for more details see- the documentation for ebook-tools (shell scripts) or
- search_file_for_isbns() from
lib.py
(Python function where ISBNs search in files is implemented).
The find subcommand from the
ebooktools.py
script uses this module.organize_ebooks.py
is used to automatically organize folders with potentially huge amounts of unorganized ebooks. This is done by renaming the files with proper names and moving them to other folders:- By default it searches the supplied ebook files for ISBNs, downloads the book metadata (author, title, series, publication date, etc.) from online sources like Goodreads, Amazon and Google Books and renames the files according to a specified template.
- If no ISBN is found, the script can optionally search for the ebooks online by their title and author, which are extracted from the filename or file metadata.
- Optionally an additional file that contains all the gathered ebook metadata can be saved together with the renamed book so it can later be used for additional verification, indexing or processing.
- Most ebook types are supported:
.epub
,.mobi
,.azw
,.pdf
,.djvu
,.chm
,.cbr
,.cbz
,.txt
,.lit
,.rtf
,.doc
,.docx
,.pdb
,.html
,.fb2
,.lrf
,.odt
,.prc
and potentially others. Even compressed ebooks in arbitrary archive files are supported. For example a.zip
,.rar
or other archive file that contains the.pdf
or.html
chapters of an ebook can be organized without a problem. - Optical character recognition (OCR [Wikipedia]) can be automatically used for
.pdf
,.djvu
and image files when no ISBNs were found in them by the fast and straightforward conversion to.txt
. This is very useful for scanned ebooks that only contain images or were badly OCR-ed in the first place. - Files are checked for corruption (zero-filled files, broken pdfs, corrupt archive, etc.) and corrupt files can optionally be moved to another folder.
- Non-ebook documents, pamphlets and pamphlet-like documents like saved webpages, short pdfs, etc. can also be detected and optionally moved to another folder.
Ref.: [ORG]
The organize subcommand from the
ebooktools.py
script uses this module.rename_calibre_library.py
traverses a calibre library folder, renames all the book files in it by reading their metadata from calibre'smetadata.opf
files. Then the book files are either moved or symlinked to an output folder along with their corresponding metadata files. The rename subcommand from theebooktools.py
script uses this module.split_into_folders.py
splits the supplied ebook files (and the accompanying metadata files if present) into folders with consecutive names that each contain the specified number of files. The split subcommand from theebooktools.py
script uses this module.
Thus, you have access to various subcommands from within the ebooktools.py
script.
- ebook-tools is the original Shell project I ported to Python. I used the same names for the script options (short and longer versions) so that if you used the shell scripts, you will easily know how to run the corresponding subcommand with the given options.
- ebooktools.py is the name of the Python script which will always be referred that way in this document (i.e. no hyphen and ending with
.py
) to distinguish from the original Shell projectebook-tools
.- pyebooktools is the name of the Python package that you need to install to have access to the
ebooktools.py
script.
Installation and dependencies
To install the script ebooktools.py
, follow these steps:
Python dependencies
- Platforms: macOS [soon linux]
- Python: >= 3.6
lxml
>= 4.4 for parsing Calibre'smetadata.opf
files.
When installing thepyebooktools
package below, thelxml
library is automatically installed if it is not found or upgraded to the correct supported version.
Other dependencies
As explained in the documentation for ebook-tools, you need recent versions of:
calibre for fetching metadata from online sources, conversion to txt (for ISBN searching) and ebook metadata extraction. Versions 2.84 and above are preferred because of their ability to manually specify from which specific online source we want to fetch metadata. For earlier versions you have to set
isbn_metadata_fetch_order
andorganize_without_isbn_sources
to empty strings.p7zip for ISBN searching in ebooks that are in archives.
Tesseract for running OCR on books - version 4 gives better results even though it's still in alpha. OCR is disabled by default and another engine can be configured if preferred.
Optionally poppler, catdoc and DjVuLibre can be installed for faster than calibre's conversion of
.doc
and.djvu
files respectively to.txt
.
β οΈ On macOS, you don't need catdoc since it has the built-in textutil command-line tool that converts any
txt
,html
,rtf
,rtfd
,doc
,docx
,wordml
,odt
, orwebarchive
file.Optionally the Goodreads and WorldCat xISBN calibre plugins can be installed for better metadata fetching.
If you only install calibre among these dependencies, you can still have a functioning program that will organize and manage your ebook collections:
- fetching metadata from online sources will work: by default calibre comes with Amazon and Google sources among others
- conversion to txt will work: calibre's own ebook-convert tool will be used
All subcommands should work but accuracy and performance will be affected as explained in the list of dependencies above.
Install pyebooktools
To install the pyebooktools package:
It is highly recommended to install the
pyebooktools
package in a virtual environment using for example venv or conda.Make sure to update pip:
$ pip install --upgrade pip
Install the
pyebooktools
package (bleeding-edge version) with pip:$ pip install git+https://github.com/raul23/pyebooktools#egg=pyebooktools
Make sure that pip is working with the correct Python version. It might be the case that pip is using Python 2.x You can find what Python version pip uses with the following:
$ pip -VIf pip is working with the wrong Python version, then try to use pip3 which works with Python 3.x
Test installation
Test your installation by importing
pyebooktools
and printing its version:$ python -c "import pyebooktools; print(pyebooktools.__version__)"
You can also test that you have access to the
ebooktools.py
script by showing the program's version:$ ebooktools --version
Usage, options and configuration
All of the options documented below can either be passed to the ebooktools.py script via command-line arguments or via the configuration file config.py
which is created along with the logging config file logging.py
when the ebooktools.py
script is run the first time with any of the subcommands defined below. The default values for these config files are taken from default_config.py and default_logging.py, respectively.
In order to use the parameters found in the configuration file config.py
, use the --use-config flag. Hence, you don't need to specify a long command-line in the terminal by using this flag. See the edit subcommand to know how to edit this configuration file.
Most arguments are not required and if nothing is specified, the default values defined in the default config file default_config.py
will be used.
The ebooktools.py
script consists of various subcommands for the organization and management of ebook collections. The usage pattern for running one of the subcommands is as followed:
ebooktools {edit,convert,find,organize,rename,split} [OPTIONS]
where [OPTIONS]
includes general options (as defined in the General options section) and options specific to the subcommand (as defined in the Script usage, subcommands and options section).
In order to avoid data loss, use the --dry-run or --symlink-only option when running some of the subcommands (e.g.rename
andsplit
) to make sure that they would do what you expect them to do, as explained in the Security and safety section.
General options
Most of these options are part of the common library lib.py and may affect some or all of the subcommands.
General control flags
-h
,--help
; no config variable; default valueFalse
Show the help message and exit.
-v
,--version
; no config variable; default valueFalse
Show program's version number and exit.
-q
,--quiet
; config variablequiet
; default valueFalse
Enable quiet mode, i.e. nothing will be printed.
--verbose
; config variableverbose
; default valueFalse
Print various debugging information, e.g. print traceback when there is an exception.
-u
,--use-config
; no config variable; default valueFalse
If this is enabled, the parameters found in the main config file config.py will be used instead of the command-line arguments.
βΉοΈ Note that any other command-line argument that you use in the terminal with the
--use-config
flag is ignored, i.e. only the parameters defined in the main config file config.py will be used.
-d
,--dry-run
; config variabledry_run
; default valueFalse
If this is enabled, no file rename/move/symlink/etc. operations will actually be executed.
--sl
,--symlink-only
; config variablesymlink_only
; default valueFalse
Instead of moving the ebook files, create symbolic links to them.
--km
,--keep-metadata
; config variablekeep_metadata
; default valueFalse
Do not delete the gathered metadata for the organized ebooks, instead save it in an accompanying file together with each renamed book. It is very useful or for additional verification, indexing or processing at a later date. [KM]
Options related to extracting ISBNs from files and finding metadata by ISBN
-i
,--isbn-regex
; config variableisbn_regex
; see default valueThis is the regular expression used to match ISBN-like numbers in the supplied books.
--isbn-blacklist-regex
; config variableisbn_blacklist_regex
; default value^(0123456789|([0-9xX])\2{9})$
Any ISBNs that were matched by the
isbn_regex
above and pass the ISBN validation algorithm are normalized and passed through this regular expression. Any ISBNs that successfully match against it are discarded. The idea is to ignore technically valid but probably wrong numbers like0123456789
,0000000000
,1111111111
, etc. [IBR]--isbn-direct-grep-files
; config variableisbn_direct_grep_files
; default value^text/(plain|xml|html)$
This is a regular expression that is matched against the MIME type of the searched files. Matching files are searched directly for ISBNs, without converting or OCR-ing them to
.txt
first. [IDGF]--isbn-ignored-files
; config variableisbn_ignored_files
; see default valueThis is a regular expression that is matched against the MIME type of the searched files. Matching files are not searched for ISBNs beyond their filename. The default value is a bit long because it tries to make the scripts ignore
.gif
and.svg
images, audio, video and executable files and fonts. [IIF]--reorder-files-for-grep
; config variableisbn_grep_reorder_files
,isbn_grep_rf_scan_first
,isbn_grep_rf_reverse_last
; default value400
,50
These options specify if and how we should reorder the ebook text before searching for ISBNs in it. By default, the first 400 lines of the text are searched as they are, then the last 50 are searched in reverse and finally the remainder in the middle. This reordering is done to improve the odds that the first found ISBNs in a book text actually belong to that book (ex. from the copyright section or the back cover), instead of being random ISBNs mentioned in the middle of the book. No part of the text is searched twice, even if these regions overlap. If you use the command-line option, the format for
False
to disable the functionality orfirst_lines last_lines
to enable it with the specified values. [RFFG]
--mfo
,--metadata-fetch-order
; config variableisbn_metadata_fetch_order
; default valueGoodreads,Amazon.com,Google,ISBNDB,WorldCat xISBN,OZON.ru
This option allows you to specify the online metadata sources and order in which the scripts will try searching in them for books by their ISBN. The actual search is done by calibre's
fetch-ebook-metadata
command-line application, so any custom calibre metadata plugins can also be used. To see the currently available options, runfetch-ebook-metadata --help
and check the description for the--allowed-plugin
option. [MFO]If you use Calibre versions that are older than 2.84, it's required to manually set this option to an empty string.
Options for OCR
--ocr
,--ocr-enabled
; config variableocr_enabled
; default valueFalse
Whether to enable OCR for
.pdf
,.djvu
and image files. It is disabled by default and can be used differently in two scripts [OCR]:- organize_ebooks.py can use OCR for finding ISBNs in scanned books. Setting the value to
True
will cause it to use OCR for books that failed to be converted to.txt
or were converted to empty files by the simple conversion tools (ebook-convert
,pdftotext
,djvutxt
). Setting the value toalways
will cause it to use OCR even when the simple tools produced a non-empty result, if there were no ISBNs in it. - convert_to_txt.py can use OCR for the conversion to
.txt
. Setting the value toTrue
will cause it to use OCR for books that failed to be converted to.txt
or were converted to empty files by the simple conversion tools. Setting it toalways
will cause it to first try OCR-ing the books before trying the simple conversion tools.
- organize_ebooks.py can use OCR for finding ISBNs in scanned books. Setting the value to
--ocrop
,--ocr-only-first-last-pages
; config variableocr_only_first_last_pages
; default value(7,3)
(except for convert_to_txt.py where it'sFalse
)Value
n,m
instructs the scripts to convert only the firstn
and lastm
pages when OCR-ing ebooks. This is done because OCR is a slow resource-intensive process and ISBN numbers are usually at the beginning or at the end of books. Setting the value toFalse
disables this optimization and is the default for convert_to_txt.py, where we probably want the whole book to be converted. [OCROP]--ocrc
,--ocr-command
; config variableocr_command
; default valuetesseract_wrapper
This allows us to define a hook for using custom OCR settings or software. The default value is just a wrapper that allows us to use both tesseract 3 and 4 with some predefined settings. You can use a custom bash function or shell script - the first argument is the input image (books are OCR-ed page by page) and the second argument is the file you have to write the output text to. [OCRC]
Options related to extracting and searching for non-ISBN metadata
--token-min-length
; config variable token_min_length; default value3
When files and file metadata are parsed, they are split into words (or more precisely, either alpha or numeric tokens) and ones shorter than this value are ignored. By default, single and two character number and words are ignored. [TML]
--tokens-to-ignore
; env. variabletokens_to_ignore
; see default valueA regular expression that is matched against the filename/author/title tokens and matching tokens are ignored. The default regular expression includes common words that probably hinder online metadata searching like
book
,novel
,series
,volume
and others, as well as probable publication years (so1999
is ignored while2033
is not). [TI]
--owis
,--organize-without-isbn-sources
; config variableorganize_without_isbn_sources
; default valueGoodreads,Amazon.com,Google
This option allows you to specify the online metadata sources in which the scripts will try searching for books by non-ISBN metadata (i.e. author and title). The actual search is done by calibre's
fetch-ebook-metadata
command-line application, so any custom calibre metadata plugins can also be used. To see the currently available options, runfetch-ebook-metadata --help
and check the description for the--allowed-plugin
option. Because Calibre versions older than 2.84 don't support the --allowed-plugin option, if you want to use such an old Calibre version you should manually set organize_without_isbn_sources to an empty string.In contrast to searching by ISBNs, searching by author and title is done concurrently in all of the allowed online metadata sources. The number of sources is smaller because some metadata sources can be searched only by ISBN or return many false-positives when searching by title and author. [OWIS]
Options related to the input and output files
--oft
,--output-filename-template
; config variableoutput_filename_template
; default value:"${d[AUTHORS]// & /, } - ${d[SERIES]:+[${d[SERIES]}] - }${d[TITLE]/:/ -}${d[PUBLISHED]:+ (${d[PUBLISHED]%%-*})}${d[ISBN]:+ [${d[ISBN]}]}.${d[EXT]}"
By default the organized files start with the comma-separated author name(s), followed by the book series name and number in square brackets (if present), followed by the book title, the year of publication (if present), the ISBN(s) (if present) and the original extension. [OFT]
--ome
,--output-metadata-extension
; config variableoutput_metadata_extension
; default valuemeta
If keep_metadata is enabled, this is the extension of the additional metadata file that is saved next to each newly renamed file. [OME]
Miscellaneous options
--log-level
; config variablelogging_level
; default valueinfo
Set logging level for all loggers. Choices are
{debug,info,warning,error}
.
--log-format
; config variablelogging_formatter
; default valuesimple
Set logging formatter for all loggers. Choices are
{console,simple,only_msg}
.
-r
,--reverse
; config variablereverse
; default valueFalse
If this is enabled, the files will be sorted in reverse (i.e. descending) order. By default, they are sorted in ascending order.
NOTE: more sort options will eventually be implemented, such as random sort.
Script usage, subcommands and options
The usage pattern for running a given subcommand is the following:
ebooktools {edit,convert,find,organize,rename,split} [OPTIONS]
where [OPTIONS]
includes general options and options specific to the subcommand as shown below.
Don't forget the name of the Python script ebooktools
before the subcommand.
All subcommands are affected by the following global options:
The -h, --help option can be applied specifically to each subcommand or to the ebooktools.py
script (when called without the subcommand). Thus when you want the help message for a specific subcommand, you do:
ebooktools {edit,convert,find,split} -h
which will show you the options that affect the choosen subcommand.
And if you want the help message for the whole ebooktools.py
script:
ebooktools -h
which will show you the project description and description of each subcommand without showing the subcommand options.
In the subsections below, you will find a definition for each of the supported subcommands for automated and organization and management of large ebook collections.
edit [OPTIONS] {main,log}
usage: ebooktools edit [OPTIONS] {main,log}
where [OPTIONS]
includes specific options and an input option, as described below.
Very few general options affect this subcommand, such as -q, --quiet and --verbose.
Description
Edits a configuration file, which can either be
- the main configuration file (
main
) where all the options associated with theebooktools.py
script can be found and whose default values are defined in default_config.py - the logging configuration file (
log
) to setup the different loggers used in theebooktools.py
script and whose default values are defined in default_logging.py.
The configuration file can be opened by a user-specified application (app
) or a default program associated with this type of file (when app
is None
).
By default, command-line arguments supersede parameters defined in the default configuration file default_config.py. However, if you enable the --use-config flag, the parameters defined in the main config file config.py
will be used instead.
Specific options for editing config files
-a
,--app
; config variableapp
; default valueNone
Name of the application to use for editing the config file. If no name is given, then the default application for opening this type of file will be used.
-r
,--reset
; config variablereset
; default valueFalse
Reset a configuration file (
main
orlog
) with factory default values.
Input option for editing config files
{main,log}
; config variablecfg_type
; requiredThe config file to edit which can either be the main configuration file (
main
) or the logging configuration file (log
).
convert [OPTIONS] input_file
usage: ebooktools convert [OPTIONS] input_file
where [OPTIONS]
includes general options and input/output options, as decribed below.
Description
Converts the supplied file to a text file. It can optionally also use OCR for .pdf
, .djvu
and image files.
General options for converting files
Some of the global options affect the convert
subcommand's behavior a lot, especially the OCR ones.
Input and output options for converting files
input_file
; config variableinput_file
; requiredThe input file to be converted to a text file.
-o
,--output-file
; config variableoutput_file
; default values isoutput.txt
The output file text. By default, it is saved in the current working directory.
find [OPTIONS] input_data
usage: ebooktools find [OPTIONS] input_data
where [OPTIONS]
includes general options, specific options and an input option, as described below.
Description
Tries to find valid ISBNs inside a file or in a string
if no file was specified. Searching for ISBNs in files uses progressively more resource-intensive methods until some ISBNs are found, for more details see
- the documentation for ebook-tools (shell scripts) or
- search_file_for_isbns() from
lib.py
(Python function where ISBNs search in files is implemented).
General options for finding ISBNs
The global options that especially affect the find
subcommand are the ones related to extracting ISBNs from files and the OCR ones.
Specific options for finding ISBNs
The only subcommand-specific option is:
--irs
,--isbn-return-separator
; config variableisbn_ret_separator
; default value\n
(a new line)This specifies the separator that will be used when returning any found ISBNs.
Input option for finding ISBNs
input_data
; config variableinput_data
; requiredCan either be the path to a file or a string. The input will be searched for ISBNs.
organize [OPTIONS] folder_to_organize
usage: ebooktools organize [OPTIONS] folder_to_organize
where [OPTIONS]
includes general options, specific options, and input/output options, as described below.
Description
This is probably the most versatile subcommand. It can automatically organize folders with huge quantities of unorganized ebook files. This is done by extracting ISBNs and/or metadata from the ebook files, downloading their full and hopefully correct metadata from online sources and auto-renaming the unorganized files with full and correct names and moving them to specified folders. It supports virtually all ebook types, including ebooks in arbitrary or even nested archives (like the other subcommands, it assumes that one file is one ebook, even if it's a huge archive). OCR can be used for scanned ebooks and corrupt ebooks and non-ebook documents (pamphlets) can be separated in specified folders. All of the general options and flags above affect how this subcommand operates, but there are also specific options for it. [ORGD]
General options for organizing files
All general options affect the organize
subcommand. However, these are the general options that you will probably used the most:
- -d, --dry-run
- --sl, --symlink-only
- --km, --keep-metadata
- ---mfo, ---metadata-fetch-order
- --owis, --organize-without-isbn-sources
- --oft, --output-filename-template
- all the ocr-related options
Specific options for organizing files
--cco
,--corruption-check-only
; config variablecorruption_check_only
; default valueFalse
Do not organize or rename files, just check them for corruption (ex. zero-filled files, corrupt archives or broken
.pdf
files). Useful with the output_folder_corrupt option.--tested-archive-extensions
; config variabletested_archive_extensions
; default value^(7z|bz2|chm|arj|cab|gz|tgz|gzip|zip|rar|xz|tar|epub|docx|odt|ods |cbr|cbz|maff|iso)$
A regular expression that specifies which file extensions will be tested with
7z t
for corruption.
--owi
,--organize-without-isbn
; config variableorganize_without_isbn
; default valueFalse
Specify whether the script will try to organize ebooks if there were no ISBN found in the book or if no metadata was found online with the retrieved ISBNs. If enabled, the script will first try to use calibre's
ebook-meta
command-line tool to extract the author and title metadata from the ebook file. The script will try searching the online metadata sources (organize_without_isbn_sources) by the extracted author & title and just by title. If there is no useful metadata or nothing is found online, the script will try to use the filename for searching. [OWI]
--wii
,--without-isbn-ignore
; config variablewithout_isbn_ignore
; complex default valueThis is a regular expression that is matched against lowercase filenames. All files that do not contain ISBNs are matched against it and matching files are ignored by the script, even if organize_without_isbn is
True
. The default value is calibrated to match most periodicals (magazines, newspapers, etc.) so the script can ignore them. [WII]--pamphlet-included-files
; config variablepamphlet_included_files
; default value\.(png|jpg|jpeg|gif|bmp|svg|csv|pptx?)$
This is a regular expression that is matched against lowercase filenames. All files that do not contain ISBNs and do not match without_isbn_ignore are matched against it and matching files are considered pamphlets by default. They are moved to output_folder_pamphlets if set, otherwise they are ignored. [PIF]
--pamphlet-excluded-files
; config variablepamphlet_excluded_files
; default value\.(chm|epub|cbr|cbz|mobi|lit|pdb)$
This is a regular expression that is matched against lowercase filenames. If files do not contain ISBNs and match against it, they are NOT considered as pamphlets, even if they have a small size or number of pages.
--pamphlet-max-pdf-pages
; config variablepamphlet_max_pdf_pages
; default value50
.pdf
files that do not contain valid ISBNs and have a lower number pages than this are considered pamplets/non-ebook documents.
--pamphlet-max-filesize-kib
; config variablepamphlet_max_filesize_kib
; default value250
Other files that do not contain valid ISBNs and are below this size in KiBs are considered pamplets/non-ebook documents.
Input and output options for organizing files
folder_to_organize
; config variablefolder_to_organize
; requiredFolder containing the ebook files that need to be organized.
-o
,--output-folder
; config variableoutput_folder
; default value is the current working directory (check withpwd
)The folder where ebooks that were renamed based on the ISBN metadata will be moved to.
--ofu
,--output-folder-uncertain
; config variableoutput_folder_uncertain
; default value isNone
If organize_without_isbn is enabled, this is the folder to which all ebooks that were renamed based on non-ISBN metadata will be moved to.
--ofc
,--output-folder-corrupt
; config variableoutput_folder_corrupt
; default value isNone
If specified, corrupt files will be moved to this folder.
--ofp
,--output-folder-pamphlets
; config variableoutput_folder_pamphlets
; default value isNone
If specified, pamphlets will be moved to this folder.
rename [OPTIONS] calibre_folder
usage: ebooktools rename [OPTIONS] calibre_folder
where [OPTIONS]
includes general options, specific options, and input/output options, as described below.
Description
This subcommand traverses a calibre library folder and renames all the book files in it by reading their metadata from calibre's metadata.opf
files. Then the book files are either moved or symlinked (if the --symlink-only flag is enabled) to the output folder along with their corresponding metadata files. [RCL]
Activate the --dry-run flag for testing purposes since no file rename/move/symlink/etc. operations will actually be executed.
General options for renaming files
In particular, the following global options are especially important for the rename
subcommand:
- -d, --dry-run
- --sl, --symlink-only
- -i, --isbn-regex
- --isbn-blacklist-regex
- --oft, --output-filename-template
- --ome, --output-metadata-extension
Specific options for renaming files
--sm
,--save-metadata
; config variablesave_metadata
; default valuerecreate
This specifies whether metadata files will be saved together with the renamed ebooks. Value
opfcopy
just copies calibre'smetadata.opf
next to each renamed file with a output_metadata_extension extension, whilerecreate
saves a metadata file that is similar to the one organize_ebooks.py creates.disable
disables this function. [SM]
Input and output options for renaming files
calibre_folder
; config variablecalibre_folder
; requiredCalibre library folder which will be traversed and all the book files in it will be renamed. The renamed files will be either moved or symlinked (if the --symlink-only flag is enabled) to the ouput folder along with their corresponding metadata.
-o
,--output-folder
; config variableoutput_folder
; default value is the current working directory (check withpwd
)This is the output folder the renamed books will be moved to along with their metadata files. The default value is the current working directory.
split [OPTIONS] folder_with_books
usage: ebooktools split [OPTIONS] folder_with_books
where [OPTIONS]
includes general options, specific options, and input/output options, as described below.
Description
Splits the supplied ebook files (and the accompanying metadata files if present) into folders with consecutive names that each contain the specified number of files.
General options for splitting files
In particular, the following global options are especially important for the split
subcommand:
Specific options for splitting files
-s
,--start-number
; config variablestart_number
; default value0
The number of the first folder.
-f
,--folder-pattern
; config variablefolder_pattern
; default value%05d000
The print format string that specifies the pattern with which new folders will be created. By default it creates folders like
00000000, 00001000, 00002000, ...
.
--fpf
,--files-per-folder
; config variablefiles_per_folder
; default value1000
How many files should be moved to each folder.
Input and output options for splitting files
input_file
; config variableinput_file
; requiredFolder with books which will be recursively scanned for files.
-o
,--output-folder
; config variableoutput_folder
; default value is the current working directory (check withpwd
)The output folder in which all the new consecutively named folders will be created.
Examples
More examples can be found at examples.rst.
Example 1: convert a pdf file to text with OCR
To convert a pdf file to text with OCR:
$ ebooktools convert --ocr always -o converted.txt pdf_to_convert.pdf
By setting --ocr to always
, the pdf file will be first OCRed before trying the simple conversion tools (pdftotext
or calibre's ebook-convert
if the former command is not found).
Running pyebooktools v0.1.0a3
Verbose option disabled
OCR=always, first try OCR then conversion
Will run OCR on file 'pdf_to_convert.pdf' with 1 page...
OCR successful!
Example 2: find ISBNs in a pdf file
Find ISBNs in a pdf file:
$ ebooktools find pdf_file.pdf
Output:
Running pyebooktools v0.1.0a3
Verbose option disabled
Searching file 'pdf_file.pdf' for ISBN numbers...
Extracted ISBNs:
9789580158448
1000100111
The search for ISBNs starts in the first pages of the document to increase the likelihood that the first extracted ISBN is the correct one. Then the last pages are analyzed in reverse. Finally, the rest of the pages are searched.
Thus, in this example, the first extracted ISBN is the correct one associated with the book since it was found in the first page.
The last sequence 1000100111
was found in the middle of the document and is not an ISBN even though it is a technically valid but wrong ISBN that the regular expression isbn_blacklist_regex didn't catch. Maybe it is a binary sequence that is part of a problem in a book about digital system.
Uninstall
To uninstall the pyebooktools package:
$ pip uninstall pyebooktools
When uninstalling the
pyebooktools
package, you might be informed that the configuration files logging.py and config.py won't be removed by pip. You can remove those files manually by noting their paths returned by pip. Or you can leave them so your saved settings can be re-used the next time you re-install the package.Example: uninstall the package and remove the config files
$ pip uninstall pyebooktools Found existing installation: pyebooktools 0.1.0a3 Uninstalling pyebooktools-0.1.0a3: Would remove: /Users/test/miniconda3/envs/ebooktools_py37/bin/ebooktools /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools-0.1.0a3.dist-info/* /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/* Would not remove (might be manually added): /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/configs/config.py /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/configs/logging.py Proceed (y/n)? y Successfully uninstalled pyebooktools-0.1.0a3 $ rm -r /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/
Limitations
Same limitations as for ebook-tools
apply to this project too:
- Automatic organization can be slow - all the scripts are synchronous and single-threaded and metadata lookup by ISBN is not done concurrently. This is intentional so that the execution can be easily traced and so that the online services are not hammered by requests. If you want to optimize the performance, run multiple copies of the script on different folders.
- The default setting for isbn_metadata_fetch_order includes two non-standard metadata sources: Goodreads and WorldCat xISBN. For best results, install the plugins (1, 2) for them in calibre and fine-tune the settings for metadata sources in the calibre GUI.
Security and safety
Important security and safety tips from the ebook-tools documentation:
Please keep in mind that this is beta-quality software. To avoid data loss, make sure that you have a backup of any files you want to organize. You may also want to run the scripts with the --dry-run or --symlink-only option the first time to make sure that they would do what you expect them to do.
Also keep in mind that these shell scripts parse and extract complex arbitrary media and archive files and pass them to other external programs written in memory-unsafe languages. This is not very safe and specially-crafted malicious ebook files can probably compromise your system when you use these scripts. If you are cautious and want to organize untrusted or unknown ebook files, use something like QubesOS or at least do it in a separate VM/jail/container/etc.
NOTE: --dry-run
and --symlink-only
can be applied to the following subcommands:
Roadmap
Starting from first priority tasks
Short-term
Port all ebook-tools shell scripts into Python
-
organize-ebooks.sh
interactive-organizer.sh
-
find-isbns.sh
-
convert-to-txt.sh
-
rename-calibre-library.sh
-
split-into-folders.sh
Status: only
interactive-organizer.sh
remaining, will port later-
Add cache support when converting files to txt
Status: working on it since it is also needed for my other project search-ebooks which makes heavy use of pyebooktools
Test on linux
Create a docker image for this project
Medium-term
Add tests on Travis CI
Eventually add documentation on Read the Docs
Add a
fix
subcommand that will try to fix corrupted PDF files based on one of the following utilities:-
gs
: Ghostscript pdftocairo
: from Popplermutool
: it does not "print" the PDF filecpdf
It will also check PDF files based on one of the following utilities:
pdfinfo
pdftotext
qpdf
jhove
-
Add a
remove
subcommand that can remove annotations (incl. highlights, comments, notes, arrows), bookmarks, attachments and metadata from PDF files based on the cpdf utilityNOTE: pdftk can also remove annotations
Credits
- Special thanks to na--, the developer of ebook-tools, for having made these very useful tools. I learned a lot (specially
bash
) while porting them to Python. - Thanks to all the developers of the different programs used by this project such as
calibre
,Tesseract
, text converters (djvutxt
andpdftotext
) and many other utilities!
License
This program is licensed under the GNU General Public License v3.0. For more details see the LICENSE file in the repository.
References
[IBR] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[IDGF] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[IIF] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[KM] | https://github.com/na--/ebook-tools#general-control-flags |
[MFO] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[OCR] | https://github.com/na--/ebook-tools#options-for-ocr |
[OCRC] | https://github.com/na--/ebook-tools#options-for-ocr |
[OCROP] | https://github.com/na--/ebook-tools#options-for-ocr |
[OFT] | https://github.com/na--/ebook-tools#options-related-to-the-input-and-output-files |
[OME] | https://github.com/na--/ebook-tools#options-related-to-the-input-and-output-files |
[ORG] | https://github.com/na--/ebook-tools#ebook-tools |
[ORGD] | https://github.com/na--/ebook-tools#description |
[OWI] | https://github.com/na--/ebook-tools#specific-options-for-organizing-files |
[OWIS] | https://github.com/na--/ebook-tools#options-related-to-extracting-and-searching-for-non-isbn-metadata |
[PIF] | https://github.com/na--/ebook-tools#specific-options-for-organizing-files |
[RCL] | https://bit.ly/3sPJ9kT |
[RFFG] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[SM] | https://bit.ly/3sPJ9kT |
[TI] | https://github.com/na--/ebook-tools#options-related-to-extracting-and-searching-for-non-isbn-metadata |
[TML] | https://github.com/na--/ebook-tools#options-related-to-extracting-and-searching-for-non-isbn-metadata |
[WII] | https://github.com/na--/ebook-tools#specific-options-for-organizing-files |