๐Ÿ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Overview

ArchiveBox
Open-source self-hosted web archiving.

โ–ถ๏ธ Quickstart | Demo | Github | Documentation | Info & Motivation | Community | Roadmap

"Your own personal internet archive" (็ฝ‘็ซ™ๅญ˜ๆกฃ / ็ˆฌ่™ซ)


Language grade: Python Language grade: JavaScript Total alerts


ArchiveBox is a powerful self-hosted internet archiving solution written in Python. You feed it URLs of pages you want to archive, and it saves them to disk in a variety of formats depending on setup and content within.

๐Ÿ”ข   Run ArchiveBox via Docker Compose (recommended), Docker, Apt, Brew, or Pip (see below).

apt/brew/pip3 install archivebox

archivebox init                       # run this in an empty folder
archivebox add 'https://example.com'  # start adding URLs to archive
curl https://example.com/rss.xml | archivebox add  # or add via stdin
archivebox schedule --every=day https://example.com/rss.xml

For each URL added, ArchiveBox saves several types of HTML snapshot (wget, Chrome headless, singlefile), a PDF, a screenshot, a WARC archive, any git repositories, images, audio, video, subtitles, article text, and more....

archivebox server --createsuperuser 0.0.0.0:8000   # use the interactive web UI
archivebox list 'https://example.com'  # use the CLI commands (--help for more)
ls ./archive/*/index.json              # or browse directly via the filesystem

You can then manage your snapshots via the filesystem, CLI, Web UI, SQLite DB (./index.sqlite3), Python API (alpha), REST API (alpha), or desktop app (alpha).

At the end of the day, the goal is to sleep soundly knowing that the part of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).



bookshelf graphic   logo   bookshelf graphic

โšก๏ธ   CLI Usage

# archivebox [subcommand] [--args]
archivebox --version
archivebox help
  • archivebox init/version/status/config/manage to administer your collection
  • archivebox add/remove/update/list to manage Snapshots in the archive
  • archivebox schedule to pull in fresh URLs in regularly from boorkmarks/history/Pocket/Pinboard/RSS/etc.
  • archivebox oneshot archive single URLs without starting a whole collection
  • archivebox shell/manage dbshell open a REPL to use the Python API (alpha), or SQL API

Demo | Screenshots | Usage
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

cli init screenshot cli init screenshot server snapshot admin screenshot server snapshot details page screenshot

grassgrass

Quickstart

๐Ÿ–ฅ   Supported OSs: Linux/BSD, macOS, Windows     ๐ŸŽฎ   CPU Architectures: x86, amd64, arm7, arm8 (raspi >=3) ๐Ÿ“ฆ   Distributions: docker/apt/brew/pip3/npm (in order of completeness)

(click to expand your preferred โ–บ distribution below for full setup instructions)

Get ArchiveBox with docker-compose on any platform (recommended, everything included out-of-the-box)

First make sure you have Docker installed: https://docs.docker.com/get-docker/

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml'
docker-compose run archivebox init
docker-compose run archivebox --version

# start the webserver and open the UI (optional)
docker-compose run archivebox manage createsuperuser
docker-compose up -d
open 'http://127.0.0.1:8000'

# you can also add links and manage your archive via the CLI:
docker-compose run archivebox add 'https://example.com'
echo 'https://example.com' | docker-compose run archivebox -T add
docker-compose run archivebox status
docker-compose run archivebox help  # to see more options

# when passing stdin/stdout via the cli, use the -T flag
echo 'https://example.com' | docker-compose run -T archivebox add
docker-compose run -T archivebox list --html --with-headers > index.html

This is the recommended way to run ArchiveBox because it includes all the extractors like:
chrome, wget, youtube-dl, git, etc., full-text search w/ sonic, and many other great features.

Get ArchiveBox with docker on any platform

First make sure you have Docker installed: https://docs.docker.com/get-docker/

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
docker run -v $PWD:/data -it archivebox/archivebox init
docker run -v $PWD:/data -it archivebox/archivebox --version

# start the webserver and open the UI (optional)
docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add links and manage your archive via the CLI:
docker run -v $PWD:/data -it archivebox/archivebox add 'https://example.com'
docker run -v $PWD:/data -it archivebox/archivebox status
docker run -v $PWD:/data -it archivebox/archivebox help  # to see more options

# when passing stdin/stdout via the cli, use only -i (not -it)
echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add
docker run -v $PWD:/data -i archivebox/archivebox list --html --with-headers > index.html
Get ArchiveBox with apt on Ubuntu/Debian

This method should work on all Ubuntu/Debian based systems, including x86, amd64, arm7, and arm8 CPUs (e.g. Raspberry Pis >=3).

If you're on Ubuntu >= 20.04, add the apt repository with add-apt-repository:

(on other Ubuntu/Debian-based systems follow the โ™ฐ instructions below)

# add the repo to your sources and install the archivebox package using apt
sudo apt install software-properties-common
sudo add-apt-repository -u ppa:archivebox/archivebox
sudo apt install archivebox
# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options

โ™ฐ On other Ubuntu/Debian-based systems add these sources directly to /etc/apt/sources.list:

echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" > /etc/apt/sources.list.d/archivebox.list
echo "deb-src http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" >> /etc/apt/sources.list.d/archivebox.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
sudo apt update
sudo apt install archivebox
sudo snap install chromium
archivebox --version
# then scroll back up and continue the initalization instructions above

(you may need to install some other dependencies manually however)

Get ArchiveBox with brew on macOS

First make sure you have Homebrew installed: https://brew.sh/#install

# install the archivebox package using homebrew
brew install archivebox/archivebox/archivebox

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options
Get ArchiveBox with pip on any platform

First make sure you have Python >= 3.7 installed: https://realpython.com/installing-python/

# install the archivebox package using pip3
pip3 install archivebox

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version
# Install any missing extras like wget/git/chrome/etc. manually as needed

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options

No matter which install method you choose, they all roughly follow this 3-step process and all provide the same CLI, Web UI, and on-disk data format.

  1. Install ArchiveBox: apt/brew/pip3 install archivebox
  2. Start a collection: archivebox init
  3. Start archiving: archivebox add 'https://example.com'

grassgrass


. . . . . . . . . . . . . . . . . . . . . . . . . . . .

DEMO: https://archivebox.zervice.io
Quickstart | Usage | Configuration

Key Features



lego

Input formats

ArchiveBox supports many input formats for URLs, including Pocket & Pinboard exports, Browser bookmarks, Browser history, plain text, HTML, markdown, and more!

echo 'http://example.com' | archivebox add
archivebox add 'https://example.com/some/page'
archivebox add < ~/Downloads/firefox_bookmarks_export.html
archivebox add < any_text_with_urls_in_it.txt
archivebox add --depth=1 'https://example.com/some/downloads.html'
archivebox add --depth=1 'https://news.ycombinator.com#2020-12-12'

# (if using docker add -i when passing via stdin)
echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add

# (if using docker-compose add -T when passing via stdin)
echo 'https://example.com' | docker-compose run -T archivebox add

See the Usage: CLI page for documentation and examples.

It also includes a built-in scheduled import feature with archivebox schedule and browser bookmarklet, so you can pull in URLs from RSS feeds, websites, or the filesystem regularly/on-demand.

Output formats

All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All archivebox CLI commands must be run from inside this folder, and you first create it by running archivebox init.

The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard sqlite3 database (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the archive/ subfolder. Each snapshot subfolder includes a static JSON and HTML index describing its contents, and the snapshot extrator outputs are plain files within the folder (e.g. media/example.mp4, git/somerepo.git, static/someimage.png, etc.)

# to browse your index statically without running the archivebox server, run:
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
# if running these commands with docker-compose, add -T:
# docker-compose run -T archivebox list ...

# then open the static index in a browser
open index.html

# or browse the snapshots via filesystem directly
ls ./archive/<timestamp>/
  • Index: index.html & index.json HTML and JSON index files containing metadata and details
  • Title, Favicon, Headers Response headers, site favicon, and parsed site title
  • Wget Clone: example.com/page-name.html wget clone of the site with warc/<timestamp>.gz
  • Chrome Headless
    • SingleFile: singlefile.html HTML snapshot rendered with headless Chrome using SingleFile
    • PDF: output.pdf Printed PDF of site using headless chrome
    • Screenshot: screenshot.png 1440x900 screenshot of site using headless chrome
    • DOM Dump: output.html DOM Dump of the HTML after rendering using headless chrome
    • Readability: article.html/json Article text extraction using Readability
  • Archive.org Permalink: archive.org.txt A link to the saved site on archive.org
  • Audio & Video: media/ all audio/video files + playlists, including subtitles & metadata with youtube-dl
  • Source Code: git/ clone of any repository found on github, bitbucket, or gitlab links
  • More coming soon! See the Roadmap...

It does everything out-of-the-box by default, but you can disable or tweak individual archive methods via environment variables or config file.

archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m'
archivebox config --help
lego graphic



Dependencies

You don't need to install all the dependencies, ArchiveBox will automatically enable the relevant modules based on whatever you have available, but it's recommended to use the official Docker image with everything preinstalled.

If you so choose, you can also install ArchiveBox and its dependencies directly on any Linux or macOS systems using the system package manager or by running the automated setup script.

ArchiveBox is written in Python 3 so it requires python3 and pip3 available on your system. It also uses a set of optional, but highly recommended external dependencies for archiving sites: wget (for plain HTML, static files, and WARC saving), chromium (for screenshots, PDFs, JS execution, and more), youtube-dl (for audio and video), git (for cloning git repos), and nodejs (for readability and singlefile), and more.



security graphic

Caveats

If you're importing URLs containing secret slugs or pages with private content (e.g Google Docs, CodiMD notepads, etc), you may want to disable some of the extractor modules to avoid leaking private URLs to 3rd party APIs during the archiving process.

# don't do this:
archivebox add 'https://docs.google.com/document/d/12345somelongsecrethere'
archivebox add 'https://example.com/any/url/you/want/to/keep/secret/'

# without first disabling share the URL with 3rd party APIs:
archivebox config --set SAVE_ARCHIVE_DOT_ORG=False   # disable saving all URLs in Archive.org
archivebox config --set SAVE_FAVICON=False      # optional: only the domain is leaked, not full URL
archivebox config --set CHROME_BINARY=chromium  # optional: switch to chromium to avoid Chrome phoning home to Google

Be aware that malicious archived JS can also read the contents of other pages in your archive due to snapshot CSRF and XSS protections being imperfect. See the Security Overview page for more details.

# visiting an archived page with malicious JS:
https://127.0.0.1:8000/archive/1602401954/example.com/index.html

# example.com/index.js can now make a request to read everything:
https://127.0.0.1:8000/index.html
https://127.0.0.1:8000/archive/*
# then example.com/index.js can send it off to some evil server

Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs). For now ArchiveBox is designed to only archive each URL with each extractor type once. A workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash:

archivebox add 'https://example.com#2020-10-24'
...
archivebox add 'https://example.com#2020-10-25'



Screenshots

brew install archivebox
archivebox version
archivebox init
archivebox add archivebox data dir
archivebox server archivebox server add archivebox server list archivebox server detail



paisley graphic

Background & Motivation

Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.

Whether it's to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010's flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.


Image from WTF is Link Rot?...

The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion, making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org and Archive.is are capable of saving. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

All the archived links are stored by date bookmarked in ./archive/<timestamp>, and everything is indexed nicely with JSON & HTML files. The intent is for all the content to be viewable with common software in 50 - 100 years without needing to run ArchiveBox in a VM.

Comparison to Other Projects

โ–ถ Check out our community page for an index of web archiving initiatives and projects.

comparison The aim of ArchiveBox is to go beyond what the Wayback Machine and other public archiving services can do, by adding a headless browser to replay sessions accurately, and by automatically extracting all the content in multiple redundant formats that will survive being passed down to historians and archivists through many generations.

User Interface & Intended Purpose

ArchiveBox differentiates itself from similar projects by being a simple, one-shot CLI interface for users to ingest bulk feeds of URLs over extended periods, as opposed to being a backend service that ingests individual, manually-submitted URLs from a web UI. However, we also have the option to add urls via a web interface through our Django frontend.

Private Local Archives vs Centralized Public Archives

Unlike crawler software that starts from a seed URL and works outwards, or public tools like Archive.org designed for users to manually submit links from the public internet, ArchiveBox tries to be a set-and-forget archiver suitable for archiving your entire browsing history, RSS feeds, or bookmarks, including private/authenticated content that you wouldn't otherwise share with a centralized service (do not do this until v0.5 is released with some security fixes). Also by having each user store their own content locally, we can save much larger portions of everyone's browsing history than a shared centralized service would be able to handle.

Storage Requirements

Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting SAVE_MEDIA=False to skip audio & video files.


dependencies graphic

Learn more

Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open-source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!



documentation graphic

Documentation

We use the Github wiki system and Read the Docs (WIP) for documentation.

You can also access the docs locally by looking in the ArchiveBox/docs/ folder.

Getting Started

Reference

More Info



development

ArchiveBox Development

All contributions to ArchiveBox are welcomed! Check our issues and Roadmap for things to work on, and please open an issue to discuss your proposed implementation before working on things! Otherwise we may have to close your PR if it doesn't align with our roadmap.

Low hanging fruit / easy first tickets:
Total alerts

Setup the dev environment

1. Clone the main code repo (making sure to pull the submodules as well)

git clone --recurse-submodules https://github.com/ArchiveBox/ArchiveBox
cd ArchiveBox
git checkout dev  # or the branch you want to test
git submodule update --init --recursive
git pull --recurse-submodules

2. Option A: Install the Python, JS, and system dependencies directly on your machine

# Install ArchiveBox + python dependencies
python3 -m venv .venv && source .venv/bin/activate && pip install -e '.[dev]'
# or: pipenv install --dev && pipenv shell

# Install node dependencies
npm install

# Check to see if anything is missing
archivebox --version
# install any missing dependencies manually, or use the helper script:
./bin/setup.sh

2. Option B: Build the docker container and use that for development instead

# Optional: develop via docker by mounting the code dir into the container
# if you edit e.g. ./archivebox/core/models.py on the docker host, runserver
# inside the container will reload and pick up your changes
docker build . -t archivebox
docker run -it --rm archivebox version
docker run -it --rm -p 8000:8000 \
    -v $PWD/data:/data \
    -v $PWD/archivebox:/app/archivebox \
    archivebox server 0.0.0.0:8000 --debug --reload

Common development tasks

See the ./bin/ folder and read the source of the bash scripts within. You can also run all these in Docker. For more examples see the Github Actions CI/CD tests that are run: .github/workflows/*.yaml.

Run in DEBUG mode

archivebox config --set DEBUG=True
# or
archivebox server --debug ...

Build and run a Github branch

docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
docker run -it -v $PWD:/data archivebox:dev ...

Run the linters

./bin/lint.sh

(uses flake8 and mypy)

Run the integration tests

./bin/test.sh

(uses pytest -s)

Make migrations or enter a django shell

Make sure to run this whenever you change things in models.py.

cd archivebox/
./manage.py makemigrations

cd path/to/test/data/
archivebox shell
archivebox manage dbshell

(uses pytest -s)

Build the docs, pip package, and docker image

(Normally CI takes care of this, but these scripts can be run to do it manually)

./bin/build.sh

# or individually:
./bin/build_docs.sh
./bin/build_pip.sh
./bin/build_deb.sh
./bin/build_brew.sh
./bin/build_docker.sh

Roll a release

(Normally CI takes care of this, but these scripts can be run to do it manually)

./bin/release.sh

# or individually:
./bin/release_docs.sh
./bin/release_pip.sh
./bin/release_deb.sh
./bin/release_brew.sh
./bin/release_docker.sh




This project is maintained mostly in my spare time with the help from generous contributors and Monadical ( โœจ hire them for dev work!).


Sponsor us on Github




Comments
  • v0.4 (first Django release)

    v0.4 (first Django release)

    The v0.4 Release

    A bunch of big changes:

    • pip install archivebox is now available
    • beginnings of transition to Django while maintaining a mostly backwards-compatible CLI
    • using argparse instead of hand-written CLI system: see archivebox/cli/archivebox.py
    • new subcommands-based CLI for archivebox (see below)

    For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

    Released in this version:

    Install Methods:

    Note: apt, brew are now available as of v0.5

    Command Line Interface:

    Web UI:

    • โœ… / Main index
    • โœ… /add Page to add new links to the archive (but needs improvement)
    • โœ… /archive/<timestamp>/ Snapshot details page
    • โœ… /archive/<timestamp>/<url> live wget archive of page
    • โœ… /archive/<timestamp>/<extractor> get a specific extractor output for a given snapshot
    • โœ… /archive/<url> shortcut to view most recent snapshot of given url
    • โœ… /archive/<url_hash> shortcut to view most recent snapshot of given url
    • โœ… /admin Admin interface to view and edit archive data

    Python API:

    (Red โŒ features are still unfinished and will be released in later versions)

    opened by pirate 46
  • Error on Windows 10 when adding URL: UnicodeEncodeError: 'charmap' codec can't encode: character maps to <undefined>

    Error on Windows 10 when adding URL: UnicodeEncodeError: 'charmap' codec can't encode: character maps to

    [i] [2021-03-27 04:40:48] ArchiveBox v0.5.4: archivebox add https://youtube.com/
        > E:\ArchiveBox
    
    [!] Warning: Missing 6 recommended dependencies
        ! WGET_BINARY: wget (unable to detect version)
        ! SINGLEFILE_BINARY: single-file (unable to detect version)
          Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
                or archivebox config --set SAVE_SINGLEFILE=False to silence this warning
    
        ! READABILITY_BINARY: readability-extractor (unable to detect version)
          Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
                or archivebox config --set SAVE_READABILITY=False to silence this warning
    
        ! MERCURY_BINARY: mercury-parser (unable to detect version)
          Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
                or archivebox config --set SAVE_MERCURY=False to silence this warning
    
        ! CHROME_BINARY: unable to find binary (unable to detect version)
        ! RIPGREP_BINARY: rg (unable to detect version)
    
    [+] [2021-03-27 04:40:52] Adding 1 links to index (crawl depth=0)...
        > Saved verbatim input to sources/E:\ArchiveBox\sources\1616820052-import.txt
        > Parsed 1 URLs from input (Plain Text)
        > Found 1 new URLs not already in index
    
    [*] [2021-03-27 04:40:52] Writing 1 links to main index...
        โˆš E:\ArchiveBox\index.sqlite3
    
    [โ–ถ] [2021-03-27 04:40:52] Starting archiving of 1 snapshots in index...
        ! Failed to archive link: UnicodeEncodeError: 'charmap' codec can't encode character '\u25be' in position 9443: character maps to <undefined>
    
    Traceback (most recent call last):
      File "d:\python\lib\runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "d:\python\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "D:\Python\Scripts\archivebox.exe\__main__.py", line 7, in <module>
        from .cli import main
      File "d:\python\lib\site-packages\archivebox\cli\__init__.py", line 129, in main
        run_subcommand(
      File "d:\python\lib\site-packages\archivebox\cli\__init__.py", line 69, in run_subcommand
        module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
      File "d:\python\lib\site-packages\archivebox\cli\archivebox_add.py", line 85, in main
        add(
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\main.py", line 592, in add
        archive_links(new_links, overwrite=False, **archive_kwargs)
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\extractors\__init__.py", line 173, in archive_links
        archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\extractors\__init__.py", line 95, in archive_link
        write_link_details(link, out_dir=out_dir, skip_sql_index=False)
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\index\__init__.py", line 333, in write_link_details
        write_html_link_details(link, out_dir=out_dir)
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\index\html.py", line 79, in write_html_link_details
        atomic_write(str(Path(out_dir) / HTML_INDEX_FILENAME), rendered_html)
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\system.py", line 47, in atomic_write
        f.write(contents)
      File "d:\python\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u25be' in position 9443: character maps to <undefined>
    
    is: bug difficulty: easy 
    opened by Leontking 41
  • Bugfix: docker-compose instructions create a sonic container that fails to start

    Bugfix: docker-compose instructions create a sonic container that fails to start

    Describe the bug

    I followed the docker-compose instructions from the README. This is the result:

    [[email protected] archivebox]# docker-compose ps
             Name                        Command                State             Ports
    --------------------------------------------------------------------------------------------
    archivebox_archivebox_1   dumb-init -- /app/bin/dock ...   Up         0.0.0.0:8000->8000/tcp
    archivebox_sonic_1        sonic -c /etc/sonic.cfg          Exit 101
    
    [[email protected] archivebox]# docker-compose logs sonic
    Attaching to archivebox_sonic_1
    sonic_1       | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
    sonic_1       | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    sonic_1       | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
    sonic_1       | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    

    Search seems to work anyway.

    I would expect one of:

    a. sonic container is not created by default if it requires the user to manually create a config and is not necessary to run ArchiveBox b. config.cfg is created for me by the init script, using the environment variable I set in the docker-compose file c. config.cfg is not required by sonic (however, this is not the case: https://github.com/valeriansaliou/sonic/issues/197)

    Steps to reproduce

    From the README:

    # create a new empty directory and initalize your collection (can be anywhere)
    mkdir ~/archivebox && cd ~/archivebox
    curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml
    docker-compose run archivebox init
    docker-compose run archivebox --version
    
    # start the webserver and open the UI (optional)
    docker-compose run archivebox manage createsuperuser
    docker-compose up -d
    open http://127.0.0.1:8000
    
    # you can also add links and manage your archive via the CLI:
    docker-compose run archivebox add 'https://example.com'
    docker-compose run archivebox status
    docker-compose run archivebox help  # to see more options
    

    ArchiveBox version

    [[email protected] archivebox]# docker-compose run archivebox --version
    Starting archivebox_sonic_1 ... done
    Creating archivebox_archivebox_run ... done
    ArchiveBox v0.5.3
    Cpython Linux Linux-5.9.1-arch1-1-x86_64-with-glibc2.28 x86_64 (in Docker)
    
    [i] Dependency versions:
     โˆš  ARCHIVEBOX_BINARY     v0.5.3          valid     /usr/local/bin/archivebox
     โˆš  PYTHON_BINARY         v3.9.1          valid     /usr/local/bin/python3.9
     โˆš  DJANGO_BINARY         v3.1.3          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
     โˆš  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
     โˆš  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
     โˆš  NODE_BINARY           v15.5.1         valid     /usr/bin/node
     โˆš  SINGLEFILE_BINARY     v0.1.14         valid     /node/node_modules/single-file/cli/single-file
     โˆš  READABILITY_BINARY    v0.1.0          valid     /node/node_modules/readability-extractor/readability-extractor
     โˆš  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
     โˆš  GIT_BINARY            v2.20.1         valid     /usr/bin/git
     โˆš  YOUTUBEDL_BINARY      v2021.01.03     valid     /usr/local/bin/youtube-dl
     โˆš  CHROME_BINARY         v87.0.4280.88   valid     /usr/bin/chromium
     โˆš  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg
    
    [i] Source-code locations:
     โˆš  PACKAGE_DIR           22 files        valid     /app/archivebox
     โˆš  TEMPLATES_DIR         3 files         valid     /app/archivebox/themes
    
    [i] Secrets locations:
     -  CHROME_USER_DATA_DIR  -               disabled
     -  COOKIES_FILE          -               disabled
    
    [i] Data locations:
     โˆš  OUTPUT_DIR            6 files         valid     /data
     โˆš  SOURCES_DIR           1 files         valid     ./sources
     โˆš  LOGS_DIR              0 files         valid     ./logs
     โˆš  ARCHIVE_DIR           1 files         valid     ./archive
     โˆš  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
     โˆš  SQL_INDEX             204.0 KB        valid     ./index.sqlite3
    
    [[email protected] archivebox]# docker version
    Client:
     Version:           20.10.2
     API version:       1.40
     Go version:        go1.15.6
     Git commit:        2291f610ae
     Built:             Tue Jan  5 19:56:21 2021
     OS/Arch:           linux/amd64
     Context:           default
     Experimental:      true
    
    Server:
     Engine:
      Version:          19.03.13-ce
      API version:      1.40 (minimum version 1.12)
      Go version:       go1.15.2
      Git commit:       4484c46d9d
      Built:            Sat Sep 26 12:03:35 2020
      OS/Arch:          linux/amd64
      Experimental:     false
     containerd:
      Version:          v1.4.1.m
      GitCommit:        c623d1b36f09f8ef6536a057bd658b3aa8632828.m
     runc:
      Version:          1.0.0-rc92
      GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
     docker-init:
      Version:          0.19.0
      GitCommit:        de40ad0
    
    [[email protected] archivebox]# docker-compose version
    docker-compose version 1.27.4, build 40524192
    docker-py version: 4.3.1
    CPython version: 3.7.7
    OpenSSL version: OpenSSL 1.1.0l  10 Sep 2019
    
    is: bug difficulty: easy status: done touches: documentation 
    opened by JohnMaguire 28
  • Question: ... How to fix Permission denied: '/data'

    Question: ... How to fix Permission denied: '/data'

    I'm following the setup instructions using docker-compose.

    When I run docker-compose run archivebox init I get

    [i] [2020-11-16 13:38:31] ArchiveBox v0.4.21: archivebox init
        > /data
    
    Traceback (most recent call last):
      File "/usr/local/bin/archivebox", line 33, in <module>
        sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
      File "/app/archivebox/cli/__init__.py", line 123, in main
        run_subcommand(
      File "/app/archivebox/cli/__init__.py", line 63, in run_subcommand
        module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
      File "/app/archivebox/cli/archivebox_init.py", line 33, in main
        init(
      File "/app/archivebox/util.py", line 113, in typechecked_function
        return func(*args, **kwargs)
      File "/app/archivebox/main.py", line 259, in init
        is_empty = not len(set(os.listdir(out_dir)) - ALLOWED_IN_OUTPUT_DIR)
    PermissionError: [Errno 13] Permission denied: '/data'
    

    Please how can I fix this?

    is: bug touches: config difficulty: easy status: done 
    opened by Prn-Ice 27
  • Discussion: new name!

    Discussion: new name!

    Hey everyone! I have a big refactor in the works with some breaking changes, and I thought I'd take this opportunity to re-release BA with a better name and a 1.0 version. The new release modularizes BA into a python package, which lets people import individual parts for their own uses (e.g. parsers, link archiving, screenshotting, indexing). It fixes a lot of the bad decisions I made early on (e.g. using timestamps as unique keys instead of sha256 hashes of the URLs). It also adds a backend with a web GUI for searching and adding imports.

    The new name should be easy to find and type in a python packaging context and should be related to web archiving somehow.

    Requirements for a new name:

    • one word
    • no symbols or spaces (since it's going to be imported as a python package like from webfreeze.pocket import parse_links
    • should be 1st in google results when released with a new name (i.e. no competing projects/keywords)
    • should be intuitively related to web archiving

    Potential ideas:

    • WebFreeze
    • Freezekit
    • ArchiveKit
    • WebCooler

    Comment with your name suggestions/ideas!

    status: idea phase 
    opened by pirate 25
  • WIP: Create python package from repository

    WIP: Create python package from repository

    This will create a python package installable using pip.

    The package can be later published on pypi for easier access.

    Before merging I would squash everything into one commit if approved.

    Scripts

    the installation provide an archive command that will be available from the shell and will execute the archive.py script

    Setup

    The important part is the setup.py file as it contains metadata and instructions for pip.

    I filled it with the information I could find and it should be ok but as you are the author please review it.

    config.py

    As this file is considered editable by the user maybe we should move it somewhere suitable (~/.config/bookmark-archiver/config.py) and access it at runtime.

    opened by edoput 23
  • Full-text search

    Full-text search

    Summary

    This PR Adds the ability to do full-text search ๐ŸŽ‰

    Related issues

    #22 #24

    Changes these areas

    • [ ] Bugfixes
    • [x] Feature behavior
    • [ ] Command line interface
    • [ ] Configuration options
    • [x] Internal architecture
    • [ ] Snapshot data layout on disk
    opened by jdcaballerov 19
  • Link parsing: Pinboard private feeds don't seem to get parsed properly

    Link parsing: Pinboard private feeds don't seem to get parsed properly

    I would love to have the cron job that monitors my Pocket feed also monitor my private Pinboard feed. However, no matter which method I use to pass the feed to bookmark-archiver using the instructions, all have their own unique failure.

    If I pass a public feed, like http://feeds.pinboard.in/rss/u:username/, it works fine. But if I pass a private feed, like https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/, it errors out. I have tried the RSS, JSON, and Text feeds, and none work.

    Examples here: (I've simply replaced the actual feed I used to test, with the demo URL Pinboard provides) ./archive "https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/"

    [*] [2018-10-18 21:14:03] Downloadinghttps://feeds.pinboard.in/rss/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897243.txt
    [X] No links found :(
    

    ./archive "https://feeds.pinboard.in/json/secret:xxxx/u:username/private/"

    [*] [2018-10-18 21:13:46] Downloading https://feeds.pinboard.in/json/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897226.txt
    Traceback (most recent call last):
      File "./archive", line 161, in <module>
        links = merge_links(archive_path=out_dir, import_path=source)
      File "./archive", line 53, in merge_links
        raw_links = parse_links(import_path)
      File "/home/USERNAME/datahoarding/bookmark-archiver/archiver/parse.py", line 54, in parse_links
        links += list(parser_func(file))
      File "/home/USERNAME/bookmark-archiver/archiver/parse.py", line 108, in parse_json_export
        url = erg['url']
    KeyError: 'url'
    

    ./archive "https://feeds.pinboard.in/text/secret:xxxx/u:username/private/"

    [*] [2018-10-18 21:17:57] Downloading https://feeds.pinboard.in/text/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897477.txt
    [X] No links found :(
    

    Even though the script says that links are not found, they are definitely there, and simply pasting the URL into a browser outputs the feed in the proper format. I used this script successfully with other methods, like the Pinboard manual export, Pocket manual export AND RSS feed, and browser export. Is this just not a supported method for importing/monitoring?

    is: bug status: needs followup 
    opened by drpfenderson 19
  • Running `archivebox init` via pip install on Windows 10 triggers

    Running `archivebox init` via pip install on Windows 10 triggers "File not found" error

    I'm on Windows 10. I tried to install archivebox from pip, but after I did "npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'", and did archivebox init, it gave a "File Not Found" Error

    is: bug 
    opened by DUOLabs333 18
  • Pocket and Pinboard imports causing tags to be split incorrectly into individual characters w/ broken hyphenation

    Pocket and Pinboard imports causing tags to be split incorrectly into individual characters w/ broken hyphenation

    A simple bug due to a .split() or set() somewhere on the tags_str instead of the tags list. Should be easy to fix.

    image

    We should also add a filter to prevent emptystring / whitespace-only tags: image

    is: bug difficulty: easy good first ticket help wanted 
    opened by pirate 17
  • Add parser for Pocket API

    Add parser for Pocket API

    Pass a url like pocket://Username to import that username's archived Pocket library. Tokens need to be stored in ArchveBox.conf with the following keys:

    POCKET_CONSUMER_KEY = key-from-custom-pocket-app
    POCKET_ACCESS_TOKENS = {"YourUsername": "pocket-token-for-app"}
    

    POCKET_ACCESS_TOKENS MUST be on a single line, or the JSON will be misinterpreted by the parser as a new key/value pair.

    Summary

    I'm not 100% this is the implementation, but my experience w/ the API is it's more reliable & complete than the feed export. It would be nice to use this as a feed source but the last since value would need to be persisted somewhere.

    Related issues

    None, I wrote this to import my Pocket library locally and was wondering if this would be useful.

    Changes these areas

    • [ ] Bugfixes
    • [x] Feature behavior
    • [ ] Command line interface
    • [ ] Configuration options
    • [ ] Internal architecture
    • [ ] Snapshot data layout on disk
    opened by mAAdhaTTah 17
  • Feature Request: support tag slug non english

    Feature Request: support tag slug non english

    Type

    • [ ] General question or discussion
    • [ ] Propose a brand new feature
    • [X] Request modification of existing behavior or design

    What is the problem that your feature request solves

    i want write tag by non english

    Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

    What hacks or alternative solutions have you tried to solve the problem?

    How badly do you want this new feature?

    • [ ] It's an urgent deal-breaker, I can't live without it
    • [ ] It's important to add it in the near-mid term future
    • [X] It would be nice to have eventually

    • [ ] I'm willing to contribute dev time / money to fix this issue
    • [ ] I like ArchiveBox so far / would recommend it to a friend
    • [ ] I've had a lot of difficulty getting ArchiveBox set up
    status: idea phase 
    opened by green1052 0
  • URL Length seems to be limited to 200 characters

    URL Length seems to be limited to 200 characters

    Describe the bug

    If you try to edit a snapshot (e.g. if it failed to process fully or needs a title etc) where the original URL is longer than 200 chars the edits are not saved as there is an error saying that the URL is too long.

    Steps to reproduce

    Save a snapshot of a URL of greater than 200 chars, then try to edit it and save the edits. Save will fail with the error.

    Screenshots or log output

    Can do if needed but its fairly easy to reproduce.

    ArchiveBox version

    v0.6.3

    opened by prgarnett 0
  • FreeBSD install instructions need a bit of TLC

    FreeBSD install instructions need a bit of TLC

    Wiki Page URL

    https://github.com/ArchiveBox/ArchiveBox/wiki/Install

    Suggested Edit

    FreeBSD install instructions need a bit of TLC:

    pkg install python git wget curl youtube_dl ripgrep py39-pip py39-sqlite3 npm ffmpeg
    pkg install chromium
    
    opened by mwestza 0
  • Bug: update and list verbs are very slow to start

    Bug: update and list verbs are very slow to start

    Describe the bug

    On a large archive, archivebox update or archivebox list starts right away some CPU intensive process that takes a long time to complete.

    Steps to reproduce

    1. Have a large archive (I have 1800+ links)
    2. archivebox update or list

    Screenshots or log output

    docker-compose run archivebox update
    [i] [2022-12-21 23:02:15] ArchiveBox v0.6.2: archivebox update
        > /data
    
    [โ–ถ] [2022-12-21 23:05:07] Starting archiving of 1847 snapshots in index...
    

    The user is simply left waiting, and for the lack of an explanation, the user is asked of his faith. Fans are screeching, it could be an infinite loop. If the user waits, they would wait 3 minutes. If they kill the program, the suspicion of an infinite loop would be wrong: (who could blame them)

    Traceback (most recent call last):
      File "/usr/local/bin/archivebox", line 33, in <module>
        sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
      File "/app/archivebox/cli/__init__.py", line 140, in main
        run_subcommand(
      File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
        module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
      File "/app/archivebox/cli/archivebox_update.py", line 119, in main
        update(
      File "/app/archivebox/util.py", line 114, in typechecked_function
        return func(*args, **kwargs)
      File "/app/archivebox/main.py", line 788, in update
        matching_folders = list_folders(
      File "/app/archivebox/util.py", line 114, in typechecked_function
        return func(*args, **kwargs)
      File "/app/archivebox/main.py", line 929, in list_folders
        return STATUS_FUNCTIONS[status](links, out_dir=out_dir)
      File "/app/archivebox/index/__init__.py", line 411, in get_indexed_folders
        links = [snapshot.as_link_with_details() for snapshot in snapshots.iterator()]
      File "/app/archivebox/index/__init__.py", line 411, in <listcomp>
        links = [snapshot.as_link_with_details() for snapshot in snapshots.iterator()]
      File "/app/archivebox/core/models.py", line 127, in as_link_with_details
        return load_link_details(self.as_link())
      File "/app/archivebox/util.py", line 114, in typechecked_function
        return func(*args, **kwargs)
      File "/app/archivebox/index/__init__.py", line 348, in load_link_details
        existing_link = parse_json_link_details(out_dir)
      File "/app/archivebox/util.py", line 114, in typechecked_function
        return func(*args, **kwargs)
      File "/app/archivebox/index/json.py", line 110, in parse_json_link_details
        return Link.from_json(link_json, guess)
      File "/app/archivebox/index/schema.py", line 246, in from_json
        cast_result = ArchiveResult.from_json(json_result, guess)
      File "/app/archivebox/index/schema.py", line 97, in from_json
        info['end_ts'] = parse_date(info['end_ts'])
      File "/app/archivebox/util.py", line 114, in typechecked_function
        return func(*args, **kwargs)
      File "/app/archivebox/util.py", line 157, in parse_date
        return dateparser(date, settings={'TIMEZONE': 'UTC'}).replace(tzinfo=timezone.utc)
      File "/usr/local/lib/python3.9/site-packages/dateparser/conf.py", line 89, in wrapper
        return f(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/dateparser/__init__.py", line 54, in parse
        data = parser.get_date_data(date_string, date_formats)
      File "/usr/local/lib/python3.9/site-packages/dateparser/date.py", line 421, in get_date_data
        parsed_date = _DateLocaleParser.parse(
      File "/usr/local/lib/python3.9/site-packages/dateparser/date.py", line 178, in parse
        return instance._parse()
      File "/usr/local/lib/python3.9/site-packages/dateparser/date.py", line 182, in _parse
        date_data = self._parsers[parser_name]()
      File "/usr/local/lib/python3.9/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
        return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
      File "/usr/local/lib/python3.9/site-packages/dateparser/freshness_date_parser.py", line 159, in get_date_data
        date, period = self.parse(date_string, settings)
      File "/usr/local/lib/python3.9/site-packages/dateparser/freshness_date_parser.py", line 88, in parse
        self.now = apply_timezone(_now, settings.TIMEZONE)
      File "/usr/local/lib/python3.9/site-packages/dateparser/utils/__init__.py", line 115, in apply_timezone
        new_datetime = apply_dateparser_timezone(date_time, tz_string)
      File "/usr/local/lib/python3.9/site-packages/dateparser/utils/__init__.py", line 103, in apply_dateparser_timezone
        if info['regex'].search(' %s' % offset_or_timezone_abb):
    KeyboardInterrupt
    

    Long story short, I snooped after setting up the dev enviroment. Before telling the user anything, archivebox is iterating on every matching link, of 1847. But that's not what's slow! It's a particular function, merge_links() that, ran 1847 times, adds up to a lot of waiting.

    Merge_links() is called by load_link_details(), seemingly to combine the information from disk about the link we're currently processing, and prettify it too. So far so good, but why are we doing in this in bulk? For example, archivebox list is going to iterate on each link, to print it, so why not "merge" as you roll? Do we really need a complete list of merged links before doing anything? Perhaps I'm not seeing the entire picture though...

    opened by notevenaperson 0
  • Question: ...What's with the users?

    Question: ...What's with the users?

    Hi,

    first of all, thank you for this cool project!

    I was able to easily install it using docker-compose. And I could also register as admin and do some test archivals.

    But I am failing to set up a user to use for actual archival jobs. What I mean is, I set up the user on the GUI, log out as admin, try to log in as the user but get an error message telling me that the user or password are incorrect (and that both could be case sensitive).

    Well, I double and triple checked the name and the password - still nothing. I reset the password using the CLI - still nothing. I tried making the user "staff" - still nothing. I tried making the user "superuser" - still nothing. I wanted to try and add the user to some group - but I can't find any group and can't create any.

    So can anybody tell me why my user can't log in?

    And what does "staff" mean? Or, to turn it around: What would be the use for a user that is not staff and, hence, cannot log in. What could such user do? And how do I create groups?

    Thanks!

    opened by gitwittidbit 1
Releases(v0.6.2)
  • v0.6.2(Apr 10, 2021)

    New features

    • new ArchiveResult log in the admin web UI, with full editing ability of individual extractor outputs + list of outputs under each Snapshot admin entry
    • ability to save multiple snapshots of the same URL over time using new Re-snapshot button
    • add init --quick and server --quick-init options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful)
    • add new archivebox setup command and archivebox init --setup flag to aid in automatically installing dependencies and creating a superuser during initial setup
    • new SNAPSHOTS_PER_PAGE=40 and MEDIA_MAX_SIZE=750m config options
    • allow hotlinking directly to specific extractor output on the snapshot detail page using URL #hash e.g. /archive/<timestamp>/index.html#git
    • add ability to view snapshot matching a given URLs by visiting /archive/https://example.com/some/url -> redirects to -> /archive/<timestamp>/index.html (also works without scheme /archive/example.com)
    • #660 add ability to tag URLs while adding them via the web UI and via the CLI using archivebox add --tag=tag1,tag2,tag3 ...
    • #659 add back ability to override visual styling with custom HTML and CSS using new config option CUSTOM_TEMPLATES_DIR
    • ability to add and remove multiple tags at once from the snapshot admin using autocompleting dropdown

    Enhancements

    • lots of performance improvements! (in testing with 100k entries, the main index was brought down from 10-14 second load times to ~110ms once cache warms up)
    • full text search now works on the public snapshot list
    • dates and times are now localized to your browser's timezone instead of showing in UTC
    • integrity and correctness improvements to readability, mercury, warc, and other extractors
    • video subtitles and description are now added to the full-text search index as well (including youtube's autogenerated transcripts in all languages)
    • log all errors with full tracebacks to new data/logs/errors.log file (so users no longer have to run in --debug mode to see error details)
    • better archivebox schedule logging and changed logfile location to ./logs/schedule.log
    • better docker-compose setup experience with sonic config example in docker-compose.yml
    • add Django Debug Toolbar + djdt_flamegraph for developers to profile UI performance
    • add --overwrite flag support to archivebox schedule, archived urls get added similarly to add --overwrite
    • #644 remove boostrap and jquery remove network requests to CDNs by inlining them instead
    • #647 allow filtering by ArchiveResult status in the Snapshot admin UI to select only links that have been archived or not archived
    • #550 kill all orphan child processes after each extractor finishes to prevent dangling chromium/node subprocesses and memory leaks
    • 3276434 add new SEARCH_BACKEND_TIMEOUT config option to tune amount of time search backend can take before it gives up
    • more diagnostic info added to the Snapshot admin view including most recent status code, content type, detected server, etc
    • make the order of the table columns, layout, and spacing the same on the public view and private view (also remove DataTable, we're not using it)
    • better snapshot grid page (faster load times, nicer CSS for tags and cards, more actions supported and metadata shown)
    • added Cache-Control headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreams
    • new project releases page https://releases.archivebox.io and demo url https://demo.archivebox.io

    Bugfixes

    • #673 fix searching by URL substring in Snapshot admin list
    • #658 fix Snapshot admin action buttons not working in Safari and some other browsers
    • #678 fix AssertionError error when archivebox would to attempt archive with CHROME_BINARY=None when Chrome was not found on host system
    • #654 fix some issues with sonic attempting to index massive text blobs or binary blobs on some pages and hanging
    • #674 fix UTF-8 encoding encoding problems with file reading/writing on Windows (supporting a Python pkg on Windows is unreasonably painful ya'll)
    • #433 fix deleted items sometimes reappearing on next import/update
    • #473 fix issue preventing use of archivebox python API inside raw REPL (not using archivebox shell)
    • fix stdin/stdout/stderr handling for some edge cases in Docker/Docker-Compose

    image image

    Source code(tar.gz)
    Source code(zip)
    archivebox--0.6.2-1.big_sur.bottle.tar.gz(11.46 MB)
    archivebox-0.6.2-py3-none-any.whl(477.89 KB)
    archivebox-0.6.2.tar.gz(403.89 KB)
    archivebox_0.6.2-1_all.deb(281.89 KB)
    Electron-ArchiveBox-macOS-x64-0.6.2.app.zip(76.54 MB)
  • v0.5.6(Feb 9, 2021)

    • add ARMv7 and ARMv8 CPU support for apt / deb distribution on Launchpad PPA
    • fix nodesource apt repo not supported on i386 b90afc8
    • fix handling of skipped ArchiveResult entries with null output 0aea5ed
    • catch exception on import of old index.json into ArchiveResult 171bbeb
    • move debsign to release not build 66fb5b2
    • skip tests during debian build a32eac3
    • fix emptystrings in cmd_version causing exception a49884a
    • automate deb dist better and bump version 0e6ac39
    • fix assertion 6705354
    • change wording of db not found error 683a087
    Source code(tar.gz)
    Source code(zip)
  • v0.5.4(Feb 1, 2021)

    Thank you contributors who helped with the 181 commits in this release!
    @cdvv7788, @jdcaballerov, @thedanbob, @aggroskater, @mAAdhaTTah, @mario-campos, @mikaelf

    • fix migration failing due to null cmd_versions in older archives a3008c8
    • Publish, minor, & major version to DockerHub and add set up CodeQL codeql-analysis.yml c5b7d9f, bbb6cc8
    • fix DATABASE_NAME posixpath, and dependencies dict bug 02bdb3b, 5c7842f
    • use relative imports for .util to fix windows import clash 72e2c7b
    • fix COOKIES_FILE config param breaking in wget ef7711f
    • Refactor should_save_extractor methods to accept overwrite parameter 5420903
    • Fix issue #617 by using mark_safe in combination with format_html โ€ฆ 1989275
    • make permission chowning on docker start less fancy, respect PUID/PGID #635
    • add createsuperuser flag to server command 39ec77e
    • fix files icons styling and use the db exclusively for rendering them, instead of filesystem f004058, 7d8fe66, 5c54bcc, 534ead2
    • limit youtubedl download size to 750m and stop splitting out audio files 3227f54
    • also search url, timestamp, tags on public index 8a4edb4
    • fix trailing slash problems and wget not detecting download path 9764a8e
    • add response status code to headers.json c089501
    • fix singlefile path used for sonic 24e2493
    • cleanup template layout in filesystem, new snapshot detail page UI
    Screen Shot 2021-01-30 at 9 53 22 p Source code(tar.gz)
    Source code(zip)
    archivebox-0.5.4-py3-none-any.whl(385.10 KB)
    archivebox_0.5.4-1_all.deb(235.85 KB)
  • v0.5.3(Jan 6, 2021)

    • ArchiveResult moved to SQLite3 DB for performance @cdvv7788
    • lots of assorted bugfixes and improvements courtesy of @cdvv7788 and @jdcaballerov
    • new full-text search support with ripgrep and sonic courtesy of @jdcaballerov
    • new archivebox oneshot command for downloading a single site without starting a whole collection
    • new Pocket API importer courtesy of @mAAdhaTTah
    • new Wallabag importer courtesy of @ehainry
    • new extractor options on Add page courtesy of @BlipRanger
    • new apt/deb/homebrew/pip packaging setup into separate repos under new Github Org https://github.com/ArchiveBox
    • new official PPA and Docker Hub accounts https://hub.docker.com/r/archivebox/archivebox (with automatic armv7 builds courtesy of @chrismeller)
    • new Snapshot grid view courtesy of @jdcaballerov image
    Source code(tar.gz)
    Source code(zip)
  • v0.4.24(Dec 3, 2020)

  • v0.4.21(Aug 18, 2020)

  • v0.4.17(Aug 18, 2020)

    • Fix bugs with parsing long URLs as paths
    • html-encoded URLs
    • new generic HTML parser
    • new --init and --overwrite flags on add
    • improve stdout and hints
    • fix Pull title button
    • other small bugfixes
    Source code(tar.gz)
    Source code(zip)
  • v0.4.16(Aug 18, 2020)

  • v0.4.15(Aug 18, 2020)

    • fix a bug where invalid URLs where attempted to be parsed an imported, causing the whole archive process to crash
    • add support for scheduled archiving in docker
    docker run -v $PWD:/data archivebox schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
    
    # docker-compose.yml
    
    version: '3.7'
    
    services:
      archivebox:
        image: nikisweeting/archivebox:latest
        command: schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
        environment:
          - USE_COLOR=True
          - SHOW_PROGRESS=False
        volumes:
          - ./data:/data
    
    Source code(tar.gz)
    Source code(zip)
  • v0.4.14(Aug 14, 2020)

    Add support for the Readability article text extractor, it runs on the SingleFile, Wget, and DOM dump output by default, but if none of those are available it will download the article from scratch to do text extraction. This release also officially adds Docker support for ARM architectures, including the Raspberry Pi. The image size was also shrunk from 1.5GB to 452MB by making sure unnecessary build tools are uninstalled after the package build process.

    image

    Source code(tar.gz)
    Source code(zip)
  • v0.4.13(Aug 10, 2020)

  • v0.4.12(Aug 10, 2020)

  • v0.4.11(Aug 7, 2020)

    We add a major new archive method in this release: SingleFile. On bare metal it requires installing Node and Chrome/Chromium, but it works out-of-the-box in the Docker version.

    This finally allows ArchiveBox to pass all of the acid tests except one, and the archive for Github and many other sites are nicer than Wget was able to do on its own.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.9(Jul 28, 2020)

    image

    ๐ŸŒ… v0.4 is officially released. This is a long-awaited 3rd-pass review over every corner of the archivebox UX. It adresses many of the fundamental shortcomings around index consistency by using a new SQLite database, with automatic migrations provided by django. It also smooths many of the rough edges, adds a new admin Web UI, a rich new CLI, closes 40+ github tickets, and is the first official release available on PyPI.

    • https://pypi.org/project/archivebox/ pip install archivebox
    • https://hub.docker.com/r/nikisweeting/archivebox docker run -v $PWD:/data nikisweeting/archivebox
    • https://archivebox.readthedocs.io/en/latest/
    • https://github.com/pirate/ArchiveBox/releases/tag/v0.4.9

    Enjoy!

    ๐ŸŽ‰ Big thanks to everyone who helped! Especially the Monadical team @cdvv7788 @apkallum @afreydev and also @drpfenderson who helped us track down the last few index importing bugs! ๐ŸŽ‰

    The docs still have some work left to finish updating, but the CLI help text is all up-to-date (when in doubt, just pass --help).
    Let us know if you find any rough edges here: https://github.com/pirate/ArchiveBox/issues/new/choose

    pip install archivebox
    
    cd path/to/your/archive/folder
    
    archivebox init  # this doubles as the migrate command, it will safely upgrade existing index files automatically
    archviebox add 'https://example.com'
    archviebox add 'https://getpocket.com/users/USERNAME/feed/all' --depth=1
    archivebox status
    archivebox server
    archivebox help
    

    Or if you prefer docker, the CLI works exactly the same archivebox [subcommand] [...args]:

    docker run -v $PWD:/data nikisweeting/archivebox init
    docker run -v $PWD:/data nikisweeting/archivebox add 'https://example.com'
    docker run -v $PWD:/data -p 8000 nikisweeting/archivebox server
    
    version: '3.7'
    
    services:
        archivebox:
            image: nikisweeting/archivebox:latest
            command: server 0.0.0.0:8000
            stdin_open: true
            tty: true
            ports:
                - 8000:8000
            environment:
                - USE_COLOR=True
            volumes:
                - ./data:/data
    

    Screenshots

    Screen Shot 2020-07-28 at 6 19 48 AM

    New Features

    A bunch of big changes:

    • pip install archivebox is now available
    • full transition to Django Sqlite DB with migrations (making upgrades between versions much safer now)
    • maintains an intuitive and helpful CLI that's backwards-compatible with all previous archivebox data versions
    • uses argparse instead of hand-written CLI system: see archivebox/cli/archivebox.py
    • new subcommands-based CLI for archivebox (see below)
    • new Web UI with pagination, better search, filtering, permissions, and more
    • 30+ assorted bugfixes, new features, and tickets closed

    For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

    Released in this version:

    Install Methods:

    Command Line Interface:

    Web UI:

    • โœ… / Main index
    • โœ… /add Page to add new links to the archive (but needs improvement)
    • โœ… /archive/<timestamp>/ Snapshot details page
    • โœ… /archive/<timestamp>/<url> live wget archive of page
    • โœ… /archive/<timestamp>/<extractor> get a specific extractor output for a given snapshot
    • โœ… /archive/<url> shortcut to view most recent snapshot of given url
    • โœ… /archive/<url_hash> shortcut to view most recent snapshot of given url
    • โœ… /admin Admin interface to view and edit archive data
    • โœ… /old.html Backwards-compatible static HTML index for the previous version

    Python API:

    (Red โŒ features are still unfinished and will be released in later versions)

    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Feb 27, 2019)

    • better archive corruption guards (check structure invariants on every parse & save)
    • remove title prefetching in favor of new FETCH_TITLE archive method
    • slightly improved CLI output for parsing and remote url downloading
    • re-save index after archiving completes to update titles and urls
    • remove redundant derivable data from link json schema
    • markdown link parsing support
    • faster link parsing and better symbol handling using a new compiled URL_REGEX
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Feb 19, 2019)

    • fixed issues with parsing titles including trailing tags
    • fixed issues with titles defaulting to URLs instead of attempting to fetch
    • fixed issue where bookmark timestamps from RSS would be ignored and current ts used instead
    • fixed issue where ONLY_NEW would overwrite existing links in archive with only new ones
    • fixed lots of issues with URL parsing by using urllib.parse instead of hand-written lambdas
    • ignore robots.txt when using wget (ssshhh don't tell anyone ๐Ÿ˜)
    • fix RSS parser bailing out when there's whitespace around XML tags
    • fix issue with browser history export trying to run ls on wrong directory
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Feb 7, 2019)

    This is a bugfix release, many parts of the parsing process have been improved or fixed.

    • Shaarli RSS export support
    • Fix issues with plain text link parsing including quotes, whitespace, and closing tags in URLs
    • add USER_AGENT to archive.org submissions so they can track archivebox usage
    • remove all icons similar to archive.org branding from archive UI
    • hide some of the noisier youtubedl and wget errors
    • set permissions on youtubedl media folder
    • fix chrome data dir incorrect path and quoting
    • better chrome binary finding
    • show which parser is used when importing links, show progress when fetching titles
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Jan 11, 2019)

    This is a feature-packed release, so it's likely to be a little buggier than usual!

    New features:

    • ability to load any plain text list of links (also the new fallback for all parses)
    • WARC file saving via wget: FETCH_WARC=True
    • Git repository downloading with git clone: FETCH_GIT=True GIT_DOMAINS=github.com,gitlab.com,bitbucket.org
    • Media downloading with youtube-dl: FETCH_MEDIA=True MEDIA_TIMEOUT=36000

    Bugfixes:

    • autodetect the correct chromium binary in almost all cases
    • create browser history export folder automatically
    • higher allowed timestamp precision

    New logo:

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Dec 21, 2018)

  • v0.1.0(Jun 11, 2018)

    Warning: Running this version will move the old html/ output folder to the new location: output/.

    Changes:

    • entirely new folder structure & code layout
    • moved scripts into bin/ folder, symlinked setup and archive for backwards-compatibility
    • removed TEMPLATE_INDEX* config options, just symlink the files in templates/ to your custom versions
    • added support for ./bin/export-browser-history JSON imports of browsing history from Chrome and Firefox
    Source code(tar.gz)
    Source code(zip)
  • v0.0.3(Oct 30, 2017)

    New Features:

    • Support for parsing links from RSS feeds
    • Support for specifying a URL as well as local file paths: ./archive.py https://example.com/path/to/rss/feed.xml
    • Support for --user-data-dir for archiving restricted sites with chrome headless
    • Simple & Fancy HTML & JSON indexes for each individual link
    • Archive attempt history stored in link index.json

    Improvements:

    • Append to existing archive instead of overwriting the index each time
    • Reduced unnecessary config options, it should "just work"
    • Smartly dedupe and cleanup messy archive folders
    • Massively cleaned up codebase
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Jul 4, 2017)

    • refactor codebase into separate files
    • check for minimum python version before running
    • fix utf-8 encoding errors when writing index.html
    • make index easier to customize with templates/ folder
    • WIP audio & video downloading with youtube-dl
    Source code(tar.gz)
    Source code(zip)
  • v0.0.1(Jul 4, 2017)

    It's reached a point where I'm comfortable bringing Bookmark Archiver out of alpha and into beta. This release supports a broad range of bookmark export files, works well with wget archiving, and produces clean, future-compatible archive folders.

    See the README for more details and a list of features. Future releases will have a changelog.

    Source code(tar.gz)
    Source code(zip)
Owner
ArchiveBox
The self-hosted internet archiving solution by @pirate and @Monadical-SAS. #webarchiving #digipres
ArchiveBox
๐Ÿฆ‰Data Version Control | Git for Data & Models

Website โ€ข Docs โ€ข Blog โ€ข Twitter โ€ข Chat (Community & Support) โ€ข Tutorial โ€ข Mailing List Data Version Control or DVC is an open-source tool for data sci

Iterative 10.9k Jan 05, 2023
Indico - A feature-rich event management system, made @ CERN, the place where the Web was born.

Indico Indico is: ๐Ÿ—“ a general-purpose event management tool; ๐ŸŒ fully web-based; ๐Ÿงฉ feature-rich but also extensible through the use of plugins; โš–๏ธ O

Indico 1.4k Jan 09, 2023
:mag: Ambar: Document Search Engine

๐Ÿ” Ambar: Document Search Engine Ambar is an open-source document search engine with automated crawling, OCR, tagging and instant full-text search. Am

RD17 1.9k Jan 09, 2023
Automatic music downloader for SABnzbd

Headphones Headphones is an automated music downloader for NZB and Torrent, written in Python. It supports SABnzbd, NZBget, Transmission, ยตTorrent, De

3.2k Dec 31, 2022
The official source code repository for the calibre ebook manager

calibre calibre is an e-book manager. It can view, convert, edit and catalog e-books in all of the major e-book formats. It can also talk to e-book re

Kovid Goyal 14.1k Dec 27, 2022
One webpage for every book ever published!

Open Library Open Library is an open, editable library catalog, building towards a web page for every book ever published. Are you looking to get star

Internet Archive 4k Jan 08, 2023
A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.

Baby Buddy A buddy for babies! Helps caregivers track sleep, feedings, diaper changes, tummy time and more to learn about and predict baby's needs wit

Baby Buddy 1.5k Jan 02, 2023
Agile project management platform. Built on top of Django and AngularJS

Taiga Backend Documentation Currently, we have authored three main documentation hubs: API: Our API documentation and reference for developing from Ta

Taiga.io 5.8k Jan 05, 2023
Wikidata scholarly profiles

Scholia is a python package and webapp for interaction with scholarly information in Wikidata. Webapp As a webapp, it currently runs from Wikimedia To

Finn ร…rup Nielsen 181 Jan 03, 2023
This is your launchpad that comes with a variety of applications waiting to run on your kubernetes cluster with a single click

This is your launchpad that comes with a variety of applications waiting to run on your kubernetes cluster with a single click.

M. Rehan 2 Jun 26, 2022
Fava - web interface for Beancount

Fava is a web interface for the double-entry bookkeeping software Beancount with a focus on features and usability. Check out the online demo and lear

1.5k Dec 30, 2022
A simple shared budget manager web application

I hate money I hate money is a web application made to ease shared budget management. It keeps track of who bought what, when, and for whom; and helps

The spiral project. 829 Dec 31, 2022
Source code for Gramps Genealogical program

The Gramps Project ( https://gramps-project.org ) We strive to produce a genealogy program that is both intuitive for hobbyists and feature-complete f

Gramps Project 1.6k Jan 08, 2023
Collect your thoughts and notes without leaving the command line.

jrnl To get help, submit an issue on Github. jrnl is a simple journal application for your command line. Journals are stored as human readable plain t

Manuel Ebert 31 Dec 01, 2022
RedNotebook is a cross-platform journal

RedNotebook RedNotebook is a modern desktop journal. It lets you format, tag and search your entries. You can also add pictures, links and customizabl

Jendrik Seipp 417 Dec 28, 2022
A time tracking application

GTimeLog GTimeLog is a simple app for keeping track of time. Contents Installing Documentation Resources Credits Installing GTimeLog is packaged for D

GTimeLog developers 224 Nov 28, 2022
cherrytree

CherryTree A hierarchical note taking application, featuring rich text and syntax highlighting, storing data in a single XML or SQLite file. The proje

Giuseppe Penone 2.7k Jan 08, 2023
Small and highly customizable twin-panel file manager for Linux with support for plugins.

Note: Prefered repository hosting is GitLab. If you don't have an account there and don't wish to make one interacting with one on GitHub is fine. Sun

Mladen Mijatov 407 Dec 29, 2022
Invenio digital library framework

Invenio Framework v3 Open Source framework for large-scale digital repositories. Invenio Framework is like a Swiss Army knife of battle-tested, safe a

Invenio digital repository framework 562 Jan 07, 2023
Find duplicate files

dupeGuru dupeGuru is a cross-platform (Linux, OS X, Windows) GUI tool to find duplicate files in a system. It is written mostly in Python 3 and has th

Andrew Senetar 3.3k Jan 04, 2023