A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

A Discord Token Grabber/Stealer But It's in One Line of Coding

Discord-Token-Grabber-But-In-One-Line That's a Discord Token Grabber/Stealer But It's in One Line of Coding! The Name Says All 3

YoSoyAngi 2 Jan 11, 2022
An API wrapper for convertio.co written in Python.

An API wrapper for convertio.co written in Python.

Moonrise 9 Sep 27, 2022
EZXT - A ccxt wrapped client for binance & ftx

EZXT Open source & beginner-friendly ccxt wrapped client for binance & ftx Want

Shaft 10 Oct 25, 2022
HelpDESK Dynamics

Helpdesk Application The project is a Helpdesk application (Helpdesk dynamics) where staff of an organization can raise and assign job/trouble tickets

Okeoma Ihunwo 0 Nov 14, 2021
Report-snapchat - Report Snapchat acc with python

report-snapchat Report Snapchat acc Report users on Snapchat about the tool : 4

17 Dec 01, 2022
A discord bot can stress ip addresses with python tool

Python-ddos-bot Coded by Lamp#1442 A discord bot can stress ip addresses with python tool. Warning! DOS or DDOS is illegal, i shared for educational p

IrgyGANS 1 Nov 16, 2021
Please Do Not Throw Sausage Pizza Away - Side Scrolling Up The OSI Stack

Please Do Not Throw Sausage Pizza Away - Side Scrolling Up The OSI Stack

John Capobianco 2 Jan 25, 2022
Easy to use API Wrapper for somerandomapi.ml.

Overview somerandomapi is an API Wrapper for some-random-api.ml Examples Asynchronous from somerandomapi import Animal import asyncio async def main

Myxi 1 Dec 31, 2021
Discord-Mass-Mention - Yup the title says it all

Protocol - Mass Mention (i havent tested this with any token other than my own t

Mallowies 14 Nov 06, 2022
A bot which is a ghost and you can make friends with it

This is a bot which is a ghost and you can make friends with it. It will haunt your friends. Explore and test the bot in replit !

Siwan SR 0 Oct 06, 2022
Telegram group manager moderen and simple.

Upin Robot A Advanced Powerful, Smart And Intelligent Group Management Bot With New And Powerful Features ... Written with Pyrogram and Telethon... If

Muhammad Nawawi 3 Dec 23, 2021
EpikCord.py - This is an API Wrapper for Discord's API for Python

EpikCord.py - This is an API Wrapper for Discord's API for Python! We've decided not to fork discord.py and start completely from scratch for a new, better structuring system!

EpikHost 28 Oct 10, 2022
Github-Checker - Simple Tool To Check If Github User Available Or Not

Github Checker Simple Tool To Check If Github User Available Or Not Socials: Lan

ميخائيل 7 Jan 30, 2022
Battle.net and PlayStation title watcher that reports updates via Discord.

Renovate Renovate is a Battle.net and PlayStation title watcher that reports updates via Discord. Usage Open config_example.json and provide the confi

Ethan 1 Nov 23, 2022
Discord Blogger Integration Using Blogger API

It's a very simple discord bot created in python using blogger api in order to search and send your website articles in your discord chat in form of an embedded message. It's pretty useful for people

Owen Singh 8 Oct 28, 2022
Project glow is an open source bot worked on by many people to create a good and safe moderation bot for all

Project Glow Greetings, I see you have stumbled upon project glow. Project glow is an open source bot worked on by many people to create a good and sa

Glowstikk 24 Sep 29, 2022
Linkvertise-bypass - Tools pour bypass les liens Linkvertise

Installation | Important | Discord 🌟 Comme Linkvertise bypass est gratuit, les

GalackQSM 3 Aug 31, 2022
Discord Mass Report script that uses multiple tokens

Discord-Mass-Report Discord Mass Report script that uses multiple tokens, full credits to https://github.com/hoki0/Discord-mass-report who made it in

cChimney 4 Jun 08, 2022
Project template for using aws-cdk, Chalice and React in concert, including RDS Postgresql and AWS Cognito

What is This? This repository is an opinonated project template for using aws-cdk, Chalice and React in concert. Where aws-cdk and Chalice are in Pyth

Rasmus Jones 4 Nov 07, 2022
Your custom slash commands Discord bot!

Slashy - Your custom slash-commands bot Hey, I'm Slashy - your friendly neighborhood custom-command bot! The code for this bot exists because I'm like

Omar Zunic 8 Dec 20, 2022