SpikeX - SpaCy Pipes for Knowledge Extraction

Overview

SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Build Status pypi Version Code style: black

What's new in SpikeX 0.5.0

WikiGraph has never been so lightning fast:

  • ๐ŸŒ• Performance mooning, thanks to the adoption of a sparse adjacency matrix to handle pages graph, instead of using igraph
  • ๐Ÿš€ Memory optimization, with a consumption cut by ~40% and a compressed size cut by ~20%, introducing new bidirectional dictionaries to manage data
  • ๐Ÿ“– New APIs for a faster and easier usage and interaction
  • ๐Ÿ›  Overall fixes, for a better graph and a better pages matching

Pipes

  • WikiPageX links Wikipedia pages to chunks in text
  • ClusterX picks noun chunks in a text and clusters them based on a revisiting of the Ball Mapper algorithm, Radial Ball Mapper
  • AbbrX detects abbreviations and acronyms, linking them to their long form. It is based on scispacy's one with improvements
  • LabelX takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms
  • PhraseX creates a Doc's underscore extension based on a custom attribute name and phrase patterns. Examples are NounPhraseX and VerbPhraseX, which extract noun phrases and verb phrases, respectively
  • SentX detects sentences in a text, based on Splitta with refinements

Tools

  • WikiGraph with pages as leaves linked to categories as nodes
  • Matcher that inherits its interface from the spaCy's one, but built using an engine made of RegEx which boosts its performance

Install SpikeX

Some requirements are inherited from spaCy:

  • spaCy version: 2.3+
  • Operating system: macOS / OS X ยท Linux ยท Windows (Cygwin, MinGW, Visual Studio)
  • Python version: Python 3.6+ (only 64 bit)
  • Package managers: pip

Some dependencies use Cython and it needs to be installed before SpikeX:

pip install cython

Remember that a virtual environment is always recommended, in order to avoid modifying system state.

pip

At this point, installing SpikeX via pip is a one line command:

pip install spikex

Usage

Prerequirements

SpikeX pipes work with spaCy, hence a model its needed to be installed. Follow official instructions here. The brand new spaCy 3.0 is supported!

WikiGraph

A WikiGraph is built starting from some key components of Wikipedia: pages, categories and relations between them.

Auto

Creating a WikiGraph can take time, depending on how large is its Wikipedia dump. For this reason, we provide wikigraphs ready to be used:

Date WikiGraph Lang Size (compressed) Size (memory)
2021-04-01 enwiki_core EN 1.1GB 5.9GB
2021-04-01 simplewiki_core EN 19MB 120MB
2021-04-01 itwiki_core IT 189MB 1.1GB
More coming...

SpikeX provides a command to shortcut downloading and installing a WikiGraph (Linux or macOS, Windows not supported yet):

spikex download-wikigraph simplewiki_core

Manual

A WikiGraph can be created from command line, specifying which Wikipedia dump to take and where to save it:

spikex create-wikigraph \
  <YOUR-OUTPUT-PATH> \
  --wiki <WIKI-NAME, default: en> \
  --version <DUMP-VERSION, default: latest> \
  --dumps-path <DUMPS-BACKUP-PATH> \

Then it needs to be packed and installed:

spikex package-wikigraph \
  <WIKIGRAPH-RAW-PATH> \
  <YOUR-OUTPUT-PATH>

Follow the instructions at the end of the packing process and install the distribution package in your virtual environment. Now your are ready to use your WikiGraph as you wish:

from spikex.wikigraph import load as wg_load

wg = wg_load("enwiki_core")
page = "Natural_language_processing"
categories = wg.get_categories(page, distance=1)
for category in categories:
    print(category)

>>> Category:Speech_recognition
>>> Category:Artificial_intelligence
>>> Category:Natural_language_processing
>>> Category:Computational_linguistics

Matcher

The Matcher is identical to the spaCy's one, but faster when it comes to handle many patterns at once (order of thousands), so follow official usage instructions here.

A trivial example:

from spikex.matcher import Matcher
from spacy import load as spacy_load

nlp = spacy_load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("TEST", [[{"LOWER": "nlp"}]])
doc = nlp("I love NLP")
for _, s, e in matcher(doc):
  print(doc[s: e])

>>> NLP

WikiPageX

The WikiPageX pipe uses a WikiGraph in order to find chunks in a text that match Wikipedia page titles.

from spacy import load as spacy_load
from spikex.wikigraph import load as wg_load
from spikex.pipes import WikiPageX

nlp = spacy_load("en_core_web_sm")
doc = nlp("An apple a day keeps the doctor away")
wg = wg_load("simplewiki_core")
wpx = WikiPageX(wg)
doc = wpx(doc)
for span in doc._.wiki_spans:
  print(span._.wiki_pages)

>>> ['An']
>>> ['Apple', 'Apple_(disambiguation)', 'Apple_(company)', 'Apple_(tree)']
>>> ['A', 'A_(musical_note)', 'A_(New_York_City_Subway_service)', 'A_(disambiguation)', 'A_(Cyrillic)')]
>>> ['Day']
>>> ['The_Doctor', 'The_Doctor_(Doctor_Who)', 'The_Doctor_(Star_Trek)', 'The_Doctor_(disambiguation)']
>>> ['The']
>>> ['Doctor_(Doctor_Who)', 'Doctor_(Star_Trek)', 'Doctor', 'Doctor_(title)', 'Doctor_(disambiguation)']

ClusterX

The ClusterX pipe takes noun chunks in a text and clusters them using a Radial Ball Mapper algorithm.

from spacy import load as spacy_load
from spikex.pipes import ClusterX

nlp = spacy_load("en_core_web_sm")
doc = nlp("Grab this juicy orange and watch a dog chasing a cat.")
clusterx = ClusterX(min_score=0.65)
doc = clusterx(doc)
for cluster in doc._.cluster_chunks:
  print(cluster)

>>> [this juicy orange]
>>> [a cat, a dog]

AbbrX

The AbbrX pipe finds abbreviations and acronyms in the text, linking short and long forms together:

from spacy import load as spacy_load
from spikex.pipes import AbbrX

nlp = spacy_load("en_core_web_sm")
doc = nlp("a little snippet with an abbreviation (abbr)")
abbrx = AbbrX(nlp.vocab)
doc = abbrx(doc)
for abbr in doc._.abbrs:
  print(abbr, "->", abbr._.long_form)

>>> abbr -> abbreviation

LabelX

The LabelX pipe matches and labels patterns in text, solving overlappings, abbreviations and acronyms.

from spacy import load as spacy_load
from spikex.pipes import LabelX

nlp = spacy_load("en_core_web_sm")
doc = nlp("looking for a computer system engineer")
patterns = [
  [{"LOWER": "computer"}, {"LOWER": "system"}],
  [{"LOWER": "system"}, {"LOWER": "engineer"}],
]
labelx = LabelX(nlp.vocab, ("TEST", patterns), validate=True, only_longest=True)
doc = labelx(doc)
for labeling in doc._.labelings:
  print(labeling, f"[{labeling.label_}]")

>>> computer system engineer [TEST]

PhraseX

The PhraseX pipe creates a custom Doc's underscore extension which fulfills with matches from phrase patterns.

from spacy import load as spacy_load
from spikex.pipes import PhraseX

nlp = spacy_load("en_core_web_sm")
doc = nlp("I have Melrose and McIntosh apples, or Williams pears")
patterns = [
  [{"LOWER": "mcintosh"}],
  [{"LOWER": "melrose"}],
]
phrasex = PhraseX(nlp.vocab, "apples", patterns)
doc = phrasex(doc)
for apple in doc._.apples:
  print(apple)

>>> Melrose
>>> McIntosh

SentX

The SentX pipe splits sentences in a text. It modifies tokens' is_sent_start attribute, so it's mandatory to add it before parser pipe in the spaCy pipeline:

from spacy import load as spacy_load
from spikex.pipes import SentX
from spikex.defaults import spacy_version

if spacy_version >= 3:
  from spacy.language import Language

    @Language.factory("sentx")
    def create_sentx(nlp, name):
        return SentX()

nlp = spacy_load("en_core_web_sm")
sentx_pipe = SentX() if spacy_version < 3 else "sentx"
nlp.add_pipe(sentx_pipe, before="parser")
doc = nlp("A little sentence. Followed by another one.")
for sent in doc.sents:
  print(sent)

>>> A little sentence.
>>> Followed by another one.

That's all folks

Feel free to contribute and have fun!

Owner
Erre Quadro Srl
Erre Quadro Srl
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
Voilร  turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilร  turns Jupyter notebooks into standalone web applications. Unlike the

Voilร  Dashboards 4.5k Jan 03, 2023
CoSENT ๆฏ”Sentence-BERTๆ›ดๆœ‰ๆ•ˆ็š„ๅฅๅ‘้‡ๆ–นๆกˆ

CoSENT ๆฏ”Sentence-BERTๆ›ดๆœ‰ๆ•ˆ็š„ๅฅๅ‘้‡ๆ–นๆกˆ

่‹ๅ‰‘ๆž—(Jianlin Su) 201 Dec 12, 2022
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

67 Dec 15, 2022
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 20 Jan 09, 2023
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
pyupbit ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ upbit์—์„œ ๋น„ํŠธ์ฝ”์ธ์„ ์ž๋™๋งค๋งคํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์กฐ์ฝ”๋”ฉ ์œ ํŠœ๋ธŒ ์ฑ„๋„์—์„œ ์ž์„ธํ•œ ๊ฐ•์˜ ์˜์ƒ์„ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ๋น„ํŠธ์ฝ”์ธ ํˆฌ์ž ์ž๋™ํ™” ๊ฐ•์˜ ์ฝ”๋“œ by ์œ ํŠœ๋ธŒ ์กฐ์ฝ”๋”ฉ ์ฑ„๋„ pyupbit ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ upbit ๊ฑฐ๋ž˜์†Œ์—์„œ ๋น„ํŠธ์ฝ”์ธ ์ž๋™๋งค๋งค๋ฅผ ํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ํŒŒ์ผ ๊ตฌ์„ฑ test.py : ์ž”๊ณ  ์กฐํšŒ (1๊ฐ•) backtest.py : ๋ฐฑํ…Œ์ŠคํŒ… ์ฝ”๋“œ (2๊ฐ•) bestK.p

์กฐ์ฝ”๋”ฉ JoCoding 186 Dec 29, 2022
Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Sentiment Analysis Project This project contains two sentiment analysis programs for Hotel Reviews using a Hotel Reviews dataset from Datafiniti. The

Simran Farrukh 0 Mar 28, 2022
Pytorch implementation of winner from VQA Chllange Workshop in CVPR'17

2017 VQA Challenge Winner (CVPR'17 Workshop) pytorch implementation of Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challeng

Mark Dong 166 Dec 11, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
Simple text to phones converter for multiple languages

Phonemizer -- foสŠnmaษชzษš The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022
๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird โ€ข How to Use โ€ข Pretraining โ€ข Evaluation Result โ€ข Docs โ€ข Citation ํ•œ๊ตญ์–ด | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Nov 16, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
A complete NLP guideline for enthusiasts

NLP-NINJA A complete guide for Natural Language Processing in Python Table of Contents S.No. Topic Level Meaning 1 Tokenization ๐Ÿค Beginner 2 Stemming

MAINAK CHAUDHURI 22 Dec 27, 2022
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023