nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Overview

Logo

nlabel is currently alpha software and in an early stage of development.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is also a system to collate results from various taggers and keep track of used models and configurations.

Apart from its standard persistence through sqlite and json files, nlabel's binary arriba format combines especially low storage requirements with high performance (see benchmarks below).

Through arriba, nlabel is thus especially suitable for

  • inspecting many features on few documents
  • inspecting few features on many documents

To support external tool chains, nlabel supports exporting to REFI-QDA.

Quick Start

Processing text works occurs in two steps. First, a NLP instance is built from an existing NLP pipeline:

from nlabel import NLP

import spacy

nlp = NLP(spacy.load("en_core_web_sm"), renames={
    'pos': 'upos',
    'tag': 'xpos'
}, require_gpu=False)

In the example, above nlp now contains a pipeline based on spacy's en_core_web_sm model. We instruct nlp to generate embedding vectors via vectors, and to rename two tags, namely pos to upos and tag to xpos.

In the next step, we run the pipeline and look at its output:

doc = nlp(
    "If you're going to San Francisco,"
    "be sure to wear some flowers in your hair.")

for sent in doc.sentences:
    for token in doc.tokens:
        print(token.text, token.upos, token.vector)

You can ask a doc which tags it carries, by calling doc.tags. In the example above, this would give:

['dep', 'ent_iob', 'lemma', 'morph', 'sentence', 'token', 'upos', 'xpos']

In the following sections, some of internal concepts will be explained. To get directly to code that will generate archives for document collections, skip to Importing a CSV to a local archive.

Tags and Labels

nlabel handles everything as tags, even it is has no label. That means that nlabel regards tokens and sentences as as tags with labels. Tags can both be iterated but also asked for labels. Tags can also be regarded as containers that contain other tags. The following examples illustrate the concepts:

for ent in doc.ents:
    print(ent.label, ent.text)

outputs

GPE San Francisco,

while

for ent in doc.ents:
    for token in ent.tokens:
        print(ent.label, token.text, token.xpos)

outputs

GPE San PROPN
GPE Francisco PROPN

NLP engines

To plug in a different nlp engine, set nlp differently:

import stanza
nlp = NLP(stanza.Pipeline('en'))

Since we renamed tag and pos, in the spacy example above, this would work without additional work.

At the moment nlabel has implementations for spacy, stanza, flair and deeppavlov. You can also write your own nlp data generators (based on nlabel.nlp.Tagger).

While NLP usually auto-detects the type of NLP parser you provide it, there are specialized constructors (NLP.spacy, NLP.flair, etc.) that cover some border cases.

Saving and Loading Documents

Documents can be saved to disk:

doc.save("path/to/file")

By default, this will generate a json-based format that should be easy to parse, even if you do decide to not use nlabel after this point - see bahia json documentation.

Of course, you can also use nlabel to load its own documents:

from nlabel import Document

with Document.open("path/to/file") as doc:
    for sent in doc.sentences:
        for token in sent.tokens:
            print(token.text, token.upos, token.vector)

Working with Archives

To store data from multiple taggers and texts, the approach from the last section would generate lots of separate files. nlabel offers a much better alternative through Archives.

There will be more detailed info on archives later on, for now, here is a quick run-through of how to use them.

A first example

This creates (or opens an existing) archive using the carenero engine (details later on), and adds a newly parsed document to it.

with open_archive("/path/to/archive", engine="carenero") as archive:
    doc = nlp(text)
    archive.add(doc)

Opening the archive later would allow us to retrieve all documents:

with open_archive("/path/to/archive", "r") as archive:
    for doc in archive.iter():
        print(doc.text)

Archives know some more things like the number of documents - use len(archive) - or information about its taggers (see next section).

Multiple Taggers

Things get interesting when using more than one tagger, e.g.:

with open_archive("/path/to/archive", engine="carenero") as archive:
    archive.add(nlp1(text))  # e.g. spacy
    archive.add(nlp2(text))  # e.g. stanza

In such an archive, calling archive.iter() will produce an error:

there are 2 taggers with conflicting tag names in this archive,
please use a selector

The reason for this error message is that spacy's and stanza's tag names clash, and nlabel would not know how to deciper doc.tokens to map either to spacy's or stanza's token data.

To resolve this issue, we can specify which tagger to use in iter.

To do this, we can first ask the archive for the taggers it knows by calling archive.taggers. Each tagger carries a unique signature that identifies it. For example, print(archive.taggers[0]) might the following signature:

env:
  machine: arm64
  platform: macOS-12.1-arm64-arm-64bit
  runtime:
    nlabel: 0.0.1.dev0
    python: 3.9.7
library:
  name: spacy
  version: 3.2.1
model:
  lang: en
  name: core_web_sm
  version: 3.2.0
renames:
  pos: upos
  tag: xpos
type: nlp
vectors:
  token:
    type: native

To iterate over documents getting tag data from this tagger, we can use archive.iter(archive.taggers[0]).

More commonly, we want to select a tagger based on its attributes, not on its index in an archive. To do this, we can use a MongoDB style query syntax:

spacy_tagger = archive.taggers[{
    'library': {
        'name': 'spacy'
    }
}]

This will return the tagger, that carries the name 'spacy' in the 'library' section of its signature. If there are no or multiple such taggers, we will get a KeyError.

As shorthand for the query above, you can also use:

spacy_tagger = archive.taggers[{
    'library.name': 'spacy'
}]

Mixing and Bridging Taggers

What happens if we want not exactly one tagger, but the output from multiple taggers.

Archive.iter() also allows to specify single tags and even rename them.

Using spacy_tagger from the last section and a new stanza_tagger:

for doc in archive.iter( spacy_tagger.sentence, spacy_tagger.xpos, stanza_tagger.xpos.to('st_xpos'))):

With these docs, we now can access spacy's sentence and xpos tags, but also stanza's xpos tag, which we rename to st_xpos to avoid a name clash with spacy's `xpos' tag:

    for token in doc.tokens:  # spacy tokens
        print(token.xpos)  # spacy xpos
        print(token.st_xpos)  # stanza xpos

Note that this only works, if stanza's tokenization for a token exactly matches that of spacy.

The Design of nlabel and Inherent Quirks

nlabel does not differentiate between tags and structuring entities such as sentences and tokens. All of them are the same concept to nlabel: labeled spans, that can be containers to other spans.

What can look like a bug at times, is a very conscious design decision: nlabel is completely agnostic to tags in terms of knowing only a single concept that it applies to everything.

Due to this design, there are various formulations in the API that are perfectly valid but rather confusing.

Obviously, it is desirable to write code that avoids these valid but quirky formulations.

Anything is a span with a label

The code below will look for a tag called "pos" that is perfectly aligned with the current token. If such a tag exists, nlabel considers it to be the "token's pos tag", and will return this tag's label.

for token in doc.tokens:
    print(token.pos)

Here is a quirky twist on the code above:

for token in doc.tokens:
    print(token.sentence)

This is allowed. The code will do the same thing as above: first it looks for a tag called "sentence" that is perfectly aligned with the current token. If such a tag exists, its label is returned.

Since the "sentence" tags provided by nlp libraries carry no labels, and "sentence" tags are not aligned to "token" tags, this will fail at step one or two, and therefore just return an empty label. Still, it is valid in terms of nlabel's concepts.

Using the "label" attribute

for ent in sentence.ents:
    print(ent.label)

The following code does exactly the same thing (avoid using it):

for ent in sentence.ents:
    print(ent.ent)

Label Types

There are four label types in nlabel:

description notes
labels all labels constisting of value and score
label first label only ignores ensuing labels
strs string list of label values ignores scores
str first label value as string ignores score and ensuing labels

strs and labels are suitable for getting output from taggers that return multiple labels.

The default type is str. The exception to this rule are morphology tags (e.g. spacy's morph and stanza's feats, which default to strs).

To specify label types, use the .to(label_type=x) method on tags, when specifying them to Archive.iter or Group.view.

Groups and Views

Groups are an underlying building block of nlabel. You might not encounter them directly.

A group contains data from multiple taggers for one shared text. If you need to collect data for multiple texts, use archives.

Documents can be combined into Groups, which will then contain information from multiple taggers:

from nlabel import Group

group = Group.join([doc1, doc2])

Groups have a view method that works similar to the iter method available in Archives.

Computing Embeddings

The following code uses a spacy model to generate token vectors from spacy's native vector attribute:

nlp = NLP.spacy(
    spacy_model,
    vectors={'token': nlabel.embeddings.native})

Spacy's vector attribute is usually filled via spacy's own Tok2Vec and Transformer components or external extensions such as spacy-sentence-bert.

Alternatively, the following code constructs a model that computes transformer embeddings for tokens via flair:

nlp = NLP.flair(
    vectors={'token': nlabel.embeddings.huggingface(
        "dbmdz/bert-base-german-cased", layers="-1, -2")},
    from_spacy=spacy_model)

from_spacy indicates that sentence splitter and tokenizer should be taken from the provided spacy model.

Archives

Engines

nlabel comes with three different persistence engines:

  • carenero is for collecting data, esp. in a batch setting - by supporting restartability and transaction safety, and enabling export of full data or sub sets of it into bahia or arriba.
  • bahia is suitable for archival purposes, as it is just a thin wrapper around a zip of human-readable json files; it is not the ideal format for exports.
  • arriba is a binary format optimized for read performance, it is suitable for data analysis; it is not suitable for exports.

Storage Size

The following graph shows data from a real-world dataset, consisting of 18861 texts (125.3 MB text data), tagged with 4 taggers and a total of 31 tags (no embedding data). Y axis shows size in GB (note logarithmic scale). REFI-QDA is roughly 100 times the size of arriba.

storage size requirements for different engines

Random Access Speeds

The exact speed of arriba depends on the task and data, but but often arriba performs 10 to 100 times faster than bahia and carenero on real-world projects. From the same data set as earlier (when extracting all POS tags from one of 4 taggers over 2000 documents):

access times for different engines

The carenero/ALL benchmarks shows the time when accessing all tags from all taggers through carenero.

More Engine Details

These engines support storing both tagging data and embedding vectors. In the ordering above, they go from slower to faster.

carenero bahia arriba
data collection + - -
exporting + - -
read speeds - - +
suitable for archival - + -

(*) bahia supports writes, but does not avoid adding duplicates or support proper restartability in batch settings, i.e. it is not suited to incremental updates.

Additional Examples

Importing a CSV to a local archive

Create a carenero archive from a CSV:

from nlabel.importers import CSV

import spacy

csv = CSV(
    "/path/to/some.csv",
    keys=['zeitung_id', 'text_type_id', 'filename'],
    text='text')
csv.importer(spacy.load("en_core_web_sm")).to_local_archive()

This will create an archive located in the same folder as the CSV. The code above is restartable, i.e. it is okay to interrupt and continue later - it will not add duplicate entries.

Once the archive has been created, one can either use it directly, e.g. iterating its documents:

from nlabel import open_archive

with open_archive("some/archive.nlabel", mode="r") as archive:
    for doc in archive.iter(some_selector):
        for x in doc.tokens:
            print(x.text, x.xpos, x.vector)

Or, one can save the archive to different formats for faster traversal:

archive.save("demo2", engine="bahia")
archive.save("demo3", engine="arriba")

The open_archive call from above works with all archive types.

Note that the iter call on archives takes an optional view description that allows picking/renaming tags as described earlier.

Exporting to a remote archive

For larger jobs, it is often useful to separate computation and storage, and to allow multiple computation processes (both often applies to GPU cluster environments). Since carenero's sqlite is bad at handling concurrent writes, the solution is starting a dedicated web service that handles the writing on a dedicated machine.

On machine A, start an archive server (it will write a carenero archive to the given path):

python -m nlabel.importers.server /path/to/archive.nlabel --password your_pwd

On machine B, you can start one or multiple importers writing to that remote archive. Modifying the example from the local archive:

from nlabel import RemoteArchive

remote_archive = RemoteArchive("http://localhost:8000", ("user", "your_pwd"))
csv.importer(spacy.load("en_core_web_sm")).to_remote_archive(
    remote_archive, batch_size=8)

Exporting REFI-QDA

The following code exports ent tags to a REFI-QDA project.

from nlabel import NLP

import spacy
nlp = NLP(spacy.load("en_core_web_lg"))
text = 'some longer text...'
doc = nlp(text)

doc.save_to_qda(
    "/path/to/your.qdp", {
        'tagger': {
        },
        'tags': {
            'ent'
        }
    })

A save_to_qda method is also part of cantenero and bahia archives.

Owner
Bernhard Liebl
Bernhard Liebl
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 01, 2023
BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

Andrew Tavis McAllister 41 Dec 27, 2022
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For furth

Yiming Cui 1.2k Dec 30, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

Saad A. Bazaz 26 Nov 27, 2022
Switch spaces for knowledge graph embeddings

SwisE Switch spaces for knowledge graph embeddings. Requirements: python3 pytorch numpy tqdm Reproduce the results To reproduce the reported results,

Shuai Zhang 4 Dec 01, 2021
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

Donghoon Shin 3 Dec 02, 2022
CPC-big and k-means clustering for zero-resource speech processing

The CPC-big model and k-means checkpoints used in Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing.

Benjamin van Niekerk 5 Nov 23, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
Simple translation demo showcasing our headliner package.

Headliner Demo This is a demo showcasing our Headliner package. In particular, we trained a simple seq2seq model on an English-German dataset. We didn

Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering 16 Nov 24, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Bethge Lab 61 Dec 21, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
pytorch implementation of Attention is all you need

A Pytorch Implementation of the Transformer: Attention Is All You Need Our implementation is largely based on Tensorflow implementation Requirements N

230 Dec 07, 2022