This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Overview
http://applejack.science.ru.nl/lamabadge.php/python-ucto Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Ucto for Python

This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++ (https://languagemachines.github.io/ucto).

Installation

Easy

Manual (Advanced)

  • Make sure to first install ucto itself (https://languagemachines.github.io/ucto) and all its dependencies.
  • Install Cython if not yet available on your system: $ sudo apt-get cython cython3 (Debian/Ubuntu, may differ for others)
  • Clone this repository and run: $ sudo python setup.py install (Make sure to use the desired version of python)

Advanced note: If the ucto libraries and includes are installed in a non-standard location, you can set environment variables INCLUDE_DIRS and LIBRARY_DIRS to point to them prior to invocation of setup.py install.

Usage

Import and instantiate the Tokenizer class with a configuration file.

import ucto
configurationfile = "tokconfig-eng"
tokenizer = ucto.Tokenizer(configurationfile)

The configuration files supplied with ucto are named tokconfig-xxx where xxx corresponds to a three letter iso-639-3 language code. There is also a tokconfig-generic one that has no language-specific rules. Alternatively, you can make and supply your own configuration file. Note that for older versions of ucto you may need to provide the absolute path, but the latest versions will find the configurations supplied with ucto automatically. See here for a list of available configuration in the latest version.

The constructor for the Tokenizer class takes the following keyword arguments:

  • lowercase (defaults to False) -- Lowercase all text
  • uppercase (defaults to False) -- Uppercase all text
  • sentenceperlineinput (defaults to False) -- Set this to True if each sentence in your input is on one line already and you do not require further sentence boundary detection from ucto.
  • sentenceperlineoutput (defaults to False) -- Set this if you want each sentence to be outputted on one line. Has not much effect within the context of Python.
  • paragraphdetection (defaults to True) -- Do paragraph detection. Paragraphs are simply delimited by an empty line.
  • quotedetection (defaults to False) -- Set this if you want to enable the experimental quote detection, to detect quoted text (enclosed within some sort of single/double quote)
  • debug (defaults to False) -- Enable verbose debug output

Text is passed to the tokeniser using the process() method, this method returns the number of tokens rather than the tokens itself. It may be called multiple times in sequence. The tokens themselves will be buffered in the Tokenizer instance and can be obtained by iterating over it, after which the buffer will be cleared:

#pass the text (a str) (may be called multiple times),
tokenizer.process(text)

#read the tokenised data
for token in tokenizer:
    #token is an instance of ucto.Token, serialise to string using str()
    print(str(token))

    #tokens remember whether they are followed by a space
    if token.isendofsentence():
        print()
    elif not token.nospace():
        print(" ",end="")

The process() method takes a single string (str), as parameter. The string may contain newlines, and newlines are not necessary sentence bounds unless you instantiated the tokenizer with sentenceperlineinput=True.

Each token is an instance of ucto.Token. It can be serialised to string using str() as shown in the example above.

The following methods are available on ucto.Token instances: * isendofsentence() -- Returns a boolean indicating whether this is the last token of a sentence. * nospace() -- Returns a boolean, if True there is no space following this token in the original input text. * isnewparagraph() -- Returns True if this token is the start of a new paragraph. * isbeginofquote() * isendofquote() * tokentype -- This is an attribute, not a method. It contains the type or class of the token (e.g. a string like WORD, ABBREVIATION, PUNCTUATION, URL, EMAIL, SMILEY, etc..)

In addition to the low-level process() method, the tokenizer can also read an input file and produce an output file, in the same fashion as ucto itself does when invoked from the command line. This is achieved using the tokenize(inputfilename, outputfilename) method:

tokenizer.tokenize("input.txt","output.txt")

Input and output files may be either plain text, or in the FoLiA XML format. Upon instantiation of the Tokenizer class, there are two keyword arguments to indicate this:

  • xmlinput or foliainput -- A boolean that indicates whether the input is FoLiA XML (True) or plain text (False). Defaults to False.
  • xmloutput or foliaoutput -- A boolean that indicates whether the input is FoLiA XML (True) or plain text (False). Defaults to False. If this option is enabled, you can set an additional keyword parameter docid (string) to set the document ID.

An example for plain text input and FoLiA output:

tokenizer = ucto.Tokenizer(configurationfile, foliaoutput=True)
tokenizer.tokenize("input.txt", "ucto_output.folia.xml")

FoLiA documents retain all the information ucto can output, unlike the plain text representation. These documents can be read and manipulated from Python using the FoLiaPy library. FoLiA is especially recommended if you intend to further enrich the document with linguistic annotation. A small example of reading ucto's FoLiA output using this library follows, but consult the documentation for more:

import folia.main as folia
doc = folia.Document(file="ucto_output.folia.xml")
for paragraph in doc.paragraphs():
    for sentence in paragraph.sentence():
        for word in sentence.words()
            print(word.text(), end="")
            if word.space:
                print(" ", end="")
        print()
    print()

Test and Example

Run and inspect example.py.

Comments
  • undefined symbol: ...

    undefined symbol: ...

    Hi there,

    I have a clean ucto installation from sudo apt install ucto. When I compile the python extension, however, I can't import it since it fails with:

    ImportError: /home/manjavacas/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/ucto.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN9Tokenizer14TokenizerClass4initERKSs
    

    Not sure what might be going bad, since ucto works perfectly fine and the extension manages to compile without errors.

    Any ideas?

    question 
    opened by emanjavacas 8
  • Compilation fails after latest ucto release

    Compilation fails after latest ucto release

        gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -march=x86-64 -mtune=generic -O3 -pipe -fno-plt -fPIC -I/home/proycon/envs/dev
    /include -I/usr/include/ -I/usr/include/libxml2 -I/usr/local/include/ -I/home/proycon/envs/dev/include -I/usr/include/python3.10 -c ucto_wrapper.cpp -o build/temp.linux-x86_64-3.10/ucto_wrapper.o --std=c++0x -D U_USING_ICU_NAMESPACE=1
        ucto_wrapper.cpp: In function ‘PyObject* __pyx_gb_4ucto_9Tokenizer_8generator(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
        ucto_wrapper.cpp:3750:86: error: no match for ‘operator=’ (operand types are ‘std::vector<std::__cxx11::basic_string<char> >’ and ‘std::vector<icu_70::UnicodeString>’)
         3750 |   __pyx_cur_scope->__pyx_v_results = __pyx_cur_scope->__pyx_v_self->tok.getSentences();
    
    bug 
    opened by proycon 3
  • Tokenizer does not return lowercase tokens when lowercase = True

    Tokenizer does not return lowercase tokens when lowercase = True

    When I call tokenizer with lowercase True, the output contains tokens with uppercase.

    t = ucto.Tokenizer("tokconfig-nld",lowercase = True,sentencedetection=False,paragraphdetection=False)
    ucto: textcat configured from: /vol/customopt/lamachine.stable/share/ucto/textcat.cfg

    z = x.article_set.all()[0]

    t.process(z.text)

    [str(token) for token in t]

    ["'", 'oor', 'onze', 'redacteur', 'mr.', 'F.', 'KUITENBROUWER', 'AMSTERDAM',

    bug 
    opened by martijnbentum 3
  • Manual installation fails: config.h: no such file or directory

    Manual installation fails: config.h: no such file or directory

    I’ve tried to follow the manual installation instructions on Ubuntu 16.04, but it seems to be missing a file:

    [email protected]:~/git/python-ucto$ git status
    On branch master
    Your branch is up-to-date with 'origin/master'.
    nothing to commit, working directory clean
    [email protected]:~/git/python-ucto$ uname -a
    Linux unut 4.4.0-124-generic #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
    [email protected]:~/git/python-ucto$ sudo python setup.py install 
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'install_requires'
      warnings.warn(msg)
    running install
    running build
    running build_ext
    cythoning ucto_wrapper2.pyx to ucto_wrapper2.cpp
    building 'ucto' extension
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/include/ -I/usr/include/libxml2 -I/usr/local/include/ -I/usr/include/python2.7 -c ucto_wrapper2.cpp -o build/temp.linux-x86_64-2.7/ucto_wrapper2.o --std=c++0x -D U_USING_ICU_NAMESPACE=1
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    In file included from ucto_wrapper2.cpp:457:0:
    /usr/include/ucto/tokenize.h:33:20: fatal error: config.h: No such file or directory
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
    opened by texttheater 3
  • TokenRole has no attribute ENDOFQUOTE

    TokenRole has no attribute ENDOFQUOTE

    Hi there, I noticed that isendofquote seems to be broken.

    Seems like a typo on this line:

    https://github.com/proycon/python-ucto/blob/65a7f03a92f60fa28e330a5fb735d75230cdbec4/ucto_wrapper.pyx#L29

    which should be rather ENDOFQUOTE.

    bug 
    opened by emanjavacas 1
  • Question: possible to retrieve untokenized sentences?

    Question: possible to retrieve untokenized sentences?

    May sound silly, but would it be possible to create a method that would allow retrieving sentences from the tokenizer without whitespace between punctuation marks (e.g. untokenized)? E.g. maybe providing a tuple that would hold two versions of a sentence, both the tokenized, as well as the original?

    It is practical to keep the untokenized sentence in some scenarios (e.g. showing them to end users), and reconstructing it by script would be rather hacky and imprecise I guess.

    enhancement 
    opened by pirolen 1
Releases(v0.6.1)
Owner
Maarten van Gompel
Research software engineer - NLP - AI - 🐧 Linux & open-source enthusiast - 🐍 Python/ 🌊C/C++ / 🦀 Rust / 🐚 Shell - 🔐 Privacy, Security & Decentralisation
Maarten van Gompel
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
Adversarial Examples for Extreme Multilabel Text Classification

Adversarial Examples for Extreme Multilabel Text Classification The code is adapted from the source codes of BERT-ATTACK [1], APLC_XLNet [2], and Atte

1 May 14, 2022
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 09, 2022
Open-World Entity Segmentation

Open-World Entity Segmentation Project Website Lu Qi*, Jason Kuen*, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Zhe Lin, Philip Torr, Jiaya Jia This projec

DV Lab 408 Dec 29, 2022
A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Bloxflip Smart Bet A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode. https://bloxflip.com/crash. THIS

43 Jan 05, 2023
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

Sidharth Pal 20 Dec 14, 2022
Pipelines de datos, 2021.

Este repo ilustra un proceso sencillo de automatización de transformación y modelado de datos, a través de un pipeline utilizando Luigi. Stack princip

Rodolfo Ferro 8 May 19, 2022
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 05, 2022
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

GNES.ai 1.2k Jan 06, 2023
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

HNLP 1.1k Dec 16, 2022
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

DevNugget 3 Jul 28, 2021
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
Score-Based Point Cloud Denoising (ICCV'21)

Score-Based Point Cloud Denoising (ICCV'21) [Paper] https://arxiv.org/abs/2107.10981 Installation Recommended Environment The code has been tested in

Shitong Luo 79 Dec 26, 2022