An open-source NLP library: fast text cleaning and preprocessing.

Overview

🌴 dobbi 🦕

Takes care of all of this boring NLP stuff

PyPI - Python Version Version GitHub

Description

An open-source NLP library: fast text cleaning and preprocessing.

TL;DR

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.

Installation

To download dobbi, either fork this GitHub repo or simply use Pypi via pip:

$ pip install dobbi

Usage

Import the library:

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('Check here: https://some-url.com')

Supported methods and patterns

The process consists of three stages:

  1. Initialization methods: initialize a dobbi Work object
  2. Intermediate methods: chain patterns in the needed order
  3. Terminal methods: choose if you need a function or a result

Initialization functions:

  • dobbi.clean()
  • dobbi.collect()
  • dobbi.replace()

Intermediate methods (pattern processing choice):

  • regexp() - custom regular expressions
  • url() - URLs
  • html() - HTML and "<...>" type markups
  • punctuation() - punctuation
  • hashtag() - hashtags
  • emoji() - emoji
  • emoticons() - emoticons
  • whitespace() - any type of whitespaces
  • nickname() - @-starting nicknames

Terminal methods:

  • execute(str) - executes chosen methods on the provided string.
  • function() - returns a function which is a combination of the chosen methods.

Examples

1) Clean a random Twitter message

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

2) Replace nicknames and urls with tokens

dobbi.replace() \
    .hashtag('') \
    .nickname() \
    .url('__CUSTOM_URL_TOKEN__') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'

3) Get the text cleanup function (one-liner)

Please, try to avoid the in-line method chaining, as it is less readable. Do as your heart tells you.

func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol    Why  @Alex33 is so... funny? 
    
    \nCheck
    \there: https://some-url.com'
   )

Result:

'Why Alex33 is so funny Check here'
  1. Chain regexp methods
dobbi.clean() \
    .regexp('#\w+') \
    .regexp('@\w+') \
    .regexp('https?://\S+') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

Additional

Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation() as one of the last functions.

Call for collaboration 🤗

If you enjoyed the project I would be grateful if you supported it :)

Below is the list of useful features I would be happy to share with you:

  • Finding bugs
  • Making code optimizations
  • Writing tests
  • Help with new features development
You might also like...
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

💬   Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Releases(v0_13)
Owner
Iaroslav
👋 Hi there! I'm a Data Scientist and a Software Engineer. I'm passionate about NLP and Deep Learning. 💭 My dream is to develop General AI.
Iaroslav
This code is the implementation of Text Emotion Recognition (TER) with linguistic features

APSIPA-TER This code is the implementation of Text Emotion Recognition (TER) with linguistic features. The network model is BERT with a pretrained mod

kenro515 1 Feb 08, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 9.7k Jan 09, 2023
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 21.2k Dec 30, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
CPC-big and k-means clustering for zero-resource speech processing

The CPC-big model and k-means checkpoints used in Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing.

Benjamin van Niekerk 5 Nov 23, 2022
A 10000+ hours dataset for Chinese speech recognition

A 10000+ hours dataset for Chinese speech recognition

309 Dec 16, 2022
CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

Sajjad Ayoubi 66 Dec 14, 2022
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Status: Archive (code is provided as-is, no updates expected) Update August 2020: For an example repository that achieves state-of-the-art modeling pe

OpenAI 1.3k Dec 28, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
Python api wrapper for JellyFish Lights

Python api wrapper for JellyFish Lights The hope is to make this a pip installable package Current capabalilities: Connects to a local JellyFish Light

10 Dec 18, 2022
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

Malik Makkes 2 Mar 26, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022