📝An easy-to-use package to restore punctuation of the text.

Last update: Dec 30, 2022

Related tags

Overview

✏️ rpunct - Restore Punctuation

This repo contains code for Punctuation restoration.

This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks. It uses HuggingFace's bert-base-uncased model weights that have been fine-tuned for Punctuation restoration.

Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.

List of punctuations we restore:

Upper-casing
Period: .
Exclamation: !
Question Mark: ?
Comma: ,
Colon: :
Semi-colon: ;
Apostrophe: '
Dash: -

🚀 Usage

Below is a quick way to get up and running with the model.

First, install the package.

pip install rpunct

Sample python code.

from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

🎯 Accuracy

Here is the number of product reviews we used for finetuning the model:

Language	Number of text samples
English	560,000

We found the best convergence around 3 epochs, which is what presented here and available via a download.

The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:

Accuracy	Overall F1	Eval Support
91%	90%	45,990

💻 🎯 Further Fine-Tuning

To start fine-tuning or training please look into training/train.py file. Running python training/train.py will replicate the results of this model.

☕ Contact

Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.

Comments

Update requirements.txt

ERROR: Could not find a version that satisfies the requirement torch==1.8.1 (from rpunct) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0) ERROR: No matching distribution found for torch==1.8.1

opened by Rukaya-lab 0
Forked repo with fixes
I forked this repository (link here) to fix the outdated dependencies and incompatibility with non-CUDA machines. If anyone needs these fixes, feel free to install from the fork:

pip install git+https://github.com/samwaterbury/rpunct.git

Hopefully this repository is updated or another maintainer is assigned. And thanks to the creator @Felflare, this is a useful tool!
opened by samwaterbury 2
Requirements shouldn't ask for such specific versions

First, thanks a lot for providing this package :)

Currently, the requirements.txt, and thus the dependencies in the setup.py are for very specific versions of Pytorch etc. This shouldn't be the case if you want this package to be used as a general library (think of a second package that would do the same but ask for an incompatible version of PyTorch and would prevent any possible installation of the two together). The end user might also be needing a more recent version of PyTorch. Given that PyTorch is almost always backward compatible, and quite stable, I think the requirements for it could be changed from ==1.8.1 to >=1.8.1. I believe the same would be true for the other packages.

opened by adefossez 2
Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.
Thanks for the great library! When running this without a GPU I had problems. I think there is a simple fix. The simple transformer NER model defaults to enabling cuda. This PR allows the user to pass a dictionary of arguments specifically for the simpletransformers NER model. So you can now run the code on a CPU by initializing rpunct like so

rpunct = RestorePuncts(ner_args={"use_cuda": False})

Before this change, when running rpunct examples on the CPU the following error occurs:

from rpunct import RestorePuncts # The default language is 'english' rpunct = RestorePuncts() rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated 3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")

ValueError Traceback (most recent call last) /var/folders/hx/dhzhl_x51118fm5cd13vzh2h0000gn/T/ipykernel_10548/194907560.py in 1 from rpunct import RestorePuncts 2 # The default language is 'english' ----> 3 rpunct = RestorePuncts() 4 rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record 5 by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were

~/repos/rpunct/rpunct/punctuate.py in init(self, wrds_per_pred, ner_args) 19 if ner_args is None: 20 ner_args = {} ---> 21 self.model = NERModel("bert", "felflare/bert-restore-punctuation", labels=self.valid_labels, 22 args={"silent": True, "max_seq_length": 512}, **ner_args) 23

~/repos/transformers/transformer-env/lib/python3.8/site-packages/simpletransformers/ner/ner_model.py in init(self, model_type, model_name, labels, args, use_cuda, cuda_device, onnx_execution_provider, **kwargs) 209 self.device = torch.device(f"cuda:{cuda_device}") 210 else: --> 211 raise ValueError( 212 "'use_cuda' set to True when cuda is unavailable." 213 "Make sure CUDA is available or set use_cuda=False."

ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set use_cuda=False.
opened by nbertagnolli 1
add use_cuda parameter

using the package in an environment without cuda support causes it to fail. Adding the parameter to shut it off if necessary allows it to function normall.

opened by mjfox3 1

Releases(1.0.1)

1.0.1(May 24, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Daulet Nurmanbetov

Deep Learning, AI and Finance

GitHub Repository

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Splinter This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to

88 Dec 31, 2022

MEDIALpy: MEDIcal Abbreviations Lookup in Python

A small python package that allows the user to look up common medical abbreviations.

7 Nov 09, 2022

📝An easy-to-use package to restore punctuation of the text.

Related tags

Overview

✏️ rpunct - Restore Punctuation

🚀 Usage

🎯 Accuracy

💻 🎯 Further Fine-Tuning

☕ Contact

Comments

Update requirements.txt

Forked repo with fixes

Requirements shouldn't ask for such specific versions

Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

add use_cuda parameter

Releases(1.0.1)

1.0.1(May 24, 2021)

Owner

Daulet Nurmanbetov

🤗🖼️ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

Summarization module based on KoBART

Official PyTorch implementation of SegFormer

Associated Repository for "Translation between Molecules and Natural Language"

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

This is a simple item2vec implementation using gensim for recbole

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

Official Stanford NLP Python Library for Many Human Languages

Fast, general, and tested differentiable structured prediction in PyTorch

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Lattice methods in TensorFlow

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Turkish Stop Words Türkçe Dolgu Sözcükleri

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

MEDIALpy: MEDIcal Abbreviations Lookup in Python