Model for recasing and repunctuating ASR transcripts

Overview

Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

  • lower: keep lowercase
  • upper: convert to upper case
  • capitalize: set first letter as upper case
  • other: left as is

And the following punctuation labels:

  • o: no punctuation
  • period: .
  • comma: ,
  • question: ?
  • exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

  • Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

  • French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
    • Iterations: 19000
    • Batch size: 16
    • Max length: 256
    • Seed: 871253
    • Cut-drop probability: 0.1
    • Train loss: 0.021128975618630648
    • Valid loss: 0.015684964135289192
    • Recasing accuracy: 96.73
    • Punctuation accuracy: 95.02
      • All punctuation F-score: 67.79
      • Comma F-score: 67.94
      • Period F-score: 72.91
      • Question F-score: 57.57
      • Exclamation mark F-score: 15.78
    • Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y
Comments
  • Is it possible to customize for new language?

    Is it possible to customize for new language?

    Dear Benoit Favre,

    Your project is really important! Is it possible to customize for new language? If yes, could you tell short hints for it?

    Thank you in advance!

    opened by ican24 5
  • Can't get attribute 'WordpieceTokenizer'

    Can't get attribute 'WordpieceTokenizer'

    Hi thanks for your effort on developing recasepunc! I know that you can't provide help for models not trained by you, but maybe you have an idea what's going wrong here:

    I'm loading the model vosk-recasepunc-de-0.21 from https://alphacephei.com/vosk/models. When I do so, torch tells me that it can't find WordpieceTokenizer. Do you know why? Is the model incompatible?

    Punc predict path: C:\Users\admin\meety\vosk-recasepunc-de-0.21\checkpoint Traceback (most recent call last): File "main2.py", line 120, in t = transcriber() File "main2.py", line 32, in init self.casePuncPredictor = CasePuncPredictor(punc_predict_path, lang="de") File "C:\Users\admin\meety\recasepunc.py", line 273, in init loaded = torch.load(checkpoint_path, map_location=device if torch.cuda.is_available() else 'cpu') File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 882, in _load result = unpickler.load() File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 875, in find_class return super().find_class(mod_name, name) AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' from 'main2.py'>

    opened by padmalcom 4
  • Can't do inference

    Can't do inference

    Hello, I'm trying to use example.py on a french model (fr.22000 or fr-txt.large.19000) But I have this error: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids". I also tried with the following command, same error in output. python recasepunc.py predict fr.22000 < toto.txt > output.txt Do you have any advice? Thanks

    opened by MatFrancois 3
  • Memory usage

    Memory usage

    Hi, on start punctuation app use about 9Gb RAM, but in one moment(in load model ). Then we need about 1.5GB. Can we reduce 9GB on start? maybe on start we check our model and it feature can be turn off?

    opened by gubri 1
  • Russian model doesn't work, while English does

    Russian model doesn't work, while English does

    When I use Russian model, it gives me this error:

    WARNING: reverting to cpu as cuda is not available
    Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
    
     File "C:\pypy\rus\recasepunc.py", line 741, in <module>
        main(config, config.action, config.action_args)
      File "C:\pypy\rus\recasepunc.py", line 715, in main
        generate_predictions(config, *args)
      File "C:\pypy\rus\recasepunc.py", line 349, in generate_predictions
        for line in sys.stdin:
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\codecs.py", line 322, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte
    
     File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\site-packages\flask\app.py", line 1796, in dispatch_request
        return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)      
      File "C:\pypy\app.py", line 32, in process_audio
        cased = subprocess.check_output('python rus/recasepunc.py predict rus/checkpoint', shell=True, text=True, input=text)
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
    line 420, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
    line 524, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command 'python rus/recasepunc.py predict rus/checkpoint' returned non-zero exit status 1.
    

    Sorry for a long message, I'm not sure which of these messages are the most important. Should I use another version of transformers? I use transformers==4.16.2 and it works fine with English model.

    opened by xenia19 0
  • Export model to be Used in C++

    Export model to be Used in C++

    Is it possible that export model to something that can be used in C++ using libtorch?

    1. export existing model(checkpoint provided in this repo)
    2. export model after I train with my own data which option above possible, or both?
    opened by leohuang2013 0
  • RuntimeError when predicting with the french models

    RuntimeError when predicting with the french models

    I tried to use the french models (both fr.22000 and fr-txt.large.19000) on a very simple text:

    j'aime les fleurs les olives et la raclette

    When running python3 recasepunc.py predict fr.22000 < input.txt > output.txt (or with the other model), I get the following RuntimeError:

    Traceback (most recent call last):
      File "/home/mael/charly/recasepunc/recasepunc.py", line 733, in <module>
        main(config, config.action, config.action_args)
      File "/home/mael/charly/recasepunc/recasepunc.py", line 707, in main
        generate_predictions(config, *args)
      File "/home/mael/charly/recasepunc/recasepunc.py", line 336, in generate_predictions
        model.load_state_dict(loaded['model_state_dict'])
      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for Model:
    	Unexpected key(s) in state_dict: "bert.position_ids".
    

    I tried the same with the english model, and it worked perfectly. Looks like something is broken with the french ones?

    opened by maelchiotti 2
  • parameters like --dab_rate can't be set from cmd line bc they are bool

    parameters like --dab_rate can't be set from cmd line bc they are bool

    look at parameters below. They really became bool, i find this bug while debugging it. ''' if name == 'main': parser = argparse.ArgumentParser() parser.add_argument("action", help="train|eval|predict|tensorize|preprocess", type=str) ... parser.add_argument("--updates", help="number of training updates to perform", default=default_config.updates, type=bool) parser.add_argument("--period", help="validation period in updates", default=default_config.period, type=bool) parser.add_argument("--lr", help="learning rate", default=default_config.lr, type=bool) parser.add_argument("--dab-rate", help="drop at boundaries rate", default=default_config.dab_rate, type=bool) config = Config(**parser.parse_args().dict)

    main(config, config.action, config.action_args)
    

    '''

    opened by al-zatv 0
  • Cannot use trained model for validation or prediction

    Cannot use trained model for validation or prediction

    Hi, thank you for this repo! I'm trying to reproduce results for different language, so I'm using multilingual-bert fine-tuned to my language dataset. Everything goes well during preprocessing and training, the resuls are comparable with those for English and French (97-99% for case and punctuation).

    But when I try to use trained model, it gives very poor results even for sentences from training dataset. It works, sometimes it puts capital letters or dots, but it's rare and mostly model can't handle. Also when I try to evaluate model with command from the README (also tried it for already used validation sets, for instance with command python recasepunc.py eval bertugan_casepunc.24000 valid.case+punc.x valid.case+punc.y) it gives error:

    File "recasepunc.py", line 220, in batchify x = x[:(len(x) // max_length) * max_length].reshape(-1, max_length) TypeError: unhashable type: 'slice'

    Sorry for pointing to two different problems in one Issue, but I though maybe it can be one common mistake for both cases.

    opened by khusainovaidar 5
Releases(0.3)
Owner
Benoit Favre
Benoit Favre
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
Chinese NER with albert/electra or other bert descendable model (keras)

Chinese NLP (albert/electra with Keras) Named Entity Recognization Project Structure ./ ├── NER │   ├── __init__.py │   ├── log

2 Nov 20, 2022
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

BERTopic BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable

Maarten Grootendorst 3.6k Jan 07, 2023
Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

Junying Chen 20 Dec 13, 2022
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
Leon is an open-source personal assistant who can live on your server.

Leon Your open-source personal assistant. Website :: Documentation :: Roadmap :: Contributing :: Story 👋 Introduction Leon is an open-source personal

Leon AI 11.7k Dec 30, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
Easy to start. Use deep nerual network to predict the sentiment of movie review.

Easy to start. Use deep nerual network to predict the sentiment of movie review. Various methods, word2vec, tf-idf and df to generate text vectors. Various models including lstm and cov1d. Achieve f1

1 Nov 19, 2021
Converts text into a PDF of handwritten notes

Text To Handwritten Notes Converts text into a PDF of handwritten notes Explore the docs » · Report Bug · Request Feature · Steps: $ git clone https:/

UVSinghK 63 Oct 09, 2022
Constituency Tree Labeling Tool

Constituency Tree Labeling Tool The purpose of this package is to solve the constituency tree labeling problem. Look from the dataset labeled by NLTK,

张宇 6 Dec 20, 2022
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021
PyWorld3 is a Python implementation of the World3 model

The World3 model revisited in Python Install & Hello World3 How to tune your own simulation Licence How to cite PyWorld3 with Bibtex References & ackn

Charles Vanwynsberghe 248 Dec 14, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

CvarAdversarialRL Official code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning". Initial setup Create a virtual

Mathieu Godbout 1 Nov 19, 2021
Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting Official PyTorch Implementation of paper "NeLF: Neural Light-tran

Ken Lin 38 Dec 26, 2022
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 07, 2022
Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The KLEJ Benchmark Baselines The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language und

Allegro Tech 17 Oct 18, 2022