A fast and easy implementation of Transformer with PyTorch.

Overview

FasySeq

FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which can be trained efficiently and modified easily. This toolkit is based on Transformer(Vaswani et al.), and will add more seq2seq models in the future.

Dependency

PyTorch >= 1.4
NLTK

Result

...

Structure

...

To Be Updated

  • top-k and top-p sampling
  • multi-GPU inference
  • length penalty in beam search
  • ...

Preprocess

Build Vocabulary

createVocab.py

NamedArguments Description
-f/--file The files used to build the vocabulary.
Type: List
--vocab_num The maximum size of vocabulary, the excess word will be discard according to the frequency.
Type: Int Default: -1
--min_freq The minimum frequency of token in vocabulary. The word with frequency less than min_freq will be discard.
Type: Int Default: 0
--lower Whether to convert all words to lowercase
--save_path The path to save voacbulary.
Type: str

Process Data

preprocess.py

NamedArguments Description
--source The path of source file.
Type: str
[--target] The path of target file.
Type: str
--src_vocab The path of source vocabulary.
Type: str
[--tgt_vocab] The path of target vocabulary.
Type: str
--save_path The path to save the processed data.
Type: str

Train

train.py

NamedArguments Description
Model -
--share_embed Source and target share the same vocabulary and word embedding. The max position of embedding is max(max_src_position, max_tgt_position) if the model employ share embedding.
--max_src_position The maximum source position, all src-tgt pairs which source sentences' lenght are greater than max_src_position will be cut or discard. If max_src_position > max source length, it wil be set to max source length.
Type: Int Default: inf
--max_tgt_position The maximum target position, all src_tgt pairs which target sentences' length are greater than max_tgt_position will be cut or discard. If max_tgt_position > max target length, it wil be set to max target length.
Type: Int Default: inf
--position_method The method to introduce positional information.
Option: encoding/embedding
--normalize_before Leveraging before layer normalization. See Xiong et al.
Checkpoint -
--checkpoint_path The path to save checkpoint file.
Type: str Default: None
--restore_file The checkpoint file to be loaded.
Type: str Default: None
--checkpoint_num Save the nearest checkpoint_num breakpoint.
Type: Int Default: inf
Data -
--vocab Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path.
Type: str Default: None
--src_vocab Source vocabulary path.
Type: str Default: None
--tgt_vocab Target vocabulary path.
Type: str Default: None
--file The training data file.
Type: str
--max_tokens The maximum tokens in each batch.
Type: Int Default: 1000
--discard_invalid_data The data which length of source or data is more than maximum position will be discard if use this option, otherwise the long sentences will be cut into max position.
Train -
--cuda_num The device's ID of GPU.
Type: List
--grad_accumulate The num of gradient accumulate.
Type: Int Default: 1
--epoch The total epoch to train.
Type: Int Default: inf
--batch_print_info The number of batch to print training information.
Type: Int Default: 1000

Inference

generator.py

NamedArguments Description
--cuda_num The device's ID of GPU.
Type: List
--file The inference data file which has been processed.
Type: str
--raw_file The raw inference data file, and will be preprocessed before generated.
Type: str
--ref_file The reference file.
Type: str
--max_length
--max_alpha
--max_add_token
Maximum generated length = min(max_length, max_alpha * max_src_len, max_add_token + max_src_token)
Type: Int Default: inf
--max_tokens The maximum tokens in each batch.
Type: Int Default: 1000
--src_vocab Source vocabulary path.
Type: str Default: None
--tgt_vocab Target vocabulary path.
Type: str Default: None
--vocab Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path.
Type: str Default: None
--model_path The path of pre-trained model.
Type: str
--output_path The path of output. the result will be saved into output_path/result.txt.
Type: str
--decode_method The decode method.
Option:greedy/beam
--beam Beam size.
Type: Int Default: 5

Postpreposs

avg_param.py

The average parameter code we employed is the same as fairseq.

License

FasySeq(-py) is Apache-2.0 License. The license applies to the pre-trained models as well.

You might also like...
Fast, general, and tested differentiable structured prediction in PyTorch
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

Reformer, the efficient Transformer, in Pytorch
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

An implementation of WaveNet with fast generation

pytorch-wavenet This is an implementation of the WaveNet architecture, as described in the original paper. Features Automatic creation of a dataset (t

Google's Meena transformer chatbot implementation
Google's Meena transformer chatbot implementation

Here's my attempt at recreating Meena, a state of the art chatbot developed by Google Research and described in the paper Towards a Human-like Open-Domain Chatbot.

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Owner
宁羽
宁羽
Leon is an open-source personal assistant who can live on your server.

Leon Your open-source personal assistant. Website :: Documentation :: Roadmap :: Contributing :: Story 👋 Introduction Leon is an open-source personal

Leon AI 11.7k Dec 30, 2022
Transformer Based Korean Sentence Spacing Corrector

TKOrrector Transformer Based Korean Sentence Spacing Corrector License Summary This solution is made available under Apache 2 license. See the LICENSE

Paul Hyung Yuel Kim 3 Apr 18, 2022
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 11 Nov 11, 2022
A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

2 Nov 11, 2022
official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Plugin 3 Jan 12, 2022
Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果,并在 G

5.4k Jan 03, 2023
Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

K. Seeliger 2 May 17, 2022
🤕 spelling exceptions builder for lazy people

🤕 spelling exceptions builder for lazy people

Vlad Bokov 3 May 12, 2022
Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

Yoga Pranata 3 Jun 28, 2022
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 03, 2023
Poetry PEP 517 Build Backend & Core Utilities

Poetry Core A PEP 517 build backend implementation developed for Poetry. This project is intended to be a light weight, fully compliant, self-containe

Poetry 293 Jan 02, 2023
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 28 Oct 30, 2022
Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Tensor2Tensor Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and ac

12.9k Jan 07, 2023
BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

Bhushan Kotnis 1 Oct 20, 2021
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Manage your exceptions in Python like a PRO Currently in BETA. Inspired by this blog post. I shared the building process of this tool here. “For those

Guilherme Latrova 353 Dec 31, 2022
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022