使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Overview

Pretrain_Bert_with_MaskLM

Info

使用Mask LM预训练任务来预训练Bert模型。

基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。

Pretraining Task

Mask Language Model,简称Mask LM,即基于Mask机制的预训练语言模型。

同时支持 原生的MaskLM任务和Whole Words Masking任务。默认使用Whole Words Masking

MaskLM

使用来自于Bert的mask机制,即对于每一个句子中的词(token):

  • 85%的概率,保留原词不变
  • 15%的概率,使用以下方式替换
    • 80%的概率,使用字符[MASK],替换当前token。
    • 10%的概率,使用词表随机抽取的token,替换当前token。
    • 10%的概率,保留原词不变。

Whole Words Masking

与MaskLM类似,但是在mask的步骤有些少不同。

在Bert类模型中,考虑到如果单独使用整个词作为词表的话,那词表就太大了。不利于模型对同类词的不同变种的特征学习,故采用了WordPiece的方式进行分词。

Whole Words Masking的方法在于,在进行mask操作时,对象变为分词前的整个词,而非子词。

Model

使用原生的Bert模型作为基准模型。

Datasets

项目里的数据集来自wikitext,分成两个文件训练集(train.txt)和测试集(test.txt)。

数据以行为单位存储。

若想要替换成自己的数据集,可以使用自己的数据集进行替换。(注意:如果是预训练中文模型,需要修改配置文件Config.py中的self.initial_pretrain_modelself.initial_pretrain_tokenizer,将值修改成 bert-base-chinese

自己的数据集不需要做mask机制处理,代码会处理。

Training Target

本项目目的在于基于现有的预训练模型参数,如google开源的bert-base-uncasedbert-base-chinese等,在垂直领域的数据语料上,再次进行预训练任务,由此提升bert的模型表征能力,换句话说,也就是提升下游任务的表现。

Environment

项目主要使用了Huggingface的datasetstransformers模块,支持CPU、单卡单机、单机多卡三种模式。

可通过以下命令安装依赖包

    pip install -r requirement.txt

主要包含的模块如下:

    python3.6
    torch==1.3.0
    tqdm==4.61.2
    transformers==4.6.1
    datasets==1.10.2
    numpy==1.19.5
    pandas==1.1.3

Get Start

单卡模式

直接运行以下命令

    python train.py

或修改Config.py文件中的变量self.cuda_visible_devices为单卡后,运行

    chmod 755 run.sh
    ./run.sh

多卡模式

如果你足够幸运,拥有了多张GPU卡,那么恭喜你,你可以进入起飞模式。 🚀 🚀

(1)使用torch的nn.parallel.DistributedDataParallel模块进行多卡训练。其中config.py文件中参数如下,默认可以不用修改。

  • self.cuda_visible_devices表示程序可见的GPU卡号,示例:1,2→可在GPU卡号为1和2上跑,亦可以改多张,如0,1,2,3
  • self.device在单卡模式,表示程序运行的卡号;在多卡模式下,表示master的主卡,默认会变成你指定卡号的第一张卡。若只有cpu,那么可修改为cpu
  • self.port表示多卡模式下,进程通信占用的端口号。(无需修改)
  • self.init_method表示多卡模式下进程的通讯地址。(无需修改)
  • self.world_size表示启动的进程数量(无需修改)。在torch==1.3.0版本下,只需指定一个进程。在1.9.0以上,需要与GPU数量相同。

(2)运行程序启动命令

    chmod 755 run.sh
    ./run.sh

Experiment

使用交叉熵(cross-entropy)作为损失函数,困惑度(perplexity)和Loss作为评价指标来进行训练,训练过程如下:

Reference

【Bert】https://arxiv.org/pdf/1810.04805.pdf

【transformers】https://github.com/huggingface/transformers

【datasets】https://huggingface.co/docs/datasets/quicktour.html

Owner
Desmond Ng
NLP Engineer
Desmond Ng
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
Prithivida 690 Jan 04, 2023
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
SentAugment is a data augmentation technique for semi-supervised learning in NLP.

SentAugment SentAugment is a data augmentation technique for semi-supervised learning in NLP. It uses state-of-the-art sentence embeddings to structur

Meta Research 363 Dec 30, 2022
The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Speech Separation The simple project to separate mixed voice (2 clean voices) to 2 separate voices. Result Example (Clisk to hear the voices): mix ||

vuthede 31 Oct 30, 2022
내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Pocket Galaxy 아주 간단한 개인용, 혹은 내부용 툴을 만들어야하는데 이왕이면 웹이 편하죠? 그럴때를 위해 만들어둔 django와 vue(vuetify)로 이뤄진 boilerplate 입니다. 각 폴더에 있는 설명서대로 실행을 시키면 일단 당장 뭔가가 돌아갑니

Jamie J. Seol 16 Dec 03, 2021
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
Textlesslib - Library for Textless Spoken Language Processing

textlesslib Textless NLP is an active area of research that aims to extend NLP t

Meta Research 379 Dec 27, 2022
FactSumm: Factual Consistency Scorer for Abstractive Summarization

FactSumm: Factual Consistency Scorer for Abstractive Summarization FactSumm is a toolkit that scores Factualy Consistency for Abstract Summarization W

devfon 83 Jan 09, 2023
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+

rJAM splitscreen message reader for MysticBBS A46+

Robbert Langezaal 4 Nov 22, 2022
A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

Riccardo Orlando 27 Nov 20, 2022
Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

texttron 193 Jan 04, 2023
wxPython app for converting encodings, modifying and fixing SRT files

Subtitle Converter Program za obradu srt i txt fajlova. Requirements: Python version 3.8 wxPython version 4.1.0 or newer Libraries: srt, PyDispatcher

4 Nov 25, 2022
Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions 🎇 Check collection of public projects 🎁 , where you can find multiple

minerva.ml 153 Jun 22, 2022
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022