Jingju baseline - A baseline model of our project of Beijing opera script generation

Overview

Jingju Baseline

It is a baseline of our project about Beijing opera script generation. Our baseline model is based on gpt2-chinese-ancient which is pretrained with 1.5GB literary Chinese.Please refer to our paper for details.

Directory Annotation

jingju_baseline/
	|-- finetuning.py 	#the finetuning script
	|-- jingju_test.py 	#test script
	|-- preprocess.py 	#data preprocess script
	|-- config/ 		#model configuration files 
	|-- corpora/ 		#corpora files
	|-- models/ 		# vocab file, model checkpoints and some necessary files
	|-- scripts/ 		# several functional scripts
	|-- test/ 		#test files
	|-- uer/ 		#files from UER-py

Environment Preparation

Our baseline model is fineturned with a pretraining framework UER-py. Refer to the part for environment requirements.

Finetuning

  1. Data preprocess
python3 preprocess.py --corpus_path corpora/jingju_train.txt\
   	  --vocab_path models/vocab.txt \
   	  --tokenizer bert \
   	  --dataset_path corpora/jingju_train.pt \
   	  --processes_num 32 --seq_length 1024 --target lm
python3 preprocess.py --corpus_path corpora/jingju_dev.txt\
   	  --vocab_path models/vocab.txt \
   	  --tokenizer bert \
   	  --dataset_path corpora/jingju_dev.pt \
   	  --processes_num 32 --seq_length 1024 --target lm
  1. Finetuning
export CUDA_VISIBLE_DEVICES=0
nohup python3 -u finetuning.py --dataset_path corpora/jingju_train.pt\
		 --devset_path corpora/jingju_dev.pt\
		 --vocab_path models/vocab.txt \
		 --config_path config/jingju_config.json \
		 --output_model_path models/finetuned_model.bin\
		 --pretrained_model_path models/uer-ancient-chinese.bin\
		 --world_size 1 --gpu_ranks 0  \
		 --total_steps 100000 --save_checkpoint_steps 50000\
		 --report_steps 1000 --learning_rate 5e-5\
		 --batch_size 5 --accumulation_steps 4 \
		 --embedding word_pos  --fp16 --fp16_opt_level O1 \
		 --remove_embedding_layernorm --encoder transformer \
		 --mask causal --layernorm_positioning pre \
		 --target lm --tie_weights > fineturning.log 2>&1 &

Refer to here for function of every argument.

Specificly, you may change environment variable CUDA_VISIBLE_DEVICES and --world_size paired with --gpu_ranks option for multi-GPU training. To enable --fp16 coordinated with --fp16_opt_level needs apex.

Test

You can finetuning by yourself with instructions above, or download the checkpoint(extracting code: q0yn) to directory ./models Then run as follows:

python3 preprocess.py --corpus_path corpora/jingju_test.txt\
		  --vocab_path models/vocab.txt \
		  --tokenizer bert \
		  --dataset_path corpora/jingju_test.pt \
		  --processes_num 32 --seq_length 1024 --target lm
nohup python3 -u jingju_test.py --load_model_path models/finetuned_model.bin-100000 \
		--vocab_path models/vocab.txt \
		--beginning_path test/jingju_beginning.txt  \
		--reference_path test/jingju_reference.txt \
		--prediction_path test/jingju_candidates.txt \
		--test_path test/jingju_beginning.txt \
		--testset_path datasets/jingju_test.pt \
		--config_path config/jingju_config.json \
		--seq_length 1024 --embedding word_pos \
		--remove_embedding_layernorm \
		--encoder transformer --mask causal \
		--layernorm_positioning pre --target lm \
		--tie_weights > test_candidate_generation.log 2>&1 &

The automatic mertics(i.e., F1, Perplexity, BLEU and Distinct) will be displayed on stdout.

Generation

nohup python3 -u scripts/generate_lm.py \
		--load_model_path models/finetuned_model.bin-100000 \
		--vocab_path models/vocab.txt \
		--test_path test/beginning.txt \
		--prediction_path test/generation.txt \
		--config_path config/jingju_config.json \
		--seq_length 1024 --embedding word_pos \
		--remove_embedding_layernorm --encoder transformer \
		--mask causal --layernorm_positioning pre \
		--target lm --tie_weights > generation_log.log 2>&1 &

Given the beginning, the model will generates script corresponding with it. The generate_lm.py script only generates sequence no longer than 1024. If you want longer script, replace scripts/generate_lm.py with scripts/long_generate_lm.py and revise --seq_length to the length you desire. Note that the generation procedure employs auto-regressive fashion, so generating long sequence is a time-consuming process.

Citation

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}
@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}
Owner
midon
master from School of Informatics,Xiamen University
midon
Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

Qiao Liu 50 Dec 18, 2022
This repository is dedicated to developing and maintaining code for experiments with wide neural networks.

Wide-Networks This repository contains the code of various experiments on wide neural networks. In particular, we implement classes for abc-parameteri

Karl Hajjar 0 Nov 02, 2021
Memory-efficient optimum einsum using opt_einsum planning and PyTorch kernels.

opt-einsum-torch There have been many implementations of Einstein's summation. numpy's numpy.einsum is the least efficient one as it only runs in sing

Haoyan Huo 9 Nov 18, 2022
Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning" This is the code for the paper Solving Graph-based Public Goo

Victor-Alexandru Darvariu 3 Dec 05, 2022
Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.

Smaller Multilingual Transformers This repository shares smaller versions of multilingual transformers that keep the same representations offered by t

Geotrend 79 Dec 28, 2022
Python binding for Khiva library.

Khiva-Python Build Documentation Build Linux and Mac OS Build Windows Code Coverage README This is the Khiva Python binding, it allows the usage of Kh

Shapelets 46 Oct 16, 2022
Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis

Readme File for "Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis" by Ham, Imai, and Janson. (2022) All scripts were written and

0 Jan 27, 2022
This is the official implementation of Elaborative Rehearsal for Zero-shot Action Recognition (ICCV2021)

Elaborative Rehearsal for Zero-shot Action Recognition This is an official implementation of: Shizhe Chen and Dong Huang, Elaborative Rehearsal for Ze

DeLightCMU 26 Sep 24, 2022
Repository to run object detection on a model trained on an autonomous driving dataset.

Autonomous Driving Object Detection on the Raspberry Pi 4 Description of Repository This repository contains code and instructions to configure the ne

Ethan 51 Nov 17, 2022
PyTorch implementation of the TTC algorithm

Trust-the-Critics This repository is a PyTorch implementation of the TTC algorithm and the WGAN misalignment experiments presented in Trust the Critic

0 Nov 29, 2021
[ICCV2021] IICNet: A Generic Framework for Reversible Image Conversion

IICNet - Invertible Image Conversion Net Official PyTorch Implementation for IICNet: A Generic Framework for Reversible Image Conversion (ICCV2021). D

felixcheng97 55 Dec 06, 2022
Unifying Global-Local Representations in Salient Object Detection with Transformer

GLSTR (Global-Local Saliency Transformer) This is the official implementation of paper "Unifying Global-Local Representations in Salient Object Detect

11 Aug 24, 2022
QAHOI: Query-Based Anchors for Human-Object Interaction Detection (paper)

QAHOI QAHOI: Query-Based Anchors for Human-Object Interaction Detection (paper) Requirements PyTorch = 1.5.1 torchvision = 0.6.1 pip install -r requ

38 Dec 29, 2022
A PyTorch implementation of EfficientNet and EfficientNetV2 (coming soon!)

EfficientNet PyTorch Quickstart Install with pip install efficientnet_pytorch and load a pretrained EfficientNet with: from efficientnet_pytorch impor

Luke Melas-Kyriazi 7.2k Jan 06, 2023
CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Languageā€“Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

Myeongjun Kim 52 Jan 07, 2023
Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks This repository contains the code that accompanies our CVPR 20

Despoina Paschalidou 161 Dec 20, 2022
Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

Code To run: python runner.py new --save SAVE_NAME --data PATH_TO_DATA_DIR --dataset DATASET --model model_name [options] --n 1000 - train - t

Geoff Pleiss 5 Dec 12, 2022
PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

943 Jan 07, 2023
Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation [OpenReview] [arXiv] [Code] The official implementation of GeoDiff: A Geome

Minkai Xu 155 Dec 26, 2022
PyTorch deep learning projects made easy.

PyTorch Template Project PyTorch deep learning project made easy. PyTorch Template Project Requirements Features Folder Structure Usage Config file fo

Victor Huang 3.8k Jan 01, 2023