ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Overview

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

This repository contains code, model, dataset for ChineseBERT at ACL2021.

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu and Jiwei Li

Guide

Section Description
Introduction Introduction to ChineseBERT
Download Download links for ChineseBERT
Quick tour Learn how to quickly load models
Experiment Experiment results on different Chinese NLP datasets
Citation Citation
Contact How to contact us

Introduction

We propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese characters into language model pretraining.

First, for each Chinese character, we get three kind of embedding.

  • Char Embedding: the same as origin BERT token embedding.
  • Glyph Embedding: capture visual features based on different fonts of a Chinese character.
  • Pinyin Embedding: capture phonetic feature from the pinyin sequence ot a Chinese Character.

Then, char embedding, glyph embedding and pinyin embedding are first concatenated, and mapped to a D-dimensional embedding through a fully connected layer to form the fusion embedding.
Finally, the fusion embedding is added with the position embedding, which is fed as input to the BERT model.
The following image shows an overview architecture of ChineseBERT model.

MODEL

ChineseBERT leverages the glyph and pinyin information of Chinese characters to enhance the model's ability of capturing context semantics from surface character forms and disambiguating polyphonic characters in Chinese.

Download

We provide pre-trained ChineseBERT models in Pytorch version and followed huggingFace model format.

  • ChineseBERT-base:12-layer, 768-hidden, 12-heads, 147M parameters
  • ChineseBERT-large: 24-layer, 1024-hidden, 16-heads, 374M parameters

Our model can be downloaded here:

Model Model Hub Size
ChineseBERT-base Pytorch 564M
ChineseBERT-large Pytorch 1.4G

Note: The model hub contains model, fonts and pinyin config files.

Quick tour

We train our model with Huggingface, so the model can be easily loaded.
Download ChineseBERT model and save at [CHINESEBERT_PATH].
Here is a quick tour to load our model.

>>> from models.modeling_glycebert import GlyceBertForMaskedLM

>>> chinese_bert = GlyceBertForMaskedLM.from_pretrained([CHINESEBERT_PATH])
>>> print(chinese_bert)

The complete example can be find here: Masked word completion with ChineseBERT

Another example to get representation of a sentence:

>>> from datasets.bert_dataset import BertDataset
>>> from models.modeling_glycebert import GlyceBertModel

>>> tokenizer = BertDataset([CHINESEBERT_PATH])
>>> chinese_bert = GlyceBertModel.from_pretrained([CHINESEBERT_PATH])
>>> sentence = '我喜欢猫'

>>> input_ids, pinyin_ids = tokenizer.tokenize_sentence(sentence)
>>> length = input_ids.shape[0]
>>> input_ids = input_ids.view(1, length)
>>> pinyin_ids = pinyin_ids.view(1, length, 8)
>>> output_hidden = chinese_bert.forward(input_ids, pinyin_ids)[0]
>>> print(output_hidden)
tensor([[[ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519],
         [ 0.0144, -0.2494, -0.1853,  ...,  0.0673,  0.0424, -0.1074],
         [ 0.0839, -0.2989, -0.2421,  ...,  0.0454, -0.1474, -0.1736],
         [-0.0499, -0.2983, -0.1604,  ..., -0.0550, -0.1863,  0.0226],
         [ 0.1428, -0.0682, -0.1310,  ..., -0.1126,  0.0440, -0.1782],
         [ 0.0287, -0.0126,  0.0389,  ...,  0.0228, -0.0677, -0.1519]]],
       grad_fn=)

The complete code can be find HERE

Experiments

ChnSetiCorp

ChnSetiCorp is a dataset for sentiment analysis.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 95.4 95.5
BERT 95.1 95.4
BERT-wwm 95.4 95.3
RoBERTa 95.0 95.6
MacBERT 95.2 95.6
ChineseBERT 95.6 95.7
---- ----
RoBERTa-large 95.8 95.8
MacBERT-large 95.7 95.9
ChineseBERT-large 95.8 95.9

Training details and code can be find HERE

THUCNews

THUCNews contains news in 10 categories.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 95.4 95.5
BERT 95.1 95.4
BERT-wwm 95.4 95.3
RoBERTa 95.0 95.6
MacBERT 95.2 95.6
ChineseBERT 95.6 95.7
---- ----
RoBERTa-large 95.8 95.8
MacBERT-large 95.7 95.9
ChineseBERT-large 95.8 95.9

Training details and code can be find HERE

XNLI

XNLI is a dataset for natural language inference.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 79.7 78.6
BERT 79.0 78.2
BERT-wwm 79.4 78.7
RoBERTa 80.0 78.8
MacBERT 80.3 79.3
ChineseBERT 80.5 79.6
---- ----
RoBERTa-large 82.1 81.2
MacBERT-large 82.4 81.3
ChineseBERT-large 82.7 81.6

Training details and code can be find HERE

BQ

BQ Corpus is a sentence pair matching dataset.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 86.3 85.0
BERT 86.1 85.2
BERT-wwm 86.4 85.3
RoBERTa 86.0 85.0
MacBERT 86.0 85.2
ChineseBERT 86.4 85.2
---- ----
RoBERTa-large 86.3 85.8
MacBERT-large 86.2 85.6
ChineseBERT-large 86.5 86.0

Training details and code can be find HERE

LCQMC

LCQMC Corpus is a sentence pair matching dataset.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 89.8 87.2
BERT 89.4 87.0
BERT-wwm 89.6 87.1
RoBERTa 89.0 86.4
MacBERT 89.5 87.0
ChineseBERT 89.8 87.4
---- ----
RoBERTa-large 90.4 87.0
MacBERT-large 90.6 87.6
ChineseBERT-large 90.5 87.8

Training details and code can be find HERE

TNEWS

TNEWS is a 15-class short news text classification dataset.
Evaluation Metrics: Accuracy

Model Dev Test
ERNIE 58.24 58.33
BERT 56.09 56.58
BERT-wwm 56.77 56.86
RoBERTa 57.51 56.94
ChineseBERT 58.64 58.95
---- ----
RoBERTa-large 58.32 58.61
ChineseBERT-large 59.06 59.47

Training details and code can be find HERE

CMRC

CMRC is a machin reading comprehension task dataset.
Evaluation Metrics: EM

Model Dev Test
ERNIE 66.89 74.70
BERT 66.77 71.60
BERT-wwm 66.96 73.95
RoBERTa 67.89 75.20
MacBERT - -
ChineseBERT 67.95 95.7
---- ----
RoBERTa-large 70.59 77.95
ChineseBERT-large 70.70 78.05

Training details and code can be find HERE

OntoNotes

OntoNotes 4.0 is a Chinese named entity recognition dataset and contains 18 named entity types.

Evaluation Metrics: Span-Level F1

Model Test Precision Test Recall Test F1
BERT 79.69 82.09 80.87
RoBERTa 80.43 80.30 80.37
ChineseBERT 80.03 83.33 81.65
---- ---- ----
RoBERTa-large 80.72 82.07 81.39
ChineseBERT-large 80.77 83.65 82.18

Training details and code can be find HERE

Weibo

Weibo is a Chinese named entity recognition dataset and contains 4 named entity types.

Evaluation Metrics: Span-Level F1

Model Test Precision Test Recall Test F1
BERT 67.12 66.88 67.33
RoBERTa 68.49 67.81 68.15
ChineseBERT 68.27 69.78 69.02
---- ---- ----
RoBERTa-large 66.74 70.02 68.35
ChineseBERT-large 68.75 72.97 70.80

Training details and code can be find HERE

Contact

If you have any question about our paper/code/modal/data...
Please feel free to discuss through github issues or emails.
You can send email to [email protected] or [email protected]

Computationally efficient algorithm that identifies boundary points of a point cloud.

BoundaryTest Included are MATLAB and Python packages, each of which implement efficient algorithms for boundary detection and normal vector estimation

6 Dec 09, 2022
Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

InfoSwap: Information Bottleneck Disentanglement for Identity Swapping Code usage Please check out the user manual page. Paper Gege Gao, Huaibo Huang,

Grace Hešeri 56 Dec 20, 2022
Robust & Reliable Route Recommendation on Road Networks

NeuroMLR: Robust & Reliable Route Recommendation on Road Networks This repository is the official implementation of NeuroMLR: Robust & Reliable Route

4 Dec 20, 2022
The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

The source code of the ICCV2021 paper "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering"

Ren Yurui 261 Jan 09, 2023
codebase for "A Theory of the Inductive Bias and Generalization of Kernel Regression and Wide Neural Networks"

Eigenlearning This repo contains code for replicating the experiments of the paper A Theory of the Inductive Bias and Generalization of Kernel Regress

Jamie Simon 45 Dec 02, 2022
Interactive Image Generation via Generative Adversarial Networks

iGAN: Interactive Image Generation via Generative Adversarial Networks Project | Youtube | Paper Recent projects: [pix2pix]: Torch implementation for

Jun-Yan Zhu 3.9k Dec 23, 2022
Server files for UltimateLabeling

UltimateLabeling server files Server files for UltimateLabeling. git clone https://github.com/alexandre01/UltimateLabeling_server.git cd UltimateLabel

Alexandre Carlier 4 Oct 10, 2022
Python library for analysis of time series data including dimensionality reduction, clustering, and Markov model estimation

deeptime Releases: Installation via conda recommended. conda install -c conda-forge deeptime pip install deeptime Documentation: deeptime-ml.github.io

495 Dec 28, 2022
[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

OW-DETR: Open-world Detection Transformer (CVPR 2022) [Paper] Akshita Gupta*, Sanath Narayan*, K J Joseph, Salman Khan, Fahad Shahbaz Khan, Mubarak Sh

Akshita Gupta 127 Dec 27, 2022
Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction

Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction. arxiv This repository contains python scripts for tr

12 Dec 12, 2022
Computations and statistics on manifolds with geometric structures.

Geomstats Code Continuous Integration Code coverage (numpy) Code coverage (autograd, tensorflow, pytorch) Documentation Community NEWS: Geomstats is r

875 Dec 31, 2022
Totally Versatile Miscellanea for Pytorch

Totally Versatile Miscellania for PyTorch Thomas Viehmann [email protected] Thi

Thomas Viehmann 428 Dec 28, 2022
The official PyTorch code for 'DER: Dynamically Expandable Representation for Class Incremental Learning' accepted by CVPR2021

DER.ClassIL.Pytorch This repo is the official implementation of DER: Dynamically Expandable Representation for Class Incremental Learning (CVPR 2021)

rhyssiyan 108 Jan 01, 2023
Code repo for "Cross-Scale Internal Graph Neural Network for Image Super-Resolution" (NeurIPS'20)

IGNN Code repo for "Cross-Scale Internal Graph Neural Network for Image Super-Resolution" [paper] [supp] Prepare datasets 1 Download training dataset

Shangchen Zhou 278 Jan 03, 2023
[ICCV 2021] Official PyTorch implementation for Deep Relational Metric Learning.

Ranking Models in Unlabeled New Environments Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch 1.7.0 + torchivision 0.8.1

Borui Zhang 39 Dec 10, 2022
Code for "Unsupervised State Representation Learning in Atari"

Unsupervised State Representation Learning in Atari Ankesh Anand*, Evan Racah*, Sherjil Ozair*, Yoshua Bengio, Marc-Alexandre Côté, R Devon Hjelm This

Mila 217 Jan 03, 2023
The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines.

The DL Streamer Pipeline Zoo is a catalog of optimized media and media analytics pipelines. It includes tools for downloading pipelines and their dependencies and tools for measuring their performace

8 Dec 04, 2022
Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

bottom-up-attention This code implements a bottom-up attention model, based on multi-gpu training of Faster R-CNN with ResNet-101, using object and at

Peter Anderson 1.3k Jan 09, 2023
Implementation for Simple Spectral Graph Convolution in ICLR 2021

Simple Spectral Graph Convolutional Overview This repo contains an example implementation of the Simple Spectral Graph Convolutional (S^2GC) model. Th

allenhaozhu 64 Dec 31, 2022
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

Zechun Liu 60 Dec 28, 2022