Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

Overview

K-BERT

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph", which is implemented based on the UER framework.

Requirements

Software:

Python3
Pytorch >= 1.0
argparse == 1.1

Prepare

  • Download the google_model.bin from here, and save it to the models/ directory.
  • Download the CnDbpedia.spo from here, and save it to the brain/kgs/ directory.
  • Optional - Download the datasets for evaluation from here, unzip and place them in the datasets/ directory.

The directory tree of K-BERT:

K-BERT
├── brain
│   ├── config.py
│   ├── __init__.py
│   ├── kgs
│   │   ├── CnDbpedia.spo
│   │   ├── HowNet.spo
│   │   └── Medical.spo
│   └── knowgraph.py
├── datasets
│   ├── book_review
│   │   ├── dev.tsv
│   │   ├── test.tsv
│   │   └── train.tsv
│   ├── chnsenticorp
│   │   ├── dev.tsv
│   │   ├── test.tsv
│   │   └── train.tsv
│    ...
│
├── models
│   ├── google_config.json
│   ├── google_model.bin
│   └── google_vocab.txt
├── outputs
├── uer
├── README.md
├── requirements.txt
├── run_kbert_cls.py
└── run_kbert_ner.py

K-BERT for text classification

Classification example

Run example on Book review with CnDbpedia:

CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_cls.py \
    --pretrained_model_path ./models/google_model.bin \
    --config_path ./models/google_config.json \
    --vocab_path ./models/google_vocab.txt \
    --train_path ./datasets/book_review/train.tsv \
    --dev_path ./datasets/book_review/dev.tsv \
    --test_path ./datasets/book_review/test.tsv \
    --epochs_num 5 --batch_size 32 --kg_name CnDbpedia \
    --output_model_path ./outputs/kbert_bookreview_CnDbpedia.bin \
    > ./outputs/kbert_bookreview_CnDbpedia.log &

Results:

Best accuracy in dev : 88.80%
Best accuracy in test: 87.69%

Options of run_kbert_cls.py:

useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
        [--config_path] - Path to the model configuration file.
        [--vocab_path] - Path to the vocabulary file.
        --train_path - Path to the training dataset.
        --dev_path - Path to the validating dataset.
        --test_path - Path to the testing dataset.
        [--epochs_num] - The number of training epoches.
        [--batch_size] - Batch size of the training process.
        [--kg_name] - The name of knowledge graph, "HowNet", "CnDbpedia" or "Medical".
        [--output_model_path] - Path to the output model.

Classification benchmarks

Accuracy (dev/test %) on different dataset:

Dataset HowNet CnDbpedia
Book review 88.75/87.75 88.80/87.69
ChnSentiCorp 95.00/95.50 94.42/95.25
Shopping 97.01/96.92 96.94/96.73
Weibo 98.22/98.33 98.29/98.33
LCQMC 88.97/87.14 88.91/87.20
XNLI 77.11/77.07 76.99/77.43

K-BERT for named entity recognization (NER)

NER example

Run an example on the msra_ner dataset with CnDbpedia:

CUDA_VISIBLE_DEVICES='0' nohup python3 -u run_kbert_ner.py \
    --pretrained_model_path ./models/google_model.bin \
    --config_path ./models/google_config.json \
    --vocab_path ./models/google_vocab.txt \
    --train_path ./datasets/msra_ner/train.tsv \
    --dev_path ./datasets/msra_ner/dev.tsv \
    --test_path ./datasets/msra_ner/test.tsv \
    --epochs_num 5 --batch_size 16 --kg_name CnDbpedia \
    --output_model_path ./outputs/kbert_msraner_CnDbpedia.bin \
    > ./outputs/kbert_msraner_CnDbpedia.log &

Results:

The best in dev : precision=0.957, recall=0.962, f1=0.960
The best in test: precision=0.953, recall=0.959, f1=0.956

Options of run_kbert_ner.py:

useage: [--pretrained_model_path] - Path to the pre-trained model parameters.
        [--config_path] - Path to the model configuration file.
        [--vocab_path] - Path to the vocabulary file.
        --train_path - Path to the training dataset.
        --dev_path - Path to the validating dataset.
        --test_path - Path to the testing dataset.
        [--epochs_num] - The number of training epoches.
        [--batch_size] - Batch size of the training process.
        [--kg_name] - The name of knowledge graph.
        [--output_model_path] - Path to the output model.

K-BERT for domain-specific tasks

Experimental results on domain-specific tasks (Precision/Recall/F1 %):

KG Finance_QA Law_QA Finance_NER Medicine_NER
HowNet 0.805/0.888/0.845 0.842/0.903/0.871 0.860/0.888/0.874 0.935/0.939/0.937
CN-DBpedia 0.814/0.881/0.846 0.814/0.942/0.874 0.860/0.887/0.873 0.935/0.937/0.936
MedicalKG -- -- -- 0.944/0.943/0.944

Acknowledgement

This work is a joint study with the support of Peking University and Tencent Inc.

If you use this code, please cite this paper:

@inproceedings{weijie2019kbert,
  title={{K-BERT}: Enabling Language Representation with Knowledge Graph},
  author={Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, Ping Wang},
  booktitle={Proceedings of AAAI 2020},
  year={2020}
}
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
Just a basic Telegram AI chat bot written in Python using Pyrogram.

Nikko ChatBot Just a basic Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher. A bot token. Installation $ https

ʀᴇxɪɴᴀᴢᴏʀ 2 Oct 21, 2022
Script and models for clustering LAION-400m CLIP embeddings.

clustering-laion400m Script and models for clustering LAION-400m CLIP embeddings. Models were fit on the first million or so image embeddings. A subje

Peter Baylies 22 Oct 04, 2022
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 08, 2022
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
Prithivida 690 Jan 04, 2023
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
The RWKV Language Model

RWKV-LM We propose the RWKV language model, with alternating time-mix and channel-mix layers: The R, K, V are generated by linear transforms of input,

PENG Bo 877 Jan 05, 2023
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 03, 2023
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022
Translation for Trilium Notes. Trilium Notes 中文版.

Trilium Translation 中文说明 This repo provides a translation for the awesome Trilium Notes. Currently, I have translated Trilium Notes into Chinese. Test

743 Jan 08, 2023
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
DAGAN - Dual Attention GANs for Semantic Image Synthesis

Contents Semantic Image Synthesis with DAGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evalu

Hao Tang 104 Oct 08, 2022
Russian GPT3 models.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Larg

Sberbank AI 1.6k Jan 05, 2023
✔👉A Centralized WebApp to Ensure Road Safety by checking on with the activities of the driver and activating label generator using NLP.

AI-For-Road-Safety Challenge hosted by Omdena Hyderabad Chapter Original Repo Link : https://github.com/OmdenaAI/omdena-india-roadsafety Final Present

Prathima Kadari 7 Nov 29, 2022