This repository provides the official code for GeNER (an automated dataset Generation framework for NER).

Overview

GeNER

This repository provides the official code for GeNER (an automated dataset Generation framework for NER).

Overview of GeNER

GeNER allows you to build NER models for specific entity types of interest without human-labeled data and and rich dictionaries. The core idea is to ask simple natural language questions to an open-domain question answering (QA) system and then retrieve phrases and sentences, as shown in the query formulation and retrieval stages in the figure below. Please see our paper (Simple Questions Generate Named Entity Recognition Datasets) for details.

Requirements

Please follow the instructions below to set up your environment and install GeNER.

# Create a conda virtual environment
conda create -n GeNER python=3.8
conda activate GeNER

# Install PyTorch
conda install pytorch=1.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge

# Install GeNER
git clone https://github.com/dmis-lab/GeNER.git
cd GeNER
pip install -r requirements.txt

NER Benchmarks

Run unzip data/benchmarks.zip -d ./data to unpack (pre-processed) NER benchmarks.

QA Model and Phrase Index: DensePhrases

We use DensePhrases and a Wikipedia index precomputed by DensePhrases in order to automatically generate NER datasets. After installing DensePhrases v1.0.0, please download the DensePhrases model (densephrases-multi-query-multi) and the phrase index (densephrases-multi_wiki-20181220) in the official DensePhrases repository.

AutoPhrase (Optional)

Using AutoPhrase in the dictionary matching stage usually improves final NER performance. If you are using AutoPhrase to apply Rule 10 (i.e., refining entity boundaries), please check the system requirements in the AutoPhrase repository. If you are not using AutoPhrase, set refine_boundary to false in a configuration file in the configs directory.

Computational Resource

Please see the resource requirement of DensePhrases and self-training, and check available resources of your machine.

  • 100GB RAM and a single 11G GPU to run DensePhrases
  • Single 9G GPU to perform self-training (based on batch size 16)

Reproducing Experiments

GeNER is implemented as a pipeline of DensePhrases, dictionary matching, and AutoPhrase. The entire pipeline is controlled by configuration files located in the configs directory. Please see configs/README.md for details.

We have already set up configuration files and optimal hyperparameters for all benchmarks and experiments so that you can easily reproduce similar or better performance to those presented in our paper. Just follow the instructions below for reproduction!

Example: low-resource NER (CoNLL-2003)

This example is intended to reproduce the experiment in the low-resource NER setting on the CoNLL-2003 benchmark. If you want to reproduce other experiments, you will need to change some arguments including --gener_config_path according to the target benchmark.

Retrieval

Running retrieve.py will create *.json and *.raw files in the data/retrieved/conll-2003 directory.

export CUDA_VISIBLE_DEVICES=0
export DENSEPHRASES_PATH={enter your densephrases path here}
export CONFIG_PATH=./configs/conll_config.json

python retrieve.py \
      --run_mode eval \
      --model_type bert \
      --cuda \
      --aggregate \
      --truecase \
      --return_sent \
      --pretrained_name_or_path SpanBERT/spanbert-base-cased \
      --dump_dir $DENSEPHRASES_PATH/outputs/densephrases-multi_wiki-20181220/dump/ \
      --index_name start/1048576_flat_OPQ96  \
      --load_dir $DENSEPHRASES_PATH/outputs/densephrases-multi-query-multi/  \
      --gener_config_path $CONFIG_PATH

Applying AutoPhrase (optional)

apply_autophrase.sh takes as input all *.raw files in the data/retrieved/conll-2003 directory and outputs *.autophrase files in the same directory.

bash autophrase/apply_autophrase.sh data/retrieved/conll-2003

Dictionary matching

Running annotate.py will create train.json and train_hf.json files in the data/annotated/conll-2003 directory. The first JSON file is used in this repository, especially in the self-training stage. The second one has the same data format as the Hugging Face Transformers library and is provided for your convenience.

python annotate.py --gener_config_path $CONFIG_PATH

Self-training

Finally, you can get the final NER model and see its performance. The model and training logs are stored in the ./outputs directory. See the Makefile file for running experiments on other benchmarks.

make conll-low

Fine-tuning GeNER

While GeNER performs well without any human-labeled data, you can further boost GeNER's performance using some training examples. The way to do this is very simple: load a trained GeNER model from the ./outputs directory and fine-tune it on training examples you have by a standard NER objective (i.e., token classification). We provide a fine-tuning script in this repository (self-training/run_ner.py) and datasets to reproduce fine-grained and few-shot NER experiments (data/fine-grained and data/few-shot directories).

export CUDA_VISIBLE_DEVICES=0

python self-training/run_ner.py \
      --data_dir data/few-shot/conll-2003/conll-2003_0 \
      --model_type bert \
      --model_name_or_path outputs/{enter GeNER model path here} \
      --output_dir outputs/{enter GeNER model path here} \
      --num_train_epochs 100 \
      --per_gpu_train_batch_size 64 \
      --per_gpu_eval_batch_size 64 \
      --learning_rate 1e-5 \
      --do_train \
      --do_eval \
      --do_test \
      --evaluate_during_training

# Note that this hyperparameter setup may not be optimal. It is recommended to search for more effective hyperparameters, especially the learning rate.

Building NER Models for Your Specific Needs

The main benefit of GeNER is that you can create NER datasets of new and different entity types you want to extract. Suppose you want to extract fighter aircraft names. The first thing you have to do is to formulate your needs as natural language questions such as "Which fighter aircraft?." At this stage, we recommend using the DensePhrases demo to manually check the feasibility of your questions. If relevant phrases are retrieved well, you can proceed to the next step.

Next, you should make a configuration file (e.g., fighter_aircraft_config.json) and set up its values. You can reflect questions you made in the configuration file as follows: "subtype": "fighter aircraft". Also, you can fine-tune some hyperparameters such as top_k and normalization rules. See configs/README.md for detailed descriptions of configuration files.

{
    "retrieved_path": "data/retrieved/{file name}",
    "annotated_path": "data/annotated/{file name}",
    "add_abbreviation": true,
    "refine_boundary" : true,
    "subquestion_configs": [
        {
            "type": "{the name of pre-defined entity type}",
            "subtype" : "fighter aircraft",
            "top_k" : 5000,
            "split_composite_mention": true,
            "remove_lowercase_phrase": true,
            "remove_the": false,
            "skip_lowercase_ngram": 1
        }
    ]
}

For subsequent steps (i.e., retrieval, dictionary matching, and self-training), refer to the CoNLL-2003 example described above.

References

Please cite our paper if you consider GeNER to be related to your work. Thanks!

@article{kim2021simple,
      title={Simple Questions Generate Named Entity Recognition Datasets}, 
      author={Hyunjae Kim and Jaehyo Yoo and Seunghyun Yoon and Jinhyuk Lee and Jaewoo Kang},
      year={2021},
      eprint={2112.08808},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Feel free to email Hyunjae Kim ([email protected]) if you have any questions.

License

See the LICENSE file for details.

Owner
DMIS Laboratory - Korea University
Data Mining & Information Systems Laboratory @ Korea University
DMIS Laboratory - Korea University
A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019).

ClusterGCN ⠀⠀ A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019). A

Benedek Rozemberczki 697 Dec 27, 2022
Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

Scribble-Supervised LiDAR Semantic Segmentation Dataset and code release for the paper Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORA

102 Dec 25, 2022
The official implementation of paper Siamese Transformer Pyramid Networks for Real-Time UAV Tracking, accepted by WACV22

SiamTPN Introduction This is the official implementation of the SiamTPN (WACV2022). The tracker intergrates pyramid feature network and transformer in

Robotics and Intelligent Systems Control @ NYUAD 29 Jan 08, 2023
Deep Crop Rotation

Deep Crop Rotation Paper (to come very soon!) We propose a deep learning approach to modelling both inter- and intra-annual patterns for parcel classi

Félix Quinton 5 Sep 23, 2022
EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration

EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration Ruikang Xu, Zeyu Xiao, Jie Huang, Yueyi Zhang, Zhiwei Xiong. EDPN: Enhanced Deep Pyra

69 Dec 15, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Source code of AAAI 2022 paper "Towards End-to-End Image Compression and Analysis with Transformers".

Towards End-to-End Image Compression and Analysis with Transformers Source code of our AAAI 2022 paper "Towards End-to-End Image Compression and Analy

37 Dec 21, 2022
SMIS - Semantically Multi-modal Image Synthesis(CVPR 2020)

Semantically Multi-modal Image Synthesis Project page / Paper / Demo Semantically Multi-modal Image Synthesis(CVPR2020). Zhen Zhu, Zhiliang Xu, Anshen

316 Dec 01, 2022
Official PyTorch Implementation of SSMix (Findings of ACL 2021)

SSMix: Saliency-based Span Mixup for Text Classification (Findings of ACL 2021) Official PyTorch Implementation of SSMix | Paper Abstract Data augment

Clova AI Research 52 Dec 27, 2022
Code for Understanding Pooling in Graph Neural Networks

Select, Reduce, Connect This repository contains the code used for the experiments of: "Understanding Pooling in Graph Neural Networks" Setup Install

Daniele Grattarola 37 Dec 13, 2022
OCR Post Correction for Endangered Language Texts

📌 Coming soon: an update to the software including features from our paper on semi-supervised OCR post-correction, to be published in the Transaction

Shruti Rijhwani 96 Dec 31, 2022
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

Haotong Qin 59 Dec 17, 2022
pixelNeRF: Neural Radiance Fields from One or Few Images

pixelNeRF: Neural Radiance Fields from One or Few Images Alex Yu, Vickie Ye, Matthew Tancik, Angjoo Kanazawa UC Berkeley arXiv: http://arxiv.org/abs/2

Alex Yu 1k Jan 04, 2023
A Python Package for Portfolio Optimization using the Critical Line Algorithm

PyCLA A Python Package for Portfolio Optimization using the Critical Line Algorithm Getting started To use PyCLA, clone the repo and install the requi

19 Oct 11, 2022
A medical imaging framework for Pytorch

Welcome to MedicalTorch MedicalTorch is an open-source framework for PyTorch, implementing an extensive set of loaders, pre-processors and datasets fo

Christian S. Perone 799 Jan 03, 2023
TilinGNN: Learning to Tile with Self-Supervised Graph Neural Network (SIGGRAPH 2020)

TilinGNN: Learning to Tile with Self-Supervised Graph Neural Network (SIGGRAPH 2020) About The goal of our research problem is illustrated below: give

59 Dec 09, 2022
source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval This repository contains source code and pre-trained/fine-tun

Siqi 65 Dec 26, 2022
PyTorch implementation of Convolutional Neural Fabrics http://arxiv.org/abs/1606.02492

PyTorch implementation of Convolutional Neural Fabrics arxiv:1606.02492 There are some minor differences: The raw image is first convolved, to obtain

Anuvabh Dutt 25 Dec 22, 2021
Piotr - IoT firmware emulation instrumentation for training and research

Piotr: Pythonic IoT exploitation and Research Introduction to Piotr Piotr is an emulation helper for Qemu that provides a convenient way to create, sh

Damien Cauquil 51 Nov 09, 2022
Implementation of ICLR 2020 paper "Revisiting Self-Training for Neural Sequence Generation"

Self-Training for Neural Sequence Generation This repo includes instructions for running noisy self-training algorithms from the following paper: Revi

Junxian He 45 Dec 31, 2022