The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

Overview

BERT is to NLP what AlexNet is to CV

This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies? (the camera-ready version of the paper is here) which has been accepted by the ACL 2021 main conference. We evaluate pretrained language models (LM) on five analogy tests that follow SAT-style format as below.

QUERY word:language
OPTION
  (1) paint:portrait
  (2) poetry:rhythm 
  (3) note:music <-- the answer!
  (4) tale:story
  (5) week:year 

We devise a new class of scoring functions, referred to as analogical proportion (AP) scores, to solve word analogies in an unsurpervised fashion and investigate the relational knowledge that LM learnt through pretraining.

Please see our paper for more information and discussion.

Get started

git clone https://github.com/asahi417/analogy-language-model
cd analogy-language-model
pip install -e .

Run Experiments

The following scripts reproduce our results in the paper.

# get result for our main AP score
python experiments/experiment_ppl_variants.py 
# get result for word embedding baseline
python experiments/experiment_word_embedding.py 
# get result for other scoring function such as vector difference, etc
python experiments/experiment_scoring_comparison.py 

Here's the result summary that can be attained by running those scripts.

Dataset

The datasets used in our experiments can be downloaded from the following link:

Please see the Analogy Tool for more information about the dataset and baselines.

Citation

Please cite our reference paper if you use our data or code:

@inproceedings{ushio-etal-2021-bert,
    title = "{BERT} is to {NLP} what {A}lex{N}et is to {CV}: Can Pre-Trained Language Models Identify Analogies?",
    author = "Ushio, Asahi  and
      Espinosa Anke, Luis  and
      Schockaert, Steven  and
      Camacho-Collados, Jose",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.280",
    doi = "10.18653/v1/2021.acl-long.280",
    pages = "3609--3624",
    abstract = "Analogies play a central role in human commonsense reasoning. The ability to recognize analogies such as {``}eye is to seeing what ear is to hearing{''}, sometimes referred to as analogical proportions, shape how we structure knowledge and understand language. Surprisingly, however, the task of identifying such analogies has not yet received much attention in the language model era. In this paper, we analyze the capabilities of transformer-based language models on this unsupervised task, using benchmarks obtained from educational settings, as well as more commonly used datasets. We find that off-the-shelf language models can identify analogies to a certain extent, but struggle with abstract and complex relations, and results are highly sensitive to model architecture and hyperparameters. Overall the best results were obtained with GPT-2 and RoBERTa, while configurations using BERT were not able to outperform word embedding models. Our results raise important questions for future work about how, and to what extent, pre-trained language models capture knowledge about abstract semantic relations.",
}

Please also cite the relevant reference papers if using any of the analogy datasets.

You might also like...
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologรญas del Lenguaje" (Plan-TL).

Spanish Language Models ๐Ÿ’ƒ๐Ÿป Corpora ๐Ÿ“ƒ Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models ๐Ÿค– RoBERTa-base BNE: https://huggingface.co

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?
Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

SAINT PyTorch implementation
SAINT PyTorch implementation

SAINT-pytorch A Simple pyTorch implementation of "Towards an Appropriate Query, Key, and Value Computation for Knowledge Tracing" based on https://arx

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

Owner
Asahi Ushio
PhD student at Cardiff University ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿด๓ ง๓ ข๓ ท๓ ฌ๓ ณ๓ ฟ working on NLP ๐Ÿ—ฃ
Asahi Ushio
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

TheCodingHub 1 Feb 01, 2022
Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
Converts python code into c++ by using OpenAI CODEX.

๐Ÿฆพ codex_py2cpp ๐Ÿค– OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? ๐ŸŒ You want to speed it up but forgot how to code in C++? โŒจ

Alexander 423 Jan 01, 2023
A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

TofuNLP 177 Jan 04, 2023
Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch

N-Grammer - Pytorch Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch Install $ pip install n-grammer-pytorch Usage

Phil Wang 66 Dec 29, 2022
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
Machine learning classifiers to predict American Sign Language .

ASL-Classifiers American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United Stat

Tarek idrees 0 Feb 08, 2022
Snips Python library to extract meaning from text

Snips NLU Snips NLU (Natural Language Understanding) is a Python library that allows to extract structured information from sentences written in natur

Snips 3.7k Dec 30, 2022
๋‚ด๋ถ€ ์ž‘์—…์šฉ django + vue(vuetify) boilerplate. ์ง  ํ•˜๋ฉด ๋Œ์•„๊ฐ.

Pocket Galaxy ์•„์ฃผ ๊ฐ„๋‹จํ•œ ๊ฐœ์ธ์šฉ, ํ˜น์€ ๋‚ด๋ถ€์šฉ ํˆด์„ ๋งŒ๋“ค์–ด์•ผํ•˜๋Š”๋ฐ ์ด์™•์ด๋ฉด ์›น์ด ํŽธํ•˜์ฃ ? ๊ทธ๋Ÿด๋•Œ๋ฅผ ์œ„ํ•ด ๋งŒ๋“ค์–ด๋‘” django์™€ vue(vuetify)๋กœ ์ด๋ค„์ง„ boilerplate ์ž…๋‹ˆ๋‹ค. ๊ฐ ํด๋”์— ์žˆ๋Š” ์„ค๋ช…์„œ๋Œ€๋กœ ์‹คํ–‰์„ ์‹œํ‚ค๋ฉด ์ผ๋‹จ ๋‹น์žฅ ๋ญ”๊ฐ€๊ฐ€ ๋Œ์•„๊ฐ‘๋‹ˆ

Jamie J. Seol 16 Dec 03, 2021
Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

BERTGEN This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codeb

<a href=[email protected]"> 9 Oct 26, 2022
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. โ„น๏ธ Introduction Twitch chatbot with entertainment commands. โ€Ž ๐Ÿ’ป Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
A simple version of DeTR

DeTR-Lite A simple version of DeTR Before you enjoy this DeTR-Lite The purpose of this project is to allow you to learn the basic knowledge of DeTR. P

Jianhua Yang 11 Jun 13, 2022
Research code for the paper "Fine-tuning wav2vec2 for speaker recognition"

Fine-tuning wav2vec2 for speaker recognition This is the code used to run the experiments in https://arxiv.org/abs/2109.15053. Detailed logs of each t

Nik 103 Dec 26, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 02, 2023
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Ponchotitlรกn 1 Aug 19, 2021