Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

Overview

New State-of-the-Art in Preposition Sense Disambiguation

Supervisor:

Institutions:

Project Description

The disambiguation of words is a central part of NLP tasks. In particular, there is the ambiguity of prepositions, which has been a problem in NLP for over a decade and still is. For example the preposition 'in' can have a temporal (e.g. in 2021) or a spatial (e.g. in Frankuft) meaning. A strong motivation behind the learning of these meanings are current research attempts to transfer text to artifical scenes. A good understanding of the real meaning of prepositions is crucial in order for the machine to create matching scenes.

With the birth of the transformer models in 2017 [1], attention based models have been pushing boundries in many NLP disciplines. In particular, bert, a transformer model by google and pre-trained on more than 3,000 M words, obtained state-of-the-art results on many NLP tasks and Corpus.

The goal of this project is to use modern transformer models to tackle the problem of preposition sense disambiguation. Therefore, we trained a simple bert model on the SemEval 2007 dataset [2], a central benchmark dataset for this task. To the best of our knowledge, the best purposed model for disambiguating the meanings of prepositions on the SemEval achives an accuracy of up to 88% [3]. Neither more recent approaches surpass this frontier[4][5] . Our model achives an accuracy of 90.84%, out-performing the current state-of-the-art.

How to train

To meet our goals, we cleand the SemEval 2007 dataset to only contain the needed information. We have added it to the repository and can be found in ./data/training-data.tsv.

Train a bert model:
First, install the requirements.txt. Afterwards, you can train the bert-model by:

python3 trainer.py --batch-size 16 --learning-rate 1e-4 --epochs 4 --data-path "./data/training_data.tsv"

The chosen hyper-parameters in the above example are tuned and already set by default. After training, this will save the weights and config to a new folder ./model_save/. Feel free to omit this training-step and use our trained weights directly.

Examples

We attach an example tagger, which can be used in an interactive manner. python3 -i tagger.py

Sourrond the preposition, for which you like to know the meaning of, with <head>...</head> and feed it to the tagger:

>>> tagger.tag("I am <head>in</head> big trouble")
Predicted Meaning: Indicating a state/condition/form, often a mental/emotional one that is being experienced 

>>> tagger.tag("I am speaking <head>in</head> portuguese.")
Predicted Meaning: Indicating the language, medium, or means of encoding (e.g., spoke in German)

>>> tagger.tag("He is swimming <head>with</head> his hands.")
Predicted Meaning: Indicating the means or material used to perform an action or acting as the complement of similar participle adjectives (e.g., crammed with, coated with, covered with)

>>> tagger.tag("She blinked <head>with</head> confusion.")
Predicted Meaning: Because of / due to (the physical/mental presence of) (e.g., boiling with anger, shining with dew)

References

[1] Vaswani, Ashish et al. (2017). Attention is all you need. Advances in neural information processing systems. P. 5998--6008.

[2] Litkowski, Kenneth C and Hargraves, Orin (2007). SemEval-2007 Task 06: Word-sense disambiguation of prepositions. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). P. 24--29

[3] Litkowski, Ken. (2013). Preposition disambiguation: Still a problem. CL Research, Damascus, MD.

[4] Gonen, Hila and Goldberg, Yoav. (2016). Semi supervised preposition-sense disambiguation using multilingual data. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. P. 2718--2729

[5] Gong, Hongyu and Mu, Jiaqi and Bhat, Suma and Viswanath, Pramod (2018). Preposition Sense Disambiguation and Representation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. P. 1510--1521

Owner
Dirk Neuhäuser
Dirk Neuhäuser
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
An open-source NLP library: fast text cleaning and preprocessing.

An open-source NLP library: fast text cleaning and preprocessing

Iaroslav 21 Mar 18, 2022
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

Ian 1 Jan 15, 2022
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
LSTM model - IMDB review sentiment analysis

NLP - Movie review sentiment analysis The colab notebook contains the code for building a LSTM Recurrent Neural Network that gives 87-88% accuracy on

Sundeep Bhimireddy 1 Jan 29, 2022
A tool helps build a talk preview image by combining the given background image and talk event description

talk-preview-img-builder A tool helps build a talk preview image by combining the given background image and talk event description Installation and U

PyCon Taiwan 4 Aug 20, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
Use Tensorflow2.7.0 Build OpenAI'GPT-2

TF2_GPT-2 Use Tensorflow2.7.0 Build OpenAI'GPT-2 使用最新tensorflow2.7.0构建openai官方的GPT-2 NLP模型 优点 使用无监督技术 拥有大量词汇量 可实现续写(堪比“xx梦续写”) 实现对话后续将应用于FloatTech的Bot

Watermelon 9 Sep 13, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

About CappuccinoJs This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini! Este conversor criar

Arthur Ottoni Ribeiro 48 Nov 15, 2022
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 03, 2023
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 02, 2022
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
Code for paper: An Effective, Robust and Fairness-awareHate Speech Detection Framework

BiQQLSTM_HS Code and data for paper: Title: An Effective, Robust and Fairness-awareHate Speech Detection Framework. Authors: Guanyi Mou and Kyumin Lee

Guanyi Mou 2 Dec 27, 2022
Transformer training code for sequential tasks

Sequential Transformer This is a code for training Transformers on sequential tasks such as language modeling. Unlike the original Transformer archite

Meta Research 578 Dec 13, 2022
Generating new names based on trends in data using GPT2 (Transformer network)

MLOpsNameGenerator Overall Goal The goal of the project is to develop a model that is capable of creating Pokémon names based on its description, usin

Gustav Lang Moesmand 2 Jan 10, 2022
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022