Gold standard corpus annotated with verb-preverb connections for Hungarian.

Overview

Hungarian Preverb Corpus

A gold standard corpus manually annotated with verb-preverb connections for Hungarian.

corpus

The corpus consist of the following 4 files:

filename # sentences # preverbs
difficult_validate1.txt 310 357
difficult_validate2.txt 840 935
difficult_test.txt 327 376
general_test.txt 503 500

Preverbs in the general dataset are in the distribution as they appear in normal Hungarian text. The difficult dataset is specially crafted: the most common and most-easy-to-handle pattern, i.e. when a verb is directly followed by its preverb (e.g. megy ki 'go out'), is omitted. validate is for development/validation, test is for testing. Note that a general_validate dataset would not be useful, because the trivial pattern would be in vast majority overwhelming the more interesting less frequent patterns.

Accordingly, the emPreverb tool which connects preverbs to their corresponding verb, was developed based only on interesting difficult examples, and tested both on difficult and general data.

(Remark. The difficult_validate dataset is divided into two parts for historical reasons, but you can simply use them together: they consist a total of 1150 sentences and 1292 preverbs.)

corpus annotation guidelines

  • Preverb marked by a suffixed backslash followed by a (single digit!) ID number: meg\1.
  • Word from which the preverb was separated marked by a pipe followed by the same ID number: főzve|1.
  • Within the same line, different verb-prefix pairs must (obviously) receive different ID numbers.
  • A preverb that does not belong to any word in the sentence (ellipsis etc.) is marked with a zero ID: "Hazakísérhetlek?" "Meg\0 hát." Any number of preverbs can have the 0 ID within the same line.
  • In the difficult dataset, a verb directly followed by its preverb is not annotated: főzte meg, but: főzte|1 volna meg\1.
  • In the general dataset, the first pattern is annotated as well: főzte|1 meg\1.
  • Normally there is a 1:1 correspondence between preverbs and verbs. However, there are exceptions, and these are annotated accordingly, e.g. Se ki\1, se be\1 nem lehetett menni|1 Budakesziről; át-\1 meg átjárták|1.

Check (see Step 1 to 4 in evaluate.ipynb) whether tokens annotated as separated preverbs are also analysed by e-magyar morph,pos as preverbs. If not (e.g. if the preverb meg is tagged by emtsv as a [/Conj]), remove this annotation (or the whole item if no annotation left) from the dataset because preverb will necessarily fail due to incorrect emtsv annotation, which is extraneous to its performance evaluation. Exception: person-inflected preverb-like postpositions such as in utánam\1 dobják|1, which are tagged by emtsv as [/Post], and case-inflected personal pronouns such as in hozzá\1 voltam szokva|1, which are tagged as [/N|Pro], should not be removed from the dataset since preverb should be able to handle these.

If a token is annotated as the verb stem counterpart of a separated preverb, but is not tagged by emtsv as a verb, check whether the preverb annotation is correct, but if so, do not remove this annotation from the dataset. preverb is supposed to be able to handle the connection of such separated preverbs.

evaluation

An environment for reproducing evaluation of emPreverb as published in the paper below.

git clone https://github.com/ril-lexknowrep/emPreverb
cd emPreverb
make evaluate

Note that make evaluate clones this current repo inside emPreverb and runs evaluation.

The results are obtained in general_test_results.txt and difficult_test_results.txt. This should be exactly the same which can be found in Table 3 of the paper below.

development

An environment used for developing emPreverb. It is "for us" but if you insist to use it:

git clone https://github.com/ril-lexknowrep/emPreverb
cd emPreverb
git clone https://github.com/ril-lexknowrep/hungarian-preverb-corpus
cd hungarian-preverb-corpus/development
jupyter notebook evaluate.ipynb

(Remark. Yes, please clone this repo inside emPreverb.)

citation

If you use the corpus, please cite the following paper.

Pethő, Gergely and Sass, Bálint and Kalivoda, Ágnes and Simon, László and Lipp, Veronika: Igekötő-kapcsolás. In: MSZNY 2022.

Owner
RIL Lexical Knowledge Representation Research Group
RIL Lexical Knowledge Representation Research Group
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
A library for end-to-end learning of embedding index and retrieval model

Poeem Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertis

54 Dec 21, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

420 Dec 28, 2022
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022
Levenshtein and Hamming distance computation

distance - Utilities for comparing sequences This package provides helpers for computing similarities between arbitrary sequences. Included metrics ar

112 Dec 22, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022