PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Deep Learningpastrie
Overview

PASTRIE

CC BY-SA 4.0

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.


Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

  • English
  • French
  • German
  • Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.


Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

  • .conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
  • .json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
  • .govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

  • Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
    • PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
    • PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
    • In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
    • PASTRIE lacks enhanced dependencies.
  • Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
    • Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
  • Noun and verb expressions in PASTRIE do not have supersense labels.
Comments
  • Misc. annotation errors and/or conversion script bugs

    Misc. annotation errors and/or conversion script bugs

    There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

    1. vs mistagged as a noun--should be prep

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. ditto

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:
    13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
    14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
    15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
    16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
    17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
    18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
    19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
    20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
    21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
    22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
    23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
    24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

    Relevant span of code:

                if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                    ('ADP','P'),('ADV','P'),('SCONJ','P'),
                    ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                    ('PART','POSS')}:
                    # most often, the single-word lexcat should match its upos
                    # check a list of exceptions
                    mismatchOK = False
                    if xpos=='TO' and lc.startswith('INF'):
                        mismatchOK = True
                    elif (xpos=='TO')!=lc.startswith('INF'):
                        assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                        mismatchOK = True
    
    1. Originator as function:

    (in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

    1. lexcat DISC with ADJ:

    AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

    1. "her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:
    1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
    2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
    3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
    4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
    5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
    6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
    7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
    8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
    9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
    10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
    11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
    12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
    13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
    14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

    1. "NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

    AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

    opened by lgessler 6
  • Prepositional supersense annotations on non-preposition targets

    Prepositional supersense annotations on non-preposition targets

    Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

    21	give	give	VERB	VB	_	10	conj	_	_	2:1	_	give up on	p.Theme	p.Theme	_	_	_	_
    22	up	up	ADP	RP	_	21	compound:prt	_	_	2:2	_	_	_	_	_	_	_	_
    23	on	on	ADP	IN	_	24	case	_	_	2:3	_	_	_	_	_	_	_	_
    
    opened by lgessler 5
  • Prepositions unannotated for supersense

    Prepositions unannotated for supersense

    Token 6:

    # sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
    # text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
    1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
    2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
    3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
    4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
    5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
    6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
    7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
    8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
    9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
    10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
    11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
    12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
    13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
    14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
    15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
    16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
    17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
    18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
    19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    

    I assumed that all preps were supposed to be annotated, but perhaps not?

    opened by lgessler 3
  • Apostrophes removed in preprocessing?

    Apostrophes removed in preprocessing?

    Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

    opened by nschneid 2
  • Dataset requested

    Dataset requested

    Hi all,

    I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

    Thanks for reply.

    opened by fj-morales 2
  • SNACS supersense tags should start with

    SNACS supersense tags should start with "p."

    For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

    Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

    opened by nschneid 0
  • Questionable adpositional MWEs

    Questionable adpositional MWEs

    • in_male_term — from "in male terms"; should be in_term (at most)
    • in_the_first_place
    • in_my_hand — from "in my hands"; should be in_hand (at most)
    • for_quite_some_time — just Duration for, weak MWE?
    • at_all_time — from what should have been "at all times". OK?
    • on_a_smaller_scale — omit adjective?
    • withouth — typo
    • see_as — "seeing as" (deverbal MWE acting like a preposition)
    opened by nschneid 0
  • Some undersegmentation of sentences

    Some undersegmentation of sentences

    Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

    It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

    opened by nschneid 0
Releases(v2.0.1)
  • v2.0.1(Nov 21, 2021)

  • v2.0(Oct 22, 2021)

    • Switch to full .conllulex format following STREUSLE
      • add lexcats (#3), morphological features, newdoc directives
    • Scripts for validation and format conversion
    • Clean up various annotation issues, including:
      • restore apostrophes and fixing other conversion problems (#6, #9)
      • include pretokenized raw text (#12)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Dec 14, 2020)

    • Added .json file format
    • Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1
    • Corrected rare encoding issue from v1.0
    Source code(tar.gz)
    Source code(zip)
Owner
NERT @ Georgetown
NERT @ Georgetown
NeWT: Natural World Tasks

NeWT: Natural World Tasks This repository contains resources for working with the NeWT dataset. ❗ At this time the binary tasks are not publicly avail

Visipedia 26 Oct 18, 2022
an implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

This work has now been superseded by: https://github.com/sniklaus/revisiting-sepconv sepconv-slomo This is a reference implementation of Video Frame I

Simon Niklaus 985 Jan 08, 2023
PAMI stands for PAttern MIning. It constitutes several pattern mining algorithms to discover interesting patterns in transactional/temporal/spatiotemporal databases

Introduction PAMI stands for PAttern MIning. It constitutes several pattern mining algorithms to discover interesting patterns in transactional/tempor

RAGE UDAY KIRAN 43 Jan 08, 2023
一个多模态内容理解算法框架,其中包含数据处理、预训练模型、常见模型以及模型加速等模块。

Overview 架构设计 插件介绍 安装使用 框架简介 方便使用,支持多模态,多任务的统一训练框架 能力列表: bert + 分类任务 自定义任务训练(插件注册) 框架设计 框架采用分层的思想组织模型训练流程。 DATA 层负责读取用户数据,根据 field 管理数据。 Parser 层负责转换原

Tencent 265 Dec 22, 2022
GNEE - GAT Neural Event Embeddings

GNEE - GAT Neural Event Embeddings This repository contains source code for the GNEE (GAT Neural Event Embeddings) method introduced in the paper: "Se

João Pedro Rodrigues Mattos 0 Sep 15, 2021
Simple and Robust Loss Design for Multi-Label Learning with Missing Labels

Simple and Robust Loss Design for Multi-Label Learning with Missing Labels Official PyTorch Implementation of the paper Simple and Robust Loss Design

Xinyu Huang 28 Oct 27, 2022
Reverse engineer your pytorch vision models, in style

🔍 Rover Reverse engineer your CNNs, in style Rover will help you break down your CNN and visualize the features from within the model. No need to wri

Mayukh Deb 32 Sep 24, 2022
Compare outputs between layers written in Tensorflow and layers written in Pytorch

Compare outputs of Wasserstein GANs between TensorFlow vs Pytorch This is our testing module for the implementation of improved WGAN in Pytorch Prereq

Hung Nguyen 72 Dec 20, 2022
Scene-Text-Detection-and-Recognition (Pytorch)

Scene-Text-Detection-and-Recognition (Pytorch) Competition URL: https://tbrain.t

Gi-Luen Huang 9 Jan 02, 2023
FluxTraining.jl gives you an endlessly extensible training loop for deep learning

A flexible neural net training library inspired by fast.ai

86 Dec 31, 2022
DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

Microsoft 45 Jan 01, 2023
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter 🦎 → 🐍 The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

684 Jan 09, 2023
Open source hardware and software platform to build a small scale self driving car.

Donkeycar is minimalist and modular self driving library for Python. It is developed for hobbyists and students with a focus on allowing fast experimentation and easy community contributions.

Autorope 2.4k Jan 04, 2023
Official implementation for paper Render In-between: Motion Guided Video Synthesis for Action Interpolation

Render In-between: Motion Guided Video Synthesis for Action Interpolation [Paper] [Supp] [arXiv] [4min Video] This is the official Pytorch implementat

8 Oct 27, 2022
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Tensorforce 3.2k Jan 02, 2023
Self-Supervised Image Denoising via Iterative Data Refinement

Self-Supervised Image Denoising via Iterative Data Refinement Yi Zhang1, Dasong Li1, Ka Lung Law2, Xiaogang Wang1, Hongwei Qin2, Hongsheng Li1 1CUHK-S

Zhang Yi 72 Jan 01, 2023
WTTE-RNN a framework for churn and time to event prediction

WTTE-RNN Weibull Time To Event Recurrent Neural Network A less hacky machine-learning framework for churn- and time to event prediction. Forecasting p

Egil Martinsson 727 Dec 28, 2022
World Models with TensorFlow 2

World Models This repo reproduces the original implementation of World Models. This implementation uses TensorFlow 2.2. Docker The easiest way to hand

Zac Wellmer 234 Nov 30, 2022
Code for the paper: Sketch Your Own GAN

Sketch Your Own GAN Project | Paper | Youtube Our method takes in one or a few hand-drawn sketches and customizes an off-the-shelf GAN to match the in

677 Dec 28, 2022
Train Scene Graph Generation for Visual Genome and GQA in PyTorch >= 1.2 with improved zero and few-shot generalization.

Scene Graph Generation Object Detections Ground truth Scene Graph Generated Scene Graph In this visualization, woman sitting on rock is a zero-shot tr

Boris Knyazev 93 Dec 28, 2022