PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Deep Learningpastrie
Overview

PASTRIE

CC BY-SA 4.0

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.


Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

  • English
  • French
  • German
  • Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.


Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

  • .conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
  • .json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
  • .govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

  • Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
    • PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
    • PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
    • In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
    • PASTRIE lacks enhanced dependencies.
  • Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
    • Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
  • Noun and verb expressions in PASTRIE do not have supersense labels.
Comments
  • Misc. annotation errors and/or conversion script bugs

    Misc. annotation errors and/or conversion script bugs

    There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

    1. vs mistagged as a noun--should be prep

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. ditto

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:
    13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
    14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
    15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
    16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
    17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
    18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
    19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
    20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
    21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
    22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
    23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
    24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

    Relevant span of code:

                if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                    ('ADP','P'),('ADV','P'),('SCONJ','P'),
                    ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                    ('PART','POSS')}:
                    # most often, the single-word lexcat should match its upos
                    # check a list of exceptions
                    mismatchOK = False
                    if xpos=='TO' and lc.startswith('INF'):
                        mismatchOK = True
                    elif (xpos=='TO')!=lc.startswith('INF'):
                        assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                        mismatchOK = True
    
    1. Originator as function:

    (in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

    1. lexcat DISC with ADJ:

    AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

    1. "her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:
    1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
    2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
    3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
    4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
    5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
    6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
    7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
    8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
    9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
    10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
    11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
    12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
    13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
    14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

    1. "NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

    AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

    opened by lgessler 6
  • Prepositional supersense annotations on non-preposition targets

    Prepositional supersense annotations on non-preposition targets

    Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

    21	give	give	VERB	VB	_	10	conj	_	_	2:1	_	give up on	p.Theme	p.Theme	_	_	_	_
    22	up	up	ADP	RP	_	21	compound:prt	_	_	2:2	_	_	_	_	_	_	_	_
    23	on	on	ADP	IN	_	24	case	_	_	2:3	_	_	_	_	_	_	_	_
    
    opened by lgessler 5
  • Prepositions unannotated for supersense

    Prepositions unannotated for supersense

    Token 6:

    # sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
    # text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
    1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
    2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
    3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
    4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
    5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
    6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
    7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
    8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
    9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
    10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
    11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
    12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
    13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
    14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
    15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
    16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
    17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
    18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
    19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    

    I assumed that all preps were supposed to be annotated, but perhaps not?

    opened by lgessler 3
  • Apostrophes removed in preprocessing?

    Apostrophes removed in preprocessing?

    Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

    opened by nschneid 2
  • Dataset requested

    Dataset requested

    Hi all,

    I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

    Thanks for reply.

    opened by fj-morales 2
  • SNACS supersense tags should start with

    SNACS supersense tags should start with "p."

    For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

    Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

    opened by nschneid 0
  • Questionable adpositional MWEs

    Questionable adpositional MWEs

    • in_male_term — from "in male terms"; should be in_term (at most)
    • in_the_first_place
    • in_my_hand — from "in my hands"; should be in_hand (at most)
    • for_quite_some_time — just Duration for, weak MWE?
    • at_all_time — from what should have been "at all times". OK?
    • on_a_smaller_scale — omit adjective?
    • withouth — typo
    • see_as — "seeing as" (deverbal MWE acting like a preposition)
    opened by nschneid 0
  • Some undersegmentation of sentences

    Some undersegmentation of sentences

    Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

    It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

    opened by nschneid 0
Releases(v2.0.1)
  • v2.0.1(Nov 21, 2021)

  • v2.0(Oct 22, 2021)

    • Switch to full .conllulex format following STREUSLE
      • add lexcats (#3), morphological features, newdoc directives
    • Scripts for validation and format conversion
    • Clean up various annotation issues, including:
      • restore apostrophes and fixing other conversion problems (#6, #9)
      • include pretokenized raw text (#12)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Dec 14, 2020)

    • Added .json file format
    • Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1
    • Corrected rare encoding issue from v1.0
    Source code(tar.gz)
    Source code(zip)
Owner
NERT @ Georgetown
NERT @ Georgetown
This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

Bin Xiao 175 Jan 08, 2023
Use AI to generate a optimized stock portfolio

Use AI, Modern Portfolio Theory, and Monte Carlo simulation's to generate a optimized stock portfolio that minimizes risk while maximizing returns. Ho

Greg James 30 Dec 22, 2022
A tool for making map images from OpenTTD save games

OpenTTD Surveyor A tool for making map images from OpenTTD save games. This is not part of the main OpenTTD codebase, nor is it ever intended to be pa

Aidan Randle-Conde 9 Feb 15, 2022
통일된 DataScience 폴더 구조 제공 및 가상환경 작업의 부담감 해소

Lucas coded by linux shell 목차 Mac버전 CookieCutter (autoenv) 1.How to Install autoenv 2.폴더 진입 시, activate 구현하기 3.폴더 탈출 시, deactivate 구현하기 4.Alias 설정하기 5

ello 3 Feb 21, 2022
Explaining in Style: Training a GAN to explain a classifier in StyleSpace

Explaining in Style: Official TensorFlow Colab Explaining in Style: Training a GAN to explain a classifier in StyleSpace Oran Lang, Yossi Gandelsman,

Google 197 Nov 08, 2022
Reading list for research topics in Masked Image Modeling

awesome-MIM Reading list for research topics in Masked Image Modeling(MIM). We list the most popular methods for MIM, if I missed something, please su

ligang 231 Dec 07, 2022
Finite-temperature variational Monte Carlo calculation of uniform electron gas using neural canonical transformation.

CoulombGas This code implements the neural canonical transformation approach to the thermodynamic properties of uniform electron gas. Building on JAX,

FermiFlow 9 Mar 03, 2022
NaturalProofs: Mathematical Theorem Proving in Natural Language

NaturalProofs: Mathematical Theorem Proving in Natural Language NaturalProofs: Mathematical Theorem Proving in Natural Language Sean Welleck, Jiacheng

Sean Welleck 83 Jan 05, 2023
GPT, but made only out of gMLPs

GPT - gMLP This repository will attempt to crack long context autoregressive language modeling (GPT) using variations of gMLPs. Specifically, it will

Phil Wang 80 Dec 01, 2022
Submission to Twitter's algorithmic bias bounty challenge

Twitter Ethics Challenge: Pixel Perfect Submission to Twitter's algorithmic bias bounty challenge, by Travis Hoppe (@metasemantic). Abstract We build

Travis Hoppe 4 Aug 19, 2022
RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving (AAAI2021). RTS3D is efficiency and accuracy s

71 Nov 29, 2022
Pytorch Lightning code guideline for conferences

Deep learning project seed Use this seed to start new deep learning / ML projects. Built in setup.py Built in requirements Examples with MNIST Badges

Pytorch Lightning 1k Jan 06, 2023
Video Matting via Consistency-Regularized Graph Neural Networks

Video Matting via Consistency-Regularized Graph Neural Networks Project Page | Real Data | Paper Installation Our code has been tested on Python 3.7,

41 Dec 26, 2022
Safe Local Motion Planning with Self-Supervised Freespace Forecasting, CVPR 2021

Safe Local Motion Planning with Self-Supervised Freespace Forecasting By Peiyun Hu, Aaron Huang, John Dolan, David Held, and Deva Ramanan Citing us Yo

Peiyun Hu 90 Dec 01, 2022
Collaborative forensic timeline analysis

Timesketch Table of Contents About Timesketch Getting started Community Contributing About Timesketch Timesketch is an open-source tool for collaborat

Google 2.1k Dec 28, 2022
NaijaSenti is an open-source sentiment and emotion corpora for four major Nigerian languages

NaijaSenti is an open-source sentiment and emotion corpora for four major Nigerian languages. This project was supported by lacuna-fund initiatives. Jump straight to one of the sections below, or jus

Hausa Natural Language Processing 14 Dec 20, 2022
Interactive Visualization to empower domain experts to align ML model behaviors with their knowledge.

An interactive visualization system designed to helps domain experts responsibly edit Generalized Additive Models (GAMs). For more information, check

InterpretML 83 Jan 04, 2023
Unofficial PyTorch implementation of SimCLR by Google Brain

Unofficial PyTorch implementation of SimCLR by Google Brain

Rishabh Anand 2 Oct 13, 2021
The implementation of the lifelong infinite mixture model

Lifelong infinite mixture model 📋 This is the implementation of the Lifelong infinite mixture model 📋 Accepted by ICCV 2021 Title : Lifelong Infinit

Fei Ye 5 Oct 20, 2022
Python package facilitating the use of Bayesian Deep Learning methods with Variational Inference for PyTorch

PyVarInf PyVarInf provides facilities to easily train your PyTorch neural network models using variational inference. Bayesian Deep Learning with Vari

342 Dec 02, 2022