GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

Overview

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger is an open-source toolkit for grammatical profiling for language learning. It can analyze text in English and Chinese and show you grammatical items included in the input, along with its estimated difficulty.

Usage

GrammarTagger is written in Python (3.7+) and AllenNLP (2.1.0+). If you have conda installed, you can set up the environment as follows:

git clone https://github.com/octanove/grammartagger.git
cd grammartagger
conda create -n grammartagger python=3.7
conda activate grammartagger
pip install -r requirements.txt

Also, download the pretrained models (see below). After these steps, you can run GrammarTagger as follows:

English:

echo 'He loves to learn new languages, and last month he practiced some lessons in Spanish.' | python scripts/predict.py model-en-multi.tar.gz | jq
{
  "spans": [
    {
      "span": [0, 3],
      "tokens": ["[CLS]", "he", "loves", "to"],
      "label": "194:VP.SV.AFF"
    },
    {
      "span": [2, 2],
      "tokens": ["loves"],
      "label": "60:TA.PRESENT.does.AFF"
    },
    {
      "span": [2, 4],
      "tokens": ["loves", "to", "learn"],
      "label": "101:TO.VV_to_do"
    },
    ...
  ],
  "tokens": [
      "[CLS]", "he", "loves", "to", "learn", "new", "languages", ",",
      "and", "last", "month", "he", "practiced", "some", "lessons", "in", "spanish", ".", "[SEP]"
  ],
  "level_probs": {
    "c2": 0.008679441176354885,
    "b2": 0.005526999477297068,
    "c1": 0.05267713591456413,
    "b1": 0.06360447406768799,
    "a2": 0.06990284472703934,
    "a1": 0.7954732775688171
  }
}

Chinese:

$ echo '她住得很远,我想送她回去。' | python scripts/predict.py model-zh-multi.tar.gz | jq
{
  "spans": [
    {
      "span": [2, 5],
      "tokens": ["住", "得", "很", "远"],
      "label": "2.12.1:V 得 A:(using adverbs)"
    },
    {
      "span": [4, 4]
      "tokens": ["很"],
      "label": "1.06.2:很:very"
    },
    {
      "span": [8, 8],
      "tokens": ["想"],
      "label": "1.08.1:想:to want"
    }
  ],
  "tokens": ["[CLS]", "她", "住", "得", "很", "远", ",", "我", "想", "送", "她", "回", "去", "。", "[SEP]"],
  "level_probs": {
    "HSK 6": 9.971807230613194e-06,
    "HSK 5": 0.0011904890416190028,
    "HSK 3": 0.005279902834445238,
    "HSK 4": 0.00014815296162851155,
    "HSK 2": 0.9917035102844238,
    "HSK 1": 0.0016456041485071182
  }
}

Technical details

GrammarTagger is based on pretrained contextualizers, namely BERT (Devlin et al. 2019), and span classification. See the following paper for more details.

Hagiwara et al. 2021. GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for Language Education

Pretrained models

These pretrained models are licensed under CC BY-NC-ND 4.0 for academic/personal uses. If you are interested in a commercial license, please contact [email protected]. We are also working on improved models with wider grammar coverage and higher accuracy.

Owner
Octanove Labs
Octanove Labs
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
Rootski - Full codebase for rootski.io (without the data)

📣 Welcome to the Rootski codebase! This is the codebase for the application run

Eric 20 Nov 18, 2022
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

raphtlw 2 Jan 27, 2022
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Kenneth Enevoldsen 124 Dec 29, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022
texlive expressions for documents

tex2nix Generate Texlive environment containing all dependencies for your document rather than downloading gigabytes of texlive packages. Installation

Jörg Thalheim 70 Dec 26, 2022
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Random Directed Acyclic Graph Generator

DAG_Generator Random Directed Acyclic Graph Generator verison1.0 简介 工作流通常由DAG(有向无环图)来定义,其中每个计算任务$T_i$由一个顶点(node,task,vertex)表示。同时,任务之间的每个数据或控制依赖性由一条加权

Livion 17 Dec 27, 2022
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022