AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

Overview

AnnIE - Annotation Platform

Tool for open information extraction annotations using text files. The output is delivered in .tsv-files

Installation:

The following manual is based on Python 3.7 or higher, running on Windows. In the case of running on Linux, use "pip3" and "python3" instead of "pip" and "python".

First install all required modules, preferable into a venv:

pip install -r requirements.txt

Then start the tool, running:

python openie.py

By executing, the CMD should show the line 'Running on http://127.0.0.1:5789/ '. A browser window of your standard browser should be opened automatically. If not, open a browser and go to http://127.0.0.1:5789/. In case the port differs to 5789, the port value in the URL inside the index.html has to be adjusted coherently. The UI of AnnIE should be displayed as follows:

Hosting the tool

In case the tool is hosted on a server, the AnnIE-server files should be used. These are provided inside the "annie-server.zip". The endpoint URL inside the App.js file, line 5, needs to be set to the tool's URL, on which it is bound. The server version can be started executing the following two commands:

pip3 install -r requirements.txt
gunicorn --bind 0.0.0.0:80 --workers=4 openie:app

Settings

The settings of the tool can be adjusted in the config.json file.

Accepted values are:

  • "Language": English, German, French, Chinese
  • "POSLabels": true, false
  • "Coloring": all, verbs, named-entities, none
  • "Word-sort": true, false
  • "Compound-words": true, false
  • "Quotation-marks": true, false
  • "Named-Entities": true, false
  • "Show-Indices": true, false

By changing the value of "Language", the used spacy language model can be adjusted to the desired language. By enabling "POSLabels", all applied labels to the tokens are shown inside the application in small boxes underneath the token's text. Changing the value of "Coloring", the color diversity for highlighting the different labels can be adjusted. If "Word-sort" is enabled, all selected words are put in the correct order of their appearance. If this is disabled, they are in the order of having been clicked on. By enabling "Compound-words", these type of words is not split by their "-". By enabling "Quotation-marks", these are shown in their own word boxes. Disabling "Named-Entities" will turn off NER labeling. Disabling "Show-Indices" will display the output without any indices after the tokens.

Special Features

As long as the "CTRL"-button is pressed, you can hover over words to select them faster then by clicking on every word explicitly.

Adding languages

For integrating additional languages to the tool, search for the spacy language model name of the required language here: https://spacy.io/models. Then add to the POS_Tagger.py file into the last row of the definition "read_config_file(self)" following lines by substituting the variables SPACY_MODEL_LANGUAGE with the spacy model name and DESIRED_LANGUAGE with the name of the added language (e.g.: SPACY_MODEL_NAME=en_core_web_sm ==> DESIRED_LANGUAGE=English ):

if configs["Language"] == "DESIRED_LANGUAGE":
    try:
        self.nlp = spacy.load("SPACY_MODEL_NAME")
    except:
        if platform.system() == "Windows":
            os.system('python -m spacy download SPACY_MODEL_NAME')
        else:
            os.system('python3 -m spacy download SPACY_MODEL_NAME')
        self.nlp = spacy.load("SPACY_MODEL_NAME")
    print("Successfully loaded language: DESIRED_LANGUAGE")

To get the application running with the labels in the new language, do not forget to set the required "Language"-field in the config.json file to the value of DESIRED_LANGUAGE.

Token Labeling

To highlight additional parts of the text and to attach different labels to tokens, the Tagger class within the tokenizer.py files needs to be adjusted in the first step. Within this Tagger class, the "tag_input" definition can be adjusted in a way so that the input text gets other labels or uses other models for POS-Tagging. This definition is called every time the application sends a sentence of the input text to its backend with the request to get this sentence back together with labels. The input to this definition is a string value containing for example a sentence. This sentence is then in the first step labeled by applying POS-Tagging on each contained token using spacy. Each word of the input is saved as an own TaggedWord object which includes next to the text representation of the word the "labels" field in which the tag label needs to be stored and its index within the sentence. In the next step, the labels of tokens which are recognized as named-entities are overwritten to include the NER-Tags instead of the POS-Tags as labels, if this functionality is enabled in the config file. To further proceed, special cases like the appearance of quotation marks and compound words are further handled depending on the set values in the configurations. If additional labels, next to the existing ones, should be added, the easiest way is to copy the if-clause for named entity labeling (lines 101 to 117) and adjust it to overwrite the PoS-Labels with the new required labels for the relevant tokens. Now the new labels are simply shown within the tool underneath the token's text if "POSLabeling" is enabled in the configurations.

    """Tags each input word with an according Label depending on the previous set configuration

    Args:
        text (str): The text containing the words (=tokens) which will be tagged with labels

    Returns:
        list: A list of TaggedWord elements representing each word of the input text with additional information,
                of the tokens index and its label content.
    """

    def tag_input(self, text):
        result = []
        counter = 0
        doc = self.nlp(text)

        # Apply POS-Tagging as Labels to the tokens
        for token in doc:
            tagged_word = TaggedWord(word=str(token.text), index=counter, label=str(token.pos_))
            result.append(tagged_word)
            counter += 1

        # Replace the previous set POS-Tags with NER-Tags if they are enabled in the configs
        if self.named_entites:
            for i in range(len(doc.ents)):
                if doc.ents[i].text in text:
                    entity = doc.ents[i].text
                    ent_label = doc.ents[i].label_
                    if ' ' in entity:
                        entity = entity.rsplit(' ')
                        for ele in result:
                            for ent in entity:
                                if ent == ele.word:
                                    ele.label = str(ent_label)
                    else:
                        for ele in result:
                            if entity == ele.word:
                                ele.label = str(ent_label)

        # Filters the input text for special cases like quotation marks and compound words
        changed = False
        if self.compound_words:
            #....

In case of complete customizations, the tagging functionalities can also be completely exchanged. In this case, it is important to ensure that the new definitions still output a list of TaggedWord elements to ensure the correct further processing and transmission to the application's frontend.

Word Highlighting

Example of included coloring scheme's.

In case the new labels are required to be displayed in a different color, the application's frontend needs to be adjusted accordingly: For adjusting the color scheme, the color “hex”-values inside the style.css files need to be set to the desired color codes. This needs to be done to the token button itself with the desired color as well as to the “hover” property of the button, where usually a darker version of the same color is used. If complete new labels are introduced to the tool, the CSS needs to include an appropriate class to handle these.

/*Add YOURLABEL class to css file */
.btn-YOURLABEL {
  color: white;
  background-color: #87ff00;
  border-color: #87ff00;
  opacity: 0.5;
}
.btn-YOURLABEL:hover,
.btn-YOURLABEL:focus {
  color: white;
  background-color: #75de00;
  border-color: #75de00;
}

In case complete new coloring schemes are required, these can either be entered additionally or be exchanged against the standard functions implementing the included schemes. The different colorings are applied using the functions "fullColoring()", "verbColoring()", "namedEntitiesColoring()", and "noneColoring()" inside the "GraphicalInterface.js". These functions can be adjusted by changing the switch statements handling which tokens need to be colorized depending on their label. There, new cases can simply be added or superfluous colored labels can be removed.

`; break; case "VERB": output += ``; } else { output += ` ${labelText} ${index}`; } return output; } ">
// Create new color labeling function
function yourLabelColoring(labelText, labelPos, index) {
  let output = "";
  switch (labelPos) {
    // Remove/adjust/add cases
    case "NOUN":
      output += ``;
  } else {
    output += `
   
    ${labelText}
    ${index}
   `;
  }
  return output;
}

To add new coloring functions, it is possible to let them rely on the provided ones. Simply "register" them to the tool by adding them to the first switch-statement inside the function "createTaggedContent()". An example of how to do this in the code is shown below. The tool requires to work properly, to additionally adjust the function "downgrade()".

// Register new labeling scheme:
function createTaggedContent(words) {
    var output = "";
    for (var i=0; i < wors.length; i++) {
        let labelText = words[i].text;
        let labelPos = words[i].posLabel;
        let type = words[i].type;
        let index = words[i].index;
        if (type == '') {
            switch (coloring) {
                case 'full': output += fullColoring(labelText, labelPos, index); break;
                case 'verbs': output += verbColoring(labelText, labelPos, index); break;
                case 'named-entities': output += namedEntitiesColoring(labelText, labelPos, index); break;
                case 'none': output += noneColoring(labelText, labelPos, index); break;
                //"Register" new created coloring function by replacing yourLabelColoring with your new function's name
                case 'YOURLABEL': output += yourLabelColoring(labeltext, labelPos, index); break;
                // End adjustments
                default: output += verbColoring(labelText, labelPos, index); break;
            }
    }
}

// ...

// Adjust downgrade function
function downgrade(targetElement) {
    //...
    else {
        //Insert if-statement for YOURLABELCOLORING
        if (coloring == 'YOURLABELCOLORING') {
            switch (posLabel) {
                case 'NOUN': targetElement.className = "btn btn-noun ml-1 mb-1"; break;
                case 'NOUN': targetElement.className = "btn btn-verb ml-1 mb-1"; break;
                case 'NOUN': targetElement.className = "btn btn-adjective ml-1 mb-1"; break;
                case 'ORG':
                case 'LOC':
                case 'PERSON':
                case 'GPE': targetElement.className = "btn btn-namedEntity ml-1 mb-1"; break;
                // Add case for YOURLABEL
                case 'YOURLABEL': targetElement.className = "btn btn-YOURLABEL ml-1 mb-1"; break;
                // End adjustments
                default: targetElement.className = "btn btn-secondary ml-1 mb-1"; break;
            }
        }
    }
    //...
}

General Code Structure

AnnIE high-level architecture diagram & Data model

In the following, a list is shown pointing out the relevant functionalities of each source code part:

Frontend JavaScript files (in the directory ./static/js):

  • App.js: Implements the application's main controller, handles the system logic. Includes the functions for file upload, loading of used files, loading of the configuration settings, as well as the general application logic for handling clusters, triples, etc., by applying EventListeners.
  • DataStructure.js: Contains the internal data models on which the application relies. An input text is processed as an Annotation object which itself contains the file data besides the clustering data for each sentence of the input. The Cluster consists out of one or multiple Triples, which are organized in Word objects. The detailed data model is shown above.
  • GraphicInterface.js: This file handles everything related to the visualization of the tokens, tags, clusters, and triples. Besides that, the markdown functions are included there.
  • LoadSave.js: Saves and loads the current annotation progress with all selected cluster data into JSON files. These files are saved within the "data"-folder.
  • Output.js: Creates the text representation of the annotated clusters and triples and displays them at the bottom. The text representation is offered to be downloaded as a .tsv text file.
  • Tokenizer.js: Splits the input text into sentences and sends the text sentence-wise to the backend with the request to perform tokenization on it.
  • Visual.js: Includes helper functions that are applied to the tokens and the selection buttons to work properly. Can be seen as an extension of the GraphicInterface.js file.

Python files (Backend):

  • openie.py: Implementation of the server functionalities, hosting a flask server locally on port 5789, and opening the system's default browser. Offers the required endpoints for the frontend.
  • tokenizer.py: Includes the class Tagger and TaggedWord. Applies Part-of-Speech tagging and Named-Entity Recognition on the input data using the spacy module for python.

Usage Example

In the following an example for annotating a sample sentence from the Guardian:

A short video going through AnnIE showing its functionalities can be found here: https://drive.google.com/file/d/1E1gB_-zKE70jnw75LMYj1kkAMsMh37TA/view?usp=sharing

Acknowledgements

Developed by Niklas Friedrich, Kiril Gashteovski, Minying Yu, Bhushan Kotnis, Carolin Lawrence, Mathias Niepert, and Goran Glavas.

This tool has been developed at the Natural Language Processing and Information Retrieval Group at the University of Mannheim.

You might also like...
Redlines produces a Markdown text showing the differences between two strings/text

Redlines Redlines produces a Markdown text showing the differences between two strings/text. The changes are represented with strike-throughs and unde

Open-source linguistic ethnography tool for framing public opinion in mediatized groups.
Open-source linguistic ethnography tool for framing public opinion in mediatized groups.

Open-source linguistic ethnography tool for framing public opinion in mediatized groups. Table of Contents Installing Quickstart Links Installing Pyth

This is a text summarizing tool written in Python
This is a text summarizing tool written in Python

Summarize Written by: Ling Li Ya This is a text summarizing tool written in Python. User Guide Some things to note: The application is accessible here

This is REST-API for Indonesian Text Summarization using Non-Negative Matrix Factorization for the algorithm to summarize documents and FastAPI for the framework.

Indonesian Text Summarization Using FastAPI This is REST-API for Indonesian Text Summarization using Non-Negative Matrix Factorization for the algorit

Convert English text to IPA using the toPhonetic
Convert English text to IPA using the toPhonetic

Installation: Windows python -m pip install text2ipa macOS sudo pip3 install text2ipa Linux pip install text2ipa Features Convert English text to I

Translate .sbv subtitle files

deepl4subtitle Deeplを使って字幕ファイル(.sbv)を翻訳します。タイムスタンプも含めて出力しますが、翻訳時はタイムスタンプは文の一部とは切り離されるので、.sbvファイルをそのまま翻訳機に突っ込むよりも高精度な翻訳ができるはずです。 つかいかた 入力する.sbvファイルの前処理

Chilean Digital Vaccination Pass Parser (CDVPP) parses digital vaccination passes from PDF files

cdvpp Chilean Digital Vaccination Pass Parser (CDVPP) parses digital vaccination passes from PDF files Reads a Digital Vaccination Pass PDF file as in

Split large XML files into smaller ones for easy upload

Split large XML files into smaller ones for easy upload. Works for WordPress Posts Import and other XML files.

Fixes mojibake and other glitches in Unicode text, after the fact.

ftfy: fixes text for you print(fix_encoding("(ง'⌣')ง")) (ง'⌣')ง Full documentation: https://ftfy.readthedocs.org Testimonials “My life is li

Releases(1.0)
Owner
Niklas
Niklas
一个可以可以统计群组用户发言,并且能将聊天内容生成词云的机器人

当前版本 v2.2 更新维护日志 更新维护日志 有问题请加群组反馈 Telegram 交流反馈群组 点击加入 演示 配置要求 内存:1G以上 安装方法 使用 Docker 安装 Docker官方安装

机器人总动员 117 Dec 29, 2022
a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

Rodrigo 2 Dec 14, 2022
The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests. They can read questions, multiple-choice exam papers, and grade.

Harry 5 Mar 27, 2022
Search for terms(word / table / field name or any) under Snowflake schema names

snowflake-search-terms-in-ddl-views Search for terms(word / table / field name or any) under Snowflake schema names Version : 1.0v How to use ? Run th

Igal Emona 1 Dec 15, 2021
Repositori untuk belajar pemrograman Python dalam bahasa Indonesia

Python Repositori ini berisi kumpulan dari berbagai macam contoh struktur data, algoritma dan komputasi matematika yang diimplementasikan dengan mengg

Bellshade 111 Dec 19, 2022
A minimal python script for generating multiple onetime use bip39 seed phrases

seed_signer_ontimes WARNING This project has mainly been used for local development, and creation should be ran on a air-gapped machine. A minimal pyt

CypherToad 4 Sep 12, 2022
Wordle strategy: Find frequency of letters appearing in 5-letter words in the English language

Find frequency of letters appearing in 5-letter words in the English language In

Gabriel Apolinário 1 Jan 17, 2022
Translate .sbv subtitle files

deepl4subtitle Deeplを使って字幕ファイル(.sbv)を翻訳します。タイムスタンプも含めて出力しますが、翻訳時はタイムスタンプは文の一部とは切り離されるので、.sbvファイルをそのまま翻訳機に突っ込むよりも高精度な翻訳ができるはずです。 つかいかた 入力する.sbvファイルの前処理

Yasunori Toshimitsu 1 Oct 20, 2021
text-to-speach bot - You really do NOT have time for read a newsletter? Now you can listen to it

NewsletterReader You really do NOT have time for read a newsletter? Now you can listen to it The Newsletter of Filipe Deschamps is a great place to re

ItanuRomero 8 Sep 18, 2021
Python library for creating PEG parsers

PyParsing -- A Python Parsing Module Introduction The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the t

Pyparsing 1.7k Dec 27, 2022
A simple Python module for parsing human names into their individual components

Name Parser A simple Python (3.2+ & 2.6+) module for parsing human names into their individual components. hn.title hn.first hn.middle hn.last hn.suff

Derek Gulbranson 574 Dec 20, 2022
Wikipedia Reader for the GNOME Desktop

Wike Wike is a Wikipedia reader for the GNOME Desktop. Provides access to all the content of this online encyclopedia in a native application, with a

Hugo Olabera 126 Dec 24, 2022
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 02, 2023
Make writing easier!

Handwriter Make writing easier! How to Download and install a handwriting font, or create a font from your handwriting. Use a word processor like Micr

64 Dec 25, 2022
An online markdown resume template project, based on pywebio

An online markdown resume template project, based on pywebio

极简XksA 5 Nov 10, 2022
RSS Reader application for the Emacs Application Framework.

EAF RSS Reader RSS Reader application for the Emacs Application Framework. Load application (add-to-list 'load-path "~/.emacs.d/site-lisp/eaf-rss-read

EAF 15 Dec 07, 2022
Auto translate Localizable.strings for multiple languages in Xcode

auto_localize Auto translate Localizable.strings for multiple languages in Xcode Usage put your origin Localizable.strings file in folder pip3 install

Wesley Zhang 13 Nov 22, 2022
Hamming code generation, error detection & correction.

Hamming code generation, error detection & correction.

Farhan Bin Amin 2 Jun 30, 2022
Word and phrase lists in CSV

Word Lists Word and phrase lists in CSV, collected from different sources. Oxford Word Lists: oxford-5k.csv - Oxford 3000 and 5000 oxford-opal.csv - O

Anton Zhiyanov 14 Oct 14, 2022
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。

一款高性能非法词(敏感词)检测组件,附带繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母,拼音模糊搜索等功能。

ToolGood 3.6k Jan 07, 2023