Extract knowledge from raw text

Overview

Extract knowledge from raw text

This repository is a nearly copy-paste of "From Text to Knowledge: The Information Extraction Pipeline" with some cosmetic updates. I made an installable version to evaluate it easily. The original code is available @ trinity-ie. To create some value, I added the Luke model to predict relations between entities. Luke is a transformer (same family as Bert), its particularity is that during its pre-training, it trains parameters dedicated to entities within the attention mechanism. Luke is in fact a very efficient model on entity-related tasks. We use here the version of Luke fine-tuned on the dataset TACRED.

In this blog post, Tomaz Bratanic presents a complete pipeline for extracting triples from raw text. The first step of the pipeline is to resolve the coreferences. The second step of the pipeline is to identify entities using the Wikifier API. Finally, Tomaz Bratanic proposes to use the Opennre library to extract relations between entities within the text.

🔧 Installation

pip install git+https://github.com/raphaelsty/textokb --upgrade

You will have to download spacy en model to do coreference resolution:

pip install spacy==2.1.0 && python -m spacy download en

Quick start

>> device = "cpu" # or device = "cuda" if you do own a gpu. >>> pipeline = pipeline.TextToKnowledge(key="jueidnxsctiurpwykpumtsntlschpx", types=types, device=device) >>> text = """Elon Musk is a business magnate, industrial designer, and engineer. He is the founder, ... CEO, CTO, and chief designer of SpaceX. He is also early investor, CEO, and product architect of ... Tesla, Inc. He is also the founder of The Boring Company and the co-founder of Neuralink. A ... centibillionaire, Musk became the richest person in the world in January 2021, with an estimated ... net worth of $185 billion at the time, surpassing Jeff Bezos. Musk was born to a Canadian mother ... and South African father and raised in Pretoria, South Africa. He briefly attended the University ... of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the ... University of Pennsylvania two years later, where he received dual bachelor's degrees in economics ... and physics. He moved to California in 1995 to attend Stanford University, but decided instead to ... pursue a business career. He went on co-founding a web software company Zip2 with his brother ... Kimbal Musk.""" >>> pipeline.process_sentence(text = text) head relation tail score 0 Tesla, Inc. architect Elon Musk 0.803398 1 Tesla, Inc. field of work The Boring Company 0.733903 2 Elon Musk residence University of Pennsylvania 0.648434 3 Elon Musk field of work The Boring Company 0.592007 4 Elon Musk manufacturer Tesla, Inc. 0.553206 5 The Boring Company manufacturer Tesla, Inc. 0.515352 6 Elon Musk developer Kimbal Musk 0.475639 7 University of Pennsylvania subsidiary Elon Musk 0.435384 8 The Boring Company developer Elon Musk 0.387753 9 SpaceX winner Elon Musk 0.374090 10 Kimbal Musk sibling Elon Musk 0.355944 11 Elon Musk manufacturer SpaceX 0.221294 ">
>>> from textokb import pipeline

# A list of types of entities that I search:
>>> types = [
...   "human", 
...   "person", 
...   "company", 
...   "enterprise", 
...   "business", 
...   "geographic region", 
...   "human settlement", 
...   "geographic entity", 
...   "territorial entity type", 
...   "organization",
... ]

>>> device = "cpu" # or device = "cuda" if you do own a gpu.

>>> pipeline = pipeline.TextToKnowledge(key="jueidnxsctiurpwykpumtsntlschpx", types=types, device=device)

>>> text = """Elon Musk is a business magnate, industrial designer, and engineer. He is the founder, 
... CEO, CTO, and chief designer of SpaceX. He is also early investor, CEO, and product architect of 
... Tesla, Inc. He is also the founder of The Boring Company and the co-founder of Neuralink. A 
... centibillionaire, Musk became the richest person in the world in January 2021, with an estimated 
... net worth of $185 billion at the time, surpassing Jeff Bezos. Musk was born to a Canadian mother 
... and South African father and raised in Pretoria, South Africa. He briefly attended the University 
... of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the 
... University of Pennsylvania two years later, where he received dual bachelor's degrees in economics 
... and physics. He moved to California in 1995 to attend Stanford University, but decided instead to 
... pursue a business career. He went on co-founding a web software company Zip2 with his brother 
... Kimbal Musk."""

>>> pipeline.process_sentence(text = text)
                          head       relation                        tail     score
0                  Tesla, Inc.      architect                   Elon Musk  0.803398
1                  Tesla, Inc.  field of work          The Boring Company  0.733903
2                    Elon Musk      residence  University of Pennsylvania  0.648434
3                    Elon Musk  field of work          The Boring Company  0.592007
4                    Elon Musk   manufacturer                 Tesla, Inc.  0.553206
5           The Boring Company   manufacturer                 Tesla, Inc.  0.515352
6                    Elon Musk      developer                 Kimbal Musk  0.475639
7   University of Pennsylvania     subsidiary                   Elon Musk  0.435384
8           The Boring Company      developer                   Elon Musk  0.387753
9                       SpaceX         winner                   Elon Musk  0.374090
10                 Kimbal Musk        sibling                   Elon Musk  0.355944
11                   Elon Musk   manufacturer                      SpaceX  0.221294

By default the model used is wiki80_cnn_softmax. I also added the model Luke (Language Understanding with Knowledge-based Embeddings) which provide a pre-trained models to do relation extraction. The results of the Luke model seem to be of better quality but the number of predicted relationships is smaller.

Here is how to use LUKE

>> device = "cpu" # or device = "cuda" if you do own a gpu. >>> pipeline = pipeline.TextToKnowledge(key="jueidnxsctiurpwykpumtsntlschpx", types=types, device=device, luke=True) >>> text = """Elon Musk is a business magnate, industrial designer, and engineer. He is the founder, ... CEO, CTO, and chief designer of SpaceX. He is also early investor, CEO, and product architect of ... Tesla, Inc. He is also the founder of The Boring Company and the co-founder of Neuralink. A ... centibillionaire, Musk became the richest person in the world in January 2021, with an estimated ... net worth of $185 billion at the time, surpassing Jeff Bezos. Musk was born to a Canadian mother ... and South African father and raised in Pretoria, South Africa. He briefly attended the University ... of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the ... University of Pennsylvania two years later, where he received dual bachelor's degrees in economics ... and physics. He moved to California in 1995 to attend Stanford University, but decided instead to ... pursue a business career. He went on co-founding a web software company Zip2 with his brother ... Kimbal Musk.""" >>> pipeline.process_sentence(text = text) head relation tail score 0 Elon Musk per:siblings Kimbal Musk 10.436224 1 Kimbal Musk per:siblings Elon Musk 10.040980 2 Elon Musk per:schools_attended University of Pennsylvania 9.808870 3 The Boring Company org:founded_by Elon Musk 8.823962 4 Elon Musk per:employee_of Tesla, Inc. 8.245111 5 SpaceX org:founded_by Elon Musk 7.795369 6 Elon Musk per:employee_of SpaceX 7.765485 7 Elon Musk per:employee_of The Boring Company 7.217330 8 Tesla, Inc. org:founded_by Elon Musk 7.002990 ">
>>> from textokb import pipeline

# A list of types of entities that I search:
>>> types = [
...   "human", 
...   "person", 
...   "company", 
...   "enterprise", 
...   "business", 
...   "geographic region", 
...   "human settlement", 
...   "geographic entity", 
...   "territorial entity type", 
...   "organization",
... ]

>>> device = "cpu" # or device = "cuda" if you do own a gpu.

>>> pipeline = pipeline.TextToKnowledge(key="jueidnxsctiurpwykpumtsntlschpx", types=types, device=device, luke=True)

>>> text = """Elon Musk is a business magnate, industrial designer, and engineer. He is the founder, 
... CEO, CTO, and chief designer of SpaceX. He is also early investor, CEO, and product architect of 
... Tesla, Inc. He is also the founder of The Boring Company and the co-founder of Neuralink. A 
... centibillionaire, Musk became the richest person in the world in January 2021, with an estimated 
... net worth of $185 billion at the time, surpassing Jeff Bezos. Musk was born to a Canadian mother 
... and South African father and raised in Pretoria, South Africa. He briefly attended the University 
... of Pretoria before moving to Canada aged 17 to attend Queen's University. He transferred to the 
... University of Pennsylvania two years later, where he received dual bachelor's degrees in economics 
... and physics. He moved to California in 1995 to attend Stanford University, but decided instead to 
... pursue a business career. He went on co-founding a web software company Zip2 with his brother 
... Kimbal Musk."""

>>> pipeline.process_sentence(text = text)
                 head              relation                        tail      score
0           Elon Musk          per:siblings                 Kimbal Musk  10.436224
1         Kimbal Musk          per:siblings                   Elon Musk  10.040980
2           Elon Musk  per:schools_attended  University of Pennsylvania   9.808870
3  The Boring Company        org:founded_by                   Elon Musk   8.823962
4           Elon Musk       per:employee_of                 Tesla, Inc.   8.245111
5              SpaceX        org:founded_by                   Elon Musk   7.795369
6           Elon Musk       per:employee_of                      SpaceX   7.765485
7           Elon Musk       per:employee_of          The Boring Company   7.217330
8         Tesla, Inc.        org:founded_by                   Elon Musk   7.002990

Here is the list of available relations using Luke studio-ousia/luke-large-finetuned-tacred:

[
    'no_relation',
    'org:alternate_names',
    'org:city_of_headquarters',
    'org:country_of_headquarters',
    'org:dissolved',
    'org:founded',
    'org:founded_by',
    'org:member_of',
    'org:members',
    'org:number_of_employees/members',
    'org:parents',
    'org:political/religious_affiliation',
    'org:shareholders',
    'org:stateorprovince_of_headquarters',
    'org:subsidiaries',
    'org:top_members/employees',
    'org:website',
    'per:age',
    'per:alternate_names',
    'per:cause_of_death',
    'per:charges',
    'per:children',
    'per:cities_of_residence',
    'per:city_of_birth',
    'per:city_of_death',
    'per:countries_of_residence',
    'per:country_of_birth',
    'per:country_of_death',
    'per:date_of_birth',
    'per:date_of_death',
    'per:employee_of',
    'per:origin',
    'per:other_family',
    'per:parents',
    'per:religion',
    'per:schools_attended',
    'per:siblings',
    'per:spouse',
    'per:stateorprovince_of_birth',
    'per:stateorprovince_of_death',
    'per:stateorprovinces_of_residence',
    'per:title'
]

Notes

The first time you initialize the model with Opennre or Luke, you may have to wait a few minutes for the model to download. Since we use the Wikifier API to track entities (NEL), it is necessary that your computer is connected to the internet. You can create your own credential for the API here: Wikifier API registration. Tomaz Bratanic mentions the possibility to replace Wikifier with BLINK however this library is very RAM intensive.

♻️ Work in progress

I failed to use the wiki80_bert_softmax model from Opennre due to a pre-trained model loading error (i.e. Tensorflow errors on Mac M1). I used the lighter model wiki80_cnn_softmax when reproducing Tomaz Bratanic's blog post. It would be interesting to be able to easily add different models and especially transformers. The API I used are not optimized for batch predictions. There are a lot of room for improvement by simply updating Opennre and Luke APIs.

You might also like...
Fixes mojibake and other glitches in Unicode text, after the fact.

ftfy: fixes text for you print(fix_encoding("(ง'⌣')ง")) (ง'⌣')ง Full documentation: https://ftfy.readthedocs.org Testimonials “My life is li

Paranoid text spacing in Python

pangu.py Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Markup is an online annotation tool that can be used to transform unstructured documents into structured formats for NLP and ML tasks, such as named-entity recognition. Markup learns as you annotate in order to predict and suggest complex annotations. Markup also provides integrated access to existing and custom ontologies, enabling the prediction and suggestion of ontology mappings based on the text you're annotating.
Markup is an online annotation tool that can be used to transform unstructured documents into structured formats for NLP and ML tasks, such as named-entity recognition. Markup learns as you annotate in order to predict and suggest complex annotations. Markup also provides integrated access to existing and custom ontologies, enabling the prediction and suggestion of ontology mappings based on the text you're annotating.

Markup is an online annotation tool that can be used to transform unstructured documents into structured formats for NLP and ML tasks, such as named-entity recognition. Markup learns as you annotate in order to predict and suggest complex annotations. Markup also provides integrated access to existing and custom ontologies, enabling the prediction and suggestion of ontology mappings based on the text you're annotating.

🐸   Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️
🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.
box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.

Box is a text-based visual programming language inspired by Unreal Engine blueprint function graphs. $ cat factorial.box ┌─ƒ(Factorial)───┐

Export solved codewars kata challenges to a text file.

Codewars Kata Exporter Note:this is not totally my work.i've edited the project to make more easier and faster for me.you can find the original work h

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.
AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

py-trans is a Free Python library for translate text into different languages.

Free Python library to translate text into different languages.

A production-ready pipeline for text mining and subject indexing

A production-ready pipeline for text mining and subject indexing

Releases(0.0.1)
Owner
Raphael Sourty
PhD Student @ IRIT and Renault
Raphael Sourty
Convert English text to IPA using the toPhonetic

Installation: Windows python -m pip install text2ipa macOS sudo pip3 install text2ipa Linux pip install text2ipa Features Convert English text to I

Joseph Quang 3 Jun 14, 2022
Fixes mojibake and other glitches in Unicode text, after the fact.

ftfy: fixes text for you print(fix_encoding("(ง'⌣')ง")) (ง'⌣')ง Full documentation: https://ftfy.readthedocs.org Testimonials “My life is li

Luminoso Technologies, Inc. 3.4k Jan 08, 2023
Microsoft's Cascadia Code font customized to my liking.

Microsoft's Cascadia Code font customized to my liking. Also includes some simple batch patch and bake scripts to batch patch glyphs and bake font features into fonts!

Frederik List 3 Jan 29, 2022
Build a translation program similar to Google Translate with Python programming language and QT library

google-translate Build a translation program similar to Google Translate with Python programming language and QT library Different parts of the progra

Amir Hussein Sharifnezhad 3 Oct 09, 2021
A python tool one can extract the "hash" from a WINDOWS HELLO PIN

WINHELLO2hashcat About With this tool one can extract the "hash" from a WINDOWS HELLO PIN. This hash can be cracked with Hashcat, more precisely with

33 Dec 05, 2022
Vector space based Information Retrieval System for Text Processing - Information retrieval

Information Retrieval: Text Processing Group 13 Sequence of operations Install Requirements Add given wikipedia files to the corpus directory. Downloa

1 Jan 01, 2022
A python Tk GUI that creates, writes text and attaches images into a custom spreadsheet file

A python Tk GUI that creates, writes text and attaches images into a custom spreadsheet file

Mirko Simunovic 13 Dec 09, 2022
This repository contains scripts to control a RGB text fan attached to a Raspberry Pi.

RGB Text Fan Controller This repository contains scripts to control a RGB text fan attached to a Raspberry Pi. Setup The Raspberry Pi and RGB text fan

Luke Prior 1 Oct 01, 2021
Meeting, rendezvous, confluence (Finnish kohtaaminen) mark up, down, and up again.

kohtaaminen Meeting, rendezvous, confluence (Finnish kohtaaminen) mark up, down, and up again. Given a zip file containing a tree of html and media fi

Stefan Hagen 2 Dec 14, 2022
Username reconnaisance tool that checks the availability of a specified username on over 200 websites.

Username reconnaisance tool that checks the availability of a specified username on over 200 websites. Installation & Usage Clone from Github: $ git c

Richard Mwewa 20 Oct 30, 2022
Tools to extract questionaire of finalexam.eu and provide interactive questionaire with summary

AskMe This script is completely terminal based. No user interface is added. You can get the command line options by using the --help argument. Make su

David Loewe 1 Nov 09, 2021
An online markdown resume template project, based on pywebio

An online markdown resume template project, based on pywebio

极简XksA 5 Nov 10, 2022
一个可以可以统计群组用户发言,并且能将聊天内容生成词云的机器人

当前版本 v2.2 更新维护日志 更新维护日志 有问题请加群组反馈 Telegram 交流反馈群组 点击加入 演示 配置要求 内存:1G以上 安装方法 使用 Docker 安装 Docker官方安装

机器人总动员 117 Dec 29, 2022
This project is a small tool for processing url-containing texts delivered by HUAWEI Share on Windows.

hwshare_helper This project is a small tool for handling url-containing texts delivered by HUAWEI Share on Windows. config Before use, please install

1 Jan 19, 2022
A python tool to convert Bangla Bijoy text to Unicode text.

Unicode Converter A python tool to convert Bangla Bijoy text to Unicode text. Installation Unicode Converter can be installed via PyPi. Make sure pip

Shahad Mahmud 10 Sep 29, 2022
Parse Any Text With Python

ParseAnyText A small package to parse strings. What is the work of it? Well It's a module to creates parser that helps to parse a text easily with les

Sayam Goswami 1 Jan 11, 2022
汉字转拼音(pypinyin)

汉字拼音转换工具(Python 版) 将汉字转为拼音。可以用于汉字注音、排序、检索(Russian translation) 。 基于 hotoo/pinyin 开发。 Documentation: http://pypinyin.rtfd.io/ GitHub: https://github.co

Huang Huang 4.2k Jan 03, 2023
Shows twitch pay for any streamer from Twitch leaked CSV files.

twitch_leak_csv_reader Shows twitch pay for any streamer from Twitch leaked CSV files. Requirements: You need python3 (you can install python 3 from o

5 Nov 11, 2022
Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 1.2k Jan 01, 2023
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Contents Maintainer wanted Introduction Installation Documentation License History Source code Authors Maintainer wanted I am looking for a new mainta

Antti Haapala 1.2k Dec 16, 2022