"Inductive Entity Representations from Text via Link Prediction" @ The Web Conference 2021

Related tags

Deep Learningblp
Overview

Inductive entity representations from text via link prediction





This repository contains the code used for the experiments in the paper "Inductive entity representations from text via link prediction", presented at The Web Conference, 2021. To refer to our work, please use the following:

@inproceedings{daza2021inductive,
    title = {Inductive Entity Representations from Text via Link Prediction},
    author = {Daniel Daza and Michael Cochez and Paul Groth},
    booktitle = {Proceedings of The Web Conference 2021},
    year = {2021},
    doi = {10.1145/3442381.3450141},
}

In this work, we show how a BERT-based text encoder can be fine-tuned with a link prediction objective, in a graph where entities have an associated textual description. We call the resulting model BLP. There are three interesting properties of a trained BLP model:

  • It can predict a link between entities, even if one or both were not present during training.
  • It produces useful representations for a classifier, that don't require retraining the encoder.
  • It improves an information retrieval system, by better matching entities and questions about them.

Usage

Please follow the instructions next to reproduce our experiments, and to train a model with your own data.

1. Install the requirements

Creating a new environment (e.g. with conda) is recommended. Use requirements.txt to install the dependencies:

conda create -n blp python=3.7
conda activate blp
pip install -r requirements.txt

2. Download the data

Download the required compressed datasets into the data folder:

Download link Size (compressed)
UMLS (small graph for tests) 121 KB
WN18RR 6.6 MB
FB15k-237 21 MB
Wikidata5M 1.4 GB
GloVe embeddings 423 MB
DBpedia-Entity 1.3 GB

Then use tar to extract the files, e.g.

tar -xzvf WN18RR.tar.gz

Note that the KG-related files above contain both transductive and inductive splits. Transductive splits are commonly used to evaluate lookup-table methods like ComplEx, while inductive splits contain entities in the test set that are not present in the training set. Files with triples for the inductive case have the ind prefix, e.g. ind-train.txt.

2. Reproduce the experiments

Link prediction

To check that all dependencies are correctly installed, run a quick test on a small graph (this should take less than 1 minute on GPU):

./scripts/test-umls.sh

The following table is a adapted from our paper. The "Script" column contains the name of the script that reproduces the experiment for the corresponding model and dataset. For example, if you want to reproduce the results of BLP-TransE on FB15k-237, run

./scripts/blp-transe-fb15k237.sh
WN18RR FB15k-237 Wikidata5M
Model MRR Script MRR Script MRR Script
GlovE-BOW 0.170 glove-bow-wn18rr.sh 0.172 glove-bow-fb15k237.sh 0.343 glove-bow-wikidata5m.sh
BE-BOW 0.180 bert-bow-wn18rr.sh 0.173 bert-bow-fb15k237.sh 0.362 bert-bow-wikidata5m.sh
GloVe-DKRL 0.115 glove-dkrl-wn18rr.sh 0.112 glove-dkrl-fb15k237.sh 0.282 glove-dkrl-wikidata5m.sh
BE-DKRL 0.139 bert-dkrl-wn18rr.sh 0.144 bert-dkrl-fb15k237.sh 0.322 bert-dkrl-wikidata5m.sh
BLP-TransE 0.285 blp-transe-wn18rr.sh 0.195 blp-transe-fb15k237.sh 0.478 blp-transe-wikidata5m.sh
BLP-DistMult 0.248 blp-distmult-wn18rr.sh 0.146 blp-distmult-fb15k237.sh 0.472 blp-distmult-wikidata5m.sh
BLP-ComplEx 0.261 blp-complex-wn18rr.sh 0.148 blp-complex-fb15k237.sh 0.489 blp-complex-wikidata5m.sh
BLP-SimplE 0.239 blp-simple-wn18rr.sh 0.144 blp-simple-fb15k237.sh 0.493 blp-simple-wikidata5m.sh

Entity classification

After training for link prediction, a tensor of embeddings for all entities is computed and saved in a file with name ent_emb-[ID].pt where [ID] is the id of the experiment in the database (we use Sacred to manage experiments). Another file called ents-[ID].pt contains entity identifiers for every row in the tensor of embeddings.

To ease reproducibility, we provide these tensors, which are required in the entity classification task. Click on the ID, download the file into the output folder, and decompress it. An experiment can be reproduced using the following command:

python train.py node_classification with checkpoint=ID dataset=DATASET

where DATASET is either WN18RR or FB15k-237. For example:

python train.py node_classification with checkpoint=199 dataset=WN18RR
WN18RR FB15k-237
Model Acc. ID Acc. Bal. ID
GloVe-BOW 55.3 219 34.4 293
BE-BOW 60.7 218 28.3 296
GloVe-DKRL 55.5 206 26.6 295
BE-DKRL 48.8 207 30.9 294
BLP-TransE 81.5 199 42.5 297
BLP-DistMult 78.5 200 41.0 298
BLP-ComplEx 78.1 201 38.1 300
BLP-SimplE 83.0 202 45.7 299

Information retrieval

This task runs with a pre-trained model saved from the link prediction task. For example, if the model trained is blp with transe and it was saved as model.pt, then run the following command to run the information retrieval task:

python retrieval.py with model=blp rel_model=transe \
checkpoint='output/model.pt'

Using your own data

If you have a knowledge graph where entities have textual descriptions, you can train a BLP model for the tasks of inductive link prediction, and entity classification (if you also have labels for entities).

To do this, add a new folder inside the data folder (let's call it my-kg). Store in it a file containing the triples in your KG. This should be a text file with one tab-separated triple per line (let's call it all-triples.tsv).

To generate inductive splits, you can use data/utils.py. If you run

python utils.py drop_entities --file=my-kg/all-triples.tsv

this will generate ind-train.tsv, ind-dev.tsv, ind-test.tsv inside my-kg (see Appendix A in our paper for details on how these are generated). You can then train BLP-TransE with

python train.py with dataset='my-kg'

Alternative implementations

Owner
Daniel Daza
PhD student at VU Amsterdam and the University of Amsterdam, working on machine learning and knowledge graphs.
Daniel Daza
StableSims is an open-source project aimed at simulating MakerDAO's Dai stablecoin system

StableSims is an open-source project aimed at simulating MakerDAO's Dai stablecoin system, initially used for researching optimal incentive parameters for Liquidations 2.0.

Blockchain at Berkeley 52 Nov 21, 2022
Continuous Conditional Random Field Convolution for Point Cloud Segmentation

CRFConv This repository is the implementation of "Continuous Conditional Random Field Convolution for Point Cloud Segmentation" 1. Setup 1) Building c

Fei Yang 8 Dec 08, 2022
DIVeR: Deterministic Integration for Volume Rendering

DIVeR: Deterministic Integration for Volume Rendering This repo contains the training and evaluation code for DIVeR. Setup python 3.8 pytorch 1.9.0 py

64 Dec 27, 2022
Similarity-based Gray-box Adversarial Attack Against Deep Face Recognition

Similarity-based Gray-box Adversarial Attack Against Deep Face Recognition Introduction Run attack: SGADV.py Objective function: foolbox/attacks/gradi

1 Jul 18, 2022
Curvlearn, a Tensorflow based non-Euclidean deep learning framework.

English | 简体中文 Why Non-Euclidean Geometry Considering these simple graph structures shown below. Nodes with same color has 2-hop distance whereas 1-ho

Alibaba 123 Dec 12, 2022
When are Iterative GPs Numerically Accurate?

When are Iterative GPs Numerically Accurate? This is a code repository for the paper "When are Iterative GPs Numerically Accurate?" by Wesley Maddox,

Wesley Maddox 1 Jan 06, 2022
CLASP - Contrastive Language-Aminoacid Sequence Pretraining

CLASP - Contrastive Language-Aminoacid Sequence Pretraining Repository for creating models pretrained on language and aminoacid sequences similar to C

Michael Pieler 133 Dec 29, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

49 Nov 23, 2022
Pytorch implementation of "Get To The Point: Summarization with Pointer-Generator Networks"

About this repository This repo contains an Pytorch implementation for the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Netwo

wxDai 7 Oct 14, 2022
SegTransVAE: Hybrid CNN - Transformer with Regularization for medical image segmentation

SegTransVAE: Hybrid CNN - Transformer with Regularization for medical image segmentation This repo is the official implementation for SegTransVAE. Seg

Nguyen Truong Hai 4 Aug 04, 2022
The devkit of the nuPlan dataset.

The devkit of the nuPlan dataset.

Motional 264 Jan 03, 2023
[NeurIPS2021] Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks

Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks Code for NeurIPS 2021 Paper "Exploring Architectural Ingredients of A

Hanxun Huang 26 Dec 01, 2022
Code repository for "Reducing Underflow in Mixed Precision Training by Gradient Scaling" presented at IJCAI '20

Reducing Underflow in Mixed Precision Training by Gradient Scaling This project implements the gradient scaling method to improve the performance of m

Ruizhe Zhao 5 Apr 14, 2022
GuideDog is an AI/ML-based mobile app designed to assist the lives of the visually impaired, 100% voice-controlled

Guidedog Authors: Kyuhee Jo, Steven Gunarso, Jacky Wang, Raghav Sharma GuideDog is an AI/ML-based mobile app designed to assist the lives of the visua

Kyuhee Jo 5 Nov 24, 2021
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
An algorithm study of the 6th iOS 10 set of Boost Camp Web Mobile

알고리즘 스터디 🔥 부스트캠프 웹모바일 6기 iOS 10조의 알고리즘 스터디 입니다. 개인적인 사정 등으로 S034, S055만 참가하였습니다. 스터디 목적 상진: 코테 합격 + 부캠끝나고 아침에 일어나기 위해 필요한 사이클 기완: 꾸준하게 자리에 앉아 공부하기 +

2 Jan 11, 2022
Training DALL-E with volunteers from all over the Internet using hivemind and dalle-pytorch (NeurIPS 2021 demo)

Training DALL-E with volunteers from all over the Internet This repository is a part of the NeurIPS 2021 demonstration "Training Transformers Together

<a href=[email protected]"> 19 Dec 13, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 138 Dec 28, 2022
Official Implementation of HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation

HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation by Lukas Hoyer, Dengxin Dai, and Luc Van Gool [Arxiv] [Paper] Overview Unsup

Lukas Hoyer 149 Dec 28, 2022
Books, Presentations, Workshops, Notebook Labs, and Model Zoo for Software Engineers and Data Scientists wanting to learn the TF.Keras Machine Learning framework

Books, Presentations, Workshops, Notebook Labs, and Model Zoo for Software Engineers and Data Scientists wanting to learn the TF.Keras Machine Learning framework

Google Cloud Platform 792 Dec 28, 2022