Precision Medicine Knowledge Graph (PrimeKG)

Overview

PrimeKG


website GitHub Repo stars GitHub Repo forks License: MIT

Website | bioRxiv Paper | Harvard Dataverse

Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integrates 20 high-quality biomedical resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, considerably expanding previous efforts in disease-rooted knowledge graphs. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multimodal analyses.

Updates

Unique Features of PrimeKG

  • Diverse coverage of diseases: PrimeKG contains over 17,000 diseases including rare dieases. Disease nodes in PrimeKG are densely connected to other nodes in the graph and have been optimized for clinical relevance in downstream precision medicine tasks.
  • Heterogeneous knowledge graph: PrimeKG contains over 100,000 nodes distributed over various biological scales as depicted below. PrimeKG also contains over 4 million relationships between these nodes distributed over 29 types of edges.
  • Multimodal integration of clinical knowledge: Disease and drug nodes in PrimeKG are augmented with clinical descriptors that come from medical authorities such as Mayo Clinic, Orphanet, Drug Bank, and so forth.
  • Ready-to-use datasets: PrimeKG is minimally dependent on external packages. Our knowledge graph can be retrieved in a ready-to-use format from Harvard Dataverse.
  • Data functions: PrimeKG provides extensive data functions, including processors for primary resources and scripts to build an updated knowledge graph.

overview

PrimeKG-example

Environment setup

Using pip

To install the dependencies required to run the PrimeKG code, use pip:

pip install -r requirements.txt

Or use conda

conda env create --name PrimeKG --file=environments.yml

Building an updated PrimeKG

Downloading primary data resources

All persistent identifiers and weblinks to download the 20 primary data resources used to build PrimeKG are systematically provided in the Data Records section of our article. We have also mentioned the exact filenames that were downloaded from each resource for easy corroboration.

Curating primary data resources

We provide the scripts used to process all primary data resources and the names of the resulting output files generated by those scripts. We would be happy to share the intermediate processing datasets that were used to create PrimeKG on request.

Database Processing scripts Expected script output
Bgee bgee.py anatomy_gene.csv
Comparative Toxicogenomics Database ctd.py exposure_data.csv
DisGeNET - curated_gene_disease_associations.tsv
DrugBank drugbank_drug_drug.py drug_drug.csv
DrugBank parsexml_drugbank.ipynb, Parsed_feature.ipynb 12 drug feature files
DrugBank drugbank_drug_protein.py drug_protein.csv
Drug Central drugcentral_queries.txt drug_disease.csv
Drug Central drugcentral_feature.Rmd dc_features.csv
Entrez Gene ncbigene.py protein_go_associations.csv
Gene Ontology go.py go_terms_info.csv, go_terms_relations.csv
Human Phenotype Ontology hpo.py, hpo_obo_parser.py hp_terms.csv, hp_parents.csv, hp_references.csv
Human Phenotype Ontology hpoa.py disease_phenotype_pos.csv, disease_phenotype_neg.csv
MONDO mondo.py, mondo_obo_parser.py mondo_terms.csv, mondo_parents.csv, mondo_references.csv, mondo_subsets.csv, mondo_definitions.csv
Reactome reactome.py reactome_ncbi.csv, reactome_terms.csv, reactome_relations.csv
SIDER sider.py sider.csv
UBERON uberon.py uberon_terms.csv, uberon_rels.csv, uberon_is_a.csv
UMLS umls.py, map_umls_mondo.py umls_mondo.csv
UMLS umls.ipynb umls_def_disorder_2021.csv, umls_def_disease_2021.csv

Harmonizing datasets into PrimeKG

The code to harmonize datasets and construct PrimeKG is available at build_graph.ipynb. Simply run this jupyter notebook in order to construct the knowledge graph form the outputs of the processing files mentioned above. This jupyter notebook produces all three versions of PrimeKG, kg_raw.csv, kg_giant.csv, and the complete version kg.csv.

Feature extraction

The code required to engineer features can be found at engineer_features.ipynb and mapping_mayo.ipynb.

Cite Us

If you find PrimeKG useful, cite our work:

@article{chandak2022building,
  title={Building a knowledge graph to enable precision medicine},
  author={Chandak, Payal and Huang, Kexin and Zitnik, Marinka},
  journal={bioRxiv},
  doi={10.1101/2022.05.01.489928},
  URL={https://www.biorxiv.org/content/early/2022/05/01/2022.05.01.489928},
  year={2022}
}

Data Server

PrimeKG is hosted on Harvard Dataverse with the following persistent identifier https://doi.org/10.7910/DVN/IXA7BM. When Dataverse is under maintenance, PrimeKG datasets cannot be retrieved. That happens rarely; please check the status on the Dataverse website.

License

PrimeKG codebase is under MIT license. For individual dataset usage, please refer to the dataset license found in the website.

Owner
Machine Learning for Medicine and Science @ Harvard
Machine Learning for Medicine and Science @ Harvard
ADCS cert template modification and ACL enumeration

Purpose This tool is designed to aid an operator in modifying ADCS certificate templates so that a created vulnerable state can be leveraged for privi

Fortalice Solutions, LLC 78 Dec 12, 2022
Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

Kumar Saksham 26 Dec 25, 2022
ReCoin - Restoring our environment and businesses in parallel

Shashank Ojha, Sabrina Button, Abdellah Ghassel, Joshua Gonzales "Reduce Reuse R

sabrina button 1 Mar 14, 2022
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Phil Wang 5k Jan 02, 2023
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 02, 2023
硕士期间自学的NLP子任务,供学习参考

NLP_Chinese_down_stream_task 自学的NLP子任务,供学习参考 任务1 :短文本分类 (1).数据集:THUCNews中文文本数据集(10分类) (2).模型:BERT+FC/LSTM,Pytorch实现 (3).使用方法: 预训练模型使用的是中文BERT-WWM, 下载地

12 May 31, 2022
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

475 Jan 04, 2023
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022
Python package for Turkish Language.

PyTurkce Python package for Turkish Language. Documentation: https://pyturkce.readthedocs.io. Installation pip install pyturkce Usage from pyturkce im

Mert Cobanov 14 Oct 09, 2022
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 28 Oct 30, 2022
Rhasspy 673 Dec 28, 2022
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
This library is testing the ethics of language models by using natural adversarial texts.

prompt2slip This library is testing the ethics of language models by using natural adversarial texts. This tool allows for short and simple code and v

9 Dec 28, 2021
Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. Documentation Proper documentation is available at

HUSEIN ZOLKEPLI 151 Jan 05, 2023
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 6.4k Jan 09, 2023
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective This is the official code base for our ICLR 2021 paper

AI Secure 71 Nov 25, 2022
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

ReekyStive 3 Nov 11, 2022