Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Overview

Version License repo size Arxiv build badge coverage badge


Karate Club is an unsupervised machine learning extension library for NetworkX.

Please look at the Documentation, relevant Paper, Promo Video, and External Resources.

Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) conferences, workshops, and pieces from prominent journals.

The newly introduced graph classification datasets are available at SNAP, TUD Graph Kernel Datasets, and GraphLearning.io.


Citing

If you find Karate Club and the new datasets useful in your research, please consider citing the following paper:

@inproceedings{karateclub,
       title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}},
       author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
       year = {2020},
       pages = {3125–3132},
       booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
       organization = {ACM},
}

A simple example

Karate Club makes the use of modern community detection techniques quite easy (see here for the accompanying tutorial). For example, this is all it takes to use on a Watts-Strogatz graph Ego-splitting:

import networkx as nx
from karateclub import EgoNetSplitter

g = nx.newman_watts_strogatz_graph(1000, 20, 0.05)

splitter = EgoNetSplitter(1.0)

splitter.fit(g)

print(splitter.get_memberships())

Models included

In detail, the following community detection and embedding methods were implemented.

Overlapping Community Detection

Non-Overlapping Community Detection

Neighbourhood-Based Node Level Embedding

Structural Node Level Embedding

Attributed Node Level Embedding

Meta Node Embedding

Graph Level Embedding

Head over to our documentation to find out more about installation and data handling, a full list of implemented methods, and datasets. For a quick start, check out our examples.

If you notice anything unexpected, please open an issue and let us know. If you are missing a specific method, feel free to open a feature request. We are motivated to constantly make Karate Club even better.


Installation

Karate Club can be installed with the following pip command.

$ pip install karateclub

As we create new releases frequently, upgrading the package casually might be beneficial.

$ pip install karateclub --upgrade

Running examples

As part of the documentation we provide a number of use cases to show how the clusterings and embeddings can be utilized for downstream learning. These can accessed here with detailed explanations.

Besides the case studies we provide synthetic examples for each model. These can be tried out by running the example scripts. In order to run one of the examples, the Graph2Vec snippet:

$ cd examples/whole_graph_embedding/
$ python graph2vec_example.py

Running tests

$ python setup.py test

License

Comments
  • GL2vec : RuntimeError: you must first build vocabulary before training the model

    GL2vec : RuntimeError: you must first build vocabulary before training the model

    Hello, First thanks for your work, it's just great.

    However, I have an error while trying to run GL2vec on my dataset, while it works perfectly with the example. Where is exactly this type of error coming from ?

    Thanks in advance

    opened by hug0prevoteau 14
  • How to build my own dataset?

    How to build my own dataset?

    I have to build graphs, and following that I have to generate graph embedding.

    I checked the documentation i.e. https://karateclub.readthedocs.io/.

    But I didn't understand how to build my own graphs.

    1. Can you please point out a sample code where you create dataset from scratch?
    2. I have already checked code here. But they all load pre-defined dataset.
    3. Can you show any code snippet where you create graph i.e. create nodes and add edges.
    4. How to set attributes (features) for the nodes and edges?

    Thanks in advance for your help.

    I am following the https://karateclub.readthedocs.io/en/latest/notes/installation.html.

    opened by smith-co 9
  • Using Feather-Graph with Node Attributes

    Using Feather-Graph with Node Attributes

    Hi @benedekrozemberczki,

    Thanks for creating and maintaining this awesome toolbox for graph and node level embedding techniques. I've been using Feather-Graph to embed non-attributed graphs and the results have been fantastic.

    Question: I'm working on a new problem where graphs contain nodes with attribute information and I wanted to see if it's possible (or makes sense) to extend Feather-Graph to incorporate node attribute information?

    Current thought process: I went through the source code and saw that Feather-Node can leverage an attribute matrix, while Feather-Graph uses the log-degree and clustering coefficient as node features. I felt like there could be an opportunity to plug the feature generation process of Feather-Node into Feather-Graph here, but couldn't determine if there would be any downsides to this approach?

    I went through your paper "Characteristic Functions on Graphs..." but wasn't able to come to a decision one way or the other. Hoping you can shed some light on it!

    Thanks, Scott

    opened by safreita1 8
  • About GL2vec

    About GL2vec

    Hello, thanks for the awesome work!!

    It seems that there are 2 mistakes in the implementation of GL2vec module.

    The first one is :

    in the code below, """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" def _create_line_graph(self, graph): r"""Getting the embedding of graphs. Arg types: * graph (NetworkX graph) - The graph transformed to be a line graph. Return types: * line_graph (NetworkX graph) - The line graph of the source graph. """ graph = nx.line_graph(graph) node_mapper = {node: i for i, node in enumerate(graph.nodes())} edges = [[node_mapper[edge[0]], node_mapper[edge[1]]] for edge in graph.edges()] line_graph = nx.from_edgelist(edges) return line_graph """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" when converting graph G to line graph LG, the method "_create_line_graph()" ignores the edge attribute of G. (It means that there will be no node arrtibutes in LG) so consequently, the method "WeisfeilerLehmanHashing" will not use the attribute information, and will always use the structural information (degree) instead.

    The second one is :

    The GL2vec module only returns the embedding of line graph. But in the original paper of GL2vec, they concatenate the embedding of graph and of line graph.
    Then named the framework "GL2vec", which means "Graph and Line graph to vector".

    Only use the embedding of line graph for downstream task may lead to worse performance.

    We noticed that when applying the embeddings to the graph classification task, (when graph both have node attribute and edge attribute) the performance (accuracy) are as follow: concat(G , LG) > G > LG

    Hope it helps :)

    opened by cheezy88 8
  • Classifying with graph2vec model

    Classifying with graph2vec model

    I'm able to obtain an embedding for a list of NetworkX graphs using graph2vec, and I was wondering if karateclub has a function to make classifications for graphs outside the training set? That is, given my embedding, I want to input a graph outside my original graph list (used in the model) and obtain a list of most similar graphs (something like a "most similar" function).

    opened by joseluisfalla 6
  • How to improve the performance of Graph2Vec model fit function ?

    How to improve the performance of Graph2Vec model fit function ?

    I tried to increase the performance of the Graph2Vec model by using increasing the worker parameter when initializing the model. But it seems that still, the model takes only 1 core to process the fit function.

    Is method I have used to assign the workers correct ? Is there another method to improve the performance ?

    model =  Graph2Vec(workers=28)
    graphs_list=create_graph_list(graph_df)
    model.fit(graphs_list)
    graph_x = model.get_embedding()
    
    opened by 1209973 6
  • Is consecutive numeric indexing necessary for Graph2Vec?

    Is consecutive numeric indexing necessary for Graph2Vec?

    Thanks for the awesome work, networkx is truly helpful when we are dealing with Graph data structure.

    I'm trying to get graph embedding using Graph2vec so that we could compare similarity among graphs. But I'm stuck in this assertion: assert numeric_indices == node_indices, "The node indexing is wrong."

    Say if we have two graphs, each node in the graph represents a word. We build a mapping so that we could replace text with number. For example, whenever the word "Library" occurs in any graph, we label it with the number "2". In this case, the indexes inside one graph might not be consecutive because the mapping is created from a number of graphs.

    So is it still necessary for enforce consecutive indexing in this case? Or I understand the usage of Graph2Vec wrong?

    opened by bdeng3 5
  • Graph Embeddings using node features and inductivity

    Graph Embeddings using node features and inductivity

    Hello,

    First of all thank you for this amazing library! I have a serie of small graphs where each node contains features and I am trying to learn graph-level embedding in an unsupervised manner. However, I couldn't find how to load node features in the graphs before feeding them to a graph embedding algorithm. Could you describe the input needed by the algorithms ?

    Also, is it possible to generate embedding with some sort of forward function once the models are trained (without retraining the model) ? I.e. does the library support inductivity ?

    Thank you!

    opened by TrovatelliT 5
  • graph2vec implementation and graphs with missing nodes

    graph2vec implementation and graphs with missing nodes

    Hi there,

    first of all, thanks a lot for developing this, it has potential to simplify in-silico experiments on biological networks and I am grateful for that!

    I have a question related to the graph2vec implementation. The requirement of the package for graph notation is that nodes have to be named with integers starting from 0 and have to be consecutive. I am working with a collection of 9.000 small networks and would like to embed all of them into an N-dimensional space. Now, all those networks consist of about 25.000 nodes but in some networks these nodes (here it's really genes) are missing (not all genes are supposed to be present in all networks).

    If I rename all my nodes from actual gene names to integers and know that some networks don't have all the genes, I will end up with some networks without consecutive node names, e.g. there will be (..), 20, 21, 24, 25, (...) in one network and perhaps (...), 20, 21, 22, 24, 25, (...) in another. That would violate the requirement of being consecutive.

    My question is: is the implementation aware that a node 25 is the same object between the different networks? Or is it not important and in reality the embedding only takes into account the structure only and I should 'rename' all my networks separately to keep the node naming consecutive?

    opened by kajocina 5
  • Multithreading for WL hasing function

    Multithreading for WL hasing function

    Hi!

    Maybe just another suggestion. In the embedding algorithms, the WeisfeilerLehmanHashing function in the fit function could be time-consuming and the WL hashing function for each graph is independent. Therefore, maybe using multhreading from python can speed them up and I modify the code for my application of graph2vec:

    ==================================

    def fit(self, graphs):
        """
        Fitting a Graph2Vec model.
    
        Arg types:
            * **graphs** *(List of NetworkX graphs)* - The graphs to be embedded.
        """
        pool = ThreadPool(8)
        args_generator = [(graph, self.wl_iterations, self.attributed) for graph in graphs]
        documents = pool.starmap(WeisfeilerLehmanHashing, args_generator)
        pool.close()
        pool.join()
        #documents = [WeisfeilerLehmanHashing(graph, self.wl_iterations, self.attributed) for graph in graphs]
        documents = [TaggedDocument(words=doc.get_graph_features(), tags=[str(i)]) for i, doc in enumerate(documents)]
    
        model = Doc2Vec(documents,
                        vector_size=self.dimensions,
                        window=0,
                        min_count=self.min_count,
                        dm=0,
                        sample=self.down_sampling,
                        workers=self.workers,
                        epochs=self.epochs,
                        alpha=self.learning_rate,
                        seed=self.seed)
    
        self._embedding = [model.docvecs[str(i)] for i, _ in enumerate(documents)]
    
    opened by zslwyuan 5
  • Update requirements

    Update requirements

    As it stands, setup.py has the following requirements which specify maximum versions:

    install_requires = [
        "numpy<1.23.0",
        "networkx<2.7",
        "decorator==4.4.2",
        "pandas<=1.3.5"
    ]
    

    Is there a reason for the maximum versions, such as expired deprecated features used by karateclub? In my personal research, and in using the included test suite via python3 ./setup.py test, I have not encountered issues in upgrading the packages.

    $ pip3 install --upgrade --user networkx numpy pandas decorator
    
    $ pip3 list | grep "networkx\|numpy\|decorator\|pandas"
    decorator              5.1.1
    networkx               2.8.8
    numpy                  1.23.5
    pandas                 1.5.2
    

    Running the tests with these updated package yields the following:

    $ cd karateclub/
    $ pytest
    ...
    47 passed, 2540 warnings in 210.58s (0:03:30) 
    

    Yes, there are lots of warnings. Many are DeprecationWarnings. The current requirements generate 855 warnings.

    $ cd karateclub/
    $ pip3 install --user .
    $ pytest
    ...
    47 passed, 855 warnings in 225.49s (0:03:45)
    

    I suppose the question is: even with additional instances of DeprecationWarning, can we bump up the maximum requirements for this package? Or would the community feel better addressing the deprecation issues before continuing?

    For context, my motivation is to keep this package current; I'm currently held back (not actually, but per the setup requirements) by this package's maximum requirements. Does anyone have any thoughts?

    opened by WhatTheFuzz 4
  • Parallel BigCLAM Gradient Computation

    Parallel BigCLAM Gradient Computation

    https://github.com/benedekrozemberczki/karateclub/blob/de27e87a92323326b63949eee76c02f8d282adc4/karateclub/community_detection/overlapping/bigclam.py#L111-L115

    I've noticed that the gradient calculation in BigCLAM could be easily parallelized. This should be embarrassingly parallel, doing batches no n_jobs gradient calculation. The only issue would be when some node in the batch is also the neighbor of another node in the same batch, since self._do_updates would have to wait for the entire batch to run and then update the duplicated node embeddings.

    This event may be rare if we consider that the graph is sparse and it is unlikely that such "collision" would happen, still, if it did happen, i believe it would not be a huge problem, as long as it does not happen in further iterations (which is even more unlikely)

    What do you guys think? I could implement this if you think it'd be safe to incur in the issue i've mentioned above.

    opened by AlanGanem 2
  • Randomness in Laplacian Eigenmaps Embeddings

    Randomness in Laplacian Eigenmaps Embeddings

    Hi! I'm using Laplacian Eigenmaps and noticed that the resulting embeddings are not always the same, even though I have explicitly set the seed:

    model = LaplacianEigenmaps(dimensions=3,seed=0)
    

    Running the same algorithm in the same python session for multiple times yields different embeddings each time. Here is a minimal reproducible example:

    import networkx as nx
    g_undirected = nx.newman_watts_strogatz_graph(1000, 20, 0.05, seed=1)
    
    from karateclub.node_embedding.neighbourhood import LaplacianEigenmaps
    import numpy as np
    
    for _ in range(5):
        model = LaplacianEigenmaps(dimensions=3,seed=0)
        model.fit(g_undirected)
        node_emb_le = model.get_embedding()
        print(np.sum(node_emb_le))
    

    It yields the following summed value of the embeddings for me:

    31.647046936812927
    -31.647046936812888
    31.64704693681287
    -31.690999529775908
    -31.581837545720354
    

    How can I control the randomness so that every time the resulting embeddings are exactly the same, even if I run the algorithm for arbitrary times in the same python session?

    opened by wendywangwwt 1
Releases(v_10304)
Owner
Benedek Rozemberczki
PhD candidate at The University of Edinburgh @cdt-data-science working on machine learning and data mining related to graph structured data.
Benedek Rozemberczki
CVPR 2021: "The Spatially-Correlative Loss for Various Image Translation Tasks"

Spatially-Correlative Loss arXiv | website We provide the Pytorch implementation of "The Spatially-Correlative Loss for Various Image Translation Task

Chuanxia Zheng 89 Jan 04, 2023
Image process framework based on plugin like imagej, it is esay to glue with scipy.ndimage, scikit-image, opencv, simpleitk, mayavi...and any libraries based on numpy

Introduction ImagePy is an open source image processing framework written in Python. Its UI interface, image data structure and table data structure a

ImagePy 1.2k Dec 29, 2022
BackgroundRemover lets you Remove Background from images and video with a simple command line interface

BackgroundRemover BackgroundRemover is a command line tool to remove background from video and image, made by nadermx to power https://BackgroundRemov

Johnathan Nader 1.7k Dec 30, 2022
Differential rendering based motion capture blender project.

TraceArmature Summary TraceArmature is currently a set of python scripts that allow for high fidelity motion capture through the use of AI pose estima

William Rodriguez 4 May 27, 2022
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

Abhijith Neil Abraham 19 Sep 09, 2021
This is a repository of our model for weakly-supervised video dense anticipation.

Introduction This is a repository of our model for weakly-supervised video dense anticipation. More results on GTEA, Epic-Kitchens etc. will come soon

2 Apr 09, 2022
Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer-Learn is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consist

THUML @ Tsinghua University 2.2k Jan 03, 2023
This is the latest version of the PULP SDK

PULP-SDK This is the latest version of the PULP SDK, which is under active development. The previous (now legacy) version, which is no longer supporte

78 Dec 07, 2022
Independent and minimal implementations of some reinforcement learning algorithms using PyTorch (including PPO, A3C, A2C, ...).

PyTorch RL Minimal Implementations There are implementations of some reinforcement learning algorithms, whose characteristics are as follow: Less pack

Gemini Light 4 Dec 31, 2022
Building blocks for uncertainty-aware cycle consistency presented at NeurIPS'21.

UncertaintyAwareCycleConsistency This repository provides the building blocks and the API for the work presented in the NeurIPS'21 paper Robustness vi

EML Tübingen 19 Dec 12, 2022
[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

template-pose Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions

Van Nguyen Nguyen 92 Dec 28, 2022
Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Exploring Cross-Image Pixel Contrast for Semantic Segmentation Exploring Cross-Image Pixel Contrast for Semantic Segmentation, Wenguan Wang, Tianfei Z

Tianfei Zhou 510 Jan 02, 2023
使用yolov5训练自己数据集(详细过程)并通过flask部署

使用yolov5训练自己的数据集(详细过程)并通过flask部署 依赖库 torch torchvision numpy opencv-python lxml tqdm flask pillow tensorboard matplotlib pycocotools Windows,请使用 pycoc

HB.com 19 Dec 28, 2022
Official Pytorch Implementation of: "Semantic Diversity Learning for Zero-Shot Multi-label Classification"(2021) paper

Semantic Diversity Learning for Zero-Shot Multi-label Classification Paper Official PyTorch Implementation Avi Ben-Cohen, Nadav Zamir, Emanuel Ben Bar

28 Aug 29, 2022
Self-labelling via simultaneous clustering and representation learning. (ICLR 2020)

Self-labelling via simultaneous clustering and representation learning 🆗 🆗 🎉 NEW models (20th August 2020): Added standard SeLa pretrained torchvis

Yuki M. Asano 469 Jan 02, 2023
[ICCV-2021] An Empirical Study of the Collapsing Problem in Semi-Supervised 2D Human Pose Estimation

An Empirical Study of the Collapsing Problem in Semi-Supervised 2D Human Pose Estimation (ICCV 2021) Introduction This is an official pytorch implemen

rongchangxie 42 Jan 04, 2023
Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Ranger-Deep-Learning-Optimizer Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) i

Less Wright 1.1k Dec 21, 2022
中文语音识别系列,读者可以借助它快速训练属于自己的中文语音识别模型,或直接使用预训练模型测试效果。

MASR中文语音识别(pytorch版) 开箱即用 自行训练 使用与训练分离(增量训练) 识别率高 说明:因为每个人电脑机器不同,而且有些安装包安装起来比较麻烦,强烈建议直接用我编译好的docker环境跑 目前docker基础环境为ubuntu-cuda10.1-cudnn7-pytorch1.6.

发送小信号 180 Dec 17, 2022
Attempt at implementation of a simple GAN using Keras

Simple GAN This is my attempt to make a wrapper class for a GAN in keras which can be used to abstract the whole architecture process. Simple GAN Over

Deven96 7 May 23, 2019
190 Jan 03, 2023