Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Overview

pgmpy

Build Status Appveyor codecov Codacy Badge Downloads Join the chat at https://gitter.im/pgmpy/pgmpy

pgmpy is a python library for working with Probabilistic Graphical Models.

Documentation and list of algorithms supported is at our official site http://pgmpy.org/
Examples on using pgmpy: https://github.com/pgmpy/pgmpy/tree/dev/examples
Basic tutorial on Probabilistic Graphical models using pgmpy: https://github.com/pgmpy/pgmpy_notebook

Our mailing list is at https://groups.google.com/forum/#!forum/pgmpy .

We have our community chat at gitter.

Dependencies

pgmpy has following non optional dependencies:

  • python 3.6 or higher
  • networkX
  • scipy
  • numpy
  • pytorch

Some of the functionality would also require:

  • tqdm
  • pandas
  • pyparsing
  • statsmodels
  • joblib

Installation

pgmpy is available both on pypi and anaconda. For installing through anaconda use:

$ conda install -c ankurankan pgmpy

For installing through pip:

$ pip install -r requirements.txt  # only if you want to run unittests
$ pip install pgmpy

To install pgmpy from the source code:

$ git clone https://github.com/pgmpy/pgmpy 
$ cd pgmpy/
$ pip install -r requirements.txt
$ python setup.py install

If you face any problems during installation let us know, via issues, mail or at our gitter channel.

Development

Code

Our latest codebase is available on the dev branch of the repository.

Contributing

Issues can be reported at our issues section.

Before opening a pull request, please have a look at our contributing guide

Contributing guide contains some points that will make our life's easier in reviewing and merging your PR.

If you face any problems in pull request, feel free to ask them on the mailing list or gitter.

If you want to implement any new features, please have a discussion about it on the issue tracker or the mailing list before starting to work on it.

Testing

After installation, you can launch the test form pgmpy source directory (you will need to have the pytest package installed):

$ pytest -v

to see the coverage of existing code use following command

$ pytest --cov-report html --cov=pgmpy

Documentation and usage

The documentation is hosted at: http://pgmpy.org/

We use sphinx to build the documentation. To build the documentation on your local system use:

$ cd /path/to/pgmpy/docs
$ make html

The generated docs will be in _build/html

Examples

We have a few example jupyter notebooks here: https://github.com/pgmpy/pgmpy/tree/dev/examples For more detailed jupyter notebooks and basic tutorials on Graphical Models check: https://github.com/pgmpy/pgmpy_notebook/

Citing

Please use the following bibtex for citing pgmpy in your research:

@inproceedings{ankan2015pgmpy,
  title={pgmpy: Probabilistic graphical models using python},
  author={Ankan, Ankur and Panda, Abinash},
  booktitle={Proceedings of the 14th Python in Science Conference (SCIPY 2015)},
  year={2015},
  organization={Citeseer}
}

License

pgmpy is released under MIT License. You can read about our license at here

Comments
  • Adds base class for continuous node representation

    Adds base class for continuous node representation

    This PR deals with the basic continuous node representation feature. It will comprise a base class for Continuous node representation along with various methods to discretize the continuous variables into discrete factors. This involves the first three weeks of my GSoC project.

    opened by yashu-seth 102
  • Hamiltonian Monte Carlo

    Hamiltonian Monte Carlo

    This PR deals with implementing HMC with dual averaging. The implementation is still open for discussion. If you find anything ambiguous please comment on the line. This PR is open for discussion.

    opened by khalibartan 74
  • ContinousFactor and Joint Gaussian Representation

    ContinousFactor and Joint Gaussian Representation

    This PR deals with

    • the creation of a base class ContinuousFactor for multivariate representations.
    • the creation of the class JointGaussainDistribution - a model to represent the gaussain random variables.
    opened by yashu-seth 55
  • Added BIF.py into readwrite

    Added BIF.py into readwrite

    All functions are not implemented. only get_variable,get_states and get_property are implemented.

    Creating this pull request for easy review. I have tested these methods on munin2.bif and dog-problem.bif and they are working fine. Implemented these functions in accordance with BIF v0.15 as given here . Time taken to run if not importing pgmpy and numpy was 0.06s on average with printing the complete variable_states for munin2.bif (PS: 1003 nodes) Check issue #506 .

    opened by khalibartan 45
  • updates in check_model method

    updates in check_model method

    @ankurankan I have removed cardinalities from the model attributes but it seems it is being used at other places as well. Should cardinalities be computed every single time it is required? Is there a particular problem in having it as an attribute?

    opened by yashu-seth 32
  • Improving Variable Elimination (VE)

    Improving Variable Elimination (VE)

    Here, we finish the implementation of VE with few missing steps, namely computing good elimination orderings (with 4 heuristics - min neighbors, min fill, min weight, and weighted min fill) and safely removing irrelevant variables from the model (barren and independent by evidence nodes). In order to make these improvements, few new methods were necessary and modifying other members too. In our preliminary experimental results, queries that were taking up to 30 minutes using VE now take less than 2 minutes. Please, help us test this new code for a robust implementation and also looking for better ways of coding the algorithms. Thanks.

    opened by jhonatanoliveira 27
  • Replaced recursive call with `while` loop.

    Replaced recursive call with `while` loop.

    • Replaces the recursive version of fun with an iterative one, much easier to read in my opinion. This should be slightly fast as well because recursive calls are expensive in Python(results in new stack frame each time), plus Python has a default limit of 1000 for recursive calls(though this is highly unlikely to occur in our case).
    • I am not sure why the library is using Python 2 based super() calls considering the fact that dependency includes Python 3.3. In Python 3 thanks to the cell variable __class__ we can simply use super().method_name(...)(PEP 3135 -- New Super).
    opened by ashwch 25
  • Added model.predict_probability

    Added model.predict_probability

    Added a new method that gives probabilities of missing variables given a predict data #794

            B_0         B_1
        80  0.439178    0.560822
        81  0.581970    0.418030
        82  0.488275    0.511725
        83  0.581970    0.418030
        84  0.510794    0.489206
        85  0.439178    0.560822
        86  0.439178    0.560822
        87  0.417124    0.582876
        88  0.407978    0.592022
        89  0.429905    0.570095
        90  0.581970    0.418030
        91  0.407978    0.592022
        92  0.429905    0.570095
        93  0.429905    0.570095
        94  0.439178    0.560822
        95  0.407978    0.592022
        96  0.559904    0.440096
        97  0.417124    0.582876
        98  0.488275    0.511725
        99  0.407978    0.592022`
    

    Also a new error test for predict that increases the coverage.

    opened by raghavg7796 24
  • Strange Behavior HillClimbSearch

    Strange Behavior HillClimbSearch

    Subject of the issue

    I want to reproduce the example here

    Your environment

    • pgmpy version: 0.1.12
    • Python version: 3.6.9
    • Operating System: Ubuntu 18.04.5 LTS

    Steps to reproduce

    import pandas as pd
    import numpy as np
    from pgmpy.estimators import HillClimbSearch, BicScore
    data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 9)), columns=list('ABCDEFGHI'))
    data['J'] = data['A'] * data['B']
    est = HillClimbSearch(data, scoring_method=BicScore(data))
    best_model = est.estimate()
    best_model.edges()
    

    Expected behaviour

    [('B', 'J'), ('A', 'J')]

    Actual behaviour

    [('A', 'B'), ('J', 'A'), ('J', 'B')]

    opened by ivanDonadello 23
  • Hamiltonian Monte Carlo & Hamiltonian Monte Carlo with dual averaging

    Hamiltonian Monte Carlo & Hamiltonian Monte Carlo with dual averaging

    @ankurankan I have send this PR to aid us in discussion. I was experimenting things with how to handle gradients (removing the gradient argument). I tested with two ways:

    • First I tried handled grad_log_pdf argument on places itself depending upon how user passed the argument, if None was passed then I created a lambda function to call model.get_gradient_log_pdf otherwise I created a lambda function to use the custom class. But with this things were messy as I have to handle this parameter at two places, one in sampling class and other in BaseSimulateHamiltonianDynamics class.
    • Second ( this PR implements it). Handle everything in model.get_gradient_log_pdf. This code is less messy, because every call is made to model.get_gradient_log_pdf and the method internally handles the rest so need of making suitable changes at different places.

    How do you suggest I should handle the gradients ? You can look at the last commit to specifically see the changes I made https://github.com/pgmpy/pgmpy/pull/702/commits/748eb1fe13488bb8f0cf27a7064a67384ec3315e

    After the discussion I'll close one of the PR.

    opened by khalibartan 21
  • Efficient factor product

    Efficient factor product

    A factor product following "Probabilistic Graphical Models" (Koller 09) on page 359, Algorithm 10.A.1 - Efficient implementation of a factor product operation. Koller's algorithm was modified to fit the configuration used in pgmpy. For example, in pgmpy the configurations of Factor are supposed to be like (0,0,0) (0,0,1) (0,1,0) (1,0,0) and so on, instead of (0,0,0) (1,0,0) (0,1,0) (0,0,1) as expected for Koller's algorithm.

    Koller's implementation is around 98% faster than the current one in pgmpy. This benchmark was done by using a simple python script as follows:

    from pgmpy.factors import Factor
    from pgmpy.factors import factor_product
    from time import time
    
    phi = Factor(['x1', 'x2'], [2, 2], range(4))
    phi1 = Factor(['x3', 'x4'], [2, 2], range(4))
    t0 = time()
    prod = factor_product(phi, phi1)
    t1 = time()
    print(t1-t0)
    

    After running 6 time each implementation, here is the results:

    Comparison

    Unfortunately, we don't know how to use JobLib. But we leave this TODO with the hope that using parallel computation can improve this implementation even further.

    opened by jhonatanoliveira 21
  • The DAG is not a DAG!

    The DAG is not a DAG!

    Subject of the issue

    I tried to estimate the DAG structure with a tree search but I have repeated nodes with various connections!

    Your environment

    • pgmpy version: 0.1.19
    • Python version: 3.6.5, 3.9, 3.10.0
    • Operating System: Windows and Linux

    Steps to reproduce

    est = TreeSearch(values, root_node='B')
    model = est.estimate(estimator_type='chow-liu')
    

    Expected behaviour

    The edges should be unique and directed not multiple

    Actual behaviour

    Having duplicated nodes with various edges

    opened by samanemami 1
  • Pandas 1.5 issue for chisquare test.

    Pandas 1.5 issue for chisquare test.

    Subject of the issue

    Running chi_square fails locally on dataframes from pandas 1.5. Worked again when using pandas 1.4.

    This was found by one of my students on a long standing coding assignment. The student suspects that the issue is how pgmpy's power_divergence method iterates through a pandas groupby and assumes that a tuple will be returned when doing so - but pandas 1.5 changed the output to be length 1, so this line fails

    Your environment

    They student got the error locally, using pandas>=1.5 and pgmpy==0.1.21. I'll ping them to update this issue with their local Python and OS.

    Steps to reproduce

    test_result = chi_square(X=X, Y=Y, Z=Z, data=data, boolean=True, significance_level=significance)

    Expected behaviour

    Get a Boolean output.

    Actual behaviour

    Error.

    opened by robertness 2
  • DBN: CPD associated with (B, 1) doesn't have proper parents associated with it.

    DBN: CPD associated with (B, 1) doesn't have proper parents associated with it.

    Subject of the issue

    A Dynamic Bayesian network is built, but the network is always displayed: CPD associated with (B, 1) doesn't have proper parents associated with it.

    Your environment

    • pgmpy version

    Steps to reproduce

    from pgmpy.factors.discrete import TabularCPD from pgmpy.models import DynamicBayesianNetwork as DBN from pgmpy.inference import DBNInference

    dbnet = DBN() dbnet.add_edges_from( [(('A', 0), ('B', 0)), (('A', 0), ('C', 0)), (('C', 0), ('D', 0)), (('B', 0), ('B', 1)), (('B', 0), ('D', 1 )) ] ) a_cpds = TabularCPD(('A', 0), 2, [[0.7], [0.3]]) b_start_cpds = TabularCPD( ('B', 0), 2, [[0.3, 0.6], [0.7, 0.4]], evidence=[('A', 0)], evidence_card=[2] ) b_trans_cpds = TabularCPD( ('B',1), 2, [[0.1, 0.3, 0.8, 0.6], [0.9, 0.7, 0.2, 0.4]], evidence=[('A', 0), ('B', 0)], evidence_card=[2, 2] ) c_cpds = TabularCPD( ('C', 0), 2, [[0.3, 0.1], [0.7,0.9]], evidence=[('A', 0)], evidence_card=[2] ) d_start_cpds = TabularCPD( ('D',0), 2, [[0.4, 0.2], [0.6, 0.8]], evidence=[('C', 0)], evidence_card=[2] ) d_trans_cpds = TabularCPD( ('D',1), 2, [[0.3, 0.4, 0.8, 0.9], [0.7, 0.6, 0.2, 0.1]], evidence=[('B', 0),('C',0)], evidence_card=[2, 2] ) dbnet.add_cpds(a_cpds, b_start_cpds, b_trans_cpds, c_cpds, d_start_cpds, d_trans_cpds) dbnet.initialize_initial_state() dbn_inf = DBNInference(dbnet) temp = dbn_inf.query([('D', 1)], {('D', 0):0, ('D', 2):1})['D', 1].values print(temp)

    Expected behaviour

    When the evidence under the first time slice of D is true and the evidence under the third time slice is false, the probability of the second time slice is calculated by reasoning. The final calculation result is D2, True: 0.7365902, False:0.2634098

    Actual behaviour

    CPD associated with (B, 1) doesn't have proper parents associated with it. Is this a problem of network model building? Specific network diagram, see [https://www.bilibili.com/video/BV1W3411M7cp/?spm_id_from=333.337.search-card.all.click&vd_source=1a42ddbfb403e da9de41b20ccdca8523]

    opened by tomorrown 1
  • GibbsSampling sometimes fails because of issue with DiscreteFactor product

    GibbsSampling sometimes fails because of issue with DiscreteFactor product

    Subject of the issue

    If I try to run the GibbsSampling for a dataset, it sometimes succeeds, and sometimes fails. This seems to be because DiscreteFactor uses set() in the product() which was introduced in #1548 in July by @ankurankan and the order of set() changes for every execution, which causes phi.variable to not equal phi.variables[0] sometimes.

    I think this can be solved by replacing

        new_variables = list(set(phi.variables).union(phi1.variables))
    

    by something like

        new_variables = phi.variables + [var for var in phi1.variables if var not in phi.variables]
    

    Your environment

    • pgmpy version : 0.1.20 and also dev branch (as of november 2022)
    • Python version : 3.10.8
    • Operating System : MacOS

    Steps to reproduce

    The following unit test sometimes succeeds completely, and sometimes fails completely.

    import unittest
    
    from pgmpy.factors import factor_product
    from pgmpy.factors.discrete import TabularCPD
    from pgmpy.models import BayesianNetwork
    from pgmpy.sampling import GibbsSampling
    
    
    class TestGibbsSamplingIssue(unittest.TestCase):
        def setUp(self):
            self.cpt_cloudy = TabularCPD(variable='Cloudy', variable_card=2, values=[[0.5], [0.5]])
            self.cpt_sprinkler = TabularCPD(variable='Sprinkler', variable_card=2,
                                            values=[[0.5, 0.9], [0.5, 0.1]],
                                            evidence=['Cloudy'], evidence_card=[2])
            self.cpt_rain = TabularCPD(variable='Rain', variable_card=2,
                                       values=[[0.8, 0.2], [0.2, 0.8]],
                                       evidence=['Cloudy'], evidence_card=[2])
            self.cpt_wet_grass = TabularCPD(variable='Wet_Grass', variable_card=2,
                                            values=[[1, 0.1, 0.1, 0.01],
                                                    [0, 0.9, 0.9, 0.99]],
                                            evidence=['Sprinkler', 'Rain'],
                                            evidence_card=[2, 2])
    
            self.dag = BayesianNetwork(
                [('Cloudy', 'Sprinkler'), ('Cloudy', 'Rain'), ('Sprinkler', 'Wet_Grass'), ('Rain', 'Wet_Grass')]
            )
            self.dag.add_cpds(self.cpt_cloudy, self.cpt_sprinkler, self.cpt_rain, self.cpt_wet_grass)
    
        def test_sampling_gibbs(self):
            # Initial issue
            GibbsSampling(self.dag).sample()
    
        def test_factor_product(self):
            # Trying to get to the root cause
            factor_product(self.cpt_cloudy, self.cpt_sprinkler, self.cpt_rain)
    
        def test_product(self):
            self.cpt_cloudy * self.cpt_sprinkler * self.cpt_rain
    
        def test_mul(self):
            prod = self.cpt_cloudy * self.cpt_sprinkler
            # the order or variables get mixed up, which I believe should not happen
            assert prod.variable == prod.variables[0]
    
    

    Expected behaviour

    It should not throw an error sometimes

    Actual behaviour

    Traceback (most recent call last):
    
      File "./pgmpy/pgmpy/factors/discrete/CPD.py", line 299, in copy
        return TabularCPD(
      File "./pgmpy/pgmpy/factors/discrete/CPD.py", line 142, in __init__
        super(TabularCPD, self).__init__(
      File "./pgmpy/pgmpy/factors/discrete/DiscreteFactor.py", line 99, in __init__
        raise ValueError("Variable names cannot be same")
    ValueError: Variable names cannot be same
    
    Bug 
    opened by oliver3 3
  • Enhancements for Causal Inference

    Enhancements for Causal Inference

    List of upcoming enhancements to the Causal Inference class:

    • [ ] estimate_ate and query methods should work with frontdoor adjustment sets.
    • [ ] Remove averaging over different adjustment sets for computing ATE.
    • [ ] Extend instrumental variables (IVs), conditional IVs, instrumental sets, and conditional instrumental sets from SEMs to work with DAGs as well.
    • [ ] Extend estimation methods to work with IVs and Conditional IVs.
    opened by ankurankan 0
  • Memory Leak in BaseEstimator Class

    Memory Leak in BaseEstimator Class

    Subject of the issue (with proposed fix)

    When creating multiple MaximumLikelihoodEstimator objects and calling estimate_cpd for all the model nodes in each, we seem to get a memory leak. This is what happens with repeated calls to BayesianNetwork.fit.

    The state_counts function of the BaseEstimator class has a decorator ([https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/estimators/base.py#L66])

    @lru_cache(maxsize=2048)

    which seems to be the culprit. When I remove it, the memory leak goes away. I haven't bothered to dig deeper on the issue, but it's something that may be worth considering if repeatedly training models.

    Your environment

    • Google Colab Notebook
    • PGMPY version: pgmpy-0.1.20
    • Python version: 3.7.15
    • Operating System: Ubuntu 18.04.6 LTS (Bionic Beaver)
    • uname -a output: Linux 5.10.133+

    Steps to reproduce

    Try the following code out. You should see the memory slowly get eaten up.

    import pandas as pd
    from pgmpy.models import BayesianNetwork
    from pgmpy.estimators import MaximumLikelihoodEstimator
    
    data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    
    cols = list(data.columns)
    edgelist = [(a,b) for a in cols[0:3] for b in cols[3:] if a != b]
    
    model = BayesianNetwork()
    model.add_nodes_from(cols)
    model.add_edges_from(edgelist)
    
    
    for i in range(1000):
      model.fit(data, MaximumLikelihoodEstimator)
    

    Expected behaviour

    Repeatedly calling the model.fit function with the MLE should not consume memory ad infinitum, the model gets re-trained and so old data should not persist.

    Actual behaviour

    During the 1000 iterations additional RAM is consumed and not freed EVER.

    Bug 
    opened by gregbolet 1
Releases(v0.1.21)
Owner
pgmpy
Python library for Probabilistic Graphical Models
pgmpy
Blind Video Temporal Consistency via Deep Video Prior

deep-video-prior (DVP) Code for NeurIPS 2020 paper: Blind Video Temporal Consistency via Deep Video Prior PyTorch implementation | paper | project web

Chenyang LEI 272 Dec 21, 2022
Official repository for the paper F, B, Alpha Matting

FBA Matting Official repository for the paper F, B, Alpha Matting. This paper and project is under heavy revision for peer reviewed publication, and s

Marco Forte 404 Jan 05, 2023
The Adapter-Bot: All-In-One Controllable Conversational Model

The Adapter-Bot: All-In-One Controllable Conversational Model This is the implementation of the paper: The Adapter-Bot: All-In-One Controllable Conver

CAiRE 37 Nov 04, 2022
Credit fraud detection in Python using a Jupyter Notebook

Credit-Fraud-Detection - Credit fraud detection in Python using a Jupyter Notebook , using three classification models (Random Forest, Gaussian Naive Bayes, Logistic Regression) from the sklearn libr

Ali Akram 4 Dec 28, 2021
Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra

850-Safra-DS-ModuloI Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra Para aprender mais Git https://learngitbranc

Brian Nunes 7 Dec 10, 2022
Monitora la qualità della ricezione dei segnali radio nelle province siciliane.

FMap-server Monitora la qualità della ricezione dei segnali radio nelle province siciliane. Conversion data Frequency - StationName maps are stored in

Triglie 5 May 24, 2021
Reinforcement learning library in JAX.

Reinforcement learning library in JAX.

Yicheng Luo 96 Oct 30, 2022
"Graph Neural Controlled Differential Equations for Traffic Forecasting", AAAI 2022

Graph Neural Controlled Differential Equations for Traffic Forecasting Setup Python environment for STG-NCDE Install python environment $ conda env cr

Jeongwhan Choi 55 Dec 28, 2022
Model parallel transformers in Jax and Haiku

Mesh Transformer Jax A haiku library using the new(ly documented) xmap operator in Jax for model parallelism of transformers. See enwik8_example.py fo

Ben Wang 4.8k Jan 01, 2023
Recurrent Neural Network Tutorial, Part 2 - Implementing a RNN in Python and Theano

Please read the blog post that goes with this code! Jupyter Notebook Setup System Requirements: Python, pip (Optional) virtualenv To start the Jupyter

Denny Britz 863 Dec 15, 2022
Multi-label classification of retinal disorders

Multi-label classification of retinal disorders This is a deep learning course project. The goal is to develop a solution, using computer vision techn

Sundeep Bhimireddy 1 Jan 29, 2022
This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Intro This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales Vehicle Sam

39 Jul 21, 2022
GUI for a Vocal Remover that uses Deep Neural Networks.

GUI for a Vocal Remover that uses Deep Neural Networks.

4.4k Jan 07, 2023
Jarvis Project is a basic virtual assistant that uses TensorFlow for learning.

Jarvis_proyect Jarvis Project is a basic virtual assistant that uses TensorFlow for learning. Latest version 0.1 Features: Good morning protocol Tell

Anze Kovac 3 Aug 31, 2022
Image marine sea litter prediction Shiny

MARLITE Shiny app for floating marine litter detection in aerial images. This directory contains the instructions and software needed to install the S

19 Dec 22, 2022
Learning Saliency Propagation for Semi-supervised Instance Segmentation

Learning Saliency Propagation for Semi-supervised Instance Segmentation PyTorch Implementation This repository contains: the PyTorch implementation of

Berkeley DeepDrive 68 Oct 18, 2022
Deep Learning with PyTorch made easy 🚀 !

Deep Learning with PyTorch made easy 🚀 ! Carefree? carefree-learn aims to provide CAREFREE usages for both users and developers. It also provides a c

381 Dec 22, 2022
Pytorch implementation for DFN: Distributed Feedback Network for Single-Image Deraining.

DFN:Distributed Feedback Network for Single-Image Deraining Abstract Recently, deep convolutional neural networks have achieved great success for sing

6 Nov 05, 2022
Explainer for black box models that predict molecule properties

Explaining why that molecule exmol is a package to explain black-box predictions of molecules. The package uses model agnostic explanations to help us

White Laboratory 172 Dec 19, 2022
A benchmark dataset for mesh multi-label-classification based on cube engravings introduced in MeshCNN

Double Cube Engravings This script creates a dataset for multi-label mesh clasification, with an intentionally difficult setup for point cloud classif

Yotam Erel 1 Nov 30, 2021