DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.

Overview

BuildStatus PyPiVersion PythonSupport Downloads

DoWhy | An end-to-end library for causal inference

Amit Sharma, Emre Kiciman

Introducing DoWhy and the 4 steps of causal inference | Microsoft Research Blog | Video Tutorial | Arxiv Paper | Slides

Read the docs | Try it online! Binder

Case Studies using DoWhy: Hotel booking cancellations | Effect of customer loyalty programs | Optimizing article headlines | Effect of home visits on infant health (IHDP) | Causes of customer churn/attrition

https://raw.githubusercontent.com/microsoft/dowhy/master/docs/images/dowhy-schematic.png

As computing systems are more frequently and more actively intervening in societally critical domains such as healthcare, education, and governance, it is critical to correctly predict and understand the causal effects of these interventions. Without an A/B test, conventional machine learning methods, built on pattern recognition and correlational analyses, are insufficient for decision-making.

Much like machine learning libraries have done for prediction, "DoWhy" is a Python library that aims to spark causal thinking and analysis. DoWhy provides a principled four-step interface for causal inference that focuses on explicitly modeling causal assumptions and validating them as much as possible. The key feature of DoWhy is its state-of-the-art refutation API that can automatically test causal assumptions for any estimation method, thus making inference more robust and accessible to non-experts. DoWhy supports estimation of the average causal effect for backdoor, frontdoor, instrumental variable and other identification methods, and estimation of the conditional effect (CATE) through an integration with the EconML library.

For a quick introduction to causal inference, check out amit-sharma/causal-inference-tutorial. We also gave a more comprehensive tutorial at the ACM Knowledge Discovery and Data Mining (KDD 2018) conference: causalinference.gitlab.io/kdd-tutorial. For an introduction to the four steps of causal inference and its implications for machine learning, you can access this video tutorial from Microsoft Research: DoWhy Webinar.

Documentation for DoWhy is available at microsoft.github.io/dowhy.

The need for causal inference

Predictive models uncover patterns that connect the inputs and outcome in observed data. To intervene, however, we need to estimate the effect of changing an input from its current value, for which no data exists. Such questions, involving estimating a counterfactual, are common in decision-making scenarios.

  • Will it work?
    • Does a proposed change to a system improve people's outcomes?
  • Why did it work?
    • What led to a change in a system's outcome?
  • What should we do?
    • What changes to a system are likely to improve outcomes for people?
  • What are the overall effects?
    • How does the system interact with human behavior?
    • What is the effect of a system's recommendations on people's activity?

Answering these questions requires causal reasoning. While many methods exist for causal inference, it is hard to compare their assumptions and robustness of results. DoWhy makes three contributions,

  1. Provides a principled way of modeling a given problem as a causal graph so that all assumptions are explicit.
  2. Provides a unified interface for many popular causal inference methods, combining the two major frameworks of graphical models and potential outcomes.
  3. Automatically tests for the validity of assumptions if possible and assesses the robustness of the estimate to violations.

To see DoWhy in action, check out how it can be applied to estimate the effect of a subscription or rewards program for customers [Rewards notebook] and for implementing and evaluating causal inference methods on benchmark datasets like the Infant Health and Development Program (IHDP) dataset, Infant Mortality (Twins) dataset, and the Lalonde Jobs dataset.

Installation

DoWhy support Python 3.6+. To install, you can use pip or conda.

Latest Release

Install the latest release using pip.

pip install dowhy

Install the latest release using conda.

conda install -c conda-forge dowhy

If you face "Solving environment" problems with conda, then try conda update --all and then install dowhy. If that does not work, then use conda config --set channel_priority false and try to install again. If the problem persists, please add your issue here.

Development Version

If you prefer the latest dev version, clone this repository and run the following command from the top-most folder of the repository.

pip install -e .

Requirements

DoWhy requires the following packages:

  • numpy
  • scipy
  • scikit-learn
  • pandas
  • networkx (for analyzing causal graphs)
  • matplotlib (for general plotting)
  • sympy (for rendering symbolic expressions)

If you face any problems, try installing dependencies manually.

pip install -r requirements.txt

Optionally, if you wish to input graphs in the dot format, then install pydot (or pygraphviz).

For better-looking graphs, you can optionally install pygraphviz. To proceed, first install graphviz and then pygraphviz (on Ubuntu and Ubuntu WSL).

sudo apt install graphviz libgraphviz-dev graphviz-dev pkg-config
## from https://github.com/pygraphviz/pygraphviz/issues/71
pip install pygraphviz --install-option="--include-path=/usr/include/graphviz" \
--install-option="--library-path=/usr/lib/graphviz/"

Sample causal inference analysis in DoWhy

Most DoWhy analyses for causal inference take 4 lines to write, assuming a pandas dataframe df that contains the data:

from dowhy import CausalModel
import dowhy.datasets

# Load some sample data
data = dowhy.datasets.linear_dataset(
    beta=10,
    num_common_causes=5,
    num_instruments=2,
    num_samples=10000,
    treatment_is_binary=True)

DoWhy supports two formats for providing the causal graph: gml (preferred) and dot. After loading in the data, we use the four main operations in DoWhy: model, estimate, identify and refute:

# I. Create a causal model from the data and given graph.
model = CausalModel(
    data=data["df"],
    treatment=data["treatment_name"],
    outcome=data["outcome_name"],
    graph=data["gml_graph"])

# II. Identify causal effect and return target estimands
identified_estimand = model.identify_effect()

# III. Estimate the target estimand using a statistical method.
estimate = model.estimate_effect(identified_estimand,
                                 method_name="backdoor.propensity_score_matching")

# IV. Refute the obtained estimate using multiple robustness checks.
refute_results = model.refute_estimate(identified_estimand, estimate,
                                       method_name="random_common_cause")

DoWhy stresses on the interpretability of its output. At any point in the analysis, you can inspect the untested assumptions, identified estimands (if any) and the estimate (if any). Here's a sample output of the linear regression estimator.

https://raw.githubusercontent.com/microsoft/dowhy/master/docs/images/regression_output.png

For a full code example, check out the Getting Started with DoWhy notebook. You can also use Conditional Average Treatment Effect (CATE) estimation methods from other libraries such as EconML and CausalML, as shown in the Conditional Treatment Effects notebook. For more examples of using DoWhy, check out the Jupyter notebooks in docs/source/example_notebooks or try them online at Binder.

A high-level Pandas API

We've made an even simpler API for dowhy which is a light layer on top of the standard one. The goal is to make causal analysis much more like regular exploratory analysis. To use this API, simply import dowhy.api. This will magically add the causal namespace to your pandas.DataFrame s. Then, you can use the namespace as follows.

import dowhy.api
import dowhy.datasets

data = dowhy.datasets.linear_dataset(beta=5,
    num_common_causes=1,
    num_instruments = 0,
    num_samples=1000,
    treatment_is_binary=True)

# data['df'] is just a regular pandas.DataFrame
data['df'].causal.do(x='v0', # name of treatment variable
                     variable_types={'v0': 'b', 'y': 'c', 'W0': 'c'},
                     outcome='y',
                     common_causes=['W0']).groupby('v0').mean().plot(y='y', kind='bar')

https://raw.githubusercontent.com/microsoft/dowhy/master/docs/images/do_barplot.png

For some methods, the variable_types field must be specified. It should be a dict, where the keys are variable names, and values are 'o' for ordered discrete, 'u' for un-ordered discrete, 'd' for discrete, or 'c' for continuous.

Note:If the variable_types is not specified we make use of the following implicit conversions:

int -> 'c'
float -> 'c'
binary -> 'b'
category -> 'd'

Currently we have not added support for timestamps.

The do method in the causal namespace generates a random sample from $P(outcome|do(X=x))$ of the same length as your data set, and returns this outcome as a new DataFrame. You can continue to perform the usual DataFrame operations with this sample, and so you can compute statistics and create plots for causal outcomes!

The do method is built on top of the lower-level dowhy objects, so can still take a graph and perform identification automatically when you provide a graph instead of common_causes.

For more details, check out the Pandas API notebook or the Do Sampler notebook.

Graphical Models and Potential Outcomes: Best of both worlds

DoWhy builds on two of the most powerful frameworks for causal inference: graphical models and potential outcomes. It uses graph-based criteria and do-calculus for modeling assumptions and identifying a non-parametric causal effect. For estimation, it switches to methods based primarily on potential outcomes.

A unifying language for causal inference

DoWhy is based on a simple unifying language for causal inference. Causal inference may seem tricky, but almost all methods follow four key steps:

  1. Model a causal inference problem using assumptions.
  2. Identify an expression for the causal effect under these assumptions ("causal estimand").
  3. Estimate the expression using statistical methods such as matching or instrumental variables.
  4. Finally, verify the validity of the estimate using a variety of robustness checks.

This workflow can be captured by four key verbs in DoWhy:

  • model
  • identify
  • estimate
  • refute

Using these verbs, DoWhy implements a causal inference engine that can support a variety of methods. model encodes prior knowledge as a formal causal graph, identify uses graph-based methods to identify the causal effect, estimate uses statistical methods for estimating the identified estimand, and finally refute tries to refute the obtained estimate by testing robustness to assumptions.

Key differences compared to available causal inference software

DoWhy brings three key differences compared to available software for causal inference:

Explicit identifying assumptions

Assumptions are first-class citizens in DoWhy.

Each analysis starts with a building a causal model. The assumptions can be viewed graphically or in terms of conditional independence statements. Wherever possible, DoWhy can also automatically test for stated assumptions using observed data.

Separation between identification and estimation

Identification is the causal problem. Estimation is simply a statistical problem.

DoWhy respects this boundary and treats them separately. This focuses the causal inference effort on identification, and frees up estimation using any available statistical estimator for a target estimand. In addition, multiple estimation methods can be used for a single identified_estimand and vice-versa.

Automated robustness checks

What happens when key identifying assumptions may not be satisfied?

The most critical, and often skipped, part of causal analysis is checking the robustness of an estimate to unverified assumptions. DoWhy makes it easy to automatically run sensitivity and robustness checks on the obtained estimate.

Finally, DoWhy is easily extensible, allowing other implementations of the four verbs to co-exist (e.g., we support implementations of the estimation verb from EconML and CausalML libraries). The four verbs are mutually independent, so their implementations can be combined in any way.

Below are more details about the current implementation of each of these verbs.

Four steps of causal inference

I. Model a causal problem

DoWhy creates an underlying causal graphical model for each problem. This serves to make each causal assumption explicit. This graph need not be complete---you can provide a partial graph, representing prior knowledge about some of the variables. DoWhy automatically considers the rest of the variables as potential confounders.

Currently, DoWhy supports two formats for graph input: gml (preferred) and dot. We strongly suggest to use gml as the input format, as it works well with networkx. You can provide the graph either as a .gml file or as a string. If you prefer to use dot format, you will need to install additional packages (pydot or pygraphviz, see the installation section above). Both .dot files and string format are supported.

While not recommended, you can also specify common causes and/or instruments directly instead of providing a graph.

Supported formats for specifying causal assumptions

  • Graph: Provide a causal graph in either gml or dot format. Can be a text file or a string.
  • Named variable sets: Instead of the graph, provide variable names that correspond to relevant categories, such as common causes, instrumental variables, effect modifiers, frontdoor variables, etc.

Examples of how to instantiate a causal model are in the Getting Started notebook.

II. Identify a target estimand under the model

Based on the causal graph, DoWhy finds all possible ways of identifying a desired causal effect based on the graphical model. It uses graph-based criteria and do-calculus to find potential ways find expressions that can identify the causal effect.

Supported identification criteria

  • Back-door criterion
  • Front-door criterion
  • Instrumental Variables
  • Mediation (Direct and indirect effect identification)

Different notebooks illustrate how to use these identification criteria. Check out the Simple Backdoor notebook for the back-door criterion, and the Simple IV notebook for the instrumental variable criterion.

III. Estimate causal effect based on the identified estimand

DoWhy supports methods based on both back-door criterion and instrumental variables. It also provides a non-parametric confidence intervals and a permutation test for testing the statistical significance of obtained estimate.

Supported estimation methods

  • Methods based on estimating the treatment assignment
    • Propensity-based Stratification
    • Propensity Score Matching
    • Inverse Propensity Weighting
  • Methods based on estimating the outcome model
    • Linear Regression
    • Generalized Linear Models
  • Methods based on the instrumental variable equation
    • Binary Instrument/Wald Estimator
    • Two-stage least squares
    • Regression discontinuity
  • Methods for front-door criterion and general mediation
    • Two-stage linear regression

Examples of using these methods are in the Estimation methods notebook.

Using EconML and CausalML estimation methods in DoWhy

It is easy to call external estimation methods using DoWhy. Currently we support integrations with the EconML and CausalML packages. Here's an example of estimating conditional treatment effects using EconML's double machine learning estimator.

1, confidence_intervals=False, method_params={ "init_params":{'model_y':GradientBoostingRegressor(), 'model_t': GradientBoostingRegressor(), 'model_final':LassoCV(), 'featurizer':PolynomialFeatures(degree=1, include_bias=True)}, "fit_params":{}} ) ">
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
dml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DML",
                control_value = 0,
                treatment_value = 1,
                target_units = lambda df: df["X0"]>1,
                confidence_intervals=False,
                method_params={
                    "init_params":{'model_y':GradientBoostingRegressor(),
                                   'model_t': GradientBoostingRegressor(),
                                   'model_final':LassoCV(),
                                   'featurizer':PolynomialFeatures(degree=1, include_bias=True)},
                    "fit_params":{}}
                                        )

More examples are in the Conditional Treatment Effects with DoWhy notebook.

IV. Refute the obtained estimate

Having access to multiple refutation methods to validate an effect estimate from a causal estimator is a key benefit of using DoWhy.

Supported refutation methods

  • Add Random Common Cause: Does the estimation method change its estimate after we add an independent random variable as a common cause to the dataset? (Hint: It should not)
  • Placebo Treatment: What happens to the estimated causal effect when we replace the true treatment variable with an independent random variable? (Hint: the effect should go to zero)
  • Dummy Outcome: What happens to the estimated causal effect when we replace the true outcome variable with an independent random variable? (Hint: The effect should go to zero)
  • Simulated Outcome: What happens to the estimated causal effect when we replace the dataset with a simulated dataset based on a known data-generating process closest to the given dataset? (Hint: It should match the effect parameter from the data-generating process)
  • Add Unobserved Common Causes: How sensitive is the effect estimate when we add an additional common cause (confounder) to the dataset that is correlated with the treatment and the outcome? (Hint: It should not be too sensitive)
  • Data Subsets Validation: Does the estimated effect change significantly when we replace the given dataset with a randomly selected subset? (Hint: It should not)
  • Bootstrap Validation: Does the estimated effect change significantly when we replace the given dataset with bootstrapped samples from the same dataset? (Hint: It should not)

Examples of using refutation methods are in the Refutations notebook. For an advanced refutation that uses a simulated dataset based on user-provided or learnt data-generating processes, check out the Dummy Outcome Refuter notebook. As a practical example, this notebook shows an application of refutation methods on evaluating effect estimators for the Infant Health and Development Program (IHDP) and Lalonde datasets.

Citing this package

If you find DoWhy useful for your research work, please cite us as follows:

Amit Sharma, Emre Kiciman, et al. DoWhy: A Python package for causal inference. 2019. https://github.com/microsoft/dowhy

Bibtex:

@misc{dowhy,
author={Sharma, Amit and Kiciman, Emre and others},
title={Do{W}hy: {A Python package for causal inference}},
howpublished={https://github.com/microsoft/dowhy},
year={2019}
}

Alternatively, you can cite our Arxiv paper on DoWhy.

Amit Sharma, Emre Kiciman. DoWhy: An End-to-End Library for Causal Inference. 2020. https://arxiv.org/abs/2011.04216

Bibtex:

@article{dowhypaper,
title={DoWhy: An End-to-End Library for Causal Inference},
author={Sharma, Amit and Kiciman, Emre},
journal={arXiv preprint arXiv:2011.04216},
year={2020}
}

Roadmap

The projects page lists the next steps for DoWhy. If you would like to contribute, have a look at the current projects. If you have a specific request for DoWhy, please raise an issue.

Contributing

This project welcomes contributions and suggestions. For a guide to contributing and a list of all contributors, check out CONTRIBUTING.md. You can also join the DoWhy development channel on Discord: discord

Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Comments
  • Functional api/causal estimators

    Functional api/causal estimators

    • Introduce fit() method to estimators.
    • Refactor constructors to avoid using *args and **kwargs and have more explicit parameters.
    • Refactor refuters and other parts of the code to use fit() and modify arguments to estimate_effect()
    enhancement 
    opened by andresmor-ms 24
  • Error in estimation step

    Error in estimation step

    I have error when I do estimate causal effect using propensity method. It complains that my treatment feature should binary. But I use binary! Where should be problem?

    opened by Jami1141 23
  • Functional api/identify effect

    Functional api/identify effect

    First of a series of PR to add a functional API according to: https://github.com/py-why/dowhy/wiki/API-proposal-for-v1

    • Refactor identify_effect to have a functional API
      • Created BackdoorIdentifier class and extracted the logic from CausalIdentifier to be just a Protocol
      • Refactor the identify_effect method of BackdoorIdentifier and IDIdentifier to take the graph as parameter
      • Moved constants into enums for easier type checking
      • Backwards compatible with previous CausalModel API
      • Added notebook as demo that CausalModel API and new API behaves the same way
    enhancement 
    opened by andresmor-ms 15
  • Backdoor identification tests

    Backdoor identification tests

    The PR contains first batch of tests for the backdoor identification

    At the moment, the method being tested is CausalIdentifier.identify_backdoor.

    From my perspective, this function should:

    1. not return unobserved variables in backdoor sets
    2. not return adjustment sets that induce bias
    3. return all expected adjustment sets, at least the minimum sufficient adjustment sets

    TODO:

    • [x] Ensure the expected output is clear. This requires small refactoring in identify_backdoor.
    • [x] Ensure nothing is returned if no adjustment is needed.
    • [x] Ensure all tests are correct and that they pass.
    • [x] Maximal-adjustment set as the default method
    • [x] Backdoor set with minimum number of IVs is selected as default in get_default_backdoor_set_id

    Three tests fail at the moment, one of them related to problems 1. and 2. and should be fixed by some refactoring of identify backdoor to not include unobserved variables and not adjust if there is no need.

    The other two are related to this graph:

    image

    The possible adjustment sets are: (C), (B,C) and (B,A). The path X <- B <- C -> Y needs to be closed. The easiest way to do it is to adjust for C, and the test passes here.

    In first of them, all variables are observed, but (B,A) is not returned as one of the possible adjustments sets. This is not that terrible, as C is detected, but it should probably still be there.

    In the second of them, C is unobserved, so (B,A) is the only adjustment set possible, but it's not returned. So these two are very similar (at least the core issue is the same). I believe this is a bug and will require a bit more detective work.

    opened by vojavocni 15
  • Implementation of Causal Discovery?

    Implementation of Causal Discovery?

    I'm curious if anyone is interested in folding causal discovery algorithms into the dowhy package? I currently use the 'Causal Discovery Toolkit' (cdt) along with my own code for performing causal discovery. I think that for sufficiently complex problem domains, causal discovery is a necessary first half of causal analysis.

    enhancement discussion 
    opened by dcompgriff 14
  • Linear regression is not reproducible

    Linear regression is not reproducible

    #!/usr/bin/env python3

    -- coding: utf-8 --

    """ Created on Thu Dec 27 11:24:48 2018

    @author: mgralle

    Debugging script for the dowhy package, using the Lalonde data example.

    Repetition of estimation using propensity score matching or weighting gives reproducible values, as expected. However, repetition of estimation using linear regression gives different values. """

    #To simplify debugging, I obtained the Lalonde data as described on the DoWhy #page and wrote it to a CSV file:

    #from rpy2.robjects import r as R #%load_ext rpy2.ipython ##%R install.packages("Matching") #%R library(Matching) #%R data(lalonde) #%R -o lalonde #lfile("lalonde.csv","w") #lalonde.to_csv(lfile,index=False) #lfile.close()

    import pandas as pd lalonde=pd.read_csv("lalonde.csv")

    print("Lalonde data frame:") print(lalonde.describe())

    from dowhy.do_why import CausalModel

    1. Propensity score weighting

    model=CausalModel( data = lalonde, treatment='treat', outcome='re78', common_causes='nodegr+black+hisp+age+educ+married'.split('+')) identified_estimand = model.identify_effect()

    psw_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_weighting") print("\n(1) Causal Estimate from PS weighting is " + str(psw_estimate.value))

    psw_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_weighting") print("\n(2) Causal Estimate from PS weighting is " + str(psw_estimate.value))

    #2. Propensity score matching psm_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_matching") print("\n(1) Causal estimate from PS matching is " + str(psm_estimate.value))

    psm_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_matching") print("\n(2) Causal estimate from PS matching is " + str(psm_estimate.value))

    #3. Linear regression linear_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression", test_significance=True) print("\n(1) Causal estimate from linear regression is " + str(linear_estimate.value))

    linear_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression", test_significance=True) print("\n(2) Causal estimate from linear regression is " + str(linear_estimate.value))

    Recreate model from scratch for linear regression

    model=CausalModel( data = lalonde, treatment='treat', outcome='re78', common_causes='nodegr+black+hisp+age+educ+married'.split('+'))

    identified_estimand = model.identify_effect()

    linear_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression", test_significance=True) print("\n(3) Causal estimate from linear regression is " + str(linear_estimate.value))

    print("\nLalonde Data frame hasn't changed:") print(lalonde.describe())

    opened by mgralle 13
  • Basic support for multivalue categorical treatments

    Basic support for multivalue categorical treatments

    Support multivalue categorical treatments in econml.py; add a utility method to CausalEstimator that returns the effect of the treatment that was actually used, and works with the multivalue changes

    enhancement 
    opened by EgorKraevTransferwise 11
  • Host Docker image on GitHub instead of Docker Hub

    Host Docker image on GitHub instead of Docker Hub

    We can simplify our setup by hosting the Docker image used for the documentation build by using GitHub's Container Registry. That will avoid the need for Docker Hub credentials management in GitHub, and also the creation of a DoWhy account on Docker Hub (currently we're hosting the image in personal accounts (first @petergtz's, now @darthtrevino's), which is not great).

    enhancement maintenance 
    opened by petergtz 11
  • DoWhy with Multiple Treatments (T) and Multiple Outcomes (Y)

    DoWhy with Multiple Treatments (T) and Multiple Outcomes (Y)

    Hi, @emrekiciman
    I have been looking to see if DoWhy supports Multiple Treatments (T) and Multiple Outcomes (Y) causal framework and it seems to be the case using DoWhy/EconMl. For example, CausalForest may be a good candidate. My question is how would I define and pass parameters for both the Treatment and the Outcome when setting up the model? Would the below code work or will I run into issues?

    model = CausalModel(data=df, treatment=["T1", "T2", "T3"], outcome=["Y1", "Y2"], common_causes=["W1", "W2", "W3"])

    Also, I want to ensure that my treatments can be all continuous variables between 0 - 10 for example? Lastly, can you reference a good example notebook where I can see the predicted CATE for each instance (i.e. user) with each treatment combination(T1, T2, T3)? Something like this output:

    user_id | T1 | T2 | T3   | CATE Y1  | CATE Y2
    1       | 4  | 0  | 0    | 0.2      | 0.1
    1       | 1  | 6  | 1    | 0.22     | 0.13
    1       | 2  | 1  | 8    | 0.62     | -0.2
    ...
    

    Any help is very appreciated.

    help wanted stale 
    opened by titubs 11
  • Not able to install dowhy with Python 3.8.3 as an Anaconda distro.

    Not able to install dowhy with Python 3.8.3 as an Anaconda distro.

    I have a fresh install of Anaconda with Python 3.8.3, but I am not able to install with the conda command; I get conflicts that conda is not able to resolve. This is not the first time I've had significant difficulty in installing dowhy. Can installation be made easier? Perhaps more checking of package versions, for greater compatibility?

    Do I just install with the pip command?

    opened by Ackbach 10
  • Creating a list of contributors to the DoWhy project

    Creating a list of contributors to the DoWhy project

    I am trying out the all-contributors project to automate the process of acknowledging contributors to this project.

    If this works, then contributors can add themselves to the list on /CONTRIBUTING.md by issuing a simple message in this thread or on the relevant pull request to the all-contributors bot.

    I will start by adding a few contributors manually. This may involve a few test messages to this thread.

    opened by amit-sharma 10
  • do sampling only works for backdoor, it does not work for front door, instrumental variables of unobserved confounders

    do sampling only works for backdoor, it does not work for front door, instrumental variables of unobserved confounders

    Ask your question I have created causal models and estimands for a range of cases - backdoor, front door, instrumental variable and also for where front door and instrumental variables have unobserved confounders. All of these work for calculating the ate.

    I have now tried to run the do operator for a graph where there is an instrumental variable and an unobserved backdoor confounder but it crashes with KeyError: 'backdoor'

    The do operator only works where there is an observed backdoor confounder which is very disappointing as it rules out a lot of use cases.

    Here is the code ...

    variable_types = {'engagement': 'd', 'retention': 'd', 'funding': 'd'} gml_graph = 'graph [directed 1\n\tnode [id "funding" label "funding"]\n\tnode [id "engagement" label "engagement"]\n\tnode [id "retention" label "retention"]\n\tnode [id "U" label "U"]\n\tedge [source "funding" target "engagement"]\n\tedge [source "engagement" target "retention"]\n\tedge [source "U" target "engagement"]\n\tedge [source "U" target "retention"]\n]'

    df_do = df_student_retention.causal.do(x={"engagement": 1}, outcome="retention", dot_graph=gml_graph, variable_types=variable_types, proceed_when_unidentifiable=True)

    Expected behavior I would expect df_do to be populated with the results of the do operation

    Version information: 0.9

    Additional context As stated above, it all works for calculating the ate, just not for running a "do" operation

    question 
    opened by grahamharrison68 0
  • No IVs detected when one variable is unobserved

    No IVs detected when one variable is unobserved

    In a complex model with several backdoor variables, model.identify_effect does not show any backdoor variable in case one of those variables is unobserved. Here is a slightly adjusted example from an example notebook:

    import numpy as np
    
    from dowhy import CausalModel
    import dowhy.datasets 
    
    data = dowhy.datasets.linear_dataset(beta=10,
            num_common_causes=5,
            num_instruments = 2,
            num_effect_modifiers=1,
            num_samples=5000, 
            treatment_is_binary=True,
            stddev_treatment_noise=10,
            num_discrete_common_causes=1)
    df = data["df"]
    
    df = df.drop(columns=["W0"])
    
    model=CausalModel(
            data = df,
            treatment=data["treatment_name"],
            outcome=data["outcome_name"],
            graph=data["gml_graph"]
            )
    
    identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
    print(identified_estimand)
    

    I would assume that the function either:

    • inform that there might be an issue (i.e., a variable is not existent) or
    • use the remaining variables as backdoor variables

    Am I overseeing something here? Thanks for checking!

    Version information:

    • DoWhy version: 0.9
    question waiting for author 
    opened by Klesel 2
  • Default value significance test: dowhy/causal_refuter.py

    Default value significance test: dowhy/causal_refuter.py

    I was wondering why the default value for the significance test is set at 0.85 (line 193): significance_level: float = 0.85

    The comment it says, the default is 0.05 (line 219).

    Version information:

    • DoWhy version (0.9)

    Can you please let me know, if this is intentional?

    question 
    opened by Klesel 0
  • How to model the data

    How to model the data

    Hello, I am new to this causality analysis. I read docs on dowhy and for my research purpose I tried for a dummy data in that I have one dependent variable as revenue and 10 independent variables. How I can select the treatment group and how to do the modeling part.

    Thanks in advance

    question waiting for author 
    opened by Harshitha-HM1999 2
  • p-value NaN in some pathological cases with non-bootstrap method

    p-value NaN in some pathological cases with non-bootstrap method

    Describe the bug In some pathological cases its possible for the p-value of a refuter to be NaN: In particular if al of the simulations return the same value.

    Identified while looking at https://github.com/py-why/dowhy/issues/804

    Steps to reproduce the behavior

    import random
    random.seed(1)
    
    from dowhy import CausalModel
    import dowhy.datasets
    
    data = dowhy.datasets.linear_dataset(
        beta=10,
        num_common_causes=3,
        num_instruments=2,
        num_samples=10000,
        treatment_is_binary=True)
    
    model = CausalModel(
      data=data["df"],
      treatment=data["treatment_name"],
      outcome=data["outcome_name"],
      graph=data["gml_graph"])
    print("identify")
    identified_estimand = model.identify_effect()
    print("estimate")
    estimate = model.estimate_effect(identified_estimand,
                                     method_name="backdoor.propensity_score_matching")
    print("refute")
    refute_results = model.refute_estimate(identified_estimand,
                                       estimate,
                                       method_name="random_common_cause",
                                       # placebo_type="permute",
                                       num_simulations=20, show_progress_bar=True)
    print(refute_results)
    

    Produces

    Refute: Add a random common cause
    Estimated effect:10.720735355706834
    New effect:10.720735355706834
    p value:nan
    

    Expected behavior This is unclear, which is why I am opening an issue rather than submitting a bug report.

    Version information:

    • DoWhy version installed from main at commit 97e6bdc3db137280fdb8812dfba34de14a248c72

    The root cause of this is in the p-value calculation of the refuter which assumes that the standard deviation of the simulations is well defined.

    This would be easy to fix by setting the p-value to 1 in this scenario. WDTY?

    bug 
    opened by Padarn 0
Releases(v0.9.1)
  • v0.9.1(Dec 17, 2022)

    Minor update to v0.9.

    • Python 3.10 support
    • Streamlined dependency structure for the dowhy package (fewer required dependencies)
    • Color option for plots (@eeulig)

    Thanks @darthtrevino, @petergtz, @andresmor-ms for driving this release!

    Source code(tar.gz)
    Source code(zip)
  • v0.9(Dec 6, 2022)

    • Preview for the new functional API (see notebook). The new API (in experimental stage) allows for a modular use of the different functionalities and includes separate fit and estimate methods for causal estimators. Please leave your feedback here. The old DoWhy API based on CausalModel should work as before. (@andresmor-ms)

    • Faster, better sensitivity analyses.

    • New API for unit change attribution (@kailashbuki)

    • New quality option BEST for auto-assignment of causal mechanisms, which uses the optional auto-ML library AutoGluon (@bloebp)

    • Better conditional independence tests through the causal-learn package (@bloebp)

    • Algorithms for computing efficient backdoor sets [ example notebook ] (@esmucler)

    • Support for estimating controlled direct effect (@amit-sharma)

    • Support for multi-valued treatments for econml estimators (@EgorKraevTransferwise)

    • New PyData theme for documentation with new homepage, Getting started guide, revised User Guide and examples page (@petergtz)

    • A contributing guide and simplified instructions for new contributors (@MichaelMarien)

    • Streamlined dev environment using Poetry for managing dependencies and project builds (@darthtrevino)

    • Bug fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.8(Jul 18, 2022)

    A big thanks to @petergtz, @kailashbuki, and @bloebp for the GCM package and @anusha0409 for an implementation of partial R2 sensitivity analysis for linear models.

    • Graphical Causal Models: SCMs, root-cause analysis, attribution, what-if analysis, and more.

    • Sensitivity Analysis: Faster, more general partial-R2 based sensitivity analysis for linear models, based on Cinelli & Hazlett (2020).

    • New docs structure: Updated docs structure including user and contributors' guide. Check out the docs.

    • Bug fixes

    Contributors: @amit-sharma, @anusha0409, @bloebp, @EgorKraevTransferwise, @EliKling, @kailashbuki, @itsoum, @MichaelMarien, @petergtz, @ryanrussell

    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Mar 20, 2022)

    • Graph refuter with conditional independence tests to check whether data conforms to the assumed causal graph

    • Better docs for estimators by adding the method-specific parameters directly in its own init method

    • Support use of custom external estimators

    • Consistent calls for init_params for dowhy and econml estimators

    • Add support for Dagitty graphs

    • Bug fixes for GLM model, causal model with no confounders, and hotel case-study notebook

    Thank you @EgorKraevTransferwise, @ae-foster, and @anusha0409 for your contributions!

    Source code(tar.gz)
    Source code(zip)
  • v0.7(Jan 10, 2022)

    • [Major] Faster backdoor identification with support for minimal adjustment, maximal adjustment or exhaustive search. More test coverage for identification.

    • [Major] Added new functionality of causal discovery [Experimental]. DoWhy now supports discovery algorithms from external libraries like CDT. Example notebook

    • [Major] Implemented ID algorithm for causal identification. [Experimental]

    • Added friendly text-based interpretation for DoWhy's effect estimate.

    • Added a new estimation method, distance matching that relies on a distance metrics between inputs.

    • Heuristics to infer default parameters for refuters.

    • Inferring default strata automatically for propensity score stratification.

    • Added support for custom propensity models in propensity-based estimation methods.

    • Bug fixes for confidence intervals for linear regression. Better version of bootstrap method.

    • Allow effect estimation without need to refit the model for econml estimators

    Big thanks to @AndrewC19, @ha2trinh, @siddhanthaldar, and @vojavocni

    Source code(tar.gz)
    Source code(zip)
  • v0.6(Mar 3, 2021)

    • [Major] Placebo refuter now supports instrumental variable methods
    • [Major] Moved matplotlib to an optional dependency. Can be installed using pip install dowhy[plotting]
    • [Major] A new method for generating unobserved confounder for refutation
    • Dummyoutcomerefuter supports unobserved confounder
    • Update to align with EconML's new API
    • All refuters now support control and treatment values for continuous treatments
    • Better logging configuration

    A big thanks to @arshiaarya, @n8sty, @moprescu and @vojavocni for their contributions!

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Dec 12, 2020)

    • Added an optimized version for identify_effect
    • Fixed a bug for direct and indirect effects computation
    • More test coverage: Notebooks are also under automatic tests
    • updated conditional-effects-notebook to support the latest EconML version
    • EconML metalearners now have the expected behavior: accept both common_causes and effect_modifiers
    • Fixed some bugs in refuter tests
    Source code(tar.gz)
    Source code(zip)
  • v0.5(Nov 21, 2020)

    Installation

    • DoWhy can be installed on Conda now!

    Code

    • Support for identification by mediation formula
    • Support for the front-door criterion
    • Linear estimation methods for mediation
    • Generalized backdoor criterion implementation using paths and d-separation
    • Added GLM estimators, including logistic regression
    • New API for interpreting causal models, estimates and refuters. First interpreter by @ErikHambardzumyan visualizes how the distribution of confounder changes
    • Friendlier error messages for propensity score stratification estimator when there is not enough data in a bin.
    • Enhancements to the dummy outcome refuter with machine learned components--now can simulate non-zero effects too. Ready for alpha testing

    Docs

    Community

    • Created a contributors page with guidelines for contributing
    • Added allcontributors bot so that new contributors can added just after their pull requests are merged

    A big thanks to @Tanmay-Kulkarni101, @ErikHambardzumyan, @Sid-darthvader for their contributions.

    Source code(tar.gz)
    Source code(zip)
  • v0.4(May 11, 2020)

    • DummyOutcomeRefuter now includes machine learning functions to increase power of the refutation.

      • In addition to generating a random dummy outcome, now you can generate a dummyOutcome that is an arbitrary function of confounders but always independent of treatment, and then test whether the estimated treatment effect is zero. This is inspired by ideas from the T-learner.
      • We also provide default machine learning-based methods to estimate such a dummyOutcome based on confounders. Of course, you can specify any custom ML method.
    • Added a new BootstrapRefuter that simulates the issue of measurement error with confounders. Rather than a simple bootstrap, you can generate bootstrap samples with noise on the values of the confounders and check how sensitive the estimate is.

      • The refuter supports custom selection of the confounders to add noise to.
    • All refuters now provide confidence intervals and a significance value.

    • Better support for heterogeneous effect libraries like EconML and CausalML

      • All CausalML methods can be called directly from DoWhy, in addition to all methods from EconML.
      • [Change to naming scheme for estimators] To achieve a consistent naming scheme for estimators, we suggest to prepend internal dowhy estimators with the string "dowhy". For example, "backdoor.dowhy.propensity_score_matching". Not a breaking change, so you can keep using the old naming scheme too.
      • EconML-specific: Since EconML assumes that effect modifiers are a subset of confounders, a warning is issued if a user specifies effect modifiers outside of confounders and tries to use EconML methods.
    • CI and Standard errors: Added bootstrap-based confidence intervals and standard errors for all methods. For linear regression estimator, also implemented the corresponding parametric forms.

    • Convenience functions for getting confidence intervals, standard errors and conditional treatment effects (CATE), that can be called after fitting the estimator if needed

    • Better coverage for tests. Also, tests are now seeded with a random seed, so more dependable tests.

    Thanks to @Tanmay-Kulkarni101 and @Arshiaarya for their contributions!

    Source code(tar.gz)
    Source code(zip)
  • v0.2(Jan 8, 2020)

    This release includes many major updates:

    • (BREAKING CHANGE) The CausalModel import is now simpler: "from dowhy import CausalModel"
    • Multivariate treatments are now supported.
    • Conditional Average Treatment Effects (CATE) can be estimated for any subset of the data. Includes integration with EconML--any method from EconML can be called using DoWhy through the estimate_effect method (see example notebook).
    • Other than CATE, specific target estimands like ATT and ATC are also supported for many of the estimation methods.
    • For reproducibility, you can specify a random seed for all refutation methods.
    • Multiple bug fixes and updates to the documentation.

    Includes contributions from @j-chou, @ktmud, @jrfiedler, @shounak112358, @Lnk2past. Thank you all!

    Source code(tar.gz)
    Source code(zip)
    dowhy-0.2.tar.gz(1016.48 KB)
  • v0.1.1-alpha(Jul 15, 2019)

    This release implements the four steps of causal inference: model, identify, estimate and refute. It also includes a pandas.DataFrame extension for causal inference and the do-sampler.

    Source code(tar.gz)
    Source code(zip)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Dragonfly is an open source python library for scalable Bayesian optimisation.

Dragonfly is an open source python library for scalable Bayesian optimisation. Bayesian optimisation is used for optimising black-box functions whose

744 Jan 02, 2023
A Time Series Library for Apache Spark

Flint: A Time Series Library for Apache Spark The ability to analyze time series data at scale is critical for the success of finance and IoT applicat

Two Sigma 970 Jan 04, 2023
Pytools is an open source library containing general machine learning and visualisation utilities for reuse

pytools is an open source library containing general machine learning and visualisation utilities for reuse, including: Basic tools for API developmen

BCG Gamma 26 Nov 06, 2022
vortex particles for simulating smoke in 2d

vortex-particles-method-2d vortex particles for simulating smoke in 2d -vortexparticles_s

12 Aug 23, 2022
Fit interpretable models. Explain blackbox machine learning.

InterpretML - Alpha Release In the beginning machines learned in darkness, and data scientists struggled in the void to explain them. Let there be lig

InterpretML 5.2k Jan 09, 2023
TIANCHI Purchase Redemption Forecast Challenge

TIANCHI Purchase Redemption Forecast Challenge

Haorui HE 4 Aug 26, 2022
Turning images into '9-pan' palettes using KMeans clustering from sklearn.

img2palette Turning images into '9-pan' palettes using KMeans clustering from sklearn. Requirements We require: Pillow, for opening and processing ima

Samuel Vidovich 2 Jan 01, 2022
cleanlab is the data-centric ML ops package for machine learning with noisy labels.

cleanlab is the data-centric ML ops package for machine learning with noisy labels. cleanlab cleans labels and supports finding, quantifying, and lear

Cleanlab 51 Nov 28, 2022
LinearRegression2 Tvads and CarSales

LinearRegression2_Tvads_and_CarSales This project infers the insight that how the TV ads for cars and car Sales are being linked with each other. It i

Ashish Kumar Yadav 1 Dec 29, 2021
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Salesforce 77 Jan 06, 2023
a distributed deep learning platform

Apache SINGA Distributed deep learning system http://singa.apache.org Quick Start Installation Examples Issues JIRA tickets Code Analysis: Mailing Lis

The Apache Software Foundation 2.7k Jan 05, 2023
A Streamlit demo to interactively visualize Uber pickups in New York City

Streamlit Demo: Uber Pickups in New York City A Streamlit demo written in pure Python to interactively visualize Uber pickups in New York City. View t

Streamlit 230 Dec 28, 2022
Upgini : data search library for your machine learning pipelines

Automated data search library for your machine learning pipelines → find & deliver relevant external data & features to boost ML accuracy :chart_with_upwards_trend:

Upgini 175 Jan 08, 2023
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is imp

432 Jan 05, 2023
Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins Classification App Penguins species predictor app is used to classify penguins species using their island, sex, bill length (mm), bill depth

Siva Prakash 3 Apr 05, 2022
A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

Arundo Analytics 888 Dec 30, 2022
The Emergence of Individuality

The Emergence of Individuality

16 Jul 20, 2022
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Dec 10, 2021
whylogs: A Data and Machine Learning Logging Standard

whylogs: A Data and Machine Learning Logging Standard whylogs is an open source standard for data and ML logging whylogs logging agent is the easiest

WhyLabs 2k Jan 06, 2023
A Python package to preprocess time series

Disclaimer: This package is WIP. Do not take any APIs for granted. tspreprocess Time series can contain noise, may be sampled under a non fitting rate

Maximilian Christ 57 Dec 17, 2022