当前位置：网站首页>Chow-Liu Tree

Chow-Liu Tree

2022-07-02 22:59:00 【Ancient road】

0. introduction

Basic concept reference
The paper ：Approximating discrete probability distributions with dependence trees

we consider the problem of best approximating an nth-order distribution by a product of n - 1 second-order distributions.

The Chow–Liu method describes a joint probability distribution $P(X_{ {1}},X_{ {2}},\ldots ,X_{ {n}})$ as a product of second-order conditional and marginal distributions. For example, the six-dimensional distribution $P(X_{ {1}},X_{ {2}},X_{ {3}},X_{ {4}},X_{ {5}},X_{ {6}})$ might be approximated as

$P^{ {\prime }}(X_{ {1}},X_{ {2}},X_{ {3}},X_{ {4}},X_{ {5}},X_{ {6}})=P(X_{ {6}}|X_{ {5}})P(X_{ {5}}|X_{ {2}})P(X_{ {4}}|X_{ {2}})P(X_{ {3}}|X_{ {2}})P(X_{ {2}}|X_{ {1}})P(X_{ {1}})$

1. Mutual information mutual information

Please add a picture description

take Mutual information Consider the weight of the edge .

2.Chow-Liu Tree Theoretical basis

Given a joint PDF $P (x)$ , the $K L$ -divergence $D\left(P, P^{\prime}\right)$ is minimized by projecting $\mathrm{P}(\mathrm{x})$ on a maximum-weight spanning tree (MSWT) over nodes in $\mathrm{X}$ , where the weight on the edge $\left(X_{i}, X_{j}\right)$ is defined by the mutual information measure
$I\left(X_{i} ; X_{j}\right)=\sum_{x_{i}, x_{j}} P\left(x_{i}, x_{j}\right) \log \frac{P\left(x_{i}, x_{j}\right)}{P\left(x_{i}\right) P\left(x_{j}\right)}$

The transformed tree has the smallest Kullback-Leible The divergence .

Please add a picture description

3.Chow-Liu Tree Algorithm flow

Please add a picture description

For distribution $P (x)$ , For all $i \neq = j$ , Calculate the joint distribution $P(X_i,Y_j)$ ;
Use the 1 The probability distribution obtained in step , Calculate the mutual information of any two nodes $I(X_i,Y_j)$ , And put $I(X_i,Y_j)$ As the weight of the connecting edge of these two nodes ;
Calculate the maximum weight spanning tree (Maximum-weight spanning tree)
- a. The initial state :n A variable ( node ),0 side
- b. Insert the edge with the largest weight
- c. Find the next largest edge , And join the tree ; After the request is added , No ring generation . otherwise , Find the next largest edge ;
- d. Repeat the process c Process until inserted n-1 side ( Tree creation complete )
Select any node as the root , Identify the direction of the edge from the root to the leaf ;
Approximate joint probability of the spanning tree $P^{'} (x)$ The joint probability with the original Bayesian network $P (x)$ The relative entropy of is the smallest .

In fact, the operation flow of the algorithm is the same as that of the minimum spanning tree , The representative algorithms are kruskal And prim Algorithm .

4. Minimum spanning tree

5. Semi naive Bayesian classifier

Excerpt from teacher zhouzhihua 《 machine learning 》（ Watermelon book ）7.4. Semi naive Bayesian classifier .

The basic idea of semi naive Bayesian classifier is to properly consider the interdependent information of some attributes , Therefore, it is not necessary to calculate the complete joint probability , And it doesn't completely ignore the strong attribute dependency .“ Independent estimation ” (One-Dependent Estimator , abbreviation ODE) It is the most commonly used strategy of semi naive Bayesian classifier . On Gu Ming , So-called " Rely solely on " It is assumed that each attribute depends on at most one other attribute outside the category , namely
$\mid \boldsymbol{x}) \propto P(c) \prod_{i=1}^{d} P\left(x_{i} \mid c, p a_{i}\right)$
among $p a_{i}$ For attributes $x_{i}$ Dependent properties , be called $x_{i}$ Parent attribute . here , For each attribute $x_{i}$ , if Its parent attribute $p a_{i}$ It is known that , A similar formula can be used $(7.20)$ To estimate the probability value $P\left(x_{i} \mid c, p a_{i}\right)$ . therefore , The key to the problem is how to determine the parent attribute of each attribute , Different approaches produce different independent classifiers . The most straightforward approach is to assume that all attributes depend on the same attribute , be called " Super parent "(super-parent) , Then, the super parent attribute is determined by model selection methods such as cross validation , And from that came SPODE (Super-Parent ODE) Method . for example , In the figure 7.1(b) in , $x_1$ Is a superparent property .

Please add a picture description

TAN (Tree Augmented naïve Bayes) [Friedman et al., 1997] Is the maximum weighted spanning tree (maximum weighted spanning tree) Algorithm (Chow-Liu tree)[Chow and Liu, 1968] On the basis of , Through the following steps, the dependency relationship between attributes is reduced to the following figure $7.1(\mathrm{c})$ The tree structure shown in :

(1) Calculate the conditional mutual information between any two attributes (conditional mutual information)
$I\left(x_{i}, x_{j} \mid y\right)=\sum_{x_{i}, x_{j} ; c \in \mathcal{Y}} P\left(x_{i}, x_{j} \mid c\right) \log \frac{P\left(x_{i}, x_{j} \mid c\right)}{P\left(x_{i} \mid c\right) P\left(x_{j} \mid c\right)}$
(2) Build a complete graph with attributes as nodes , The weight of the edge between any two nodes is set to $I\left(x_{i}, x_{j} \mid y\right)$
(3) Build the maximum weighted spanning tree of this complete graph , Pick root variable , Set edge to directed ;
(4) Add category node $y$ , Increase from $y$ To the directed edge of each attribute .

Easy to see , Conditional mutual information $I\left(x_{i}, x_{j} \mid y\right)$ Characterizing attributes $x_{i}$ and $x_{j}$ Relevance in the case of known categories , therefore , Through the maximum spanning tree algorithm , $\mathrm{TAN}$ In fact, only strongly related attributes are preserved The dependence of .