当前位置：网站首页>Reading notes of cgnf: conditional graph neural fields

Reading notes of cgnf: conditional graph neural fields

2022-07-02 05:47:00 【Si Xi is towering】

One . An overview of the article

In most GNNs in , The dependency between node labels is not considered . So , The author puts the conditional random field （Conditional Random Fields, CRF） And graph convolution network CGNF（Conditional Graph Neural Network）, This model explicitly models the joint probability of the whole node label set , thus The neighborhood label information can be used in the node label prediction task .

Two . Background knowledge

2.1 Figure convolution network

GCN The mathematical form of convolution in the middle map is as follows ：
$\boldsymbol{H}^{(l+1)}=\sigma\left(\tilde{\boldsymbol{D}}^{-\frac{1}{2}} \tilde{\boldsymbol{A}} \tilde{\boldsymbol{D}}^{-\frac{1}{2}} \boldsymbol{H}^{(l)} \boldsymbol{W}^{(l)}\right)$
among $\tilde{A}=\boldsymbol{A}+\boldsymbol{I}$ Represents the adjacency matrix with self ring added , $\tilde{D}$ yes $\tilde{A}$ The corresponding degree matrix （ Diagonal matrix ）, $\boldsymbol{H}^{(l)}$ It means the first one $l$ The node representation of the layer , $\boldsymbol{W}^{(l)}$ It means the first one $l$ Layer weight matrix , $\sigma$ Is the activation function , The common one is ReLU.

2.2 Conditional random field

Conditional random field （CRF） It's a kind of Undirected probability graph model , Usually used for structural prediction tasks . Given the input characteristics $\in \mathbb{R}^{d}$ ,CRF To find the maximum conditional probability $P(\boldsymbol{y} \mid \boldsymbol{x})$ Tag set $\boldsymbol{y}$ . On an undirected graph ,CRF The way to calculate the joint probability distribution is Factorization , namely ：
$P(\boldsymbol{y} \mid \boldsymbol{x})=\frac{1}{Z(\boldsymbol{x})} \prod_{c} \Phi_{a}\left(\boldsymbol{x}_{c}, \boldsymbol{y}_{c}\right)$
among $c$ Represents the group in the figure , $\boldsymbol{x}_{c}$ Express Group $c$ Features corresponding to all vertices in , $\Phi_{c}$ Represents the potential function , $Z(\boldsymbol{x})=\sum_{\boldsymbol{y}_{c}^{\prime}} \prod_{c} \Phi_{a}\left(\boldsymbol{x}_{c}, \boldsymbol{y}_{c}^{\prime}\right)$ Denotes the normalization factor （ It is used to ensure that the calculated probability value is legal ）.

Clique refers to a subgraph in which all vertices have edge connections .

3、 ... and .CGNF Detailed introduction

First, the symbol table is given for the convenience of subsequent introduction ：

CGNF_Notation

3.1 Training

CGNF The first step is to input the graph $G=\{\boldsymbol{X}, \boldsymbol{Y}, \boldsymbol{A}\}$ Later Kipf and Welling Bring up the 2 layer GCN Model , namely ：
$\boldsymbol{H}=f(\boldsymbol{X}, \boldsymbol{A})=\operatorname{Softmax}\left(\hat{\boldsymbol{A}} \operatorname{ReLu}\left(\hat{\boldsymbol{A}} \boldsymbol{X} \boldsymbol{W}^{0}\right) \boldsymbol{W}^{1}\right)$
And then , The author considers the influence of node characteristics and label dependency , Define the energy function （energy function） as follows ：
$E(\boldsymbol{Y}, \boldsymbol{X}, \boldsymbol{A})=E_{c}\left(\boldsymbol{Y}_{c}, \boldsymbol{X}_{c}, \boldsymbol{A}\right)=\sum_{i} \psi\left(\boldsymbol{y}_{i}, \boldsymbol{x}_{i}\right)+\gamma \sum_{(i, j) \in \mathcal{E}, i<j} \phi\left(\boldsymbol{y}_{i}, \boldsymbol{y}_{j}, A_{i, j}\right)$
among $c$ Express Group , $\mathcal{E}$ Represents an edge set , $\psi(\cdot)$ by The univariate potential function （ Used for policy observation nodes $x_i$ And labels $y_i$ Compatibility between compatibility, namely The observed value is $x_i$ Belong to $y_i$ The probability of a class ）, Pairwise potential function $\phi(\cdot)$ Used to capture tag Correlation . Based on this energy function , Can export Gibbs Distribution ：
$P(\boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{A})=\frac{\exp (-E(\boldsymbol{Y}, \boldsymbol{X}, \boldsymbol{A}))}{\sum_{\boldsymbol{Y}^{\prime} \in \mathcal{Y}} \exp \left(-E\left(\boldsymbol{Y}^{\prime}, \boldsymbol{X}, \boldsymbol{A}\right)\right)}=\frac{\exp (-E(\boldsymbol{Y}, \boldsymbol{X}, \boldsymbol{A}))}{Z(\boldsymbol{X}, \boldsymbol{A})}$
The goal of the author is to maximize the conditional probability , namely ：
$\begin{aligned} E(\boldsymbol{Y}, \boldsymbol{X}, \boldsymbol{A}) &=\sum_{i} \psi\left(\boldsymbol{y}_{i}, \boldsymbol{h}_{\boldsymbol{i}}\right)+\gamma \sum_{(i, j) \in \mathcal{E}, i<j} \phi\left(\boldsymbol{y}_{i}, \boldsymbol{y}_{j}, \hat{A}_{i, j}\right) \\ &=\sum_{i}\left(\psi\left(\boldsymbol{y}_{i}, \boldsymbol{h}_{i}\right)+\frac{\gamma}{2} \sum_{j \in N(i)} \phi\left(\boldsymbol{y}_{i}, \boldsymbol{y}_{j}, \hat{A}_{i, j}\right)\right) \end{aligned}$
among $h_i$ It's through 2 layer GCN The node representation obtained by the model , $\hat{A}_{i, j}$ Is the original in the regularized adjacency matrix , $N (i)$ Is the node $i$ The neighborhood of . The calculation formula of the two potential functions is as follows ：
$\begin{aligned} \psi\left(\boldsymbol{y}_{i}, \boldsymbol{h}_{i}\right) &=-\log p\left(\boldsymbol{y}_{i} \mid \boldsymbol{h}_{i}\right)=-\sum_{k} y_{i, k} \log h_{i, k} \\ \phi\left(\boldsymbol{y}_{i}, \boldsymbol{y}_{j}, \hat{A}_{i, j}\right) &=-2 \hat{A}_{i, j} U_{y_{i}, y_{j}} \end{aligned}$
It can be seen from the above formula that $\psi\left(\boldsymbol{y}_{i}, \boldsymbol{h}_{i}\right)$ It's actually cross entropy , $U_{y_{i}, y_{j}} \in \boldsymbol{U}$ yes label $y_i$ and $y_j$ Learnable correlation weights between . Adopt a similar tradition CRF How to do it , The author uses negative log likelihood as the objective function of training ：
$\begin{aligned} -\log P(\boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{A}) &=E(\boldsymbol{Y}, \boldsymbol{X}, \boldsymbol{A})+\log Z(\boldsymbol{X}, \boldsymbol{A}) \\ &=E(\boldsymbol{Y}, \boldsymbol{X}, \boldsymbol{A})+\log \sum_{\boldsymbol{Y}^{\prime}} \exp \left(-E\left(\boldsymbol{Y}^{\prime}, \boldsymbol{X}, \boldsymbol{A}\right)\right) \end{aligned}$
In inferring （inference） When , just $\min _{\boldsymbol{Y}} E(\boldsymbol{Y}, \boldsymbol{X}, \boldsymbol{A})$ that will do . But it is more difficult to optimize than the above training objectives , For this reason, the author adopts Pseudo likelihood To approximate it ：
$P(\boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{A}) \approx P L(\boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{A})=\prod_{i} P\left(\boldsymbol{y}_{i} \mid \boldsymbol{y}_{N(i)}, \boldsymbol{X}, \boldsymbol{A}\right)$
among ：
$\begin{aligned} P\left(\boldsymbol{y}_{i} \mid \boldsymbol{y}_{N(i)}, \boldsymbol{X}, \boldsymbol{A}\right) &=\frac{\exp \left(-\psi\left(\boldsymbol{y}_{i}, \boldsymbol{h}_{\boldsymbol{i}}\right)-\gamma \sum_{j \in N(i)} \phi\left(\boldsymbol{y}_{i}, \boldsymbol{y}_{j}, \hat{A}_{i, j}\right)\right.}{\sum_{\boldsymbol{y}_{i}^{\prime}}\left(\exp \left(-\psi\left(\boldsymbol{y}_{i}^{\prime}, \boldsymbol{h}_{\boldsymbol{i}}\right)-\gamma \sum_{j \in N(i)} \phi\left(\boldsymbol{y}_{i}^{\prime}, \boldsymbol{y}_{j}, \hat{A}_{i, j}\right)\right)\right.} \\ &=\frac{\exp \left(-\log p\left(\boldsymbol{y}_{i} \mid \boldsymbol{h}_{\boldsymbol{i}}\right)-2 \gamma \sum_{j \in N(i)} \hat{A}_{i, j} U_{y_{i}, y_{j}}\right.}{\sum_{\boldsymbol{y}_{i}^{\prime}}\left(\exp \left(-\log p\left(\boldsymbol{y}_{i}^{\prime} \mid \boldsymbol{h}_{\boldsymbol{i}}\right)-2 \gamma \sum_{j \in N(i)} \hat{A}_{i, j} U_{y_{i}^{\prime}, y_{j}}\right)\right.} \end{aligned}$
$\boldsymbol{y}_{i}^{\prime}$ Is the node $\boldsymbol{x}_{i}$ All possible labels . therefore , The new training goal is ：
$\begin{aligned} &-\log P L(\boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{A})=\sum_{i}-\log P\left(\boldsymbol{y}_{i} \mid \boldsymbol{y}_{N(i)}, \boldsymbol{X}, \boldsymbol{A}\right)= \\ &\sum_{i}\left(\psi\left(\boldsymbol{y}_{i}, \boldsymbol{h}_{i}\right)+\gamma \sum_{j \in N(i)} \phi\left(\boldsymbol{y}_{i}, \boldsymbol{y}_{j}, \hat{A}_{i, j}\right)+\log \sum_{\boldsymbol{y}_{i}^{\prime}}\left(\exp \left(-\psi\left(\boldsymbol{y}_{i}^{\prime}, \boldsymbol{h}_{\boldsymbol{i}}\right)-\gamma \sum_{j \in N(i)} \phi\left(\boldsymbol{y}_{i}^{\prime}, \boldsymbol{y}_{j}, \hat{A}_{i, j}\right)\right)\right)\right. \\ &=-\sum_{i, k}(\boldsymbol{Y} \odot \log \boldsymbol{H})_{i, k}-2 \gamma \sum_{i, j, i \neq j}\left(\hat{\boldsymbol{A}} \odot\left(\boldsymbol{Y} \boldsymbol{U} \boldsymbol{Y}^{T}\right)\right)_{i, j}+\sum_{i} \log \sum_{k}(\boldsymbol{H} \odot \exp (2 \gamma \hat{\boldsymbol{A}} \boldsymbol{Y} \boldsymbol{U}))_{i, k} \end{aligned}$
$\odot$ Represents element by element multiplication .

3.2 infer

As mentioned above , When inferring, you only need to optimize the following objectives ：
$\min _{\hat{\boldsymbol{Y}}_{t e}} E\left(\hat{\boldsymbol{Y}}_{t e}, \boldsymbol{X}, \boldsymbol{A}, \boldsymbol{Y}_{t r}\right)=\min _{\hat{\boldsymbol{Y}}_{t e}}\left[-\log p\left(\hat{\boldsymbol{Y}}_{t e} \mid \boldsymbol{H}\right)-\gamma \sum_{i \neq j}\left(\hat{\boldsymbol{A}} \odot\left(\hat{\boldsymbol{Y}} \boldsymbol{U} \hat{\boldsymbol{Y}}^{T}\right)\right)_{i, j}\right]$
among $\hat{\boldsymbol{Y}}=$ concatenate $\left(\boldsymbol{Y}_{t r}, \hat{\boldsymbol{Y}}_{t e}\right)$ . The author mentioned two inference methods in his paper .

3.2.1 Inference method 1

The simplest inference method is to ignore the correlation between tags , namely ：
$y_{i}=\underset{y_{j}}{\arg \min } E\left(\boldsymbol{y}_{i}, \boldsymbol{Y}_{t r}, \boldsymbol{X}, \boldsymbol{A}\right)=\underset{j}{\arg \min }\left[-\log \left(\boldsymbol{h}_{i}\right)-2 \gamma \hat{\boldsymbol{A}}_{t r} \boldsymbol{Y} \boldsymbol{U}^{T}\right]_{j}$

3.2.2 Inference method 2

The second scheme is to use dynamic programming method to find the optimal value . This method will randomly select a test node as the start , And randomly sort other test nodes , Then follow the order of sorting test nodes beam search（beam The size is $K$ , That is, you can get $K$ Best set ）. Repeat the process $T$ word , Then select the best of all search results , The algorithm is summarized as follows ：

CGNF_DP

Four . experiment

The author in Cora、Pubmed、Citeseer and PPI Experiment on four data sets , And compared with others baseline Good performance has been achieved , The corresponding results are as follows ：

CGNF_Outcome