当前位置:网站首页>Paper notes: graph neural network gat

Paper notes: graph neural network gat

2022-07-06 02:14:00 Min fan

Abstract : Share your understanding of the paper . See the original Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, Graph attention networks, ICLR 2018, 1–12. Can be in ArXiv: 1710.10903v3 download . It's completely difficult to estimate the influence !

1. Contribution of thesis

  • Overcome the shortcomings of the existing methods of graph convolution .
  • No time-consuming matrix operation ( Such as inverse ).
  • There is no need to predict the structure of the graph .
  • Applicable to inductive and deductive problems .

2. Basic ideas

Use neighbor information , Map the original attributes of the nodes in the graph to a new space , To support later learning tasks .
This idea may be common to different graph Neural Networks .

3. programme

Table 1. Notations.
Symbol meaning remarks
N N N Number of nodes
F F F Original feature number
F ′ F' F Original feature number In the example 4
h \mathbf{h} h Node feature set { h → 1 , … , h → N } \{\overrightarrow{h}_1,\dots, \overrightarrow{h}_N \} { h1,,hN}
h → i \overrightarrow{h}_i hi The first i i i Characteristics of nodes Belong to space R F \mathbb{R}^F RF
h ′ \mathbf{h}' h Node new feature set { h → 1 ′ , … , h → N ′ } \{\overrightarrow{h}'_1,\dots, \overrightarrow{h}'_N \} { h1,,hN}
h → i ′ \overrightarrow{h}'_i hi The first i i i New features of nodes Belong to space R F ′ \mathbb{R}^{F'} RF
W \mathbf{W} W Characteristic mapping matrix Belong to R F × F ′ \mathbb{R}^{F \times F'} RF×F, All nodes share
N i \mathcal{N}_i Ni node i i i The neighborhood set of Include i i i own , In the example, the cardinality is 6
a → \overrightarrow{\mathbf{a}} a Feature weight vector Belong to R 2 F ′ \mathbb{R}^{2F'} R2F, All nodes share , Corresponding to single-layer network
α i j \alpha_{ij} αij node j j j Yes i i i Influence The sum of the influences of all neighbor nodes is 1
α → i j \overrightarrow{\alpha}_{ij} αij node j j j Yes i i i Influence vector of The length is K K K, Corresponding to the bull

Map node features to the new space , Using the attention mechanism a a a Calculate the relationship between nodes
e i j = a ( W h → i , W h → j ) (1) e_{ij} = a(\mathbf{W}\overrightarrow{h}_i, \mathbf{W}\overrightarrow{h}_j) \tag{1} eij=a(Whi,Whj)(1)
Here only if j j j yes i i i When you are a neighbor on the network , Only calculated e i j e_{ij} eij.
Carry it on softmax, Make nodes i i i The corresponding weight sum is 1.
α i j = s o f t m a x j ( e i j ) = exp ⁡ ( e i j ) ∑ k ∈ N i exp ⁡ ( e i k ) (2) \alpha_{ij} = \mathrm{softmax}_j(e_{ij}) = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}_i} \exp(e_{ik})}\tag{2} αij=softmaxj(eij)=kNiexp(eik)exp(eij)(2)
because a a a The length will be 2 F ′ 2F' 2F The column vector of is converted to a scalar , It can be written as a line vector of the same length a → T \overrightarrow{\mathbf{a}}^{\mathrm{T}} aT. Plus an activation function , It can be realized with a single-layer neural network .

α i j = exp ⁡ ( L e a k y R e L u ( a → T [ W h → i ∥ W h → j ] ) ) ∑ k ∈ N i exp ⁡ ( L e a k y R e L u ( a → T [ W h → i ∥ W h → k ] ) ) (2) \alpha_{ij} = \frac{\exp(\mathrm{LeakyReLu}(\overrightarrow{\mathbf{a}}^{\mathrm{T}}[\mathbf{W}\overrightarrow{h}_i \| \mathbf{W}\overrightarrow{h}_j]))}{\sum_{k \in \mathcal{N}_i} \exp(\mathrm{LeakyReLu}(\overrightarrow{\mathbf{a}}^{\mathrm{T}}[\mathbf{W}\overrightarrow{h}_i \| \mathbf{W}\overrightarrow{h}_k]))}\tag{2} αij=kNiexp(LeakyReLu(aT[WhiWhk]))exp(LeakyReLu(aT[WhiWhj]))(2)

 Insert picture description here
chart 1. GAT Core program . Left : F ′ = 4 F' = 4 F=4 When , from W \mathbf{W} W The new space mapped to is 4 dimension . Corresponding 2 F ′ = 8 2F' = 8 2F=8 dimension . vector a → \overrightarrow{\mathbf{a}} a Shared by all nodes . Right : K = 3 K = 3 K=3 Head .

3.1 Scheme 1 : Single head

h → i ′ = σ ( ∑ j ∈ i α i j W h → j ) (4) \overrightarrow{h}'_i = \sigma\left(\sum_{j \in \mathcal{i}} \alpha_{ij} \mathbf{W} \overrightarrow{h}_j\right)\tag{4} hi=σ(jiαijWhj)(4)
All neighbor nodes are mapped to the new space first ( Such as 4 dimension ), Then the weighted sum is calculated according to its influence , And use sigmoid Isononlinear activation function , What you finally get is 4 Dimension vector .

3.2 Option two : Multi head connection

h → i ′ = ∥ k = 1 K σ ( ∑ j ∈ N i α i j k W k h → j ) (5) \overrightarrow{h}'_i = \|_{k = 1}^K \sigma\left(\sum_{j \in \mathcal{N}_i} \alpha^k_{ij} \mathbf{W}^k \overrightarrow{h}_j\right)\tag{5} hi=k=1KσjNiαijkWkhj(5)
K K K Get the corresponding new vectors respectively , chart 1 The right shows 3 Head , So the last vector is 3 × 4 = 12 3 \times 4 = 12 3×4=12 dimension .

3.3 Option three : Long average

h → i ′ = σ ( 1 K ∑ k = 1 K ∑ j ∈ N i α i j k W k h → j ) (5) \overrightarrow{h}'_i = \sigma \left(\frac{1}{K} \sum_{k = 1}^K \sum_{j \in \mathcal{N}_i} \alpha^k_{ij} \mathbf{W}^k \overrightarrow{h}_j\right)\tag{5} hi=σK1k=1KjNiαijkWkhj(5)
Just average , The last vector is 4 4 4 dimension .

4. doubt

  • problem : there W \mathbf{W} W And a → \overrightarrow{\mathbf{a}} a How to learn ?
    guess : From related work , That is, the necessary knowledge is obtained in the graph neural network . This paper just wants to describe different core technologies .
    If the output of this network is used as the input of other networks ( The final output is class labels, etc ), It is possible to learn accordingly .
    Tang Wentao's explanation : In essence, it is equivalent to matrix multiplication ( Linear regression ), You can see from the code of the paper : In the training phase , The whole training set is entered ( Characteristic matrix and adjacency matrix of samples ), adopt W \mathbf{W} W and a → \overrightarrow{\mathbf{a}} a Get the prediction label of the training set ( First, get the self attention weight of each sample for all samples , Then according to the adjacency matrix mask, Then normalize the weight as a layer of self attention ), Then proceed loss Calculation and dissemination of .

  • problem : Why use when calculating influence LeakyReLU, When calculating the final eigenvector sigmoid?
    Force to explain : The former is only different from the latter ( It's not necessary ), The latter is to change linearity ( It is necessary to ).
    Tang Wentao's explanation : Calculate influence using LeakyReLU: Pay more attention to the neighbor nodes that are more positively related to the target node .
    The final eigenvector uses sigmoid It should be to prevent the value from being too large , Affect the next level of learning , Because the self attention mechanism is relatively unstable ( From my previous experiments ), High requirements for the range and density of values ( Small scope :0-1 And so on , More dense ).
    Besides , stay GAT The source code given in the paper can be seen , The author uses only two layers of self attention network for all data sets , also dropout All set to 0.5-0.8, It can be seen that it is easier to over fit .

5. Summary

  • utilize W \mathbf{W} W Linear mapping to new space .
  • utilize a → \overrightarrow{\mathbf{a}} a Calculate the influence of each neighbor α i j \alpha_{ij} αij. a → \overrightarrow{\mathbf{a}} a Only for the corresponding attribute , Not affected by neighbor number . α i j \alpha_{ij} αij The calculation of involves LeakyReLU Use of activation functions .
  • Use bulls to increase stability .
  • Calculating mean 、 Using nonlinear function activation will not change the dimension of the vector .
原网站

版权声明
本文为[Min fan]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060149083191.html