当前位置：网站首页>[six articles talk about scalablegnn] around www 2022 best paper PASCA

[six articles talk about scalablegnn] around www 2022 best paper PASCA

2022-07-25 13:09:00 【chad_ lee】

Scalable GNN

《Simplifying Graph Convolutional Networks》 ICML 2019

《SIGN: Scalable Inception Graph Neural Networks》 ICML 2020 workshop

《Simple Spectral Graph Convolution》ICLR 2020

《Scalable and Adaptive Graph Neural Networks with Self-Label-Enhanced training》 arxiv’21 OGB Once the first place

《Graph Attention Multi-Layer Perceptron》 arxiv‘21 The same author

《PaSca: a Graph Neural Architecture Search System under the Scalable Paradigm》 WWW 2022 best paper

existing GNN The defects of

High Memory Cost

GNN Feature propagation operations in （ Neighbor aggregation mechanism 、 Message passing mechanism ） Involving the multiplication of a large sparse matrix and characteristic matrix , It takes more time ; If used during training GPU Speed up , Then we must put the matrix in GPU On the memory of , But this demand cannot be satisfied for the big picture . such as OGB The largest in the node classification dataset ogb The sparse adjacency matrix size of the dataset exceeds 50GB, At present, only 80GB Video memory version Tesla A100 To save , But you can use 250GB Memory operation , The latter is easier to get . So only distributed training GNN, But this will bring a lot

High Communication Cost

GNN Every feature propagation needs to aggregate neighbor features , about K Layer of GNN Come on , Each node needs to aggregate K Characteristics of neighbor nodes within hops , The number of nodes involved increases exponentially as the number of layers increases , This will bring a lot of communication overhead in distributed scenarios . For dense connected graphs , Each node almost needs to aggregate the node information of the whole graph every time it is trained . Some sampling algorithms （GraphSAGE） Can alleviate this problem , But because of the recursive neighbor aggregation mechanism in the training process , These algorithms still cause high communication costs .

Insert picture description here

It can be seen that the performance bottleneck is the communication overhead , Add GPU Can't effectively speed up training .

Inconsistent Receptive Field expansion speed

Insert picture description here

The receptive field expansion speed of different nodes is different , Therefore, the optimal receptive field of different nodes is also different . chart a Is random 20 Test results of nodes , use SGC（ Only the first K Layer of feature） Experiments done , Record continuous operation 100 Proportion of predicted correct times in total times . It can be seen that the prediction accuracy of different nodes varies with the increase of feature propagation steps , Such as node 10, The number of propagation steps is 1 The maximum prediction accuracy is reached , Subsequently, the prediction accuracy decreased continuously ; And there are some nodes , Such as node 4, The prediction accuracy increases with the increase of feature propagation steps . Even if the trend is the same , The optimal convolution layers of different nodes are also different . Therefore, when the node characteristics of different propagation steps are weighted and summed , Different weights should be given .

Different tasks need to be designed separately GNN structure

The existing GNN Work is to design a model structure , In different mission scenarios （ Data sets ） Domain experts may be required to design specific model structures 、 The training process , Modeling costs are high .

Scalable GNN summary

The article first summarizes Scalable GNN Architectural paradigm .Sampling-based The method is not Scalable GNN, Because these models also need to aggregate neighbor nodes in the training process , Communication overhead cannot be avoided in distributed scenarios .

model decoupling

Some research points out that 【SGC】,GNN The performance of is mainly due to the feature propagation operation in each layer , Instead of nonlinear transformation operation . This gave birth to a series of decoupling GNN, They are designing GNN The feature propagation operation and nonlinear transformation operation are decoupled , All feature propagation operations are completed in advance during preprocessing , In model training, there is no need to spend time and effort to propagate features . therefore , In terms of scalability , As long as the sparse matrix and dense matrix multiplication in the feature propagation operation can be completed during preprocessing , The subsequent model training can be easily carried out on a single machine and a single card . Because the feature propagation operation is completed in the preprocessing process , Therefore, there is no need to use GPU Accelerate , You can use only CPU Perform matrix multiplication .

But these are decoupled GNN The specificity of different nodes was not distinguished during training .SIGN、S2GC、SAGN、GAMLP We consider specificity .

SGC—— Consider only the K Characteristics of the layer

《Simplifying Graph Convolutional Networks》 ICML‘19

Insert picture description here

In the pretreatment stage ,SGC use K-step Features obtained after feature propagation （ Only the first K Output of layer convolution ）, As the initial feature of the node ： $\overline{\mathbf{X}} \leftarrow \mathbf{S}^{K} \mathbf{X}$ , $\mathbf{S}$ It's an adjacency matrix . Then only one is trained in the model training stage 1 Layer of MLP As a classifier .

SIGN—— Consider different receptive fields

《SIGN: Scalable Inception Graph Neural Networks》 ICML‘20 workshop
Insert picture description here

SGC It only took K-step The obtained feature is used as the initial feature of the node , Obviously, it is not optimal for all nodes ,SIGN take 0～K-step All the features are stitched together , It retains the characteristics of all receptive fields ：
$\begin{aligned} \mathbf{Z} &=\sigma\left(\left[\mathbf{X} \boldsymbol{\Theta}_{0}, \mathbf{A}_{1} \mathbf{X} \boldsymbol{\Theta}_{1}, \ldots, \mathbf{A}_{r} \mathbf{X} \boldsymbol{\Theta}_{r}\right]\right) \\ \mathbf{Y} &=\xi(\mathbf{Z} \boldsymbol{\Omega}) \end{aligned}$
$\boldsymbol{\Theta}$ and $\boldsymbol{\Omega}$ All training parameters . The disadvantage is that the space cost of splicing is directly proportional to the number of convolution layers .

S2GC——PPR+ Average

《Simple Spectral Graph Convolution》ICLR‘20

Insert picture description here

$\hat{Y}=\operatorname{softmax}\left(\frac{1}{K} \sum_{k=1}^{K}\left((1-\alpha) \tilde{\mathbf{T}}^{k} \mathbf{X}+\alpha \mathbf{X}\right) \mathbf{W}\right)$
$T$ Or adjacency matrix ,S2GC In fact, it is approximate in probability ( $1-\alpha$ ) Rebooted PPR Get the characteristics of each layer , And then add them evenly , As the initial feature of the node .

SAGN—— Different receptive fields attention Sum up

《Scalable and Adaptive Graph Neural Networks with Self-Label-Enhanced training》 arxiv’21

Insert picture description here

hold SIGN The splicing operation in is replaced by attention Weighted sum of , Weighted sum of different receptive field characteristics , Later articles are follow This paradigm .

GAMLP—— Design attention

《Graph Attention Multi-Layer Perceptron》 arxiv‘21

Insert picture description here

1、Smoothing Attention

It can be understood as the node feature after infinite times of feature propagation $X_{i}^{(\infty)}$ As reference vector , And then calculate attention：
$\widetilde{\mathbf{X}}_{i}^{(l)}=\mathbf{X}_{i}^{(l)} \| \mathbf{X}_{i}^{(\infty)}, \quad \widetilde{w}_{i}(l)=\delta\left(\tilde{\mathbf{X}}_{i}^{(l)} \cdot s\right), \quad w_{i}(l)=e^{\widetilde{w}_{i}(l)} / \sum_{k=0}^{K} e^{\widetilde{w}_{i}(k)}$
among $\in \mathbb{R}^{1 \times d}$ It's a trainable parameter . The larger $w_{i}(l)$ It means that nodes $v_{i}$ Of l-step The propagation characteristics are far away from the infinite propagation state , The risk of becoming noise is small . With larger $w_{i}(l)$ The transmission characteristics of should contribute more to the combination of characteristics .

2、Recursive Attention

Recursively calculate the information gain in the process of feature propagation ：

$\tilde{\mathbf{X}}_{i}^{(l)}=\mathbf{X}_{i}^{(l)} \| \sum_{k=0}^{l-1} w_{i}(k) \mathbf{X}_{i}^{(k)}, \tilde{w}_{i}(l)=\delta\left(\tilde{\mathbf{X}}_{i}^{(l)} \cdot s\right), w_{i}(l)=e^{\tilde{w}_{i}(l)} / \sum_{k=0}^{K} e^{\tilde{w}_{i}(k)}$

The larger $w_{i}(l)$ Signifying characteristics $\mathbf{X}_{i}^{(l)}$ Yes $v_{i}$ The current state of is more important （ Compared with the previous layer, the combination features account for a higher proportion ）.

3、JK Attention

Insert picture description here

Will all step Feature input to MLP,MLP The output result is used as the reference vector , Calculation attention：
$\tilde{X}_{i}^{(l)}=X_{i}^{(l)} \| E_{i}, \quad \widetilde{w}_{i}(l)=\delta\left(\tilde{X}_{i}^{(l)} \cdot \mathrm{s}\right), \quad w_{\mathrm{i}}(l)=e^{\widetilde{w}_{i}(l)} / \sum_{k=0}^{K} e^{\widetilde{w}_{i}(k)} \\ \mathbf{E}_{i}=\mathbf{M L P}\left(\mathbf{X}_{i}^{(1)}\left\|\mathbf{X}_{i}^{(2)}\right\| \ldots \| \mathbf{X}_{i}^{(K)}\right)$

Scalable GNN Model design space

《PaSca: a Graph Neural Architecture Search System under the Scalable Paradigm》 WWW 2022 best paper

As mentioned above Scalable GNN Are a specific model architecture , According to the above work , It can be concluded that Scalable GNN Model design space , Then you can define the search target （ Accuracy and speed ）, Search for the one that is suitable for the current task scenario Scalable GNN.

Insert picture description here

Scalable GNN The model design of can be divided into three stages ：

1、 Pretreatment stage

Aggregation step（ $K_{pre}$ ） It refers to the feature obtained by convolution of several layers at most .

Graph aggregator（ $GA_{pre}$ ） It refers to the adjacency matrix used by the neighbor aggregation mechanism , For example, after standardization A、PPR etc. .

It's described above SGC、SIGN、SAGN、GAMLP It's used A;S2GC It's using PPR.

2、 Model training stage

Message aggregators（ $M A$ ） It refers to how to use the obtained k-steps Characteristics of .None Only the last layer （SGC）;Mean Namely S2GC;Concatenate Namely SIGN;Weighted Namely SAGN;Adaptive Namely GAMLP. It's not good to use only one layer , Because different nodes have different optimal receptive fields .

Transformation step（ $K_{trans}$ ） Refer to MLP The number of layers .

3、 Post processing stage

there $K_{pos}$ and $GA_{post}$ And pretreatment stage $K_{pre}$ and $GA_{pre}$ identical , The purpose is to Label Propagation, The model predicts Classification probability distribution Spread it several times , Be similar to APPNP, But it is directly transmitted to the model output after training .

Insert picture description here

The whole model design space contains 18 Ten thousand model architectures , The above summary Scalable GNN Are concrete examples of this design space ：
Insert picture description here

Model structure automatic search system （ Auto search super parameter ）

Insert picture description here

This part of the article is very short . It is divided into two engines ： One is the search engine , One is the evaluation engine .

Search engine is based on Bayesian optimization to realize the recommendation of model structure . First, define the optimization objective ： Accuracy and training speed . Then model the relationship between the observed model structure and the optimization goal （ Try to establish a mapping ）, Return a model structure that can balance multiple optimization objectives , Let the evaluation engine verify .

The evaluation engine is to train the model , Then return the results to the search engine , In this way, there is another observation result of search engine .

experimental result

Communication overhead

Insert picture description here

Model accuracy and efficiency ：

Insert picture description here

PaSca The found model can better balance the training efficiency and model accuracy . It can be seen from the right figure that there is no decoupling GCN Compared with decoupling GCN, Performance has no significant advantage , But the efficiency is more than 100 times worse .

Insert picture description here