当前位置：网站首页>How powerful are spectral graph neural networks

How powerful are spectral graph neural networks

2022-06-21 06:27:00 【w55100】

Preface

title It's a copy of another famous article before "how powerful …"
The paper constructs linear GNN To 1-WL Equivalence relation under certain conditions .
For reasons of personal interest, I don't care much about the model part .

How Powerful are Spectral Graph Neural Networks
https://arxiv.org/abs/2205.11172v1
ICML2022

One 、 Basic theory

For now spectral My research is actually , Basically, they all remove activation Of .
Because with activation Words , Many properties are difficult to deduce a convergent result in chain form .
Of course, you might give an example of a big guy with a math background subspace My article is with activation Certified together .
But other articles , from optimization The departure was good ,polynomial Let's just say , Basically, they tend to get rid of it on the mainstream stage .（ What I understand is for mathematical elegance）
and , Here's the thing to watch out for , Even that article , It can only be applied to activation === ReLU , There is no way to extend it to other activation Of .
Let's say so ,“ If activation Can be merge To theoretical Take some with you , Otherwise, discard ” It is already a very natural thing .

This article ICML2022 Of paper The starting point of is roughly here .
Since everyone's theory is based on the analysis of how to get rid of it , Why do you have to bring it with you in practice .（ In the past, we could put this gap, Explained as the difference between theory and practice , Fool the past with the ideal hypothesis of theoretical research ）
Can we simply prove that there is no activation It's good enough , I don't want to practice at all ？
This is the starting point of my understanding of the full text .
The study did not activation The upper limit of expression ability .

Yes, of course , The words used in the author's original text are linear GNN.
It means to get rid of non-linear function Of GNN.
use $g(\hat{L})$ Express arbitrarily spectral filter.
$\text{linear GNN} =g(\hat{L})XW$

as everyone knows , arbitrarily linear GNN from filter and transformation Two parts .

Condition

one-dimensional prediction
no multiple eigenvalues
no missing frequency component

Satisfy the above 3 Conditions ,linear GNN Can approach any 1 Dimensional prediction.
For example, for a 2 Classification problem , The output is out.shape = (N,1)

The author said that for multi-dimensional prediction, Violence can be used ,
Every dimension One linear GNN.( Need different filter and transformation)
It's because of a linear GNN You can only express one completely at most dimension.

4.1 The original text is a little messy .
Proposition 4.2 Say if $X$ The line is not rank , Then there is always multi-dimension prediction yes linear GNN Inaccessible .
I'm done Appendix The proof in the book shows what he wants to say …

Use your own words to express .

Two 、Section 4.1 multi-dimensional Partial disassembly

First , The rows of the characteristic matrix are less than the rank , It's a high probability event . because $S h a p e (X) = (N, d)$ , Generally speaking $N > > d$ .
here , If we want to use it linear GNN = gXW This framework is used to approximate a k-dimensional prediction, The output (N,k).
k=1 when , Expressive ability is sufficient .
k>1 when , Inadequate expression .

That's why I'm here k>1 when , Put forward to use k individual linear GNN To learn separately k individual dimension.

We might as well look at the situation in the airspace first .

consider k=1

2.0 prove k=1 Time is sufficient

The original text proves that there is something wrong .

Insert picture description here

The first part , $\mathbb{R}^{d}-\cup_{i=1}^{n} V_{i} \ne \{0\}$ Can be controlled by $V_i$ is a proper subspace derived ?

Q：N individual proper subspace Of union still proper subspace?
A： It's not right , There is a basic theorem ：A vector space cannot be written as a finite union of proper subspaces.

This proof is wrong . Let's treat it as a clerical error ..

Think again k>1.

2.1 Nature of airspace

Order rank $\text{RowRank}(X) = r$ .
from $X$ In looking for $r$ Rows form a set of maximal linearly independent vectors , this $r$ Row labels of rows form a new set $\mathbb{I} = \{s_1,s_2,...,s_r\}$ .
Then the matrix formed by this group of maximal linear independent vectors can be written as $X_{\mathbb{I}} = \{ X_{s_{i}}\} \in \mathbb{R}^{r\times d}$ .
We can express the original characteristic matrix with this set of maximal linear independent vectors , It's written in $PX_{\mathbb{I}}, M \in \mathbb{R}^{N\times r}$ .

Suppose there is a $\in \mathbb{R}^{d\times k}$ bring $X_{\mathbb{I}}W$ Reduce to row rank 1 Matrix .
namely $\text{RowRank}(X_{\mathbb{I}}W)=1$ , namely $X_{\mathbb{I}}W$ Each column of is the same .

So clearly linear GNN Output
$g(PX_{\mathbb{I}})W =gP(X_{\mathbb{I}}W)$ Satisfy
$\text{RowRank}(Z)\le \text{RowRank}(X_{\mathbb{I}}W)=1$ ,
also $\ne \mathbb{0}$ , namely $\text{RowRank}(Z) \gt 0$ .
therefore $\text{RowRank}(Z)=1$ .
therefore Z Each column of is the same .

The nature of the airspace itself is not strong , Because in the airspace g No particularity .
But it has more powerful performance in frequency domain .

2.2 Frequency domain properties

Remember the frequency domain characteristics $\hat{X} = U^TX \in \mathbb{R}^{N\times d}$ , $\in \mathbb{R}^{d\times k}$
similarly
Order rank $\text{RowRank}(\hat{X}) = r$ .
from $\hat{X}$ In looking for $r$ Rows form a set of maximal linearly independent vectors , this $r$ Row labels of rows form a new set $\mathbb{I} = \{s_1,s_2,...,s_r\}$ .
Then the matrix formed by this group of maximal linear independent vectors can be written as $\hat{X}_{\mathbb{I}} = \{ \hat{X}_{s_{i}}\} \in \mathbb{R}^{r\times d}$ .
We can express the original characteristic matrix with this set of maximal linear independent vectors , It's written in $\hat{X} = M\hat{X}_{\mathbb{I}}, M \in \mathbb{R}^{N\times r}$ .

therefore
$g(\hat{L})XW = Ug(\Lambda)\hat{X}W =Ug(\Lambda)M\hat{X}_{\mathbb{I}}W$

hold g Become the first , Convenient operation .
remember $\hat{Z} = U^TZ = g(\Lambda)M\hat{X}_{\mathbb{I}}W$
Yes
$\hat{Z}_{\mathbb{I}} = (U^TZ)_{\mathbb{I}} = (g(\Lambda)M\hat{X}_{\mathbb{I}}W)_{\mathbb{I}} = (g(\Lambda)M)_{\mathbb{I}} \hat{X}_{\mathbb{I}}W$

Frequency domain g Of a special nature , It's a diagonal matrix . therefore ,

$g(\Lambda)M = [g_{ii}M_i]_{i\in[N]}$ , among $g_{ii}$ Express i That's ok i Diagonal elements of columns , $M_i$ It means the first one i That's ok .
$(g(\Lambda)M)_{\mathbb{I}}= ([g_{ii}M_i]_{i\in[N]})_{\mathbb{I}} = g(\Lambda)_{\mathbb{I}\mathbb{I}}M_{\mathbb{I}}$
Considering
$\hat{X}_{\mathbb{I}}= (M\hat{X}_{\mathbb{I}} )_{\mathbb{I}} =M_{\mathbb{I}}\hat{X}_{\mathbb{I}}$
This means that we can make $M_{\mathbb{I}} =E_{r}$ Become a $r\times r$ Unit matrix .
So in the frequency domain
$(g(\Lambda)M)_{\mathbb{I}} = g(\Lambda)_{\mathbb{I}\mathbb{I}}M_{\mathbb{I}}=g(\Lambda)_{\mathbb{I}\mathbb{I}}$

With the above properties , You can further write
$\hat{Z}_{\mathbb{I}} = g(\Lambda)_{\mathbb{I}\mathbb{I}}\hat{X}_{\mathbb{I}}W$
and
$\hat{Z} = g(\Lambda)M\hat{X}_{\mathbb{I}}W$

According to the above formula, it is found that
if $W$ bring $\text{RowRank}(\hat{X}_{\mathbb{I}}W)=1$ ,
We will get a conclusion similar to the airspace part
$\text{RowRank}(\hat{Z}_{\mathbb{I}})=1$
$\text{RowRank}(\hat{Z})=1$

namely , When $g(\Lambda)_{\mathbb{I}\mathbb{I}}$ Reversible ,
If $\hat{Z}_{\mathbb{I}}$ The columns are the same , $\hat{Z}$ The columns of are also synchronized .

namely , When $g(\Lambda)_{\mathbb{I}\mathbb{I}}$ Reversible ,
There is no model that makes $\hat{Z}_{\mathbb{I}}$ The columns are the same , and $\hat{Z}$ The columns of are different $Z$ .
here , about k>1 Of prediction, The model cannot be learned completely .

I call this property Columns, etc nature .

2.3 The boundary conditions

It is noted that an important premise for the above conclusion is that when $g(\Lambda)_{\mathbb{I}\mathbb{I}}$ reversible .
Can this precondition hold ？ The answer is yes .
Because our goal is to prove , There is a kind of $Z$ .
So find a counterexample .

Just make $\hat{Z}$ Satisfy ： $\hat{Z}_{\mathbb{I}}$ None of the elements in are 0.
according to $\hat{Z}_{\mathbb{I}} = g(\Lambda)_{\mathbb{I}\mathbb{I}}\hat{X}_{\mathbb{I}}W$ Can launch $g(\Lambda)_{\mathbb{I}\mathbb{I}}$ All diagonal elements of are not 0, So the reversible condition holds .

2.4 Consider this conclusion for classical Spectral GNN The establishment of

Q： If you join activation, Can we learn a lesson that makes $\hat{Z}_{\mathbb{I}}$ The columns are the same （ And it doesn't include 0）, and $\hat{Z}$ The columns of are different $Z$ ？

obviously , In the traditional way , use ReLU As activation, There are a lot of Z It's impossible to learn .
For example, I hope
$\hat{Z} = \begin{pmatrix} 1 & 1 \\ 2 & 2 \\ 5&6 \\ 7&8 \end{pmatrix}$

The first two lines serve as $\mathbb{I} = \{1,2\}$ .
use ReLU Never learn .
in tanh And that sort of thing doesn't work .

And simply stack Layers I can't do it .
Because the output of the last layer will face the same problem .

So you could say , This conclusion is important for understanding tradition spectral GNN The upper limit of expression ability is very enlightening .
grace , It's so elegant ！

Yes 2 There's a solution
1. abandon spectral, change to the use of sth. spatial GNN.
I have already proved , Above this Columns, etc Nature is spectral domain On g by diagonal The special nature of .
Once abandoned spectral, obtain non-diagonal filter Will automatically eliminate this property .
2. Front use spectral,output layer Change to non-spectral.
This may solve part of the problem .

2.5 Practical significance

Q： I got it! spectral Can't learn to wait , What's the use of this ？ Is there a case where you need to output local columns, etc ？

For the output layer .
I can only think of , some multi-label May appear in the multi category task of .
For example, a movie belongs to both romance and thriller （ Blind detective ？）

But it's hard to say about the hidden layer in the middle .
For some highly relevant dimension pair, Columns and so on may be very strong properties .

2.6 Connection with some past work

The author mentioned .

We can use individual polynomial coefficients for each output channel to solve this problem

I can't remember which article I read , That's what I did .
hold g Take it apart , In different channel Go to school again coefficients…
I only remember that I smiled …
It seems that there is more than one article .
One day, I will remember to fill the pit .

3、 ... and 、Section4.2&4.3

Multiple Eigenvalue
Missing frequency components

It's easy to understand , If feature There is no frequency component in the , No matter how you pass filter Go to scale Can't solve .
This corresponds to the third condition.
The good news is , The author in Appendix E. Chinese vs 10 individual benchmark dataset Made statistics , This proves to be very rare .

（ This problem should have been discussed long ago ？ I have the impression that ... Because it is seldom a natural conclusion to have multiple roots on a large graph . The absence of frequency components does not exist for data sets with strong enough characteristics .）

contrary , The problems that really need to be explored are those discussed in other previous work .
We know that the big picture has no double root , But we care about the density between different roots （ distance ）.
$d_{p} = \lambda_{p+1}-\lambda_{p}$ .

I can't remember which article came to this conclusion , This d Whether it is greater than or less than a certain threshold will improve the performance ...

Four 、 section 4.4

Proposition 4.3…

I don't know why he has to prove it again .. I remember it was like GIN It has been certified once .
mp Under the framework of GNN The ability to express will not exceed 1-WL .
Does it smell good directly .

Corollary 4.4 Under the two conditions of no double characteristic root and no missing frequency component ,1-WL It can also distinguish all non isomorphic nodes .

Theorem 4.5 Graphs without multiple eigenvalues ,order of automorphism Not more than 3.
It means at most 3 individual g^3 = E , E is identity matrix.
Then there is infinite automorphism . $g^4 = g^3*g=g, g^5=g^2, g^6 = g^3 = E$
Automorphism group {E,g,g^2}, order by 3.

Theorem 4.6 Further , If there are no double eigenvalues and no missing frequency components at the same time , except identity There will be no other automorphism.
Automorphism group {E}, order=1.

It means , Now run 1-WL and linearGNN There is no difference between .
Will output completely different labels to different nodes .

Appendix C It is discussed in random feature Influence
To be changed …

5、 ... and 、 Section 4.5 nonlinear

Simply speaking
nonlinear Sure mix frequency components.

I remember the previous conclusion was ,non linear Can be introduced high frequency components.
（Q： How to prove ？）

Baseline

ChebyNet
SGC
APPNP
GNN-LF/HF
GPRGNN
ARMA
BernNet

Look at it like this , The progress in this field can be said to be very slow .
Only the big guys can move something in from other fields .
Ordinary people can't pour water if they want to .
On average ,1 Years will come out 1~1.5 individual model.

Reference resources
//https://www.mathcounterexamples.net/a-vector-space-written-as-finite-union-of-proper-subspaces/

原网站

版权声明
本文为[w55100]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/172/202206210602374696.html