当前位置:网站首页>Recommendation system (IX) PNN model (product based neural networks)

Recommendation system (IX) PNN model (product based neural networks)

2022-07-06 04:10:00 Tianze 28

Recommendation system ( Nine )PNN Model (Product-based Neural Networks)

Recommendation system series blog :

  1. Recommendation system ( One ) Overall overview of the recommendation system
  2. Recommendation system ( Two )GBDT+LR Model
  3. Recommendation system ( 3、 ... and )Factorization Machines(FM)
  4. Recommendation system ( Four )Field-aware Factorization Machines(FFM)
  5. Recommendation system ( 5、 ... and )wide&deep
  6. Recommendation system ( 6、 ... and )Deep & Cross Network(DCN)
  7. Recommendation system ( 7、 ... and )xDeepFM Model
  8. Recommendation system ( 8、 ... and )FNN Model (FM+MLP=FNN)

PNN Model (Product-based Neural Networks) And the last blog introduction FNN The model is the same , They are all from teacher Zhangweinan and his collaborators of Jiaotong University , This article paper Published in ICDM’2016 On , It's a CCF-B Class meeting , Personally, I have never heard of any company in the industry that has practiced this model in its own scenario , But we can still read this paper The results of , Maybe it can provide some reference for your business model . By the name of this model Product-based Neural Networks, We can also basically know PNN By introducing product( It can be inner product or outer product ) To achieve the purpose of feature intersection .

This blog will introduce motivation and model details .

One 、 motivation

This article paper The main improvement is introduced in the last blog FNN,FNN There are two main shortcomings :

  1. DNN Of embedding Layer quality is limited by FM The quality of training .
  2. stay FM Implicit vector dot product is used in feature intersection , hold FM Pre trained embedding The vector is sent to MLP Full link layer in ,MLP The full link layer of is essentially a linear weighted sum between features , That is to do 『add』 The operation of , This is related to FM A little inconsistent . in addition ,MLP Different field The difference between , All use linear weighted summation .

The original text of the paper is :

  1. the quality of embedding initialization is largely limited by the factorization machine.、
  2. More importantly, the “add” operations of the perceptron layer might not be useful to explore the interactions of categorical data in multiple fields. Previous work [1], [6] has shown that local dependencies between features from different fields can be effectively explored by feature vector “product” operations instead of “add” operations.

Actually, I think FNN There is also a big limitation :FNN It's a two-stage training model , It's not a data-driven Of end-to-end Model ,FNN There is still a strong shadow of traditional machine learning .

Two 、PNN Model details

2.1 PNN The overall structure of the model

PNN The network structure of is shown in the figure below ( The picture is taken from the original paper )
PNN
PNN The core of the model is product layer, It's the one in the picture above product layer pair-wisely connected This floor . But unfortunately , The picture given in the original paper hides the most important information , The first impression after reading the above picture is product layer from z z z and p p p form , Then send it directly to the upper L 1 L1 L1 layer . But it's not , What is described in the paper is : z z z and p p p stay product layer The transformation of the whole connection layer is carried out respectively , Separately z z z and p p p The mapping becomes D 1 D_1 D1 Dimension input vector l z l_z lz and l p l_p lp D 1 D_1 D1 Is the number of neurons in the hidden layer ), And then l z l_z lz and l p l_p lp superposition , Then send it to L 1 L_1 L1 layer .
therefore , I will product layer Redraw , It looks more detailed PNN Model structure , As shown in the figure below :

PNN_detail
The above figure is basically very clear PNN The overall structure and core details of . Let's take a look at each floor from the bottom up

  • Input layer
    Original features , There's nothing to say . In the paper is N N N Features .
  • embedding layer
    Nothing to say , Just an ordinary embedding layer , Dimension for D 1 D_1 D1.
  • product layer
    The core of the whole paper , There are two parts here : z z z and p p p, Both are from embedding Layer .
    【1. About z z z
    among z z z Is a linear signal vector , In fact, it's just to put N N N Characteristic embedding Vector directly copy To come over , A constant is used in the paper 1, Do multiplication . The paper gives a formal definition :
    z = ( z 1 , z 2 , z 3 , . . . , z N ) ≜ ( f 1 , f 2 , f 3 , . . . , f N ) (1) z = (z_1,z_2,z_3,...,z_N) \triangleq (f_1,f_2,f_3,...,f_N) \tag{1} z=(z1,z2,z3,...,zN)(f1,f2,f3,...,fN)(1)
    among ≜ \triangleq Is identity , therefore z z z The dimensions are N ∗ M N*M NM.
    【2. About p p p
    p p p Here is the real cross operation , About how to generate p p p, The paper gives a formal definition : p i , j = g ( f i , f j ) p_{i,j}=g(f_i,f_j) pi,j=g(fi,fj), there g g g Theoretically, it can be any mapping function , This paper gives two kinds of mappings : The inner product and the outer product , Corresponding to IPNN and OPNN. therefore :
    p = { p i , j } , i = 1 , 2 , . . . , N ; j = 1 , 2 , . . . , N (2) p=\{p_{i,j}\}, i=1,2,...,N;j=1,2,...,N \tag{2} p={ pi,j},i=1,2,...,N;j=1,2,...,N(2)
    therefore , p p p The dimensions are N 2 ∗ M N^2*M N2M.
    【3. About l z l_z lz
    With z z z after , Through a fully connected network , Such as in the figure above product layer in The blue line part , Finally mapped to D 1 D_1 D1 dimension ( Number of hidden layer units ) Vector , The formal expression is :
    l z = ( l z 1 , l z 2 , . . . , l z k , . . . , l z D 1 )              l z k = W z k ⊙ z (3) l_z=(l_z^1,l_z^2,...,l_z^k,...,l_z^{D_1}) \ \ \ \ \ \ \ \ \ \ \ \ l_z^k=W_z^k\odot z \tag{3} lz=(lz1,lz2,...,lzk,...,lzD1)            lzk=Wzkz(3)
    among , ⊙ \odot Is the inner product symbol , Defined as : A ⊙ B ≜ ∑ i , j A i j A i j A\odot B\triangleq \sum_{i,j}A_{ij}A_{ij} ABi,jAijAij, because z z z The dimensions are N ∗ M N*M NM, therefore W z k W_z^k Wzk The dimension of is also N ∗ M N*M NM.
    【4. About l p l_p lp
    and l z l_z lz equally , Through a fully connected network , Such as in the figure above product layer in The green line , Finally mapped to D 1 D_1 D1 dimension ( Number of hidden layer units ) Vector , The formal expression is :
    l p = ( l p 1 , l p 2 , . . . , l p k , . . . , l z D 1 )              l z k = W p k ⊙ p (4) l_p=(l_p^1,l_p^2,...,l_p^k,...,l_z^{D_1}) \ \ \ \ \ \ \ \ \ \ \ \ l_z^k=W_p^k\odot p \tag{4} lp=(lp1,lp2,...,lpk,...,lzD1)            lzk=Wpkp(4)
    W p k W_p^k Wpk And p p p The dimensions of are the same , by N 2 ∗ M N^2*M N2M.
  • Fully connected layer
    Here is the ordinary two-layer full connection layer , Go straight to the formula , Simple and clear .
    l 1 = r e l u ( l z + l p + b 1 ) (5) l_1 = relu(l_z + l_p + b_1) \tag{5} l1=relu(lz+lp+b1)(5)
    l 2 = r e l u ( W 2 l 1 + b 2 ) (6) l_2 = relu(W_2l_1 + b_2) \tag{6} l2=relu(W2l1+b2)(6)
  • Output layer
    Because the scene is CTR forecast , So it is classified into two categories , Activate the function directly sigmoid,
    y ^ = σ ( W 3 l 2 + b 3 ) (6) \hat{y} = \sigma(W_3l_2 + b_3) \tag{6} y^=σ(W3l2+b3)(6)
  • Loss function
    Cross entropy goes ,
    L ( y , y ^ ) = − y log ⁡ y ^ − ( 1 − y ) log ⁡ ( 1 − y ) ^ (7) L(y,\hat{y}) = -y\log\hat{y} - (1-y)\log\hat{(1-y)} \tag{7} L(y,y^)=ylogy^(1y)log(1y)^(7)

2.2 IPNN

stay product layer , On features embedding Vectors do cross , Theoretically, it can be any operation , This article paper Two ways are given : The inner product and the outer product , They correspond to each other IPNN and OPNN. In terms of complexity ,IPNN The complexity is lower than OPNN, Therefore, if you plan to implement the industry , Don't think about OPNN 了 , So I will introduce it in detail IPNN.
IPNN, That is to say product Layers do inner product operation , According to the definition of inner product given above , It can be seen that the result of the inner product of two vectors is a scalar . The formal expression is :
g ( f i , f j ) = < f i , f j > (8) g(f_i,f_j) = <f_i,f_j> \tag{8} g(fi,fj)=<fi,fj>(8)
Let's analyze IPNN Time complexity of , In defining the dimensions of the next few variables :embedding Dimension for M M M, The number of features is N N N, l p l_p lp and l z l_z lz The dimensions are D 1 D_1 D1. Because we need to cross the features in pairs , therefore p p p The size is N ∗ N N*N NN, So get p p p The time complexity of is N ∗ N ∗ M N*N*M NNM, With p p p, After mapping, we get l p l_p lp The time complexity of is N ∗ N ∗ D 1 N*N*D_1 NND1, So for IPNN, Whole product The time complexity of the layer is O ( N ∗ N ( D 1 + M ) ) O(N*N(D1+M)) O(NN(D1+M)), This time complexity is very high , So it needs to be optimized , The technique of matrix decomposition is used in this paper , Reduce the time complexity to D 1 ∗ M ∗ N D1*M*N D1MN. Let's see how the paper is done :

Because the features intersect , therefore p p p It's obviously a N ∗ N N*N NN The symmetric matrix of , be W p k W_p^k Wpk It's also a N ∗ N N*N NN The symmetric matrix of , obviously W p k W_p^k Wpk It can be decomposed into two N N N Multiplied by dimensional vectors , hypothesis W p k = θ k ( θ k ) T W_p^k=\theta^k(\theta^k)^T Wpk=θk(θk)T, among θ k ∈ R N \theta^k\in R^N θkRN.

therefore , We have :
W p k ⊙ p = ∑ i = 1 N ∑ j = 1 N θ i k θ j k < f i , f j > = < ∑ i = 1 N θ i k f i , ∑ j = 1 N θ j k f j > (9) W_p^k \odot p = \sum_{i=1}^N\sum_{j=1}^N\theta^k_i\theta^k_j<f_i,f_j> = <\sum_{i=1}^N\theta^k_if_i, \sum_{j=1}^N\theta^k_jf_j> \tag{9} Wpkp=i=1Nj=1Nθikθjk<fi,fj>=<i=1Nθikfi,j=1Nθjkfj>(9)
If we remember δ i k = θ i k f i \delta_i^k=\theta^k_if_i δik=θikfi, be δ i k ∈ R M \delta_i^k \in R^M δikRM, δ i k = ( δ 1 k , δ 2 k , . . . , δ N k ) ∈ R N ∗ M \delta_i^k=(\delta_1^k,\delta_2^k,...,\delta_N^k) \in R^{N*M} δik=(δ1k,δ2k,...,δNk)RNM.
therefore l p l_p lp by :
l p = ( ∣ ∣ ∑ i δ i 1 ∣ ∣ , . . . . , ∣ ∣ ∑ i δ i D 1 ∣ ∣ ) (10) l_p = (||\sum_i\delta_i^1||,....,||\sum_i\delta_i^{D_{1}}||) \tag{10} lp=(iδi1,....,iδiD1)(10)
The above series of formulas go up , It is estimated that there are not enough people interested in reading this blog 10% 了 , It's too obscure to understand . Want others to understand , The simplest way is to give examples , Let's go straight to the example , Suppose our sample has 3 Features , Every feature of embedding Dimension for 2, namely N = 3 , M = 2 N=3, M=2 N=3,M=2, So we have the following sample :

features (field)embedding vector
f 1 f_1 f1[1,2]
f 2 f_2 f2[3,4]
f 1 f_1 f1[5,6]

It is represented by a matrix as :
[ 1 2 3 4 5 6 ] (11) \begin{bmatrix} 1& 2 \\ 3&4 \\ 5&6 \\ \end{bmatrix} \tag{11} 135246(11)
Then define W p 1 W_p^1 Wp1
W p 1 = [ 1 2 3 2 4 6 3 6 9 ] (12) W_p^1 = \begin{bmatrix} 1& 2 & 3\\ 2&4 & 6\\ 3&6 &9 \\ \end{bmatrix} \tag{12} Wp1=123246369(12)
Let's separate them according to the ( The complexity is O ( N ∗ N ( D 1 + M ) ) O(N*N(D1+M)) O(NN(D1+M))) And decomposition, then calculate , As soon as we compare, we find the mystery .

  • Before decomposition
    Direct pair ( f 1 , f 2 , f 3 ) (f_1,f_2,f_3) (f1,f2,f3) Do the inner product in pairs to get the matrix p p p, The calculation process is as follows :
    p = [ 1 2 3 4 5 6 ] ⋅ [ 1 3 5 2 4 6 ] = [ 5 11 17 11 25 39 17 39 61 ] (13) p = \begin{bmatrix} 1& 2 \\ 3&4 \\ 5&6 \\ \end{bmatrix} \cdot \begin{bmatrix} 1& 3 & 5 \\ 2&4 & 6 \\ \end{bmatrix} = \begin{bmatrix} 5& 11 & 17 \\ 11& 25 & 39 \\ 17& 39 & 61 \\ \end{bmatrix} \tag{13} p=135246[123456]=51117112539173961(13)
    among p i j = f i ⊙ f j p_{ij} = f_i \odot f_j pij=fifj.
    With p p p, Let's calculate l p 1 l_p^1 lp1, In fact, it is an arithmetic 2 in product Layer l p l_p lp( The green part ) The value of the first element of , Yes :
    W p 1 ⊙ p = [ 1 2 3 2 4 6 3 6 9 ] ⊙ [ 5 11 17 11 25 39 17 39 61 ] = s u m ( [ 5 22 51 22 100 234 51 234 549 ] ) = 1268 (14) W_p^1 \odot p = \begin{bmatrix} 1& 2 & 3\\ 2&4 & 6\\ 3&6 &9 \\ \end{bmatrix} \odot \begin{bmatrix} 5& 11 & 17 \\ 11& 25 & 39 \\ 17& 39 & 61 \\ \end{bmatrix} = sum(\begin{bmatrix} 5 &22 & 51 \\ 22 &100 &234 \\ 51 &234 &549 \\ \end{bmatrix}) = 1268 \tag{14} Wp1p=12324636951117112539173961=sum(522512210023451234549)=1268(14)
  • After decomposition
    W p 1 W_p^1 Wp1 Obviously, it can be decomposed , Can be broken down into :
    W p 1 = [ 1 2 3 2 4 6 3 6 9 ] = [ 1 2 3 ] [ 1 2 3 ] = θ 1 ( θ 1 ) T (15) W_p^1 = \begin{bmatrix} 1& 2 & 3\\ 2&4 & 6\\ 3&6 &9 \\ \end{bmatrix} = \begin{bmatrix} 1\\ 2\\ 3 \\ \end{bmatrix} \begin{bmatrix} 1&2&3\\ \end{bmatrix} = \theta^1 (\theta^1)^T \tag{15} Wp1=123246369=123[123]=θ1(θ1)T(15)
    So according to the formula δ i 1 = θ i 1 f i \delta_i^1=\theta^1_if_i δi1=θi1fi, So let's calculate δ i 1 \delta_i^1 δi1, See the table below :
features (field)embedding vector θ i 1 \theta^1_i θi1 δ i 1 \delta_i^1 δi1
f 1 f_1 f1[1,2]1[1,2]
f 2 f_2 f2[3,4]2[6,8]
f 1 f_1 f1[5,6]3[15,18]

Because there is still a step to be done ∑ i = 1 N δ i 1 \sum_{i=1}^N\delta_i^1 i=1Nδi1, therefore , Finally, we get the vector δ 1 \delta^1 δ1
δ 1 = ∑ i = 1 N δ i 1 = [ 22 28 ] (16) \delta^1 = \sum_{i=1}^N\delta_i^1 = \begin{bmatrix} 22& 28 \end{bmatrix} \tag{16} δ1=i=1Nδi1=[2228](16)
So in the end :
< δ 1 , δ 1 > = δ 1 ⊙ δ 1 = [ 22 28 ] ⊙ [ 22 28 ] = 1268 (17) <\delta^1, \delta^1> = \delta^1 \odot \delta^1 = \begin{bmatrix} \tag{17} 22& 28 \end{bmatrix} \odot \begin{bmatrix} 22& 28 \end{bmatrix} = 1268 <δ1,δ1>=δ1δ1=[2228][2228]=1268(17)

therefore , Finally we can find , Calculated before and after decomposition l p l_p lp The value of the first element of is the same , The time complexity after decomposition is only O ( D 1 ∗ M ∗ N ) O(D1*M*N) O(D1MN).

therefore , We are realizing IPNN When , The dimension of parameter matrix of linear part and cross part is :

  • For linear weight matrix W z W_z Wz Come on , The size is D 1 ∗ N ∗ M D_1 * N * M D1NM.
  • Pair crossover ( square ) Weight matrices W p W_p Wp Come on , The size is D 1 ∗ N D_1 * N D1N.

2.3 OPNN

OPNN and IPNN The only difference is that product layer Middle computation l p l_p lp From inner product to outer product , As a result of the outer product operation, we get a M ∗ M M*M MM Matrix , namely :
g ( f i , f j ) = f i f j T (18) g(f_i,f_j) = f_if_j^T \tag{18} g(fi,fj)=fifjT(18)
therefore ,OPNN The whole time is complicated as O ( D 1 M 2 N 2 ) O(D1M^2N^2) O(D1M2N2), The time complexity is also high , therefore , In this paper, a superposition (superposition) Methods , To reduce time complexity , Let's take a look at the formula :
p = ∑ i = 1 N ∑ j = 1 N f i f j T = f ∑ ( f ∑ ) T (19) p = \sum_{i=1}^N \sum_{j=1}^N f_i f_j^T = f_{\sum}(f_{\sum})^T \tag{19} p=i=1Nj=1NfifjT=f(f)T(19)
among , f ∑ = ∑ i = 1 N f i f_{\sum} = \sum_{i=1}^Nf_i f=i=1Nfi .

After the above simplification ,OPNN The time complexity is reduced to O ( D 1 M ( M + N ) ) O(D1M(M + N)) O(D1M(M+N)).

But let's look at the above formula (19), f ∑ = ∑ i = 1 N f i f_{\sum} = \sum_{i=1}^Nf_i f=i=1Nfi, This is equivalent to making an analysis of all features sum pooling, That is to add the corresponding dimension elements of different features , There may be some problems in the actual business , For example, two features : Age and gender , These two characteristics embedding Vectors are not in a vector space , Do this forcibly sum pooling May cause some unexpected problems . We pass only for multivalued features , For example, the user clicks in history item Of embedding Vector do pooling, There's no problem with that , because item Ben is in the same vector space , And it has practical significance .

In practice , I suggest to use IPNN.




reference

[1]: Yanru Qu et al. Product-Based Neural Networks for User Response Prediction. ICDM 2016: 1149-1154

原网站

版权声明
本文为[Tianze 28]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202132243344133.html