当前位置：网站首页>Recommendation system (IX) PNN model (product based neural networks)

Recommendation system (IX) PNN model (product based neural networks)

2022-07-06 04:10:00 【Tianze 28】

Recommendation system （ Nine ）PNN Model （Product-based Neural Networks）

Recommendation system series blog ：

PNN Model （Product-based Neural Networks） And the last blog introduction FNN The model is the same , They are all from teacher Zhangweinan and his collaborators of Jiaotong University , This article paper Published in ICDM’2016 On , It's a CCF-B Class meeting , Personally, I have never heard of any company in the industry that has practiced this model in its own scenario , But we can still read this paper The results of , Maybe it can provide some reference for your business model . By the name of this model Product-based Neural Networks, We can also basically know PNN By introducing product（ It can be inner product or outer product ） To achieve the purpose of feature intersection .

This blog will introduce motivation and model details .

One 、 motivation

This article paper The main improvement is introduced in the last blog FNN,FNN There are two main shortcomings ：

DNN Of embedding Layer quality is limited by FM The quality of training .
stay FM Implicit vector dot product is used in feature intersection , hold FM Pre trained embedding The vector is sent to MLP Full link layer in ,MLP The full link layer of is essentially a linear weighted sum between features , That is to do 『add』 The operation of , This is related to FM A little inconsistent . in addition ,MLP Different field The difference between , All use linear weighted summation .

The original text of the paper is ：

the quality of embedding initialization is largely limited by the factorization machine.、
More importantly, the “add” operations of the perceptron layer might not be useful to explore the interactions of categorical data in multiple fields. Previous work [1], [6] has shown that local dependencies between features from different fields can be effectively explored by feature vector “product” operations instead of “add” operations.

Actually, I think FNN There is also a big limitation ：FNN It's a two-stage training model , It's not a data-driven Of end-to-end Model ,FNN There is still a strong shadow of traditional machine learning .

Two 、PNN Model details

2.1 PNN The overall structure of the model

PNN The network structure of is shown in the figure below （ The picture is taken from the original paper ）
PNN
PNN The core of the model is product layer, It's the one in the picture above product layer pair-wisely connected This floor . But unfortunately , The picture given in the original paper hides the most important information , The first impression after reading the above picture is product layer from $z$ and $p$ form , Then send it directly to the upper $L 1$ layer . But it's not , What is described in the paper is ： $z$ and $p$ stay product layer The transformation of the whole connection layer is carried out respectively , Separately $z$ and $p$ The mapping becomes $D_1$ Dimension input vector $l_z$ and $l_p$ （ $D_1$ Is the number of neurons in the hidden layer ）, And then $l_z$ and $l_p$ superposition , Then send it to $L_1$ layer .
therefore , I will product layer Redraw , It looks more detailed PNN Model structure , As shown in the figure below ：

PNN_detail
The above figure is basically very clear PNN The overall structure and core details of . Let's take a look at each floor from the bottom up

Input layer
Original features , There's nothing to say . In the paper is $N$ Features .
embedding layer
Nothing to say , Just an ordinary embedding layer , Dimension for $D_1$ .
product layer
The core of the whole paper , There are two parts here ： $z$ and $p$ , Both are from embedding Layer .
【1. About $z$ 】
among $z$ Is a linear signal vector , In fact, it's just to put $N$ Characteristic embedding Vector directly copy To come over , A constant is used in the paper 1, Do multiplication . The paper gives a formal definition ：
$(z_1,z_2,z_3,...,z_N) \triangleq (f_1,f_2,f_3,...,f_N) \tag{1}$
among $\triangleq$ Is identity , therefore $z$ The dimensions are $N * M$ .
【2. About $p$ 】
$p$ Here is the real cross operation , About how to generate $p$ , The paper gives a formal definition ： $p_{i,j}=g(f_i,f_j)$ , there $g$ Theoretically, it can be any mapping function , This paper gives two kinds of mappings ： The inner product and the outer product , Corresponding to IPNN and OPNN. therefore ：
$p=\{p_{i,j}\}, i=1,2,...,N;j=1,2,...,N \tag{2}$
therefore , $p$ The dimensions are $N^2*M$ .
【3. About $l_z$ 】
With $z$ after , Through a fully connected network , Such as in the figure above product layer in The blue line part , Finally mapped to $D_1$ dimension （ Number of hidden layer units ） Vector , The formal expression is ：
$l_z=(l_z^1,l_z^2,...,l_z^k,...,l_z^{D_1}) \ \ \ \ \ \ \ \ \ \ \ \ l_z^k=W_z^k\odot z \tag{3}$
among , $\odot$ Is the inner product symbol , Defined as ： $A\odot B\triangleq \sum_{i,j}A_{ij}A_{ij}$ , because $z$ The dimensions are $N * M$ , therefore $W_z^k$ The dimension of is also $N * M$ .
【4. About $l_p$ 】
and $l_z$ equally , Through a fully connected network , Such as in the figure above product layer in The green line , Finally mapped to $D_1$ dimension （ Number of hidden layer units ） Vector , The formal expression is ：
$l_p=(l_p^1,l_p^2,...,l_p^k,...,l_z^{D_1}) \ \ \ \ \ \ \ \ \ \ \ \ l_z^k=W_p^k\odot p \tag{4}$
$W_p^k$ And $p$ The dimensions of are the same , by $N^2*M$ .
Fully connected layer
Here is the ordinary two-layer full connection layer , Go straight to the formula , Simple and clear .
$l_1 = relu(l_z + l_p + b_1) \tag{5}$
$l_2 = relu(W_2l_1 + b_2) \tag{6}$
Output layer
Because the scene is CTR forecast , So it is classified into two categories , Activate the function directly sigmoid,
$\hat{y} = \sigma(W_3l_2 + b_3) \tag{6}$
Loss function
Cross entropy goes ,
$L(y,\hat{y}) = -y\log\hat{y} - (1-y)\log\hat{(1-y)} \tag{7}$

2.2 IPNN

stay product layer , On features embedding Vectors do cross , Theoretically, it can be any operation , This article paper Two ways are given ： The inner product and the outer product , They correspond to each other IPNN and OPNN. In terms of complexity ,IPNN The complexity is lower than OPNN, Therefore, if you plan to implement the industry , Don't think about OPNN 了 , So I will introduce it in detail IPNN.
IPNN, That is to say product Layers do inner product operation , According to the definition of inner product given above , It can be seen that the result of the inner product of two vectors is a scalar . The formal expression is ：
$g(f_i,f_j) = <f_i,f_j> \tag{8}$
Let's analyze IPNN Time complexity of , In defining the dimensions of the next few variables ：embedding Dimension for $M$ , The number of features is $N$ , $l_p$ and $l_z$ The dimensions are $D_1$ . Because we need to cross the features in pairs , therefore $p$ The size is $N * N$ , So get $p$ The time complexity of is $N * N * M$ , With $p$ , After mapping, we get $l_p$ The time complexity of is $N*N*D_1$ , So for IPNN, Whole product The time complexity of the layer is $O (N * N (D 1 + M))$ , This time complexity is very high , So it needs to be optimized , The technique of matrix decomposition is used in this paper , Reduce the time complexity to $D 1 * M * N$ . Let's see how the paper is done ：

Because the features intersect , therefore $p$ It's obviously a $N * N$ The symmetric matrix of , be $W_p^k$ It's also a $N * N$ The symmetric matrix of , obviously $W_p^k$ It can be decomposed into two $N$ Multiplied by dimensional vectors , hypothesis $W_p^k=\theta^k(\theta^k)^T$ , among $\theta^k\in R^N$ .

therefore , We have ：
$W_p^k \odot p = \sum_{i=1}^N\sum_{j=1}^N\theta^k_i\theta^k_j<f_i,f_j> = <\sum_{i=1}^N\theta^k_if_i, \sum_{j=1}^N\theta^k_jf_j> \tag{9}$
If we remember $\delta_i^k=\theta^k_if_i$ , be $\delta_i^k \in R^M$ , $\delta_i^k=(\delta_1^k,\delta_2^k,...,\delta_N^k) \in R^{N*M}$ .
therefore $l_p$ by ：
$l_p = (||\sum_i\delta_i^1||,....,||\sum_i\delta_i^{D_{1}}||) \tag{10}$
The above series of formulas go up , It is estimated that there are not enough people interested in reading this blog 10% 了 , It's too obscure to understand . Want others to understand , The simplest way is to give examples , Let's go straight to the example , Suppose our sample has 3 Features , Every feature of embedding Dimension for 2, namely $N = 3, M = 2$ , So we have the following sample ：

features （field）	embedding vector
$f_1$	[1,2]
$f_2$	[3,4]
$f_1$	[5,6]

It is represented by a matrix as ：
$\begin{bmatrix} 1& 2 \\ 3&4 \\ 5&6 \\ \end{bmatrix} \tag{11}$
Then define $W_p^1$ ：
$W_p^1 = \begin{bmatrix} 1& 2 & 3\\ 2&4 & 6\\ 3&6 &9 \\ \end{bmatrix} \tag{12}$
Let's separate them according to the （ The complexity is $O (N * N (D 1 + M))$ ） And decomposition, then calculate , As soon as we compare, we find the mystery .

Before decomposition
Direct pair $f_1,f_2,f_3)$ Do the inner product in pairs to get the matrix $p$ , The calculation process is as follows ：
$\begin{bmatrix} 1& 2 \\ 3&4 \\ 5&6 \\ \end{bmatrix} \cdot \begin{bmatrix} 1& 3 & 5 \\ 2&4 & 6 \\ \end{bmatrix} = \begin{bmatrix} 5& 11 & 17 \\ 11& 25 & 39 \\ 17& 39 & 61 \\ \end{bmatrix} \tag{13}$
among $p_{ij} = f_i \odot f_j$ .
With $p$ , Let's calculate $l_p^1$ , In fact, it is an arithmetic 2 in product Layer $l_p$ （ The green part ） The value of the first element of , Yes ：
$W_p^1 \odot p = \begin{bmatrix} 1& 2 & 3\\ 2&4 & 6\\ 3&6 &9 \\ \end{bmatrix} \odot \begin{bmatrix} 5& 11 & 17 \\ 11& 25 & 39 \\ 17& 39 & 61 \\ \end{bmatrix} = sum(\begin{bmatrix} 5 &22 & 51 \\ 22 &100 &234 \\ 51 &234 &549 \\ \end{bmatrix}) = 1268 \tag{14}$
After decomposition
$W_p^1$ Obviously, it can be decomposed , Can be broken down into ：
$W_p^1 = \begin{bmatrix} 1& 2 & 3\\ 2&4 & 6\\ 3&6 &9 \\ \end{bmatrix} = \begin{bmatrix} 1\\ 2\\ 3 \\ \end{bmatrix} \begin{bmatrix} 1&2&3\\ \end{bmatrix} = \theta^1 (\theta^1)^T \tag{15}$
So according to the formula $\delta_i^1=\theta^1_if_i$ , So let's calculate $\delta_i^1$ , See the table below ：

features （field）	embedding vector	$\theta^1_i$	$\delta_i^1$
$f_1$	[1,2]	1	[1,2]
$f_2$	[3,4]	2	[6,8]
$f_1$	[5,6]	3	[15,18]

Because there is still a step to be done $\sum_{i=1}^N\delta_i^1$ , therefore , Finally, we get the vector $\delta^1$ ：
$\delta^1 = \sum_{i=1}^N\delta_i^1 = \begin{bmatrix} 22& 28 \end{bmatrix} \tag{16}$
So in the end ：
$<\delta^1, \delta^1> = \delta^1 \odot \delta^1 = \begin{bmatrix} \tag{17} 22& 28 \end{bmatrix} \odot \begin{bmatrix} 22& 28 \end{bmatrix} = 1268$

therefore , Finally we can find , Calculated before and after decomposition $l_p$ The value of the first element of is the same , The time complexity after decomposition is only $O (D 1 * M * N)$ .

therefore , We are realizing IPNN When , The dimension of parameter matrix of linear part and cross part is ：

For linear weight matrix $W_z$ Come on , The size is $D_1 * N * M$ .
Pair crossover （ square ） Weight matrices $W_p$ Come on , The size is $D_1 * N$ .

2.3 OPNN

OPNN and IPNN The only difference is that product layer Middle computation $l_p$ From inner product to outer product , As a result of the outer product operation, we get a $M * M$ Matrix , namely ：
$g(f_i,f_j) = f_if_j^T \tag{18}$
therefore ,OPNN The whole time is complicated as $O(D1M^2N^2)$ , The time complexity is also high , therefore , In this paper, a superposition （superposition） Methods , To reduce time complexity , Let's take a look at the formula ：
$\sum_{i=1}^N \sum_{j=1}^N f_i f_j^T = f_{\sum}(f_{\sum})^T \tag{19}$
among , $f_{\sum} = \sum_{i=1}^Nf_i$ .

After the above simplification ,OPNN The time complexity is reduced to $O (D 1 M (M + N))$ .

But let's look at the above formula （19）, $f_{\sum} = \sum_{i=1}^Nf_i$ , This is equivalent to making an analysis of all features sum pooling, That is to add the corresponding dimension elements of different features , There may be some problems in the actual business , For example, two features ： Age and gender , These two characteristics embedding Vectors are not in a vector space , Do this forcibly sum pooling May cause some unexpected problems . We pass only for multivalued features , For example, the user clicks in history item Of embedding Vector do pooling, There's no problem with that , because item Ben is in the same vector space , And it has practical significance .