当前位置：网站首页>Paper notes: limit multi label learning galaxc (temporarily stored, not finished)

Paper notes: limit multi label learning galaxc (temporarily stored, not finished)

2022-07-06 02:14:00 【Min fan】

Abstract : Share your understanding of the paper . See the original D. Saini, A. K. Jain, K. Dave, J. Jiao, A. Singh, R. Zhang and M. Varma, GalaXC: Graph neural networks with labelwise attention for extreme classification, in WWW 2021. 7 Among the authors 6 This is from Microsoft Research , Fight them , I feel like I have a funny head .

1. Contribution of thesis

Deal with the situation that labels exist in documents : labels and documents cohabit the same space.
Use tag text and tag relevance : label text and label correlations, label metadata.
Tag level attention mechanism : label-wise attention mechanism.
Hot start ( Some labels are known ) The effect is good : warm-start scenarios where predictions need to be made on data points with partially revealed label sets,
Can handle millions of tags .
Fast and good .

2. motivation

Work has shown that , With the use of application independent features （ For example, traditional word bag features ） comparison , Learning intensive application specific document representation can lead to better predictions .These works have demonstrated that learning dense application-specific document representations can lead to better predictions than using application-agnostic features such as the traditional bag-of-words features.
5-10 Short text of tags . For example, use the title to predict relevant web pages or advertisements . Short textual descriptions with typically only 5-10 tokens. Examples include applications such as predicting related webpages or related products using only the title of a given webpage/product and predicting relevant ads/keywords/searches for
user queries.
Use a variety of metadata, such as tag text 、 Label relevance 、 Label hierarchy , Better serve the tail label . XC applications often make available label metadata in various forms such as label text, label correlations or label hierarchies.
Label features . Contemporary XC algorithms have explored utilizing label features.
Hot start and auxiliary data sources . Warm-start and auxiliary sources of data.
Most of the existing work uses document diagrams instead of documents - Label map ( see Table 1). existing works mostly use document-document graphs and not joint document-label graphs at extreme scales.

2. Basic symbols

Table 1. Notations.

Symbol	meaning	remarks
$\mathbb{G}$	Bipartite graph	$\mathbb{G} = (\mathbb{D} \cup \mathbb{L}, \mathbb{E})$
$\mathbb{D}$	A collection of text nodes	The element is recorded as $d$ , The base number is $N$
$\mathbb{L}$	Label node set	The element is recorded as $l$ , The base number is $L$
$\mathbf{y}_i$	The first $i$ A real label vector of text	The value range is ${-1, +1\}^L$
$\hat{\mathbf{x}}_i^0$	The first $i$ The eigenvector of a document	$D$ dimension
$\hat{\mathbf{z}}_l^0$	The first $l$ Eigenvectors of labels	$D$ dimension
$\hat{\mathbf{v}}_n^0$	$\hat{\mathbf{x}}_i^0$ And $\hat{\mathbf{z}}_l^0$ The unified expression of	$D$ dimension
$\mathcal{N}$	Ask neighbors to operate	$\mathbb{V} \to 2^\mathbb{V}$
$\mathcal{C}$	Convolution operation
$\mathcal{T}$	Transformation operation	transformation
$\hat{\mathbf{a}}_n^k$	$\mathcal{C}_k(\{\hat{\mathbf{v}}_m^{k-1}, \hat{\mathbf{a}}_m^{k-1}: m \in \mathcal{N}(n)\})$	GNN operation
$\hat{\mathbf{v}}_n^k$	$\mathcal{T}_k(\{\hat{\mathbf{v}}_n^{k-1}, \hat{\mathbf{a}}_n^{k-1}\})$	GNN operation
$\mathbf{W}$	coefficient matrix	$\times L$ dimension
$K$	hop Count
$e_{lk}$	label $l$ In the $k$ individual hop scalar

3. programme

Graph convolution block The specific operation is
$\hat{\mathbf{a}}_n^k = \mathcal{C}_k(\hat{\mathbf{a}}_n^{k-1}) = (1 + \epsilon_k) \cdot \hat{\mathbf{a}}_n^{k-1} + \sum_{m \in \mathcal{N}(n)}\hat{\mathbf{a}}_m^{k-1}$
Embedding The specific operation is
$\hat{\mathbf{v}}_n^k = \mathcal{T}_k(\hat{\mathbf{a}}_n^k)$
Make
$\alpha_{lk} = \exp(e_{lk}) / \sum_{k' \in [K]} \exp e_{lk'}$
It represents the first $k$ individual hop Proportion of time .
The calculation formula of label embedding is
$\hat{\mathbf{x}}^{(l)} = \sum_{k \in [k]} \alpha_{lk} \cdot \hat{\mathbf{x}}^{k}$
Be careful : there $k$ The power has not been understood .
The tag score is
$s_l = \langle \mathbf{w}_l, \hat{\mathbf{x}}^{(l)} \rangle$