当前位置：网站首页>论文笔记: 极限多标签学习 GalaXC (暂存, 还没学完)

论文笔记: 极限多标签学习 GalaXC (暂存, 还没学完)

2022-07-06 02:00:00 【闵帆】

摘要: 分享对论文的理解. 原文见 D. Saini, A. K. Jain, K. Dave, J. Jiao, A. Singh, R. Zhang and M. Varma, GalaXC: Graph neural networks with labelwise attention for extreme classification, in WWW 2021. 7 位作者中 6 位是微软研究院的人, 跟他们杠, 我觉得自己简直脑袋秀逗了.

1. 论文贡献

处理标签存在于文档内的情况: labels and documents cohabit the same space.
利用标签文本与标签相关性: label text and label correlations, label metadata.
标签级注意力机制: label-wise attention mechanism.
热启动 (部分标签已知) 时效果好: warm-start scenarios where predictions need to be made on data points with partially revealed label sets,
能处理几百万个标签.
又快又好.

2. 动机

已有工作表明，与使用与应用程序无关的特征（例如传统的词袋特征）相比，学习密集的特定于应用程序的文档表示可以带来更好的预测。These works have demonstrated that learning dense application-specific document representations can lead to better predictions than using application-agnostic features such as the traditional bag-of-words features.
5-10 个标记的短文本. 如使用标题进行相关网页或广告的预测. Short textual descriptions with typically only 5-10 tokens. Examples include applications such as predicting related webpages or related products using only the title of a given webpage/product and predicting relevant ads/keywords/searches for
user queries.
使用多种元数据如标签文本、标签相关性、标签层次结构, 更好地服务于尾部标签. XC applications often make available label metadata in various forms such as label text, label correlations or label hierarchies.
标签特征. Contemporary XC algorithms have explored utilizing label features.
热启动与辅助数据源. Warm-start and auxiliary sources of data.
已有工作多数使用文档图而不是文档-标签图 (见 Table 1). existing works mostly use document-document graphs and not joint document-label graphs at extreme scales.

2. 基本符号

Table 1. Notations.

符号	含义	备注
$\mathbb{G}$	二部图	$\mathbb{G} = (\mathbb{D} \cup \mathbb{L}, \mathbb{E})$
$\mathbb{D}$	文本节点集合	元素记作 $d$ , 基数为 $N$
$\mathbb{L}$	标签节点集合	元素记作 $l$ , 基数为 $L$
$\mathbf{y}_i$	第 $i$ 个文本的真实标签向量	取值范围为 ${-1, +1\}^L$
$\hat{\mathbf{x}}_i^0$	第 $i$ 个文档的特征向量	$D$ 维
$\hat{\mathbf{z}}_l^0$	第 $l$ 个标签的特征向量	$D$ 维
$\hat{\mathbf{v}}_n^0$	$\hat{\mathbf{x}}_i^0$ 与 $\hat{\mathbf{z}}_l^0$ 的统一表示	$D$ 维
$\mathcal{N}$	求邻居操作	$\mathbb{V} \to 2^\mathbb{V}$
$\mathcal{C}$	卷积操作
$\mathcal{T}$	转型操作	transformation
$\hat{\mathbf{a}}_n^k$	$\mathcal{C}_k(\{\hat{\mathbf{v}}_m^{k-1}, \hat{\mathbf{a}}_m^{k-1}: m \in \mathcal{N}(n)\})$	GNN 操作
$\hat{\mathbf{v}}_n^k$	$\mathcal{T}_k(\{\hat{\mathbf{v}}_n^{k-1}, \hat{\mathbf{a}}_n^{k-1}\})$	GNN 操作
$\mathbf{W}$	系数矩阵	$\times L$ 维
$K$	hop 数
$e_{lk}$	标签 $l$ 在第 $k$ 个 hop 的标量

3. 方案

Graph convolution block 的具体操作是
$\hat{\mathbf{a}}_n^k = \mathcal{C}_k(\hat{\mathbf{a}}_n^{k-1}) = (1 + \epsilon_k) \cdot \hat{\mathbf{a}}_n^{k-1} + \sum_{m \in \mathcal{N}(n)}\hat{\mathbf{a}}_m^{k-1}$
Embedding 的具体操作是
$\hat{\mathbf{v}}_n^k = \mathcal{T}_k(\hat{\mathbf{a}}_n^k)$
令
$\alpha_{lk} = \exp(e_{lk}) / \sum_{k' \in [K]} \exp e_{lk'}$
它表示第 $k$ 个 hop 时的占比.
标签嵌入计算式为
$\hat{\mathbf{x}}^{(l)} = \sum_{k \in [k]} \alpha_{lk} \cdot \hat{\mathbf{x}}^{k}$
注意: 这里的 $k$ 次方还未理解.
标签得分为
$s_l = \langle \mathbf{w}_l, \hat{\mathbf{x}}^{(l)} \rangle$