当前位置：网站首页>Thesis learning record essay multi label lift

Thesis learning record essay multi label lift

2022-07-01 05:51:00 【LTA_ ALBlack】

original text : Share your understanding of the paper . See the original Zhang, M.-L., & Wu, L. (2015). LIFT: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 107–120.

Catalog

1. Understanding of original abstract

2. Basic symbols

3.LIFT Algorithm details in method

4.LIFT Process and forecast of

Simple summary

1. Understanding of original abstract

In the last essay LSML Mentioned in "Label-Specific Features" Such a multi label neighborhood , However, when tracing the source of such problems , obviously , This article proposed by zhangminling LIFT yes "Label-Specific Features" The beginning of the multi tag application .

Papers mentioned , The problem of multi label learning processing is that a data row of each data set is composed of a single instance (feature vector) Express , At the same time, this instance corresponds to a group of tags . Most of the current multi label algorithms use the same scheme to learn from the same single feature data set . But the author believes that this is sub optimal .

However, this popular strategy might be suboptimal as each label is supposed to possess specific characteristics of its own

Each tag has its own characteristics , Therefore, using a single scheme will always violate the label " Specific Features " The possibility of .

This paper proposes a multi label algorithm using different label characteristics , This algorithm first clusters positive and negative instances , Then we train and test by querying the clustering results .

But this algorithm seems to overemphasize the characteristics of tags and ignore the possible correlation between tags , yes first-order Multi label . This is also LIFT The pain points . But many algorithms have been used since "Label-Specific Features" The tag correlation is introduced besides the feature of , This algorithm is supplemented .

First order (first-order) Strategy : Multi label classification uses one label to one label (label-by-label) Method handling , Then the coexistence of other tags is ignored , For example, multi label learning is decomposed into several independent binary classification problems . The main value of the first-order strategy lies in its conceptual simplicity 、 Efficient . On the other hand . The effectiveness of the results may not be the best

2. Basic symbols

Symbol	dimension	meaning
\(n\)		Number of objects
\(d\)		The eigenvector dimension of the original dataset
\(q\)		Number of labels
\(X\)	\(n \times d\)	Data matrix
\(Y\)	\(n \times q\)	Label matrix
\(\phi_{k}(\boldsymbol{x})\)	\(1 \times 2m_k\)	On the label \(l_k\) The new mapped feature space is constructed under the influence of positive and negative samples , Yes \(k \in [1,q]\)

In addition, there are some symbols that represent the meaning of data sets , such as \(\mathcal{B}_{k}\) Indicates the basis label \(l_k\) The size of the constructed is \(n \times 2m_k\) The binary data set of , \(\mathcal{D}\) Represents the original dataset , \(\mathfrak{L}\) It is a two class learner , \(\boldsymbol{u}\) Represents the test data set .

3.LIFT Algorithm details in method

First select a label \(k\), Get a positive and negative sample The size of the composition is \(l \times 1\) Label column for , Where positive samples form a set \(\mathcal{P}_{k}\), Negative samples make up a set \(\mathcal{N}_{k}\), Yes :\[\begin{aligned}
\mathcal{P}_{k} &=\left\{\boldsymbol{x}_{i} \mid\left(\boldsymbol{x}_{i}, Y_{i}\right) \in \mathcal{D}, l_{k} \in Y_{i}\right\} \\ \mathcal{N}_{k} &=\left\{\boldsymbol{x}_{i} \mid\left(\boldsymbol{x}_{i}, Y_{i}\right) \in \mathcal{D}, l_{k} \notin Y_{i}\right\}\end{aligned} \tag{1}\] The data within these two sets are largely unbalanced (\(|\mathcal{P}_{k}|\neq |\mathcal{N}_{k}|\)), This is determined by the imbalance of the multi label matrix itself .

Then, cluster the two sets , This is for \(\mathcal{P}_{k}\) And \(\mathcal{N}_{k}\) The way to further mine information .

To gain insights on the properties o\(P_k\) and \(N_k\), LIFT chooses to employ clustering techniques which have been widely used as stand-alone tools for data analysis

Used by the original author K-Means. natural use K-Means We must consider a suitable \(k\) value , This choice is really a troublesome one , The original author adopted a heuristic algorithm :\[m_{k}^{+}=m_{k}^{-}=\left\lceil r \cdot \min \left\{\left|\mathcal{P}_{k}\right|,\left|\mathcal{N}_{k}\right|\right\}\right\rceil \tag{2}\] there \(m_{k}^{+}\) For the assembly \(\mathcal{P}_k\) Of K-Means Of \(k\) value , \(m_{k}^{-}\) For the assembly \(\mathcal{N}_k\) Of . there \(r\) It's a ratio, It is used to determine the proportion of clustering . Here, the number of cluster centers of positive and negative samples (\(k\)) It's the same , This solves the problem of class imbalance very well .

Multi-label learning tasks usually encounter the issue of class-imbalance [59], where the number of positive instances for each class label is much smaller than the number of negative ones, i.e. \(|P_{k}| ≪ |N_{k}|\). To mitigate potential risks brought by the class-imbalance problem, LIFT sets equivalent number of clusters for \(P_{k}\) and \(N_{k}\), i.e. \(m_{k}^{+}=m_{k}^{-}=m_{k}\). In this way, clustering information gained from positive instances as well as negative instances are treated with equal importance.

So we can get 2 The same number of cluster centers \(\{p^{k}_{1},p^{k}_{2},\dots,p^{k}_{m^{+}_k}\}\) And \(\{n^{k}_{1},n^{k}_{2},\dots,n^{k}_{m^{-}_k}\}\). Each value represents a \(d\) Dimension vector , Now let's arbitrarily take an instance from the original data set \(\boldsymbol{x}\), We can get the result here separately \(2m_k\) Distance between clustering centers , And then \(2m_k\) Cluster centers form a new \(1\times 2m_k\) About \(k\) Example \(\phi_{k}(\boldsymbol{x})\):\[
\phi_{k}(\boldsymbol{x})= {\left[d\left(\boldsymbol{x}, \boldsymbol{p}_{1}^{k}\right), \cdots, d\left(\boldsymbol{x}, \boldsymbol{p}_{m_{k}}^{k}\right), d\left(\boldsymbol{x}, \boldsymbol{n}_{1}^{k}\right), \cdots, d\left(\boldsymbol{x}, \boldsymbol{n}_{m_{k}}^{k}\right)\right]}
\tag{3} \] This set of constructed features is about labels \(l_k\) Of " label-specific features ", The original text modifies this structure : " which can be served as appropriate building blocks (prototypes) for the construction of label-specific features. "

notes : When you look up this sentence on the Internet , Get a few messages , Understand the "appropriate building blocks" It may be because the original features can be divided into one by one when mapping " block " mapping , When the last one " block " If the internal data volume is not enough for the number of a block, there is a related special processing . ( This detail may need to be understood and explained after looking at the code later )

Through this toss and turn, we can find , In the end, our \(n \times d\) Matrix \(\mathbf{X}\) It turns into \(n \times 2m_k\). At the same time, because we only selected the label \(l_k\), So the new label matrix only \(n \times 1\) So big , Look at it this way , We seem to have constructed a binary classification problem , To be exact, it is any original instance ( data row )\(x_{j}\) About labels \(l_k\) A temporary data set of is \(n \times 2m_k\) The dichotomous problem of .

Here the paper seems to use SVM Having two categories of learning .

In this new dichotomous problem , 2\(m_{k}\) Dimension compared to the original \(d\) The dimension carries more information .

Because the original dimension has only any selected \(x\) Each attribute characteristic of , But in the second category 2\(m_{k}\) Each column of the dimension represents the distance of the original data as a whole relative to a data center , This distance measure is a typical feature . The implication , Expand the data row that originally looked at only one dimension to reflect itself and all feature classes ( A collection of cluster centers ) The relationship between ( distance ) Two dimensional features of . But I'm sorry , The feature class in this article is only about the same tag \(k\) Feature class of , Therefore, there is no way to reflect the relevance of labels , meanwhile LIFT It seems to me that the relationship between a single tag and the central tag group is overemphasized and the basic characteristics of the tag itself are ignored ( primary \(d\) Dimensional information )

4.LIFT Process and forecast of

The process in the original text represents

The input of the process is basically understandable , It's the training set , Heuristics to determine the number of clusters \(r\), The second classifier , Test matrix . The goal is to predict the test matrix .

There are two distinct aspects in the process for Loop handle LIFT There are two steps :

label-specific features construction
classifier induction

The first step has just been described , Simply speaking , \(1\)~\(q\) Go through all the tags , Identify each label \(l_k\) Then take out the positive and negative samples to form two sets \(\mathcal{P}_{k}\) And \(\mathcal{N}_{k}\), Then on these two sets, we use Eq.(2) Obtained in \(m_k\) For clustering \(k\) Values are clustered . And then get whatever you want \(\boldsymbol{x}_i \in \mathcal{D}\), Find this example to \(2m_k\) The distance between positive and negative samples , obtain \(\phi_{k}(\boldsymbol{x}_i) = \{ d\left(\boldsymbol{x}_i, \boldsymbol{p}_{1}^{k}\right),\dots,d\left(\boldsymbol{x}_i, \boldsymbol{p}_{2m_k}^{k}\right) \}\)

The second step , All that you get from this first step \(\phi_{k}\) Information , So it passed again \(1\)~\(q\) Go through all the tags , Identify different \(\phi_{k}\) Information , Again by \(n\) individual \(\phi_{k}(\boldsymbol{x}_i)\) Construct a two classifier \(\mathcal{B}_{k}\). Macroscopically speaking, it can also be said to be \(\mathcal{D}\) be based on \(l_k\) A two classifier is constructed \(\mathcal{B}_{k}\). The two classifiers satisfy :\[\begin{array}{r}
\mathcal{B}_{k}=\left\{\left(\phi_{k}\left(\boldsymbol{x}_{i}\right), Y_{i}(k)\right) \mid\left(\boldsymbol{x}_{i}, Y_{i}\right) \in \mathcal{D}\right\} \text { where } \\
Y_{i}(k)=+1 \text { if } l_{k} \in Y_{i} ; \text { Otherwise, } Y_{i}(k)=-1
\end{array} \tag{4}\] Then use two classifiers \(\mathfrak{L}\) Train each data set \(\mathcal{B}_{k}\), for example \(g_{k} \leftarrow \mathcal{L}\left(\mathcal{B}_{k}\right)\), This one. \(g_k\) It's about labels \(l_k\) A perfect classifier model , Can be used to predict . ( Any label \(l_k\) Corresponding data set \(\mathcal{B}_{k}\) Our training should also be based on \(n\) Based on training set samples )

Then any test set \(u \in \mathcal{X}\), We can call all \(q\) A classifier model , To test any sample of this test set \(\boldsymbol{x}\) To make predictions , Of course, pay attention to , Each sample in the test set should enter the model \(g_k\) Before forecast , Conventionally, we should follow the clustering process , Get basic \(\phi_{k}\) Data can only be fed into the model for prediction . Therefore, there is a final prediction output :\[Y=\left\{l_{k} \mid g_{k}\left(\phi_{k}(\boldsymbol{u})\right)>0,1 \leq k \leq q\right\} \tag{5}\]

( Follow up on the operation of the code and the follow-up of the paper Experiments With the use of metrics The details need to be improved by further reading the paper )

Simple summary

LIFT It is better to read than before LSML Take it easy , It can only be said that it deserves to be label-specific features The opening work , About label-specific features Its characteristics and theory are very clear , It's a kind of paper that can understand the general direction after reading it . It seems that the implementation is not so complicated , It is the kind of algorithm that can understand the code structure at a glance .

LIFT Mining the specific characteristics of each tag , Used K-Means And transform multiple labels into multiple binary classification problems , Very bold and groundbreaking , Although the beauty is not enough to only complete 了 first-order Multi label , But this kind of The embedded embedding Thinking should be the most valuable part of this article . As a reference for other papers .

I did worry about it when I first learned about it K-Means The effect of , But the effect seems heuristic \(r\) It is good to use . I hope I can see more parallel effects after I have access to more clustering algorithms in the future .

原网站

版权声明
本文为[LTA_ ALBlack]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207010541442992.html