当前位置：网站首页>Multi label lsml for essay learning records

Multi label lsml for essay learning records

2022-07-01 05:50:00 【LTA_ ALBlack】

original text : Huang, J., Qin, F., Zheng, X., Cheng, Z.-K., Yuan, Z.-X., Zhang, W.-G., & Huang, Q.-M. (2019). Improving multi-label classification with missing labels by learning label-specific features. Information Sciences, 492, 124–146.

Catalog

1. Understanding of original abstract

2. Basic symbols

3. Formula model

Simple summary

1. Understanding of original abstract

Nowadays, many label learning is done by The same data composed of the characteristics of all labels To express ( I understand that the unique features of different tags have not been extracted ?) And hope to be able to peek all the tags in the training set , In multi label learning , Each tag may consist of some of its own Specific characteristics decision , For some practical applications , You can only get Some tag sets .

and LSML ( Label-Specific features for multi-label classification with Missing Labels ) Focus on the label problem of missing labels , there Label-Specific features Multi label learning reminds people of Professor zhangminling LIFT The scheme of building a new attribute set with a tag in , But although this article still says Label-Specific features Multi label learning , But it is totally different from LIFT Style . ( ~~by Label-Specific features Multi label learning neighborhood opens up new possibilities~~ )

Zhang, M.-L., & Wu, L. (2015). LIFT: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 107–120.

LSML The way to realize label correlation is to use incomplete label matrix \(Y\) On the basis of , By learning high-order label Association , Get a new one \(l\times l\) Supplementary label matrix \(C\). And by merging the high-order label relevance of learning , On this basis, a multi label classifier is constructed at the same time .

( This \(C\) It seems to have the effect of summarizing the characteristics of the original label ? After all, it describes the mapping of a tag to all other tags , Maybe this is what this article emphasizes "Label-Specific features"?)

The challenge of multi tag learning is how to learn accurate tag correlation from incomplete tag data , Because most of the label matrices are missing , LSML It is trying to improve the relevance and recognition accuracy of labels by constructing the relevance matrix of labels under the premise of missing information , Instead of learning by filling up the missing matrix . The author thinks that when the class label of training data is missing , Label correlations learned directly from incomplete label matrices may be inaccurate , And it will greatly affect the performance of multi label classifier . So the author didn't use something similar to what I introduced in my previous blog GLOCAL That matrix decomposition complements the label matrix .

2. Basic symbols

Symbol	meaning	explain
\(\mathbf{X} \in \mathbb{R}^{n \times m}\)	Attribute matrix
\(\mathbf{Y} \in\{0,1\}^{n \times l}\)	Label matrix
\(\mathbf{W} \in \mathbb{R}^{m \times l}\)	coefficient matrix	It's still a linear model
\(\mathbf{w}_{i} \in \mathbb{R}^{m}\)	The coefficient vector of a label
\(\mathbf{C} \in \mathbb{R}^{l \times l}\)	Label correlation matrix	Pairwise correlation , Does not satisfy symmetry

\(C\) The matrix describes any... In a way similar to the adjacency matrix \(i\) Labels and \(j\) Relevance of labels , But it is worth mentioning that such a matrix does not satisfy the symmetry , That is to say, the correlation between two matrices does not satisfy the Exchange Theorem in this paper ( Original author's note ：\(C_{ij}\) May not equal \(C_{ji}\), And in the experiment , We find that in most cases ,\(C_{ij} = C_{ji}\)). According to the author , such label-specific features The idea of multi label processing is similar to LLSF-DL And JFSC Of , But the pairwise label correlation of those schemes is calculated in advance , And this article LSML Fitting is used \(C\) The idea of , Let the machine learn the correlation by itself , It is assumed that any missing tag can reconstruct the relationship between them by the values of other tags .

LLSF-DL From thesis
J. Huang, C. li, Q. Huang, X.Wu, Leaning label-specific features and class-dependent labels for muli-label clssfication, IE Trans.Knowl. Data Eng. 28 (12)(2016) 3309-3323.
JFSC From thesis
Huang, C.Li.,Q Huang,X Wu, Joint feture selection and classification for mulilabel learning IEE Trans. Cyoben. 48 (3)(2018)876-889.

3. Formula model

The basic idea of the author's fitting is still starting from the linear model , From the fitting coefficient matrix \(\mathbf{W}\) set out , Guarantee \(\mathbf{X} \mathbf{W}\approx \mathbf{Y}\), At the same time \(\mathbf{W}\) Regular penalty : \[\min _{\mathbf{W}} \frac{1}{2}\|\mathbf{X} \mathbf{W}-\mathbf{Y}\|+\lambda_{3}\|\mathbf{W}\|_{1} \tag{1}\] It is already a common trick to regularize the matrix , But the most accurate thing is to find 0 norm , This is also the author's original intention , But because 0 Norm itself is not convenient for derivation , The equivalent is then replaced by 1 norm , This kind of plan is related to PML-NI There are similarities and differences in the formula substitution in .

The charm of machine learning is here , Don't care about mathematical rigor , Most of the time , As long as the equivalent replacement loss is acceptable , So it's right !

PML-NI From the paper
Xie, M.-K., & Huang, S.-J. (2022). Partial multi-label learning with noisy label identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 1–12.

Then continue to consider the correlation matrix \(\mathbf{C}\) The goal of optimization :

\[\begin{aligned}
\min _{\mathbf{W}, \mathbf{C}} & \frac{1}{2}\|\mathbf{X W}-\mathbf{Y C}\|_{F}^{2}+\frac{\lambda_{1}}{2}\|\mathbf{Y C}-\mathbf{Y}\|_{F}^{2}+\lambda_{2}\|\mathbf{C}\|_{1}+\lambda_{3}\|\mathbf{W}\|_{1} \\
& \text { s.t. } \mathbf{C} \succeq 0
\end{aligned} \tag{2}\] This model will be based on the previous linear \(Y\) Change it to \(YC\), the \(Y\) Completion and reconstruction of , second , Three yes \(YC\) towards \(Y\) And in the process of fitting \(YC\) Gradually complement the original \(Y\) Missing tag for .

Follow the example given by the teacher , such as \[\mathbf{Y}=\left[\begin{array}{lll}
0 & 1 & 1 \\
1 & 0 & 0 \\
1 & 0 & 1 \\
1 & 1 & 0
\end{array}\right]\], if \(C\) Is a basic identity matrix , Then the fitting effect is the highest :\[\mathbf{I}=\left[\begin{array}{lll}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{array}\right]\] Yes :\[\mathbf{Y I}=\mathbf{Y}\] This is if there is a non identity matrix \(\mathbf{C}\) Form like :\[\mathbf{C}=\left[\begin{array}{lll}
0.9 & 0.1 & 0.2 \\
0.1 & 0.8 & 0.3 \\
0.1 & 0.2 & 0.9
\end{array}\right]\] This matrix \(\mathbf{C}\) The diagonals of are relatively high , It means that every label must be the same as himself , For example, the above 0 Labels and 2 Labels are more relevant than 1 label (0.2>0.1), Remember the value , And then calculate \(\mathbf{YC}\) You can get :\[\mathbf{Y} \mathbf{C}=\left[\begin{array}{ccc}
0.2 & 1 & 1.2 \\
0.9 & 0.1 & 0.2 \\
1 & 0.3 & 0.4 \\
1 & 0.9 & 0.5
\end{array}\right]\] Tongyuan \(\mathbf{Y}\) The second line of \([1\text{ }0\text{ }0]\) By comparison, we can find , \(\mathbf{YC}\) first line \([0.9\text{ }0.1\text{ }0.2]\) The latter of the last two values of is greater than the former , Even if their original labels are 0( It may be missing or expressed as a negative label ), This is the original rigid binary label matrix \(Y\) Change to reflect the label certainty " Probabilistic " matrix , So as to solve the problem of missing labels and take into account the relevance of labels .

here \(\mathbf{C}\), \(\mathbf{W}\) It shouldn't be too big , This requirement is reflected in \(\mathbf{C}\) Regular penalty of , The main goal here is to reduce the step size of each iteration after data multiplication ( ~~It seems that I can understand ?~~), It is not easy to adapt to different target values when the value is large , It leads to blindly compatible with the current training set, resulting in over fitting .( The reason given in the original text is : Because a class label may only be related to a subset of a class label ).

\(\mathbf{YC}\) towards \(\mathbf{Y}\) Fitting is actually intended to make \(\mathbf{C}\) Close to identity matrix , from overall situation set out , We hope that each label should be as independent as possible , That is, in addition to the diagonal , The smaller the other elements, the better . This is from the overall point of view , Retain the amount of information and separability . In fact, I have another perspective on this view , This kind of fitting can ensure the following \(\mathbf{XW}\) The result of fitting is consistent with the objective matrix as far as possible \(\mathbf{Y}\), if \(\mathbf{YC}\) Are very different from the original \(\mathbf{Y}\) 了 , So the linear model \(\mathbf{XW}\) It seems that the most basic fidelity has been lost .

As for the final training \(\mathbf{XW}\) As a prediction matrix , Or do I need to take it again \(C^{-1}\), To be honest , I'm not sure ( You need to read the paper further ), But I feel that if it fits \(\mathbf{C}\) Are close to the identity matrix , So do you need to carry this inverse , It doesn't seem to have a great impact ?

( Some of the above are my own guesses , Welcome to correct .)

further , LSML Also joined similar to yesterday GLOCAL Introduce what is mentioned in the blog The manifold regularizer:\[\sum_{1 \leq i, j \leq l} c_{i j}\left\|\mathbf{w}_{i}-\mathbf{w}_{j}\right\|\] But the difference is , Here is the weight matrix \(\mathbf{W}\) Each column of the uses manifold Regularized rather than output matrix , The considerations here should be consistent with GLOCAL There are similarities , there \(\mathbf{YC}\) Similar to the output of a classifier , Can correspond to and GLOCAL Mentioned in :

If the positive correlation between two tags is greater , Then the output of the corresponding classifier is closer to (Intuitively, the more positively correlated two labels are,the closer are the corresponding classifier outputs, and vice versa.)

\(\mathbf{W}\) The values of the blue box and the red box are close to each other , This is a manifestation of Correlation , This performance can be perfectly mapped to \(\mathbf{YC}\) The above position of , In other words , \(\mathbf{W}\) Each label column of is actually representing the output of classification , Or the mapping of classifier output .

When the two are close enough , A high value in the blue area will result in a high value in the red area , vice versa , This is a kind of correlation response . So I found , \(\sum_{1 \leq i, j \leq l} c_{i j}\left\|\mathbf{w}_{i}-\mathbf{w}_{j}\right\|\) Itself is actually a tangle , The previous fitting let me know \(\mathbf{C}\) The situation of , When \(\mathbf{C}\) When it is large, in order to perform correlation response , We will lower the Euclidean distance of the weight vector ( adopt \(C\) Known performance to coach fitting \(W\)); When \(\mathbf{C}\) When I was a child , Our relationship to the weight matrix becomes low , Will not deliberately pull in \(\mathbf{w}_{i}\) And \(\mathbf{w}_{j}\) Relationship , Reflect a kind of if Correlation .

same , Last \(\sum_{1 \leq i, j \leq l} c_{i j}\left\|\mathbf{w}_{i}-\mathbf{w}_{j}\right\|\) For the sake of calculation , Still equivalent replace with \(\operatorname{tr}\left(\boldsymbol{F}_{0}^{\top} \boldsymbol{L}_{0} \boldsymbol{F}_{0}\right)\), Specific reason reference manifold Regularization techniques , GLOCAL There are similar skills

( The following will be read further to improve this article , Please correct any mistakes )

Simple summary

LSML Boldly introduced the tag correlation matrix \(\mathbf{C}\) It's really a good idea, But in \(\mathbf{C}\) On the question of what to synthesize , At first, I still had a lot of doubts , Why approach the identity matrix ? Why should a norm be small ? Today, the teacher mentioned DM2L Global and local kernel norms in (nuclear norm) I got a little inspiration from , I feel that this may be an overall issue . Most of the time, if you don't understand clearly, it's a question of perspective .

Of course, this article mentioned again manifold Regularization , We know that the column vector for calculating Euclidean distance in this regularization does not necessarily use the output matrix , Some matrices similar to or embodying the output matrix mapping can be used .

Machine learning is really too flexible , Many conclusions often need to be explained by yourself to make them justifiable , Full of hazy feeling . But it may be possible to be appropriate without understanding , It may be more critical to collect more skills and tricks .

原网站

版权声明
本文为[LTA_ ALBlack]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207010541443083.html