当前位置:网站首页>Multi label lsml for essay learning records
Multi label lsml for essay learning records
2022-07-01 05:50:00 【LTA_ ALBlack】
original text : Huang, J., Qin, F., Zheng, X., Cheng, Z.-K., Yuan, Z.-X., Zhang, W.-G., & Huang, Q.-M. (2019). Improving multi-label classification with missing labels by learning label-specific features. Information Sciences, 492, 124–146.
Catalog
1. Understanding of original abstract
1. Understanding of original abstract
Nowadays, many label learning is done by The same data composed of the characteristics of all labels To express ( I understand that the unique features of different tags have not been extracted ?) And hope to be able to peek all the tags in the training set , In multi label learning , Each tag may consist of some of its own Specific characteristics decision , For some practical applications , You can only get Some tag sets .
and LSML ( Label-Specific features for multi-label classification with Missing Labels ) Focus on the label problem of missing labels , there Label-Specific features Multi label learning reminds people of Professor zhangminling LIFT The scheme of building a new attribute set with a tag in , But although this article still says Label-Specific features Multi label learning , But it is totally different from LIFT Style . ( by Label-Specific features Multi label learning neighborhood opens up new possibilities )
Zhang, M.-L., & Wu, L. (2015). LIFT: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 107–120.
LSML The way to realize label correlation is to use incomplete label matrix \(Y\) On the basis of , By learning high-order label Association , Get a new one \(l\times l\) Supplementary label matrix \(C\). And by merging the high-order label relevance of learning , On this basis, a multi label classifier is constructed at the same time .
( This \(C\) It seems to have the effect of summarizing the characteristics of the original label ? After all, it describes the mapping of a tag to all other tags , Maybe this is what this article emphasizes "Label-Specific features"?)
The challenge of multi tag learning is how to learn accurate tag correlation from incomplete tag data , Because most of the label matrices are missing , LSML It is trying to improve the relevance and recognition accuracy of labels by constructing the relevance matrix of labels under the premise of missing information , Instead of learning by filling up the missing matrix . The author thinks that when the class label of training data is missing , Label correlations learned directly from incomplete label matrices may be inaccurate , And it will greatly affect the performance of multi label classifier . So the author didn't use something similar to what I introduced in my previous blog GLOCAL That matrix decomposition complements the label matrix .
2. Basic symbols
Symbol | meaning | explain |
---|---|---|
\(\mathbf{X} \in \mathbb{R}^{n \times m}\) | Attribute matrix | |
\(\mathbf{Y} \in\{0,1\}^{n \times l}\) | Label matrix | |
\(\mathbf{W} \in \mathbb{R}^{m \times l}\) | coefficient matrix | It's still a linear model |
\(\mathbf{w}_{i} \in \mathbb{R}^{m}\) | The coefficient vector of a label | |
\(\mathbf{C} \in \mathbb{R}^{l \times l}\) | Label correlation matrix | Pairwise correlation , Does not satisfy symmetry |
\(C\) The matrix describes any... In a way similar to the adjacency matrix \(i\) Labels and \(j\) Relevance of labels , But it is worth mentioning that such a matrix does not satisfy the symmetry , That is to say, the correlation between two matrices does not satisfy the Exchange Theorem in this paper ( Original author's note :\(C_{ij}\) May not equal \(C_{ji}\), And in the experiment , We find that in most cases ,\(C_{ij} = C_{ji}\)). According to the author , such label-specific features The idea of multi label processing is similar to LLSF-DL And JFSC Of , But the pairwise label correlation of those schemes is calculated in advance , And this article LSML Fitting is used \(C\) The idea of , Let the machine learn the correlation by itself , It is assumed that any missing tag can reconstruct the relationship between them by the values of other tags .
LLSF-DL From thesis
J. Huang, C. li, Q. Huang, X.Wu, Leaning label-specific features and class-dependent labels for muli-label clssfication, IE Trans.Knowl. Data Eng. 28 (12)(2016) 3309-3323.
JFSC From thesis
Huang, C.Li.,Q Huang,X Wu, Joint feture selection and classification for mulilabel learning IEE Trans. Cyoben. 48 (3)(2018)876-889.
3. Formula model
The basic idea of the author's fitting is still starting from the linear model , From the fitting coefficient matrix \(\mathbf{W}\) set out , Guarantee \(\mathbf{X} \mathbf{W}\approx \mathbf{Y}\), At the same time \(\mathbf{W}\) Regular penalty : \[\min _{\mathbf{W}} \frac{1}{2}\|\mathbf{X} \mathbf{W}-\mathbf{Y}\|+\lambda_{3}\|\mathbf{W}\|_{1} \tag{1}\] It is already a common trick to regularize the matrix , But the most accurate thing is to find 0 norm , This is also the author's original intention , But because 0 Norm itself is not convenient for derivation , The equivalent is then replaced by 1 norm , This kind of plan is related to PML-NI There are similarities and differences in the formula substitution in .
The charm of machine learning is here , Don't care about mathematical rigor , Most of the time , As long as the equivalent replacement loss is acceptable , So it's right !
PML-NI From the paper
Xie, M.-K., & Huang, S.-J. (2022). Partial multi-label learning with noisy label identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 1–12.
Then continue to consider the correlation matrix \(\mathbf{C}\) The goal of optimization :
\[\begin{aligned}
\min _{\mathbf{W}, \mathbf{C}} & \frac{1}{2}\|\mathbf{X W}-\mathbf{Y C}\|_{F}^{2}+\frac{\lambda_{1}}{2}\|\mathbf{Y C}-\mathbf{Y}\|_{F}^{2}+\lambda_{2}\|\mathbf{C}\|_{1}+\lambda_{3}\|\mathbf{W}\|_{1} \\
& \text { s.t. } \mathbf{C} \succeq 0
\end{aligned} \tag{2}\] This model will be based on the previous linear \(Y\) Change it to \(YC\), the \(Y\) Completion and reconstruction of , second , Three yes \(YC\) towards \(Y\) And in the process of fitting \(YC\) Gradually complement the original \(Y\) Missing tag for .
Follow the example given by the teacher , such as \[\mathbf{Y}=\left[\begin{array}{lll}
0 & 1 & 1 \\
1 & 0 & 0 \\
1 & 0 & 1 \\
1 & 1 & 0
\end{array}\right]\], if \(C\) Is a basic identity matrix , Then the fitting effect is the highest :\[\mathbf{I}=\left[\begin{array}{lll}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{array}\right]\] Yes :\[\mathbf{Y I}=\mathbf{Y}\] This is if there is a non identity matrix \(\mathbf{C}\) Form like :\[\mathbf{C}=\left[\begin{array}{lll}
0.9 & 0.1 & 0.2 \\
0.1 & 0.8 & 0.3 \\
0.1 & 0.2 & 0.9
\end{array}\right]\] This matrix \(\mathbf{C}\) The diagonals of are relatively high , It means that every label must be the same as himself , For example, the above 0 Labels and 2 Labels are more relevant than 1 label (0.2>0.1), Remember the value , And then calculate \(\mathbf{YC}\) You can get :\[\mathbf{Y} \mathbf{C}=\left[\begin{array}{ccc}
0.2 & 1 & 1.2 \\
0.9 & 0.1 & 0.2 \\
1 & 0.3 & 0.4 \\
1 & 0.9 & 0.5
\end{array}\right]\] Tongyuan \(\mathbf{Y}\) The second line of \([1\text{ }0\text{ }0]\) By comparison, we can find , \(\mathbf{YC}\) first line \([0.9\text{ }0.1\text{ }0.2]\) The latter of the last two values of is greater than the former , Even if their original labels are 0( It may be missing or expressed as a negative label ), This is the original rigid binary label matrix \(Y\) Change to reflect the label certainty " Probabilistic " matrix , So as to solve the problem of missing labels and take into account the relevance of labels .
here \(\mathbf{C}\), \(\mathbf{W}\) It shouldn't be too big , This requirement is reflected in \(\mathbf{C}\) Regular penalty of , The main goal here is to reduce the step size of each iteration after data multiplication ( It seems that I can understand ?), It is not easy to adapt to different target values when the value is large , It leads to blindly compatible with the current training set, resulting in over fitting .( The reason given in the original text is : Because a class label may only be related to a subset of a class label ).
\(\mathbf{YC}\) towards \(\mathbf{Y}\) Fitting is actually intended to make \(\mathbf{C}\) Close to identity matrix , from overall situation set out , We hope that each label should be as independent as possible , That is, in addition to the diagonal , The smaller the other elements, the better . This is from the overall point of view , Retain the amount of information and separability . In fact, I have another perspective on this view , This kind of fitting can ensure the following \(\mathbf{XW}\) The result of fitting is consistent with the objective matrix as far as possible \(\mathbf{Y}\), if \(\mathbf{YC}\) Are very different from the original \(\mathbf{Y}\) 了 , So the linear model \(\mathbf{XW}\) It seems that the most basic fidelity has been lost .
As for the final training \(\mathbf{XW}\) As a prediction matrix , Or do I need to take it again \(C^{-1}\), To be honest , I'm not sure ( You need to read the paper further ), But I feel that if it fits \(\mathbf{C}\) Are close to the identity matrix , So do you need to carry this inverse , It doesn't seem to have a great impact ?
( Some of the above are my own guesses , Welcome to correct .)
further , LSML Also joined similar to yesterday GLOCAL Introduce what is mentioned in the blog The manifold regularizer:\[\sum_{1 \leq i, j \leq l} c_{i j}\left\|\mathbf{w}_{i}-\mathbf{w}_{j}\right\|\] But the difference is , Here is the weight matrix \(\mathbf{W}\) Each column of the uses manifold Regularized rather than output matrix , The considerations here should be consistent with GLOCAL There are similarities , there \(\mathbf{YC}\) Similar to the output of a classifier , Can correspond to and GLOCAL Mentioned in :
If the positive correlation between two tags is greater , Then the output of the corresponding classifier is closer to (Intuitively, the more positively correlated two labels are,the closer are the corresponding classifier outputs, and vice versa.)
\(\mathbf{W}\) The values of the blue box and the red box are close to each other , This is a manifestation of Correlation , This performance can be perfectly mapped to \(\mathbf{YC}\) The above position of , In other words , \(\mathbf{W}\) Each label column of is actually representing the output of classification , Or the mapping of classifier output .
When the two are close enough , A high value in the blue area will result in a high value in the red area , vice versa , This is a kind of correlation response . So I found , \(\sum_{1 \leq i, j \leq l} c_{i j}\left\|\mathbf{w}_{i}-\mathbf{w}_{j}\right\|\) Itself is actually a tangle , The previous fitting let me know \(\mathbf{C}\) The situation of , When \(\mathbf{C}\) When it is large, in order to perform correlation response , We will lower the Euclidean distance of the weight vector ( adopt \(C\) Known performance to coach fitting \(W\)); When \(\mathbf{C}\) When I was a child , Our relationship to the weight matrix becomes low , Will not deliberately pull in \(\mathbf{w}_{i}\) And \(\mathbf{w}_{j}\) Relationship , Reflect a kind of if Correlation .
same , Last \(\sum_{1 \leq i, j \leq l} c_{i j}\left\|\mathbf{w}_{i}-\mathbf{w}_{j}\right\|\) For the sake of calculation , Still equivalent replace with \(\operatorname{tr}\left(\boldsymbol{F}_{0}^{\top} \boldsymbol{L}_{0} \boldsymbol{F}_{0}\right)\), Specific reason reference manifold Regularization techniques , GLOCAL There are similar skills
( The following will be read further to improve this article , Please correct any mistakes )
Simple summary
LSML Boldly introduced the tag correlation matrix \(\mathbf{C}\) It's really a good idea, But in \(\mathbf{C}\) On the question of what to synthesize , At first, I still had a lot of doubts , Why approach the identity matrix ? Why should a norm be small ? Today, the teacher mentioned DM2L Global and local kernel norms in (nuclear norm) I got a little inspiration from , I feel that this may be an overall issue . Most of the time, if you don't understand clearly, it's a question of perspective .
Of course, this article mentioned again manifold Regularization , We know that the column vector for calculating Euclidean distance in this regularization does not necessarily use the output matrix , Some matrices similar to or embodying the output matrix mapping can be used .
Machine learning is really too flexible , Many conclusions often need to be explained by yourself to make them justifiable , Full of hazy feeling . But it may be possible to be appropriate without understanding , It may be more critical to collect more skills and tricks .
边栏推荐
- MySQL数据迁移遇到的一些错误
- Boot + jsp University Community Management System (with source Download Link)
- Beauty of Mathematics - Application of Mathematics
- 教务管理系统(免费源码获取)
- MinIO纠错码、分布式MinIO集群搭建及启动
- FPGA - 7系列 FPGA内部结构之Clocking -01- 时钟架构概述
- 不是你脑子不好用,而是因为你没有找到对的工具
- Call us special providers of personal cloud services for College Students
- win10、win11中Elan触摸板滚动方向反转、启动“双指点击打开右键菜单“、“双指滚动“
- SystemVerilog学习-10-验证量化和覆盖率
猜你喜欢
随机推荐
2022.6.30-----leetcode. one thousand one hundred and seventy-five
excel高级绘图技巧100讲(一)-用甘特图来展示项目进度情况
tese_ Time_ 2h
从底层结构开始学习FPGA----RAM IP的定制与测试
为了保护自己的数据,他奋斗了一天一夜
SystemVerilog学习-07-类的继承和包的使用
基于LabVIEW的计时器
tese_Time_2h
Wild melon or split melon?
linux 关闭redis 进程 systemd+
表格中el-tooltip 实现换行展示
【考研高数 武忠祥+880版 自用】高数第二章基础阶段思维导图
uniapp树形层级选择器
数据治理:数据治理管理(第五篇)
restframework-simpleJWT重写认证机制
Ssm+mysql second-hand trading website (thesis + source code access link)
Know the future of "edge computing" from the Nobel prize!
Huluer app help
数据库问题,如何优化Oracle SQL查询语句更快,效率更高
OpenGL es: (3) EGL, basic steps of EGL drawing, eglsurface, anativewindow