当前位置:网站首页>Migration learning notes - adaptive component analysis
Migration learning notes - adaptive component analysis
2022-07-29 06:09:00 【orokok】
《Adapting Component Analysis》 Article learning
2012 IEEE 12th International Conference on Data Mining
List of articles
Abstract
Only when the training data is the appropriate representative of the test data , The prediction model can be implemented satisfactorily . That is, the training data and test data come from the same basic probability distribution . However , For various reasons , This assumption may not be correct in many applications . We propose a method based on Nuclear distribution embedding and Hilbert - Schmidt independence criterion (HSIC) To solve this problem . The proposed method explores a new representation of data in the new feature space , It has two properties :
- In the new feature space , The distribution of training and test data sets should be as close as possible
- Important structural information of data is saved .
Our method has a closed form solution , Experimental results show that , It works well in practice .
Keywords:domain adaptation; kernel embedding; Hilbert-Schmidt independence criterion;
One 、 Introduce
In the field of machine learning , The training model is used to predict the response variables of the test data set . The training process is based on minimizing the loss function , Only when the training data set is a suitable representative of the test data , Learning can achieve the goal .
In the traditional prediction model , Suppose that the training data set and the test data set are the same distribution , However , In fact, the distribution of the two data sets is different for various reasons .
Domain adaptation in machine learning has always been the focus of attention , The research people have done is :
- Covariate shift
- Class imbalance
- Semi-supervised learning
- Multi task learning
- Sample selection bias
But all these methods mainly solve the problem through two methods :
- Reweight source instances
- Change the representation space
The methods proposed in the literature have some common shortcomings :
- In high-dimensional datasets , The approximation of the underlying distribution makes the problem difficult to solve
- Explore new representations of data that are not necessarily linear , Usually, the computational cost of solving problems is very high
- Some domain adaptation techniques are only applicable to limited prediction models .
Our proposed method can solve the above problems , The algorithm finds a new representation of data in the new feature space , Make the potential probability distribution of the embedded training data set and the test data set as close as possible , And retain important structural information of data , For further prediction and analysis . These two constraints lead to a single optimization problem with closed solutions
Symbol
set up X X X and ϕ \phi ϕ Represents a random variable ( That is, the input variables in the original feature space and the new feature space ), Y Y Y Represents the corresponding response variable ( That is, output variables such as class labels ). P ( X , Y ) P (X, Y) P(X,Y) yes X X X and Y Y Y The joint probability distribution of , P t r ( X ) P_{tr}(X) Ptr(X) and P t s ( X ) P_{ts}(X) Pts(X) Represents the training and test data set X X X and Y Y Y The true marginal probability distribution of . Again , We use it P t r ( Y ∣ X ) P_{tr}(Y|X) Ptr(Y∣X) and P t s ( Y ∣ X ) P_{ts}(Y|X) Pts(Y∣X) To represent the true conditional probability distribution of these two domains .
Bold small letters , x \boldsymbol{x} x yes d V's sample . X t r X_{tr} Xtr and X t s X_{ts} Xts The sizes are respectively d × n t r d\times n_{tr} d×ntr and d × n t s d\times n_{ts} d×nts Matrix of training data set and test data set samples . n t r n_{tr} ntr and n t s n_{ts} nts It is the number of samples in the training data set and the test data set respectively . X d × n X_{d\times n} Xd×n = [ X t r X t s [X_{tr}\space X_{ts} [Xtr Xts] It's a n , d n, d n,d Matrix of dimensional samples , among n = n t r + n t s n = n_{tr} + n_{ts} n=ntr+nts. Φ \Phi Φ It is a new representation of data in the new feature space , among Ψ : X → Φ \Psi:X\rightarrow\Phi Ψ:X→Φ Is defined as Φ \Phi Φ, such Φ : = [ Φ t r Φ t s ] \Phi: = [\Phi_{tr}\Phi_{ts}] Φ:=[ΦtrΦts]. Φ t r \Phi_{tr} Φtr and Φ t s \Phi_{ts} Φts Represent training and test data sets embedded in the new representation space .
Two 、 Method
The main challenge of the regional adaptation problem is the dissimilarity of the joint probability distribution of the training data set and the test data set . Decompose into
p t r ( X , Y ) = p t r ( X ) p t r ( Y ∣ X ) p t s ( X , Y ) = p t s ( X ) p t s ( Y ∣ X ) p_{tr}(X,Y)=p_{tr}(X)p_{tr}(Y|X)\\p_{ts}(X,Y)=p_{ts}(X)p_{ts}(Y|X) ptr(X,Y)=ptr(X)ptr(Y∣X)pts(X,Y)=pts(X)pts(Y∣X)
Assuming that : All differences between the joint probability distribution of the training data set and the test data set , It is caused by the difference between their marginal probability distributions . There is a new data representation Φ \Phi Φ, Make the edge probability distribution of the embedded training data set and the test data set similar , It means p t r ( Φ ) ≈ p t s ( Φ ) p_{tr}(\Phi)\approx p_{ts}(\Phi) ptr(Φ)≈pts(Φ)
1. Minimize the distance between two probability distributions
Maximum average deviation (MMD) Is a nonparametric measure of the distance between data set distributions .
MMD ( P ^ t r , P ^ t s ) = ∥ μ X t r [ P ^ t r ] − μ X t s [ P ^ t s ] ∥ H = sup g ∈ F , ∥ g ∥ H ≤ 1 ( E X t r ∼ P ^ t r g ( x t r ) − E X t s ∼ P ^ t s g ( x t r ) ) \begin{gathered} \operatorname{MMD}\left(\hat{P}_{t r}, \hat{P}_{t s}\right)=\left\|\mu_{X_{t r}}\left[\hat{P}_{t r}\right]-\mu_{X_{t s}}\left[\hat{P}_{t s}\right]\right\|_{\mathcal{H}}= \\ \sup _{g \in \mathcal{F},\|g\|_{\mathcal{H}} \leq 1}\left(\mathbf{E}_{X_{t r} \sim \hat{P}_{t r}} g\left(\mathbf{x}_{t r}\right)-\mathbf{E}_{X_{t s} \sim \hat{P}_{t s}} g\left(\mathbf{x}_{t r}\right)\right) \end{gathered} MMD(P^tr,P^ts)=∥∥∥μXtr[P^tr]−μXts[P^ts]∥∥∥H=g∈F,∥g∥H≤1sup(EXtr∼P^trg(xtr)−EXts∼P^tsg(xtr))
In style E X ∼ P [ g ( x ) ] \mathbf{E}_{\mathbf X \sim P} [g(\mathbf x)] EX∼P[g(x)] Is the function g ( x ) g(x) g(x) The expectation of ( The sample comes from the probability distribution P P P),MMD It can be estimated by experience as
∥ μ X t r [ P ^ t r ] − μ X t s [ P ^ t s ] ∥ H 2 ≈ t r ( H L M H L ϕ ) \begin{gathered} \left\|\mu_{X_{t r}}\left[\hat{P}_{t r}\right]-\mu_{X_{t s}}\left[\hat{P}_{t s}\right]\right\|_{\mathcal{H}}^2\approx \mathcal{tr}\left(HL_MHL_\phi\right) \end{gathered} ∥∥∥μXtr[P^tr]−μXts[P^ts]∥∥∥H2≈tr(HLMHLϕ)
among L Φ L_\Phi LΦ yes Φ \Phi Φ A kernel on , The assumption is Φ T Φ \Phi^T\Phi ΦTΦ, L M L_M LM Is a predefined kernel . So the objective function is :
m i n i m i z e t r ( H L M H L ϕ ) = t r ( H L M H Φ T Φ ) \begin{gathered} \mathop{minimize} \mathcal{tr}\left(HL_MHL_\phi\right)=\mathcal{tr}\left(HL_MH\Phi^T\Phi\right) \end{gathered} minimizetr(HLMHLϕ)=tr(HLMHΦTΦ)
A simple solution is , Fold all samples of each probability distribution to a point , Then let these two points close to each other . But in the future, when making prediction and analysis , This new representation will lose key information in the data , therefore , The new representation should also retain any important data features required for later analysis .
2. Important characteristics of saving data
The dependence of raw data and its new representation can be used as a measure , Shows how the structure and important characteristics of the predicted response variables are preserved .HSIC It is used as a quantitative measure of the correlation between two random variables . The correlation between two random variables can be measured by the distance between probability distributions , This measure is estimated empirically as :
H S I C ( X , Φ ) = ( n − 1 ) − 2 t r ( H K X H L Φ ) \begin{gathered} HSIC\left(X,\Phi\right)=\left(n-1\right)^{-2}\mathcal{tr}\left(HK_XHL_\Phi\right) \end{gathered} HSIC(X,Φ)=(n−1)−2tr(HKXHLΦ)
among L Φ L_\Phi LΦ yes Φ \Phi Φ A kernel on , The assumption is Φ T Φ \Phi^T\Phi ΦTΦ, and K X K_X KX Is a valid kernel on raw data . Choosing the kernel means retaining the structure and important information . The supplementary objective function is :
m a x i m i z e t r ( H K X H L ϕ ) = t r ( H L M H Φ T Φ ) \begin{gathered} \mathop{maximize}\mathcal{tr}\left(HK_XHL_\phi\right)=\mathcal{tr}\left(HL_MH\Phi^T\Phi\right) \end{gathered} maximizetr(HKXHLϕ)=tr(HLMHΦTΦ)
3. Adjust the component analysis
take P ( Φ t r ) P\left(\Phi_{tr}\right) P(Φtr) and P ( Φ t s ) P\left(\Phi_{ts}\right) P(Φts) Minimize the distance between , And keep X X X An important feature of , Establish a single optimization problem to solve the domain adaptation problem , The solution is to embed the data into a new feature space . The objective function is defined as :
m a x i m i z e t r ( H K X H L ϕ ) t r ( H L M H L ϕ ) \begin{gathered} \mathop{maximize}\frac{\mathcal{tr}\left(HK_XHL_\phi\right)}{\mathcal{tr}\left(HL_MHL_\phi\right)} \end{gathered} maximizetr(HLMHLϕ)tr(HKXHLϕ)
Where the denominator is a measure of the distance between the probability distribution of the training data set and the probability distribution of the test data set , Molecules are measures that estimate the correlation between samples and their corresponding representations in the original space . to :
m a x i m i z e t r ( H K X H Φ T Φ ) t r ( H L M H Φ T Φ ) = t r ( Φ H K X H Φ T ) t r ( Φ H L M H Φ T ) \begin{gathered} \mathop{maximize}\frac{\mathcal{tr}\left(HK_XH\Phi^T\Phi\right)}{\mathcal{tr}\left(HL_MH\Phi^T\Phi\right)}= \frac{\mathcal{tr}\left(\Phi HK_XH\Phi^T\right)}{\mathcal{tr}\left(\Phi HL_MH\Phi^T\right)} \end{gathered} maximizetr(HLMHΦTΦ)tr(HKXHΦTΦ)=tr(ΦHLMHΦT)tr(ΦHKXHΦT)
The objective function is for Φ \Phi Φ Any proportion of is constant , You can choose Φ \Phi Φ, Make the denominator equal to 1.
m a x i m i z e t r ( Φ H K X H Φ T ) s u b j e c t t o t r ( Φ H L M H Φ T ) = 1 \begin{gathered} \mathop{maximize}\mathcal{tr}\left(\Phi HK_XH\Phi^T\right)\\ \mathop{subject\space to}\mathcal{tr}\left(\Phi HL_MH\Phi^T\right)=1 \end{gathered} maximizetr(ΦHKXHΦT)subject totr(ΦHLMHΦT)=1
Because this optimization problem has a closed form solution , So it can directly find the best Φ \Phi Φ. This corresponds to an eigenvector estimation problem , Φ T \Phi^T ΦT yes K X − 1 L M K^{−1}_XL_M KX−1LM The eigenvector matrix of . Selected eigenvector d ′ ≤ d d'\leq d d′≤d Is the dimension of data in the new feature space .
The proposed domain adaptation algorithm is called Adaptive component analysis (adaptive Component Analysis, ACA), On the basis of training set and test set , utilize Response variables of the training set To enhance the performance of the algorithm . They are encapsulated in the kernel K X K_X KX in . Once you find the right expression , We can apply further prediction algorithms to the samples in the new feature space
Classification task kernel K X K_X KX The choice of :
Using training data sets Response variables It's valuable information , It can improve the efficiency of the algorithm .
Use Response variables The information of is conducive to finding a new representation of data that is more suitable for the next prediction and analysis .
Rewrite according to linear kernel function K X K_X KX, We have :
K X = [ X t r T X t s T ] [ X t r X t s ] = [ K X t r X t r K X t r X t s K X t s X t r K X t s X t s ] \begin{gathered} K_X= \begin{bmatrix} X_{tr}^T \\ X_{ts}^T \end{bmatrix} \begin{bmatrix} X_{tr} & X_{ts} \end{bmatrix}= \begin{bmatrix} K_{X_{tr}X_{tr}} & K_{X_{tr}X_{ts}} \\ K_{X_{ts}X_{tr}} & K_{X_{ts}X_{ts}} \end{bmatrix} \end{gathered} KX=[XtrTXtsT][XtrXts]=[KXtrXtrKXtsXtrKXtrXtsKXtsXts]
K X t r X t r K_{X_{tr}X_{tr}} KXtrXtr and K X t s X t s K_{X_{ts}X_{ts}} KXtsXts Collect the structure information of training and test data sets respectively , These two sub matrices are very important for the learning or training of prediction models , They should be preserved .
matrix K X K_X KX It can be changed to K ^ X \hat{K}_X K^X, among K ^ X \hat{K}_X K^X It is built based on data and known response variables .
K X t s X t r K_{X_{ts}X_{tr}} KXtsXtr Initially, it indicates the similarity between the training data set and the test data set .
therefore , take K X t s X t r K_{X_{ts}X_{tr}} KXtsXtr and K X t r X t s K_{X_{tr}X_{ts}} KXtrXts The two sub matrices are respectively used K ^ X t s X t r = K X t s X t r K Y t r \hat{K}_{X_{ts}X_{tr}}=K_{X_{ts}X_{tr}}K_{Y_{tr}} K^XtsXtr=KXtsXtrKYtr and K ^ X t r X t s = K Y t r K X t r X t s \hat{K}_{X_{tr}X_{ts}}=K_{Y_{tr}}K_{X_{tr}X_{ts}} K^XtrXts=KYtrKXtrXts Replace , among K Y t r K_{Y_{tr}} KYtr It's the training data set X t r X_{tr} Xtr The kernel of the response variable , Indicates the similarity between the sample labels of the training data set , Its main function is to Differences between homogeneous and similar samples .
According to the formula , The samples of the training data set x i x_i xi Change to the weighted mean of its similar samples . Weights and samples x i x_i xi and x j x_j xj Is directly proportional to the similarity . This makes the variation of similar samples smaller .
K ^ X = [ K X t r X t r K ^ X t r X t s K ^ X t s X t r K X t s X t s ] \begin{gathered} \hat{K}_X= \begin{bmatrix} K_{X_{tr}X_{tr}} & \hat{K}_{X_{tr}X_{ts}} \\ \hat{K}_{X_{ts}X_{tr}} & K_{X_{ts}X_{ts}} \end{bmatrix} \end{gathered} K^X=[KXtrXtrK^XtsXtrK^XtrXtsKXtsXts]
3、 ... and 、 experimental result
This paper compares this method with MMDE and CODA Made a comparison .
3.1 The kernel of the response variable
So in ACA Middle core K X K_X KX It has been revised to K ^ X \hat{K}_X K^X
Will nucleus K Y t r And K X t s X t r and K X t r X t s K_{Y_tr} And K_{X_{ts}X_{tr}} and K_{X_{tr}X_{ts}} KYtr And KXtsXtr and KXtrXts Multiply , Basically, each sample of the training data set is replaced by the weighted mean of its corresponding similar samples , Make the data change less along similar samples .
3.2 Examples of toy classification
Training data set and test data set are composed of 100 and 200 Sample composition , They come from multivariate normal distribution , The mean values are μ t r = ( − 1 , 3 ) μ_{tr} =(−1,3) μtr=(−1,3) and μ t s = ( 2 , 1 ) μ_{ts} =(2,1) μts=(2,1), Their covariance matrix is similar , σ = [ 2 0.5 0.5 2 ] \sigma=\begin{bmatrix}2&0.5\\0.5&2\end{bmatrix} σ=[20.50.52]
Training data sets and test data sets are divided into two categories . If the first eigenvalue of each set is less than its corresponding mean , Then the samples of this set belong to the first class , If its first eigenvalue is greater than its corresponding mean , Then the samples of this set belong to the second class .
chart 1-a It shows the data in the original feature space
chart 1-b To adopt ACA Embedded data of Algorithm
◦ and × There are two kinds of training data sets
◊ \Diamond ◊ and * There are two types of test data sets
It can be seen that , The distance between the embedded training data set distribution and the test data set distribution is reduced .
therefore , The new training data sample can better represent the test data set for classification .
utilize 1-NN Classify the original data unchanged , And take the error rate as the baseline .
ACA The algorithm provides a significant improvement in the error rate of the classification process .
3.3 Real world data sets
3.3.1MNIST Handwritten numerals
The first data set is a collection of images , yes MNIST Handwritten numerals .
The definition of domain adaptation problem is : The classifier trains on two digits of the training set , And test on two different numbers in the test set .
surface 2 The training samples of all data sets in are 300 individual , The test sample is 500 individual
Through all the experiments in this paper , take ACA The dimension of the output data in is set to 2.
The error rates of different algorithms are shown in table 2 Shown :
Dig-1 New representation in two-dimensional space , Its classification effect is obviously better :
3.3.2 Newsgroup datasets
The second database is 20 Newsgroup datasets , By about 20,000 Newsgroup text documents , These documents are divided into four groups according to similar topics .
Generate 3 Data sets , among “Newsgroup-1” Data sets consist of groups 1 And groups 2 Randomly selected from 1000 Post as training data set , Group 3 And groups 4 Randomly selected from 2000 Posts as a test data set .
The error rate is 10 Average error rate of tests , The samples of each trial were randomly selected from the original data set .
ACA Better than other methods , Except in the second database Newsgroup-2.
Wine, German Credit, India diabetes and Ionosphere yes UCI Archived datasets , Their “ Offset ratio ” It is defined as 80%.
3.3.3 Breast cancer datasets
The data includes 699 Benign ( Positive label ) And malign ( Negative label ) sample .
This is a base 9 Binary classification of initial features .
The bias ratio is 70%、80%、90% Repeat the experiment .
Another parameter that shows the efficiency of the method is Normalization improvement (NI), It quantifies the algorithm a Relative to the algorithm B How good is the performance of . This parameter is estimated as :
N I = ∣ E r r o r A − E r r o r B ∣ E r r o r A NI=\frac {\lvert Error_A-Error_B\rvert}{Error_A} NI=ErrorA∣ErrorA−ErrorB∣
ACA It can also be considered as a dimension reduction technology 
summary
We propose a domain adaptation algorithm , The algorithm transfers data samples to a new feature space . Explore new data representation , Make the training data set and the test data set as close as possible in the new feature space , At the same time, important structural information of data is preserved . In order to solve this problem and meet the above properties , We define a fast optimization problem , Its solution is known as the eigenvector of the given matrix . Experimental results show that , The algorithm has good performance in practical application , It has good efficiency in the case of low dimension , It can be used as a dimension reduction technology .
References
- Adapting Component Analysis
边栏推荐
- Valuable blog and personal experience collection (continuous update)
- MarkDown简明语法手册
- 【Transformer】SOFT: Softmax-free Transformer with Linear Complexity
- Power Bi report server custom authentication
- PyTorch中的模型构建
- 【TensorRT】将 PyTorch 转化为可部署的 TensorRT
- 4、 Application of one hot and loss function
- 【语义分割】语义分割综述
- [semantic segmentation] setr_ Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformer
- Interesting talk about performance optimization thread pool: is the more threads open, the better?
猜你喜欢

clion+opencv+aruco+cmake配置

ML16-神经网络(2)

虚假新闻检测论文阅读(二):Semi-Supervised Learning and Graph Neural Networks for Fake News Detection

GA-RPN:引导锚点的建议区域网络

Error in installing pyspider under Windows: Please specify --curl dir=/path/to/build/libcurl solution

D3.js vertical relationship diagram (with arrows and text description of connecting lines)

【Transformer】SOFT: Softmax-free Transformer with Linear Complexity

【语义分割】Mapillary 数据集简介

预训练语言模型的使用方法

迁移学习—Geodesic Flow Kernel for Unsupervised Domain Adaptation
随机推荐
Ffmpeg creation GIF expression pack tutorial is coming! Say thank you, brother black fly?
torch.nn.Embedding()详解
Error in installing pyspider under Windows: Please specify --curl dir=/path/to/build/libcurl solution
3、 How to read video?
Configuration and use of Nacos external database
【目标检测】6、SSD
ML11-SKlearn实现支持向量机
五、图像像素统计
3、 How to customize data sets?
Continue the new journey and control smart storage together
四、One-hot和损失函数的应用
ML15 neural network (1)
[competition website] collect machine learning / deep learning competition website (continuously updated)
Spring, summer, autumn and winter with Miss Zhang (1)
逻辑回归-项目实战-信用卡检测任务(下)
神经网络相关知识回顾(PyTorch篇)
ML8自学笔记-LDA原理公式推导
ASM piling: after learning ASM tree API, you don't have to be afraid of hook anymore
一、迁移学习与fine-tuning有什么区别?
Are you sure you know the interaction problem of activity?