当前位置：网站首页>Migration learning notes - adaptive component analysis

Migration learning notes - adaptive component analysis

2022-07-29 06:09:00 【orokok】

《Adapting Component Analysis》 Article learning
2012 IEEE 12th International Conference on Data Mining

List of articles

Abstract
One 、 Introduce
- Symbol
Two 、 Method
3、 ... and 、 experimental result
summary
References

Abstract

Only when the training data is the appropriate representative of the test data , The prediction model can be implemented satisfactorily . That is, the training data and test data come from the same basic probability distribution . However , For various reasons , This assumption may not be correct in many applications . We propose a method based on Nuclear distribution embedding and Hilbert - Schmidt independence criterion （HSIC） To solve this problem . The proposed method explores a new representation of data in the new feature space , It has two properties ：

In the new feature space , The distribution of training and test data sets should be as close as possible
Important structural information of data is saved .

Our method has a closed form solution , Experimental results show that , It works well in practice .
Keywords：domain adaptation; kernel embedding; Hilbert-Schmidt independence criterion;

One 、 Introduce

In the field of machine learning , The training model is used to predict the response variables of the test data set . The training process is based on minimizing the loss function , Only when the training data set is a suitable representative of the test data , Learning can achieve the goal .
In the traditional prediction model , Suppose that the training data set and the test data set are the same distribution , However , In fact, the distribution of the two data sets is different for various reasons .
Domain adaptation in machine learning has always been the focus of attention , The research people have done is ：

Covariate shift
Class imbalance
Semi-supervised learning
Multi task learning
Sample selection bias

But all these methods mainly solve the problem through two methods ：

Reweight source instances
Change the representation space

The methods proposed in the literature have some common shortcomings ：

In high-dimensional datasets , The approximation of the underlying distribution makes the problem difficult to solve
Explore new representations of data that are not necessarily linear , Usually, the computational cost of solving problems is very high
Some domain adaptation techniques are only applicable to limited prediction models .

Our proposed method can solve the above problems , The algorithm finds a new representation of data in the new feature space , Make the potential probability distribution of the embedded training data set and the test data set as close as possible , And retain important structural information of data , For further prediction and analysis . These two constraints lead to a single optimization problem with closed solutions

Symbol

set up $X$ and $\phi$ Represents a random variable ( That is, the input variables in the original feature space and the new feature space ), $Y$ Represents the corresponding response variable ( That is, output variables such as class labels ). $P (X, Y)$ yes $X$ and $Y$ The joint probability distribution of , $P_{tr}(X)$ and $P_{ts}(X)$ Represents the training and test data set $X$ and $Y$ The true marginal probability distribution of . Again , We use it $P_{tr}(Y|X)$ and $P_{ts}(Y|X)$ To represent the true conditional probability distribution of these two domains .
Bold small letters , $\boldsymbol{x}$ yes d V's sample . $X_{tr}$ and $X_{ts}$ The sizes are respectively $d\times n_{tr}$ and $d\times n_{ts}$ Matrix of training data set and test data set samples . $n_{tr}$ and $n_{ts}$ It is the number of samples in the training data set and the test data set respectively . $X_{d\times n}$ = $[X_{tr}\space X_{ts}$ ] It's a $n, d$ Matrix of dimensional samples , among $n = n_{tr} + n_{ts}$ . $\Phi$ It is a new representation of data in the new feature space , among $\Psi:X\rightarrow\Phi$ Is defined as $\Phi$ , such $\Phi: = [\Phi_{tr}\Phi_{ts}]$ . $\Phi_{tr}$ and $\Phi_{ts}$ Represent training and test data sets embedded in the new representation space .

Two 、 Method

The main challenge of the regional adaptation problem is the dissimilarity of the joint probability distribution of the training data set and the test data set . Decompose into
$p_{tr}(X,Y)=p_{tr}(X)p_{tr}(Y|X)\\p_{ts}(X,Y)=p_{ts}(X)p_{ts}(Y|X)$
Assuming that ： All differences between the joint probability distribution of the training data set and the test data set , It is caused by the difference between their marginal probability distributions . There is a new data representation $\Phi$ , Make the edge probability distribution of the embedded training data set and the test data set similar , It means $p_{tr}(\Phi)\approx p_{ts}(\Phi)$

1. Minimize the distance between two probability distributions

Maximum average deviation (MMD) Is a nonparametric measure of the distance between data set distributions .
$\begin{gathered} \operatorname{MMD}\left(\hat{P}_{t r}, \hat{P}_{t s}\right)=\left\|\mu_{X_{t r}}\left[\hat{P}_{t r}\right]-\mu_{X_{t s}}\left[\hat{P}_{t s}\right]\right\|_{\mathcal{H}}= \\ \sup _{g \in \mathcal{F},\|g\|_{\mathcal{H}} \leq 1}\left(\mathbf{E}_{X_{t r} \sim \hat{P}_{t r}} g\left(\mathbf{x}_{t r}\right)-\mathbf{E}_{X_{t s} \sim \hat{P}_{t s}} g\left(\mathbf{x}_{t r}\right)\right) \end{gathered}$
In style $\mathbf{E}_{\mathbf X \sim P} [g(\mathbf x)]$ Is the function $g (x)$ The expectation of ( The sample comes from the probability distribution $P$ ),MMD It can be estimated by experience as
$\begin{gathered} \left\|\mu_{X_{t r}}\left[\hat{P}_{t r}\right]-\mu_{X_{t s}}\left[\hat{P}_{t s}\right]\right\|_{\mathcal{H}}^2\approx \mathcal{tr}\left(HL_MHL_\phi\right) \end{gathered}$
among $L_\Phi$ yes $\Phi$ A kernel on , The assumption is $\Phi^T\Phi$ , $L_M$ Is a predefined kernel . So the objective function is ：
$\begin{gathered} \mathop{minimize} \mathcal{tr}\left(HL_MHL_\phi\right)=\mathcal{tr}\left(HL_MH\Phi^T\Phi\right) \end{gathered}$
A simple solution is , Fold all samples of each probability distribution to a point , Then let these two points close to each other . But in the future, when making prediction and analysis , This new representation will lose key information in the data , therefore , The new representation should also retain any important data features required for later analysis .

2. Important characteristics of saving data

The dependence of raw data and its new representation can be used as a measure , Shows how the structure and important characteristics of the predicted response variables are preserved .HSIC It is used as a quantitative measure of the correlation between two random variables . The correlation between two random variables can be measured by the distance between probability distributions , This measure is estimated empirically as ：
$\begin{gathered} HSIC\left(X,\Phi\right)=\left(n-1\right)^{-2}\mathcal{tr}\left(HK_XHL_\Phi\right) \end{gathered}$
among $L_\Phi$ yes $\Phi$ A kernel on , The assumption is $\Phi^T\Phi$ , and $K_X$ Is a valid kernel on raw data . Choosing the kernel means retaining the structure and important information . The supplementary objective function is ：
$\begin{gathered} \mathop{maximize}\mathcal{tr}\left(HK_XHL_\phi\right)=\mathcal{tr}\left(HL_MH\Phi^T\Phi\right) \end{gathered}$

3. Adjust the component analysis

take $P\left(\Phi_{tr}\right)$ and $P\left(\Phi_{ts}\right)$ Minimize the distance between , And keep $X$ An important feature of , Establish a single optimization problem to solve the domain adaptation problem , The solution is to embed the data into a new feature space . The objective function is defined as ：
$\begin{gathered} \mathop{maximize}\frac{\mathcal{tr}\left(HK_XHL_\phi\right)}{\mathcal{tr}\left(HL_MHL_\phi\right)} \end{gathered}$
Where the denominator is a measure of the distance between the probability distribution of the training data set and the probability distribution of the test data set , Molecules are measures that estimate the correlation between samples and their corresponding representations in the original space . to ：
$\begin{gathered} \mathop{maximize}\frac{\mathcal{tr}\left(HK_XH\Phi^T\Phi\right)}{\mathcal{tr}\left(HL_MH\Phi^T\Phi\right)}= \frac{\mathcal{tr}\left(\Phi HK_XH\Phi^T\right)}{\mathcal{tr}\left(\Phi HL_MH\Phi^T\right)} \end{gathered}$
The objective function is for $\Phi$ Any proportion of is constant , You can choose $\Phi$ , Make the denominator equal to 1.
$\begin{gathered} \mathop{maximize}\mathcal{tr}\left(\Phi HK_XH\Phi^T\right)\\ \mathop{subject\space to}\mathcal{tr}\left(\Phi HL_MH\Phi^T\right)=1 \end{gathered}$
Because this optimization problem has a closed form solution , So it can directly find the best $\Phi$ . This corresponds to an eigenvector estimation problem , $\Phi^T$ yes $K^{−1}_XL_M$ The eigenvector matrix of . Selected eigenvector $d'\leq d$ Is the dimension of data in the new feature space .
The proposed domain adaptation algorithm is called Adaptive component analysis (adaptive Component Analysis, ACA), On the basis of training set and test set , utilize Response variables of the training set To enhance the performance of the algorithm . They are encapsulated in the kernel $K_X$ in . Once you find the right expression , We can apply further prediction algorithms to the samples in the new feature space

Classification task kernel $K_X$ The choice of :

Using training data sets Response variables It's valuable information , It can improve the efficiency of the algorithm .
Use Response variables The information of is conducive to finding a new representation of data that is more suitable for the next prediction and analysis .
Rewrite according to linear kernel function $K_X$ , We have :
$\begin{gathered} K_X= \begin{bmatrix} X_{tr}^T \\ X_{ts}^T \end{bmatrix} \begin{bmatrix} X_{tr} & X_{ts} \end{bmatrix}= \begin{bmatrix} K_{X_{tr}X_{tr}} & K_{X_{tr}X_{ts}} \\ K_{X_{ts}X_{tr}} & K_{X_{ts}X_{ts}} \end{bmatrix} \end{gathered}$
$K_{X_{tr}X_{tr}}$ and $K_{X_{ts}X_{ts}}$ Collect the structure information of training and test data sets respectively , These two sub matrices are very important for the learning or training of prediction models , They should be preserved .
matrix $K_X$ It can be changed to $\hat{K}_X$ , among $\hat{K}_X$ It is built based on data and known response variables .
$K_{X_{ts}X_{tr}}$ Initially, it indicates the similarity between the training data set and the test data set .
therefore , take $K_{X_{ts}X_{tr}}$ and $K_{X_{tr}X_{ts}}$ The two sub matrices are respectively used $\hat{K}_{X_{ts}X_{tr}}=K_{X_{ts}X_{tr}}K_{Y_{tr}}$ and $\hat{K}_{X_{tr}X_{ts}}=K_{Y_{tr}}K_{X_{tr}X_{ts}}$ Replace , among $K_{Y_{tr}}$ It's the training data set $X_{tr}$ The kernel of the response variable , Indicates the similarity between the sample labels of the training data set , Its main function is to Differences between homogeneous and similar samples .
According to the formula , The samples of the training data set $x_i$ Change to the weighted mean of its similar samples . Weights and samples $x_i$ and $x_j$ Is directly proportional to the similarity . This makes the variation of similar samples smaller .
$\begin{gathered} \hat{K}_X= \begin{bmatrix} K_{X_{tr}X_{tr}} & \hat{K}_{X_{tr}X_{ts}} \\ \hat{K}_{X_{ts}X_{tr}} & K_{X_{ts}X_{ts}} \end{bmatrix} \end{gathered}$

3、 ... and 、 experimental result

This paper compares this method with MMDE and CODA Made a comparison .

3.1 The kernel of the response variable

So in ACA Middle core $K_X$ It has been revised to $\hat{K}_X$
Will nucleus $K_{Y_tr} And K_{X_{ts}X_{tr}} and K_{X_{tr}X_{ts}}$ Multiply , Basically, each sample of the training data set is replaced by the weighted mean of its corresponding similar samples , Make the data change less along similar samples .

3.2 Examples of toy classification

Training data set and test data set are composed of 100 and 200 Sample composition , They come from multivariate normal distribution , The mean values are $μ_{tr} =(−1,3)$ and $μ_{ts} =(2,1)$ , Their covariance matrix is similar , $\sigma=\begin{bmatrix}2&0.5\\0.5&2\end{bmatrix}$
Training data sets and test data sets are divided into two categories . If the first eigenvalue of each set is less than its corresponding mean , Then the samples of this set belong to the first class , If its first eigenvalue is greater than its corresponding mean , Then the samples of this set belong to the second class .
Insert picture description here
chart 1-a It shows the data in the original feature space
chart 1-b To adopt ACA Embedded data of Algorithm
◦ and × There are two kinds of training data sets
$\Diamond$ and * There are two types of test data sets
It can be seen that , The distance between the embedded training data set distribution and the test data set distribution is reduced .
therefore , The new training data sample can better represent the test data set for classification .
utilize 1-NN Classify the original data unchanged , And take the error rate as the baseline .
ACA The algorithm provides a significant improvement in the error rate of the classification process .
Insert picture description here

3.3 Real world data sets

3.3.1MNIST Handwritten numerals

The first data set is a collection of images , yes MNIST Handwritten numerals .
The definition of domain adaptation problem is : The classifier trains on two digits of the training set , And test on two different numbers in the test set .
surface 2 The training samples of all data sets in are 300 individual , The test sample is 500 individual
Through all the experiments in this paper , take ACA The dimension of the output data in is set to 2.
The error rates of different algorithms are shown in table 2 Shown ：
Insert picture description here
Dig-1 New representation in two-dimensional space , Its classification effect is obviously better ：

3.3.2 Newsgroup datasets

The second database is 20 Newsgroup datasets , By about 20,000 Newsgroup text documents , These documents are divided into four groups according to similar topics .
Generate 3 Data sets , among “Newsgroup-1” Data sets consist of groups 1 And groups 2 Randomly selected from 1000 Post as training data set , Group 3 And groups 4 Randomly selected from 2000 Posts as a test data set .
The error rate is 10 Average error rate of tests , The samples of each trial were randomly selected from the original data set .
ACA Better than other methods , Except in the second database Newsgroup-2.
Wine, German Credit, India diabetes and Ionosphere yes UCI Archived datasets , Their “ Offset ratio ” It is defined as 80%.
Insert picture description here

3.3.3 Breast cancer datasets

The data includes 699 Benign ( Positive label ) And malign ( Negative label ) sample .
This is a base 9 Binary classification of initial features .
The bias ratio is 70%、80%、90% Repeat the experiment .
Insert picture description here
Another parameter that shows the efficiency of the method is Normalization improvement (NI), It quantifies the algorithm a Relative to the algorithm B How good is the performance of . This parameter is estimated as ：
$NI=\frac {\lvert Error_A-Error_B\rvert}{Error_A}$
ACA It can also be considered as a dimension reduction technology
Insert picture description here

summary

We propose a domain adaptation algorithm , The algorithm transfers data samples to a new feature space . Explore new data representation , Make the training data set and the test data set as close as possible in the new feature space , At the same time, important structural information of data is preserved . In order to solve this problem and meet the above properties , We define a fast optimization problem , Its solution is known as the eigenvector of the given matrix . Experimental results show that , The algorithm has good performance in practical application , It has good efficiency in the case of low dimension , It can be used as a dimension reduction technology .