当前位置：网站首页>Migration learning robot visual domain adaptation with low rank reconstruction

Migration learning robot visual domain adaptation with low rank reconstruction

2022-07-29 06:10:00 【orokok】

《Robust Visual Domain Adaptation with Low-Rank Reconstruction》 Study
Computer Vision

List of articles

Abstract
One 、 Introduce
Two 、 Related work
3、 ... and 、 Robust domain adaptation based on low rank reconstruction
Four 、 experiment
summary
References

Abstract

In this paper, a low rank reconstruction method is proposed to reduce the difference of domain distribution .
Our method captures the intrinsic correlation of source samples in the process of adaptation , At the same time, we find the noise and outliers that cannot be adapted in the source domain , This makes it more robust than the previous methods .
We express the problem as a constrained kernel norm and $\mathcal{l}_{2,1}$ Norm minimization goal , Then the augmented Lagrange multiplier (ALM) Methods to optimize .
A large number of experiments on various visual adaptation tasks show that , The proposed method is consistent and significantly better than the existing domain adaptation methods .

One 、 Introduce

Visual classification often faces the dilemma of data flooding and label scarcity , To solve this problem , People use the data samples on the Internet to enrich the collection of limited training data samples . However , One problem with this strategy comes from the possible mismatch between the considered target domain and the source domain that provides additional data and labels . Physically speaking , This dislocation is caused by the deviation of various visual cues in various visual domains , For example, visual resolution 、 viewpoint 、 Lighting, etc .
Insert picture description here
This dislocation corresponds to the change of data distribution in a specific feature space , To be exact , The edge distribution of samples in the source domain and the target domain are different . This makes it harmful to merge data directly from the source domain .
Formally , Domain adaptation solves the problem of samples in the source domain $X_s$ And samples in the target domain $X_t$ The edge distribution of different problems , The conditional distribution of the label that provides the sample $P（Y_s | X_s）$ and $P（Y_t | X_t）$ （ $Y_s$ and $Y_t$ Represents a label in any field ） be similar .
Depending on how the source information is used , Divide between classifier based and representation based adaptation .
The former advocates implicitly adapting the target distribution by adjusting the classifier from the source domain , The latter attempts to adjust the representation of source data through learning transformation to achieve alignment
There are two common issues with the previous proposal ：

First , In the adaptive process , They usually process source samples separately , Without considering interdependence . This may be possible. （ Implicit or explicit ） Resulting in arbitrary dispersion of adaptive distribution , And any structural information other than a single data sample of the source data may be destroyed .
secondly , They blindly put all the samples （ Including noise and especially possible outliers ） Convert from source domain to target . When learning the recognition model , The latter may cause serious distortion or damage to the model .

We propose a new visual domain adaptive method , This method not only attempts to maintain the intrinsic correlation of source samples in the adaptive process , And by considering noise and removing outliers to achieve a more robust adaptive .
The basic idea is to convert the data samples in the source domain into intermediate representations , So that each converted sample can be linearly reconstructed from the samples in the target domain .
Based on this linear relationship , We use low rank structure to capture the intrinsic correlation of source samples , At the same time, sparse structure is used to identify peripheral samples . The whole conversion process is unsupervised , Not using any label information .
then , We have formulated our proposal into a binding nuclear norm , then $l_{2,1}$ Norm minimization problem , And the augmented Lagrange multiplier （ALM） Methods to optimize
Besides , We extend our approach to scenarios that consider multiple related source domains , A multi task low rank domain adaptive method is proposed , This method can simultaneously adapt multiple source domains to the target domain through low rank reconstruction
Insert picture description here

Two 、 Related work

Daume III Et al. Proposed feature replication （FR）, Support vector machine training by using simple enhanced features of source and target .
Yang et al. Proposed an adaptive support vector machine （A-SVM） Method , Where the target classifier $f^ t(x)$ Is from the auxiliary classifier $f ^s（x）$ Adapted from , In this way , Training can be reduced to learning disturbance $\bigtriangleup f(x)$ bring $f^ t（x）=f^ s(x)+\bigtriangleup f(x)$ .
Jiang et al. Proposed cross domain support vector machine （CDSVM） Method , This method is based on k The nearest neighbor defines the weight for each source sample , Then retrain the support vector machine classifier to update the weight .

Some other work uses multi-core learning to align the distribution between the source domain and the target domain
Besides ,Saenko Et al. Proposed a metric learning method , So that the obtained source domain visual model can adapt to the new domain , And minimize the variance between different feature distributions .
What is most relevant to our suggestion is [13], It proposes an unsupervised incremental learning algorithm .
say concretely , They suggest creating a series of intermediate representation subspaces between the source domain and the target domain （ So it's incremental ）, To explain the domain offset , By this offset , Source tag information can “ spread ” To the target domain .
by comparison , Here we focus on direct conversion , But here we emphasize sample correlation and noise / Outlier removal , Although in the process of conversion , Our setup is also unsupervised .
Robust principal component analysis aims to transform the damaged low rank matrix $X$ Decompose into clean low rank matrix $Z$ And sparse matrices $E$ , To explain the sparse error .
Besides , Chen et al. Proposed to use low rank structure to capture the correlation of different tasks , For multi task learning , Use at the same time $l_{2,1}$ Norm removes outliers .
The difference is , Our method uses the advantage of low rank sum group sparse structure to find a transformation function , This function can bridge the distribution gap between different domains .

3、 ... and 、 Robust domain adaptation based on low rank reconstruction

3.1 Single source domain adaptation

Suppose we have a group of $n$ Samples $S=[s_1,…,s_n]\in \mathbb{R}^{d\times n}$ A single source domain and a group $p$ sample $T=[t_1,…,t_p]\in \mathbb{R}^{d\times p}$ In the target domain , among $d$ Is the dimension of the eigenvector . Our goal is to find a transformation matrix $W\in \mathbb{R}^{d\times d}$ The source domain $S$ Convert to intermediate representation matrix , So as to maintain the following relationship ：
$WS=TZ,\tag{1}$
among , $WS=[W_ {s_1},…,W_ {s_n}]\in \mathbb{R}^{d\times n}$ Represents the transformation matrix reconstructed from the target domain , $Z=[z_1,…,z_n]\in \mathbb{R}^{p\times n}$ Is each $z_i\in \mathbb{R}^{p}$ The reconstruction coefficient matrix of is corresponding to the transformed sample $W_{s_i}$ Reconstruction coefficient vector of .
such , Each transformed source sample will be linearly reconstructed from the target sample , This may significantly reduce the difference in domain distribution .
However , The above formula finds the reconstruction of each source sample independently , Therefore, it may not be possible to capture any structural information of the source domain .
equation （1） Another problem in reconstruction is , It cannot deal with bad noise and outliers in the source domain that are not associated with the target domain .
In order to solve these problems effectively , We express the domain adaptation problem as the following objective function ：
$\mathop{min}\limits_{W,Z,E}rank(Z)+\alpha\|E\|_{2,1}\\ s.t.\space WS=TZ+E,\tag{2}\\ WW^T=I,$
among $rank(\cdot)$ Denotes the rank of the matrix , $\|E\|_{2,1}=\sum ^n_{j=1}\sqrt {\sum^d_{ i=1}(E_{ij})^2}$ go by the name of $l_{2,1}$ norm , $\alpha>0$ Trade off parameters .
He binds $W W^T= I$ , To ensure that you get $W$ Is a base transformation matrix .
First , $r ank (Z)$ The minimization of tends to find a reconstruction coefficient matrix with the lowest rank structure .
This is essentially a combination of reconstructions of different homologous samples , This gives the correlation of all source samples .
second , To minimize the $E\|_{2,1}$ bring $E$ The error of is listed as zero , This is based on the fact that some samples in the source domain are noise or outliers , Other samples are clean enough , Assumptions that can be successfully adapted .
By decomposing the noise and outlier information in the source domain into a matrix $E$ in , Make the adaptive algorithm more robust to noise and outliers .
The above optimization problem is difficult to solve because of the discreteness of the rank function . Fortunately, , The following optimization is the problem (2) It provides a good substitute :
$\mathop{min}\limits_{W,Z,E}\|Z\|_*+\alpha\|E\|_{2,1}\\ s.t.\space WS=TZ+E,\tag{3}\\ WW^T=I,$
$\|\cdot\|_*$ Represents the kernel norm of the matrix , That is, the sum of matrix singular values .
Once we get the optimal solution $(\hat{W},\hat{Z},\hat{E})$ , We can convert the source data to the target domain in the following ways :
$\hat{W}S-\hat{E}=[\hat{W}s_1-\hat{e}_1,\dots,\hat{W}s_n-\hat{e}_n],\tag{4}$
In style $\hat{e_i}$ According to matrix $\hat{E}$ Of the $i$ Column .
Last , Compare the transformed source sample with the target sample $T$ Mix as an enhancement training sample , Training classifier , It is used to identify the test samples not found in the target domain

3.2 Multi source domain adaptation

Suppose we have $M$ Source domains , $S_1,S_2,\dots, S_M,$ Each of them $S_i\in \mathbb{R}^{d\times n}$ For the first time $i$ Characteristic matrix of source domain . Our multi task low rank domain adaptation method can be expressed as :
$\begin{gathered} \mathop{min}\limits_{W_i,Z_i,E_i}\sum^M_{i=1}(\|Z_i\|_*+\alpha\|E_i\|_{2,1})+\beta\|Q\|_*\tag{5}\\ s.t.\space W_iS_i=TZ_i+E_i,\\ W_iW_i^T=I,i=1,\dots,M, \end{gathered}$
among $\alpha,\beta>0$ There are two trade-off parameters , $W_i ,Z_i$ and $E_i$ It's a transformation matrix , Coefficient matrix and $i$ Sparse error matrix of dimensional source domain .
matrix $Q$ By $[W_1S_1| W_2S_2|\dots| W_MS_M]\in \mathbb{R}^{d\times (M\times n)}$ , among $W_iS_i\in \mathbb{R}^{d\times n}$ Represents the... After transformation $i$ Source domains .
And form (3) Compared with the single field adaptation formula in , The proposed multi task domain adaptation target has the following characteristics :

For each source domain $S_i$ , The low rank and sparsity constraints are still used to find the transformation matrix $W_i$ , This maintains the association structure , It also provides noise tolerance characteristics .
After the merger $Q$ Be forced to be low rank , In particular, it is added to find low rank structures across different source domains , So as to further reduce the distribution differences in a collective way .

As in the case of a single source domain , Get the optimal solution $W_i, Z_i, E_i)$ after , $1,\dots, M,$ We can convert each source domain into $W_iS_i−E_i$ , Then all source domains and target domains $T$ Merge , Training data as a training classifier .

3.3 optimization

problem (5) Is a typical mixed kernel norm and $l_{2,1}$ Norm optimization problem . But different from the existing optimization formula , It has matrix orthogonal constraints $W_iW_i^T=I,i=1,\dots,M,$ We use matrix orthogonalization to deal with these constraints .
In order to solve (5) Optimization problems in , We first convert it into the following equivalent form :
$\begin{gathered} \mathop{min}\limits_{J,F_i,Z_i,E_i,W_i}\sum^M_{i=1}(\|F_i\|_*+\alpha\|E_i\|_{2,1})+\beta\|J\|_*\tag{6}\\ \text{ s.t. }\space W_iS_i=TZ_i+E_i,\\ Q=J,\\ Z_i=F_i,i=1,\dots,M, \end{gathered}$
among $J=[J_1,\dots,J_M]$ Every one of $J_i$ Corresponding to $W_iS_i$ , Orthogonality constraints are ignored .
The above equivalence problem can be solved by minimizing the augmented Lagrange multiplier method [16] solve , Among them, it minimizes the augmented Lagrangian function , Form the following :
$\begin{aligned} &\min _{J_{i}, F_{i}, Z_{i}, E_{i}, W_{i}, Y_{i}, U_{i}, V_{i}} \beta\|J\|_{*}+\sum_{i=1}^{M}\left(\left\|F_{i}\right\|_{*}+\alpha\left\|E_{i}\right\|_{2,1}\right) \\ &+\sum_{i=1}^{M}\left(\left\langle U_{i}, W_{i} S_{i}-J_{i}\right\rangle+\left\langle Y_{i}, Z_{i}-F_{i}\right\rangle+\frac{\mu}{2}\left\|Z_{i}-F_{i}\right\|_{F}^{2}\right. \\ &+\left\langle V_{i}, W_{i} S_{i}-T Z_{i}-E_{i}\right\rangle+\frac{\mu}{2}\left\|W_{i} S_{i}-J_{i}\right\|_{F}^{2} \\ &\left.+\frac{\mu}{2}\left\|W_{i} S_{i}-T Z_{i}-E_{i}\right\|_{F}^{2}\right)\tag{7} \end{aligned}$
$\langle\cdot,\cdot\rangle$ Represents the inner product operator , $\mu>0$ Is a penalty parameter , $Y_1,\dots,Y_M,U_1,\dots,U_M, and V_1,\dots,V_M,$ It's a Lagrange multiplier .
The optimization process is like Algorithm 1 Shown .
Insert picture description here

Be careful , All subproblems involved in the optimization process have closed form solutions .
One step 2 And steps 5 Singular value threshold operator can be used to solve , step 6 It can be solved by analytical solution .
chart 3 It shows the convergence process of iterative optimization , This is adaptive on the three domain target recognition data set dslr Source domain and webcam Target domain . It can be seen that , By appointment 40 Sub iteration , The objective function converges to the minimum .
Insert picture description here

Four 、 experiment

In every mission , The performance of the following domain adaptation methods will be compared .

Simple combination (NC). We directly use samples from the source domain to enhance the target domain , Without any transformation .
The adaptive SVM (A-SVM); This method first trains in the source domain SVM classifier , Then adjust the training samples to the target domain
Adaptive reconstruction in noise domain （NDAR）. under these circumstances , We do not consider removing noise and abnormal information in the source domain , This can be done by removing the equation （5） Medium $E_i$ Item .
We proposed RDALR Method .
The most advanced domain adaptation methods in the recent literature

We use one to many support vector machine as a classifier for cross domain classification . After domain adaptation , The source domain （ After transformation ） It is combined with the training samples in the target domain for support vector machine training , The resulting support vector machine classifier will be used to test unknown samples in the target domain .
In order to determine the appropriate parameter setting of our method , We are $\{10^{-4},10^{-3},\dots,1\}$ Changes on the grid $\alpha$ and $\beta$ Value , Then the optimal value is selected based on the five fold cross validation . Similarly ,A-SVM and SVM The optimal parameters in C Based on cross validation from $\{2^{-5},2^{-2},\dots,2^3\}$ Choose from .

4.1 An illustrative toy example

Insert picture description here

Pictured 4（a） Shown , We randomly generate three sample clouds , Each sample cloud contains approximately 400 Samples . We simply regard the red sample as the target domain , At the same time, assume that the blue and green samples are two different source domains . We apply our method to map two source domains to the target domain at the same time , At the same time, unnecessary noise information is removed
The result is shown in Fig. 4（b） Shown . It can be seen that , The two source domains are mixed into the target domain in a compact region , This shows the effectiveness of our proposed method in reducing the difference of domain distribution in domain adaptation .

4.2 Three domain object benchmark experiment

We are starting from Amazon 、 Visual domain adaptation benchmark data sets collected from three different fields of digital SLR and webcam [21] Test the proposed method . The data set is composed of 31 There are different object categories , From bike 、 Laptop to bookshelf and keyboard , The total number of images is 4652.
There are about 30 Images , On average, there are 90 Images .
Or low-level features , We use [21] Medium SURF features , All images are represented by 800 Uyghur character bag （BoW） Characteristic means .
For source domain samples , We are on webcam / Select each category randomly in digital SLR 8 Images , Select each category in Amazon 20 Images . meanwhile , We choose for each category 3 An image as amazon/webcam/dslr Target domain for .
These images are used for domain adaptation and classifier training , The remaining unseen images in the target domain are used as test sets for performance evaluation .
We also use belts RBF Kernel support vector machine as classifier , Among them, the test set 31 The average classification accuracy of object categories is used as an evaluation index . Each experiment is based on 5 Random segmentation repetition 5 Time , And report the average classification accuracy and standard derivation of all categories .

4.2.1 Single source domain adaptation

Insert picture description here

surface 1 Shows the performance of different methods , Among them, we also directly quote [10,13,21] Results in .
From the results , We have the following observations ：

All domain adaptive methods are better than NC Methods produce better results , This confirms the superiority of domain adaptation .
our RDALR This method is obviously superior to domain adaptive metric learning （DAML）、A-SVM And unsupervised domain adaptation （UDA） Method , This verifies the comparison with the most advanced methods in the literature , Low rank reconstruction can better reduce the difference of domain distribution .
RDALR Obviously better than NDAR, Because the latter will not remove unwanted noise information in the source domain .

4.2.2 Multi source domain adaptation

Insert picture description here
We use the same settings as the single domain adaptation experiment . However , The difference is , Samples in the target domain are combined with samples from multiple source domains , Used to train classifiers .
surface 2 Shows three different combinations of multiple source domains . A closely related work is UDA Method , The author attempts to learn an intermediate representative subspace on the Glassman manifold between the source domain and the target domain .
This shows that our method is effective for multi-source domain adaptation .
chart 5 It shows the performance under various parameter combinations in the multi-source domain adaptive experiment .
Insert picture description here

4.3 Caltech 256 experiment

Caltech 256 The target domain has 30607 Zhang image , It is divided into 256 Object categories .
Bing The source domain contains approximately 120924 A weakly marked image , These images are using Caltech 256 Each text tag of is crawled as a search keyword .
For each image , Let's start with Gauss difference （DOG） detector [19] Extract the detected key points SIFT features , Then represent each image as 5000 dimension BoW features
stay Caltech 256 On the target domain , We choose randomly from each category {5,10,…,50} Images as training data , And use the rest of the images as test data .
stay Bing On the source domain , We choose randomly from each category 10 Images are used for domain adaptation , In the experiment, linear support vector machine is used as the classifier .
Insert picture description here
chart 6 It shows the experimental results of comparing all methods under different numbers of training images in the target domain .

4.4TRECVID MED 2011 experiment

TRECVID 2011 Multimedia event detection （MED）[26] The development data set contains 10704 A video clip , come from 17566 Minute video program , It is divided into five event classes and background classes .
These five activities are “ Play chess board tricks ”、“ Feed animals ”、“ Falling fish ”、“ wedding ” and “ Engaged in woodworking projects ”.
Divide the data set into contains 8783 A training set of videos and contains 2021 Test set of video .
say concretely , The training set contains about 8273 A background video that does not belong to any of the five events , The average number of training videos per event is 100.
In this experiment , We use TRECVID MED Data set as target domain , Use simultaneously from web The video captured on is used as the source domain .
Given a video clip , We sample a frame every two seconds . For each frame , We start with two detectors （DoG and Hessian Affine detector ） Extract the detected key points 128 dimension SIFT features .
then , application k- The mean method will SIFT The characteristics are grouped into 5000 Clusters .
Last , We will record all sampled frames in the video clip 5000 Dimensional features are aggregated , As clip level feature representation . In the experiment, we use linear support vector machine as classifier .
stay TRECVID After assessment , We use average precision （AP） Evaluate the performance of each event , Then calculate the average accuracy of five events （MAP） As the overall evaluation index .
Insert picture description here
From the results , We have the following observations ：
（1） although NC Generated MAP Above baseline , But in “ Feed animals ” and “ Falling fish ” The performance on the event is even worse than the baseline method .
（2） Compared with other methods , Our method achieves the best average performance .
This proves the great potential of our method in video event detection .
“ Work on woodworking projects ” The reason for the decline of event performance may be unexpected large cross domain content differences .
Another potential reason is , The visual features used as input to the recognition model may not be sufficient （ for example , Lack of time and audio features ） Capture event attributes that can persist on different domains .

summary

We introduce a robust visual domain adaptive method , To reduce the distribution difference between the source domain and the target domain .
Its basic idea is to convert the source sample into an intermediate representation , In this way, each sample can be linearly reconstructed from the target sample .
This method uses low rank structure to capture the intrinsic correlation of source samples , At the same time, sparse structure is used to identify noise and outlier information , This makes our method have excellent robustness in domain adaptive tasks .
We have proved the effectiveness of our proposed method on a wide range of domain adaptation benchmarks .
some time , We plan to use low rank reconstruction as the preprocessing step of semi supervised learning , In order to make the distribution of unlabeled samples and labeled samples more consistent .