当前位置：网站首页>Machine learning 07: Interpretation of PCA and its sklearn source code

Machine learning 07: Interpretation of PCA and its sklearn source code

2022-06-26 05:46:00 【SP FA】

List of articles

summary
Decompose covariance matrix
- Covariance matrix
- Eigenvalue decomposition and SVD
Decompose the inner product matrix
About source code

summary

Principal Component Analysis That is, the main idea of principal component analysis is to n Dimension features map to k D on , this k Dimension is a new orthogonal feature, also known as the main component , It's in the original n Based on the characteristics of dimension, we reconstruct k Whitman's sign .PCA My job is to find a set of mutually orthogonal coordinate axes from the original space , among , The first new axis selection is the direction with the largest variance in the original data , The second new coordinate axis is selected in the plane orthogonal to the first coordinate axis to maximize the variance , The third axis is the same as 1,2 The one with the largest variance in the plane with orthogonal axes , By analogy , obtain n One of these axes . The new coordinate axis obtained in this way , Most of the variance is included in the front k In two axes , The variance of the following coordinate axis is almost 0 . therefore , We can keep only the former k dimension , So as to reduce the dimension of data features .

PCA There are two ways to calculate ：

Decompose covariance matrix ： The specific implementation process mainly includes three steps ：
1. Calculate the covariance matrix of the data matrix
2. Calculate the eigenvalue eigenvector of the covariance matrix
3. Select the maximum eigenvalue （ That is, the variance is the largest ） Of k A matrix composed of eigenvectors corresponding to features
Decompose the inner product matrix

Decompose covariance matrix

Covariance matrix

Covariance is used to describe the correlation between two random variables . Simply compare variance and covariance ：

mean value ： $\bar x=\frac1n\sum\limits^N_{i=1}\vec x_i$

Sample variance ： $S^2=\frac1{n-1}\sum\limits^N_{i=1}(\bar x-\vec x_i)^2$

sample $X$ And the sample $Y$ The covariance ： $Cov(X,Y)=\frac1{n-1}\sum\limits^N_{i=1}(\bar x-\vec x_i)(\bar y-\vec y_i)$

We can find out ： The variance is calculated for one-dimensional characteristics , Covariance requires at least two dimensions . Variance is a special case of covariance . When the covariance is positive , explain $X, Y$ It's a positive correlation , When it is negative , It means a negative correlation , by 0 It means that the two samples are independent of each other . When there are multiple samples , The covariance between two sets constitutes the covariance matrix .

Covariance matrix and divergence matrix are closely related , From the definition of divergence matrix ： $S=\sum\limits^N_{i=1}(\vec x_i-\bar x)(\vec x_i-\bar x)^T$ We found that , For a set of data $X$ ： $S = (n - 1) C o v (X)$ , So they have the same eigenvalues and eigenvectors .

Eigenvalue decomposition and SVD

For the detailed process, please refer to another article of mine ： Eigendecomposition and singular value decomposition
To calculate the n After eigenvectors , We sort them , And take the front k Just one .

Decompose the inner product matrix

stay sklearn In the library PCA Is calculated using this method .

hypothesis $X$ A matrix is a sample set , It has been centralized （PCA Data centralization is required ）, So the covariance matrix is ： $\frac1nXX^T$
We are right. $X$ Conduct SVD decompose ： $X=U\Sigma V^T$ , be $X^T=V\Sigma U^T$
So there is ： $\frac1nXX^T=U\Sigma^2U^T$

therefore , We can $X$ Conduct SVD decompose , Then seek $\Sigma$ To calculate the eigenvector .

About source code

Spent the night reading sklearn Of Official code , Take a rough record . The source code is too long , So only the key parts are recorded .

Calculation PCA

adopt svd_solver Parameters , We can choose different ways to solve the data , But the basic idea is the same , Here is just svd_solver == 'full' That is, take the case of feature decomposition using all data as an example .

fit() Method implementation ：
among ,explained_variance_ Is the variance value ,explained_variance_ratio_ Contribution rate for each equation , That is, proportion , This parameter can be used to determine n Specific value .

Be careful ,n_components Parameters can take several values ：

“mle”： Automatically determine n value .
$n_components < 1 0<\text{n\_components}<1$ ： Indicates the proportion of information you want to keep .PCA Will automatically select to make The amount of information $n_components \ge\text{n\_components}$ Number of features .
$n_components ≥ 1 \text{n\_components} \ge1$ And it's an integer ： Before the election n individual

def _fit_full(self, X, n_components):
	n_samples, n_features = X.shape
	
	... #  Validate the incoming data , Omitted for the time being 
	
	# Center data
	self.mean_ = np.mean(X, axis=0)
	X -= self.mean_
	
	#  Use the decomposition inner product matrix method to find the eigenvector 
	U, S, Vt = linalg.svd(X, full_matrices=False)
	U, Vt = svd_flip(U, Vt) #  Because sometimes it is possible to find negative values , So flip the eigenvector symbols to enforce deterministic output 
	components_ = Vt
	
	explained_variance_ = (S**2) / (n_samples - 1)
	total_var = explained_variance_.sum()
	explained_variance_ratio_ = explained_variance_ / total_var
	singular_values_ = S.copy()  # Store the singular values.
	
	#  according to  n_components  Make sure to take several features , That is to say  n  Value .
	if n_components == "mle":
	    n_components = _infer_dimension(explained_variance_, n_samples)
	elif 0 < n_components < 1.0:
	    ratio_cumsum = stable_cumsum(explained_variance_ratio_)
	    n_components = np.searchsorted(ratio_cumsum, n_components, side="right") + 1
	# Compute noise covariance using Probabilistic PCA model
	# The sigma2 maximum likelihood (cf. eq. 12.46)
	if n_components < min(n_features, n_samples):
	    self.noise_variance_ = explained_variance_[n_components:].mean()
	else:
	    self.noise_variance_ = 0.0
	
	... #  Assign values to parameters , Omitted for the time being 
	
	return U, S, Vt

transform() Method implementation ：

The whitening operation is used here , That is, to decorrelate the data , To reduce the redundancy of input data , Whitened data has two characteristics ： Eliminate the correlation between features , And the variances of all features are 1.

def transform(self, X):
        ... #  Verify the program and data , Let's ignore 
        
        if self.mean_ is not None:
            X = X - self.mean_
        X_transformed = np.dot(X, self.components_.T)
        #  An albino 
        if self.whiten:
            X_transformed /= np.sqrt(self.explained_variance_)
        return X_transformed

fit_transform() Method implementation ：

The relevant formulas are carefully written in the notes by the author .

def fit_transform(self, X, y=None):
        U, S, Vt = self._fit(X)
        U = U[:, : self.n_components_]

        if self.whiten:
            # X_new = X * V / S * sqrt(n_samples) = U * sqrt(n_samples)
            U *= sqrt(X.shape[0] - 1)
        else:
            # X_new = X * V = U * S * Vt * V = U * S
            U *= S[: self.n_components_]
        return U

Automatic selection n Value

Judge by calculating the log likelihood n How many to take , That is, keep several dimensions . Relevant papers can be seen here link

def _assess_dimension(spectrum, rank, n_samples):
    """Compute the log-likelihood of a rank ``rank`` dataset. The dataset is assumed to be embedded in gaussian noise of shape(n, dimf) having spectrum ``spectrum``. Parameters ---------- spectrum : ndarray of shape (n_features,) Data spectrum. rank : int Tested rank value. It should be strictly lower than n_features, otherwise the method isn't specified (division by zero in equation (31) from the paper). n_samples : int Number of samples. Returns ------- ll : float The log-likelihood. """

    n_features = spectrum.shape[0]
    if not 1 <= rank < n_features:
        raise ValueError("the tested rank should be in [1, n_features - 1]")

    eps = 1e-15

    if spectrum[rank - 1] < eps:
        # When the tested rank is associated with a small eigenvalue, there's
        # no point in computing the log-likelihood: it's going to be very
        # small and won't be the max anyway. Also, it can lead to numerical
        # issues below when computing pa, in particular in log((spectrum[i] -
        # spectrum[j]) because this will take the log of something very small.
        return -np.inf

    pu = -rank * log(2.0)
    for i in range(1, rank + 1):
        pu += (
            gammaln((n_features - i + 1) / 2.0)
            - log(np.pi) * (n_features - i + 1) / 2.0
        )

    pl = np.sum(np.log(spectrum[:rank]))
    pl = -pl * n_samples / 2.0

    v = max(eps, np.sum(spectrum[rank:]) / (n_features - rank))
    pv = -np.log(v) * n_samples * (n_features - rank) / 2.0

    m = n_features * rank - rank * (rank + 1.0) / 2.0
    pp = log(2.0 * np.pi) * (m + rank) / 2.0

    pa = 0.0
    spectrum_ = spectrum.copy()
    spectrum_[rank:n_features] = v
    for i in range(rank):
        for j in range(i + 1, len(spectrum)):
            pa += log(
                (spectrum[i] - spectrum[j]) * (1.0 / spectrum_[j] - 1.0 / spectrum_[i])
            ) + log(n_samples)

    ll = pu + pl + pv + pp - pa / 2.0 - rank * log(n_samples) / 2.0
    return ll


def _infer_dimension(spectrum, n_samples):
    """Infers the dimension of a dataset with a given spectrum. The returned value will be in [1, n_features - 1]. """
    ll = np.empty_like(spectrum)
    ll[0] = -np.inf  # we don't want to return n_components = 0
    for rank in range(1, spectrum.shape[0]):
        ll[rank] = _assess_dimension(spectrum, rank, n_samples)
    return ll.argmax()

Sample scores

The performance is reflected by calculating the log likelihood of each sample . Relevant papers can be seen here link

def get_covariance(self):
    """Compute data covariance with the generative model. ``cov = components_.T * S**2 * components_ + sigma2 * eye(n_features)`` where S**2 contains the explained variances, and sigma2 contains the noise variances. Returns ------- cov : array of shape=(n_features, n_features) Estimated covariance of data. """
    components_ = self.components_
    exp_var = self.explained_variance_
    if self.whiten:
        components_ = components_ * np.sqrt(exp_var[:, np.newaxis])
    exp_var_diff = np.maximum(exp_var - self.noise_variance_, 0.0)
    cov = np.dot(components_.T * exp_var_diff, components_)
    cov.flat[:: len(cov) + 1] += self.noise_variance_  # modify diag inplace
    return cov

def get_precision(self):
    """Compute data precision matrix with the generative model. Equals the inverse of the covariance but computed with the matrix inversion lemma for efficiency. Returns ------- precision : array, shape=(n_features, n_features) Estimated precision of data. """
    n_features = self.components_.shape[1]

    # handle corner cases first
    if self.n_components_ == 0:
        return np.eye(n_features) / self.noise_variance_

    if np.isclose(self.noise_variance_, 0.0, atol=0.0):
        return linalg.inv(self.get_covariance())

    # Get precision using matrix inversion lemma
    components_ = self.components_
    exp_var = self.explained_variance_
    if self.whiten:
        components_ = components_ * np.sqrt(exp_var[:, np.newaxis])
    exp_var_diff = np.maximum(exp_var - self.noise_variance_, 0.0)
    precision = np.dot(components_, components_.T) / self.noise_variance_
    precision.flat[:: len(precision) + 1] += 1.0 / exp_var_diff
    precision = np.dot(components_.T, np.dot(linalg.inv(precision), components_))
    precision /= -(self.noise_variance_**2)
    precision.flat[:: len(precision) + 1] += 1.0 / self.noise_variance_
    return precision

def score_samples(self, X):
	"""Return the log-likelihood of each sample. Parameters ---------- X : array-like of shape (n_samples, n_features) The data. """
	... #  Verification of procedures and data , Let's ignore 
	
	Xr = X - self.mean_
	n_features = X.shape[1]
	precision = self.get_precision()
	log_like = -0.5 * (Xr * (np.dot(Xr, precision))).sum(axis=1)
	log_like -= 0.5 * (n_features * log(2.0 * np.pi) - fast_logdet(precision))
	return log_like

def score(self, X, y=None):
    """Return the average log-likelihood of all samples."""
    return np.mean(self.score_samples(X))