当前位置：网站首页>Sparse knowledge points

Sparse knowledge points

2022-06-10 08:50:00 【Itchy heart】

sparsity (sparse)

Definition ：Sparse Expressed as parameters in the model , Only a few non-zero elements or only a few elements far greater than zero .

`WHY:` Why should we include sparsity in the model ？

Example ： Take an examination of grind learn bully to have 10000 Vocabulary of , The vocabulary used in the exam , yes 10000 A small part of a vocabulary accumulation library .

Example:
Test Number:123.456
The first set of digital bases ：
[100,10,1] $\Rightarrow$ 123.456 $\approx$ 100 $\times$ 1 + 10 $\times$ 2 + 1 $\times$ 3 (error=0.456)

The second set of digital bases ：
[100,50,10,1,0.5,0.1,0.03,0.01,0.001]
123.456=100 $\times$ 1 + 50 $\times$ 0 + 10 $\times$ 2 + 1 $\times$ 3 + 0.5 $\times$ 0 + 0.1 $\times$ 4 + 0.03 $\times$ 0 + 0.01 $\times$ 5 + 0.001 $\times$ 6(error=0)

among Sparse Feature( Be prepared against want ): Yes 50,0.5,0.03 These three numbers .

compared with PCA(Principal Component Analysis)
PCA(a complete set of basis vectors: Complete dictionary )
Through the vector base in this set of complete dictionaries , Restore the original data .

Sparse Represnetation(an over-complete set of basis vectors： Super complete dictionary , Contrary to sparsity .)
The number of base vectors is much larger than the dimension of the input vector

How to ensure sparsity ？

Machine learning model $\Rightarrow$ Optimize parameters based on training set ( For example, reduce Loss) $\Rightarrow$ Loss Add regular terms to , The penalty model parameter values make it close to 0

Common operations ：
Loss = Training Loss + $\lambda$ ${||W||_0}$ ( ${L_0}$ normal form )

Loss = Training Loss + $\lambda$ ${||W||_1}$ ( ${L_1}$ normal form )

Sparce Coding( Sparse coding LOSS)
Loss = $\sum_{j=1}^m||x^{(j)}-\sum_{i=1}^k a_i^{(j)}\phi_i||^2 + \lambda\sum_{i=1}^k||a_i||_1$

among , $\sum_{i=1}^k a_i^{(j)}$ It's reconstruction error , $\lambda\sum_{i=1}^k||a_i||_1$ For sparse penalty （ $L_1$ Norm）

Also in the era of convolutional networks , We will add... To the convolution layer $L_1$ norm , To ensure its sparsity .
Increase the depth and width of the model , To ensure that there are more super complete dictionaries .

Is mindless sparsity good or bad ？

Super complete dictionary $\Rightarrow$ A lot of high-quality data .
Too many inactive parameters $\Rightarrow$ The training process is very long

$L_1$ The paradigm is Loss Some positions in are not differentiable $\Rightarrow$ The derivative is at zero , Derivative is not unique , Therefore, the model is difficult to converge