当前位置：网站首页>On data preprocessing in sklearn

On data preprocessing in sklearn

2022-07-02 12:00:00 【raelum】

Catalog

Preface
One 、 Standardization （StandardScaler）
Two 、 normalization （MinMaxScaler）
3、 ... and 、 Regularization （Normalizer）
Four 、 Absolute maximum Standardization （MaxAbsScaler）
5、 ... and 、 Two valued （Binarizer）

Preface

sklearn Medium sklearn.preprocessing The correlation function of data preprocessing is provided in , This article will mainly focus on feature scaling .

One 、 Standardization （StandardScaler）

Let the data matrix be

$\begin{bmatrix} \boldsymbol{x}_1^{\mathrm T} \\ \boldsymbol{x}_2^{\mathrm T} \\ \vdots \\ \boldsymbol{x}_n^{\mathrm T} \end{bmatrix}$

among $\boldsymbol{x}_i=(x_{i1}, x_{i2},\cdots,x_{id})^{\mathrm T}$ Is the eigenvector .

Before proceeding to the next step , It is necessary to introduce the mean and standard deviation of the data matrix first .

We know , For data vectors $\boldsymbol{a}=(a_1,\cdots,a_n)^{\mathrm T}$ for （ The vector here can be understood as a set of data , It's called a vector , To facilitate subsequent statements ）, The mean and standard deviation are ：

$\mu(\boldsymbol{a})=\frac{a_1+\cdots+a_n}{n},\quad\sigma(\boldsymbol a)=\left(\frac1n \Vert \boldsymbol{a}-\boldsymbol{\mu}(\boldsymbol{a})\Vert^2\right)^{1/2},\quad among \;\boldsymbol{\mu}(\boldsymbol{a})=(\underbrace{\mu(\boldsymbol{a}),\cdots, \mu(\boldsymbol{a})}_{n individual })^{\mathrm T}$

We will $X$ Written as a line vector ： $=(\boldsymbol{a}_1,\boldsymbol{a}_2,\cdots,\boldsymbol{a}_d)$ , Each of them $\boldsymbol{a}_i$ All vectors are columns , therefore

$\begin{aligned} \mu(X)&=(\mu(\boldsymbol{a}_1),\mu(\boldsymbol{a}_2),\cdots,\mu(\boldsymbol{a}_d))^{\mathrm T} \\ \sigma(X)&=(\sigma(\boldsymbol{a}_1),\sigma(\boldsymbol{a}_2),\cdots,\sigma(\boldsymbol{a}_d))^{\mathrm T} \end{aligned}$

Set right $X$ After standardization, we get $Z$ , utilize numpy The broadcast mechanism of , $Z$ There are the following forms

$Z=(\boldsymbol{z}_1,\boldsymbol{z}_2,\cdots,\boldsymbol{z}_d),\quad among \; \boldsymbol{z}_i=\frac{\boldsymbol{a}_i-\mu(\boldsymbol{a}_i)}{\sigma(\boldsymbol{a}_i)},\;\;i=1,2,\cdots,d$

Of course $Z$ Can be more succinctly expressed as

$Z=\frac{X-\mu(X)^{\mathrm T}}{\sigma(X)^{\mathrm T}}$

see $X$ The average of , Variance and standard deviation ：

from sklearn.preprocessing import StandardScaler
import numpy as np

#  Data matrix 
X = np.array([
    [1, 3],
    [0, 1]
])
#  Create a scaler Instance and pass data into the instance 
scaler = StandardScaler().fit(X)
#  see X Mean value of , Variance and standard deviation 
print(scaler.mean_)  # [0.5 2. ]
print(scaler.var_)  # [0.25 1. ]
print(scaler.scale_)  # [0.5 1. ]

The reason why the standard deviation is scale_, Because our scaling standard is poor . It should be noted that , If the variance of a column of the data matrix is $0$ , be scale_ by $1$ , That is, this column is not scaled .

Yes $X$ Standardize , Just use transfrom() Method ：

X = np.array([
    [243, 80],
    [19, 47]
])
scaler = StandardScaler().fit(X)
#  Zoom 
X_scaled = scaler.transform(X)
# [[ 1. 1.]
# [-1. -1.]]

see X_scaled Mean and standard deviation ：

print(X_scaled.mean(axis=0))
# [0. 0.]
print(X_scaled.std(axis=0))
# [1. 1.]

You can see X_scaled The mean for $\boldsymbol 0$ , The standard deviation is $\boldsymbol 1$ , namely $X$ It has been standardized .

Of course we can use scaler De standardizing new samples , The standardization process adopts $X$ Mean and standard deviation ：

X = np.array([
    [243, 80],
    [19, 47]
])
scaler = StandardScaler().fit(X)
#  Scale the new sample 
print(scaler.transform([[2, 3]]))
# [[-1.15178571 -3.66666667]]

Two 、 normalization （MinMaxScaler）

Yes $X$ Normalization is to normalize $X$ Zoom all elements in to $[0, 1]$ Inside . The specific process is as follows ：

remember

$\underline{\boldsymbol{a}_i}=\min(x_{1i},x_{2i},\cdots,x_{ni}),\quad \overline{\boldsymbol{a}_i}=\max(x_{1i},x_{2i},\cdots,x_{ni}) \\ \\ \underline{X}=(\underline{\boldsymbol{a}_1},\underline{\boldsymbol{a}_2},\cdots,\underline{\boldsymbol{a}_d})^{\mathrm{T}},\quad \overline{X}=(\overline{\boldsymbol{a}_1},\overline{\boldsymbol{a}_2},\cdots,\overline{\boldsymbol{a}_d})^{\mathrm{T}}$

set up $X$ After normalization, we get $Z$ , utilize numpy The broadcast mechanism of , We have

$Z=\frac{X-\underline{X}^{\mathrm T}}{\overline{X}^{\mathrm T}-\underline{X}^{\mathrm T}}$

First use make_blobs() Generate speckle dataset ：

from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=6, centers=2, random_state=27)
print(X)
# [[ 5.93412904 6.82960749]
# [-1.66484812 6.53450678]
# [-1.26216614 6.23733539]
# [ 5.26739446 7.73680694]
# [-0.66451524 7.50872847]
# [ 4.14680663 6.35238034]]

Yes $X$ Normalize ：

scaler = MinMaxScaler().fit(X)
print(scaler.transform(X))
# [[1. 0.39498722]
# [0. 0.19818408]
# [0.0529916 0. ]
# [0.91225996 1. ]
# [0.13164046 0.8478941 ]
# [0.76479434 0.07672366]]

If we want to $X$ Zoom elements in to $(1, 2)$ Within the interval , It only needs ：

scaler = MinMaxScaler((1, 2)).fit(X)
print(scaler.transform(X))
# [[2. 1.39498722]
# [1. 1.19818408]
# [1.0529916 1. ]
# [1.91225996 2. ]
# [1.13164046 1.8478941 ]
# [1.76479434 1.07672366]]

3、 ... and 、 Regularization （Normalizer）

Yes $X$ Regularization is to regularize each sample （ Every line ） Regularize , That is, the norm of each sample is transformed into the unit norm . The specific process is as follows ：

$\boldsymbol{x}_i:=\frac{\boldsymbol{x}_i}{\Vert \boldsymbol{x}_i\Vert_p},\quad i=1,2,\cdots,n,\quad p=1,2,\infty$

$p = 1$ Time is L1 norm , $p = 2$ Time is L2 norm , $p=\infty$ Time is infinite （ Maximum ） norm .Normalizer By default L2 norm .

Yes $X$ Conduct L2 Regularization ：

from sklearn.preprocessing import Normalizer
import numpy as np

X = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8]
])
scaler = Normalizer().fit(X)
print(scaler.transform(X))
# [[0.18257419 0.36514837 0.54772256 0.73029674]
# [0.37904902 0.45485883 0.53066863 0.60647843]]

If you want to use the maximum norm or L1 norm , It only needs ：

scaler = Normalizer('max').fit(X)
print(scaler.transform(X))
# [[0.25 0.5 0.75 1. ]
# [0.625 0.75 0.875 1. ]]

scaler = Normalizer('l1').fit(X)
print(scaler.transform(X))
# [[0.1 0.2 0.3 0.4 ]
# [0.19230769 0.23076923 0.26923077 0.30769231]]

Four 、 Absolute maximum Standardization （MaxAbsScaler）

Yes $X$ Normalize the absolute value maximum, that is, to $X$ Each column of , Scale according to its maximum absolute value . The specific process is as follows ：

remember

$\mathrm{MaxAbs}(\boldsymbol{a_i})=\max(|x_{1i}|,|x_{2i}|,\cdots,|x_{ni}|),\quad \mathrm{MaxAbs}(X)=(\mathrm{MaxAbs}(\boldsymbol{a}_1),\cdots, \mathrm{MaxAbs}(\boldsymbol{a}_d))^{\mathrm T}$

set up $X$ After normalizing the absolute value, we get $Z$ , utilize numpy The broadcast mechanism of , Yes

$Z=\frac{X}{\mathrm{MaxAbs}(X)^{\mathrm T}}$

Yes $X$ Normalize the absolute maximum ：

from sklearn.preprocessing import MaxAbsScaler
import numpy as np

X = np.array([
    [1, -1, 2],
    [2, 0, 0],
    [0, 1, -1]
])
scaler = MaxAbsScaler().fit(X)
print(scaler.transform(X))
# [[ 0.5 -1. 1. ]
# [ 1. 0. 0. ]
# [ 0. 1. -0.5]]

5、 ... and 、 Two valued （Binarizer）

Yes $X$ Binarization is to set a threshold , $X$ in Greater than The element of this threshold is set to $1$ , Less than or equal to The element of this threshold is set to $0$ .

Binarizer The default threshold is $0$ .

Yes $X$ To binarize ：

from sklearn.preprocessing import Binarizer
import numpy as np

X = np.array([
    [1, -1, 2],
    [2, 0, 0],
    [0, 1, -1]
])
transformer = Binarizer().fit(X)
print(transformer.transform(X))
# [[1 0 1]
# [1 0 0]
# [0 1 0]]

If the threshold is set to $1$ , Then the result becomes ：

transformer = Binarizer(threshold=1).fit(X)
print(transformer.transform(X))
# [[0 0 1]
# [1 0 0]
# [0 0 0]]

in fact , utilize numpy Characteristics of , We can just use numpy Complete these operations ：

import numpy as np

def binarizer(X, threshold):
    Y = X.copy()
    Y[Y > threshold] = 1
    Y[Y <= threshold] = 0
    return Y


X = np.array([
    [1, -1, 2], 
    [2, 0, 0], 
    [0, 1, -1]
])
print(binarizer(X, 0))
# [[1 0 1]
# [1 0 0]
# [0 1 0]]