当前位置:网站首页>Machine learning notes - supervised learning memo list

Machine learning notes - supervised learning memo list

2022-06-13 06:36:00 Sit and watch the clouds rise

One 、 Introduction to supervised learning

         Given a set of data points \{x^{(1)}, ..., x^{(m)}\} Associate to a set of results \{y^{(1)}, ..., y ^{(m)}\}, We want to build a classifier , Learn how to learn from x forecast y.

1、 Forecast type

         The following table summarizes the different types of forecasting models :

2、 Model type

         The following table summarizes the different models :

Two 、 Symbols and general concepts

1、Hypothesis

         This assumption is marked as h_\theta And the model we chose . For a given input data x^{(i)}, The prediction output of the model is h_\theta(x^{(i)}).

2、Loss function

         The loss function is a function L:(z,y)\in\mathbb{R}\times \longmapsto L(z,y)\in\mathbb{R}, It will be compared with the real data value y The corresponding predicted value z As input , And output their differences . The following table summarizes the common loss functions :

3、Cost function

         cost function J It is usually used to evaluate the performance of the model , Its loss function L The definition is as follows :

        \boxed{J(\theta)=\sum_{i=1}^mL(h_\theta(x^{(i)}), y^{(i)})}

4、Gradient descent

         By paying attention to \alpha\in\mathbb{R} Learning rate , The updating rule of gradient descent is a function of learning rate and cost J Shown by the following :

        \boxed{\theta\longleftarrow\theta-\alpha\nabla J(\theta)}

          remarks : Stochastic gradient descent (SGD) Is to update the parameters according to each training sample , Batch gradient descent is performed on a batch of training samples .

5、Likelihood

         The given parameters \theta  Model of L(\theta) The likelihood of is used to find the optimal parameters by likelihood maximization \theta. We have :

        \boxed{\theta^{\textrm{opt}}=\underset{\theta}{\textrm{arg max }}L(\theta)}

         remarks : In practice , We use log likelihood, which is easier to optimize \ell(\theta)=\log(L(\theta)).

6、Newton's algorithm

         Newton's algorithm is a method of solving \theta The numerical method of , bring \ell'(\theta)=0. The update rules are as follows :\boxed{\theta\leftarrow\theta-\frac{\ell'(\theta)}{\ell''(\theta)}}

         remarks : Multidimensional generalization , Also known as Newton-Raphson Method , There are the following update rules :

        \theta\leftarrow\theta-\left(\nabla_\theta^2\ell(\theta)\right)^{-1}\nabla_\theta\ell(\theta)

3、 ... and 、 Linear model

1、 Linear regression

         Let's assume here that y|x;\theta\sim\mathcal{N}

(1)Normal equations

         By paying attention to the design matrix X, Minimize the cost function \theta  The value of is a closed form solution , bring :

        \boxed{\theta=(X^TX)^{-1}X^Ty}

(2)LMS algorithm

         By paying attention to α Learning rate , Least mean square (LMS) Algorithm to m Update rules for training sets of data points , Also known as Widrow-Hoff Learning rule , As shown below :

        \boxed{\forall j,\quad \theta_j \leftarrow \theta_j+\alpha\sum_{i=1}^m\left[y^{(i)}-h_\theta(x^{(i)})\right]x_j^{(i)}}

         remarks : The update rule is a special case of gradient rise .

(3)LWR

         Local weighted regression , Also known as LWR, Is a variant of linear regression , It passes through w^{(i)}(x) Each training sample is weighted in its cost function , It's made up of parameters \tau \in\mathbb{R} Defined as :

        \boxed{w^{(i)}(x)=\exp\left(-\frac{(x^{(i)}-x)^2}{2\tau^2}\right)}

2、 Classification and logical regression

(1)Sigmoid function

        sigmoid function g, Also known as logical functions , The definition is as follows :\forall z\in\mathbb{R},\quad\boxed{g(z)=\frac{1}{1+e^{-z}}\in[0,1]}

(2)Logistic regression

         Let's assume here that y|x;\theta\sim\textrm{Bernoulli} (\phi). We have the following forms :

        \boxed{\phi=p(y=1|x;\theta)=\frac{1}{1+\exp(-\theta^Tx)}=g(\theta^Tx)}

         remarks : There is no closed form solution for logistic regression .

(3)Softmax regression

         When there is more than 2 Result categories ,softmax Return to , Also called multi class logistic regression , Used to generalize logistic regression . By convention , We set up \theta_K=0, This makes every class i Bernoulli parameter of \phi_i by :

        \boxed{\displaystyle\phi_i=\frac{\exp(\theta_i^Tx)}{\displaystyle\sum_{j=1}^K\exp(\theta_j^Tx)}}

3、 Generalized linear model

(1)Exponential family

         If a kind of distribution can use natural parameters ( Also known as canonical arguments or link functions )、\eta、 Sufficient statistics T(y) And logarithm , It belongs to log-partition function a(\eta) as follows :

        \boxed{p(y;\eta)=b(y)\exp(\eta T(y)-a(\eta))}

         remarks : We often have T(y)=y. Besides ,\exp(-a(\eta)) It can be regarded as a standardized parameter , It will ensure that the sum of the probabilities is 1.

         The following table summarizes the most common exponential distributions :

(2)Assumptions of GLMs

         Generalized linear model (GLM) To predict random variables y As x\in\mathbb{R}^{n+1} Function of , And depends on the following 3 A hypothesis :

        (1) \quad\boxed{y|x;\theta\sim\textrm{ExpFamily}(\eta)} (2) \quad\boxed{h_\theta(x)=E[y|x;\theta]} (3) \boxed{\eta=\theta^Tx}

         remarks : Ordinary least square method and logistic regression are special cases of generalized linear model .

Four 、 Support vector machine

         The goal of support vector machine is to find the line with the minimum distance and the maximum distance from the line .

1、Optimal margin classifier

         Best margin classifier h That's true :\boxed{h(x)=\textrm{sign}(w^Tx-b)}, among (w, b)\in\mathbb{R}^n\times\mathbb{R} Is the solution of the following optimization problem :\boxed{\min\frac{1}{2}||w||^2}\quad\quad\textrm{such that }\quad \boxed{y^{(i)}(w^Tx^{(i)}-b)\geqslant1}

          remarks : The decision boundary is defined as \boxed{w^Tx-b=0}

2、Hinge loss

        hinge loss be used for SVM Set up , The definition is as follows :\boxed{L(z,y)=[1-yz]_+=\max(0,1-yz)}

3、Kernel

         Given a feature map \phi, We will use the kernel K  The definition is as follows :\boxed{K(x,z)=\phi(x)^T\phi(z)}

         In practice , from K(x,z)=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right) Defined kernel K It is called Gaussian kernel , More commonly used .

          remarks : We say we use “ Kernel techniques ” To use the kernel to compute cost functions , Because we don't really need to know the explicit mapping \phi, This is usually very complicated . contrary , Just the value K(x,z).

4、Lagrangian

         We will Lagrange \mathcal{L}(w,b) The definition is as follows :\boxed{\mathcal{L}(w,b)=f(w)+\sum_{i=1}^l\beta_ih_i(w)}

         remarks : coefficient \beta_i It is called Lagrange multiplier .

5、 ... and 、 Generative learning

         The generation model first attempts to estimate P(x|y) To learn how to generate data , Then we can use Bayesian estimation P(y|x) The rules .

1、Gaussian Discriminant Analysis

(1)Setting        

         Gaussian discriminant analysis hypothesis y and x|y=0 and x|y=1 That's true :
        (1) \quad\boxed{y\sim\textrm{Bernoulli}(\phi)}  (2) \quad\boxed{x|y=0\sim\mathcal{N}(\mu_0,\Sigma)}​ (3) \quad\boxed{x|y=1\sim\mathcal{N}(\mu_1,\Sigma)}

(2)Estimation

         The following table summarizes the estimates we found when maximizing likelihood :

2、Naive Bayes

(1)Assumption

         Naive Bayesian model assumes that the characteristics of each data point are independent :

        \boxed{P(x|y)=P(x_1,x_2,...|y)=P(x_1|y)P(x_2|y)...=\prod_{i=1}^nP(x_i|y)}

(2)Solutions

         Maximizing log likelihood gives the following solution :

\boxed{P(y=k)=\frac{1}{m}\times\#\{j|y^{(j)}=k\}}\quad\textrm{ and }\quad\boxed{P(x_i=l|y=k)=\frac{\#\{j|y^{(j)}=k\textrm{ and }x_i^{(j)}=l\}}{\#\{j|y^{(j)}=k\}}}

        k\in\{0,1\} , l\in[\![1,L]\!]

         remarks : Naive Bayes is widely used in text classification and spam detection .

6、 ... and 、 Tree based and integrated approach

         These methods can be used for regression and classification problems .

1、CART

         Classification and regression trees (CART), Commonly known as decision tree , Can be expressed as a binary tree . Their advantage is that they are easy to explain .

2、Random forest

         It is a tree based technology , It uses a large number of decision trees constructed from randomly selected feature sets . Contrary to a simple decision tree , It is highly unexplainable , But its generally good performance makes it a popular algorithm . remarks : Random forest is an integrated method .

3、Boosting

         The idea of the lifting method is to combine several weak learners into a stronger one . The main conclusions are as follows :

Adaptive boostingGradient boosting
• High weights are put on errors to improve at the next boosting step
• Known as Adaboost
• Weak learners are trained on residuals
• Examples include XGBoost

7、 ... and 、 Other nonparametric methods

1、k-nearest neighbors

        k- Nearest neighbor algorithm , Often referred to as k-NN, It's a nonparametric method , The response of data points is determined by the k The nature of a neighbor determines . It can be used for classification and regression settings .

         notes : Parameters k The higher the , The bigger the deviation , Parameters k The lower the , The higher the variance is. .

  8、 ... and 、 Learning theory

1、Union bound

         Make A_1, ..., A_k by k  Events . We have :\boxed{P(A_1\cup ...\cup A_k)\leqslant P(A_1)+...+P(A_k)}

2、Hoeffding inequality

         Make Z_1, .., Z_m From the parameters \phi  Extracted from Bernoulli distribution m iid Variable . set up \widehat{\phi} For their sample mean , also \gamma>0 Fix . We have :\boxed{P(|\phi-\widehat{\phi}|>\gamma)\leqslant2\exp(-2\gamma^2m)}

         remarks : This inequality is also called Chernoff bound .

3、Training error

         For a given classifier h, We define training error \widehat{\epsilon}(h), Also known as empirical risk or empirical error , as follows :

        \boxed{\widehat{\epsilon}(h)=\frac{1}{m}\sum_{i=1}^m1_{\{h(x^{(i)})\neq y^{(i)}\}}}

4、Probably Approximately Correct (PAC)

        PAC It's a framework , Many results of learning theory are proved under this framework , And has the following set of assumptions :

         The training set and the test set follow the same distribution

         Training samples are drawn independently

5、Shattering

         Given a set S=\{x^{(1)},...,x^{(d)}\} And a set of classifiers \mathcal{H}, We said \mathcal{H}  about  S If for any set of labels \{y^{(1)}, ..., y^{(d)}\}, We have :\boxed{\exists h\in\mathcal{H}, \quad \forall i\in[\![1,d]\!],\quad h(x^{(i)})=y^{(i)}}

6、Upper bound theorem

         Make \mathcal{H} Is a finite hypothetical class , bring |\mathcal{H}|=k And make \delta And sample size m Is constant . then , At least 1-\delta Probability , We have :

        \boxed{\epsilon(\widehat{h})\leqslant\left(\min_{h\in\mathcal{H}}\epsilon(h)\right)+2\sqrt{\frac{1}{2m}\log\left(\frac{2k}{\delta}\right)}}

7、VC dimension

         Given an infinite hypothetical class \mathcal{H} Of Vapnik-Chervonenkis (VC) dimension , Write it down as \textrm{VC} (\mathcal{H}) Be being H The size of the largest set of fragments .

         notes :{\small\mathcal{H}=\{\textrm{set of linear classifiers in 2 dimensions}\}} Of VC Dimension is 3.

 8、Theorem (Vapnik)

          set up \mathcal{H},\textrm{VC}(\mathcal{H})=d and m It's the number of training samples . At least 1-\delta  Probability , We have :

        \boxed{\epsilon(\widehat{h})\leqslant \left(\min_{h\in\mathcal{H}}\epsilon(h)\right) + O\left(\sqrt{\frac{d}{m}\log\left(\frac{m}{d}\right)+\frac{1}{m}\log\left(\frac{1}{\delta}\right)}\right)}

原网站

版权声明
本文为[Sit and watch the clouds rise]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/164/202206130617323140.html