当前位置:网站首页>Teacher Wu Enda's machine learning course notes 04 multiple linear regression
Teacher Wu Enda's machine learning course notes 04 multiple linear regression
2022-07-29 06:53:00 【three billion seventy-seven million four hundred and ninety-one】
4 Multiple linear regression
4.1 Multidimensional characteristics
There is only one single feature in the linear regression problem studied before , But for other problems, there can also be multiple characteristics , This linear regression problem is called multiple linear regression .
Token Convention
use n n n Represents the number of characteristic quantities , use x ( i ) x^{(i)} x(i) It means the first one i i i Input eigenvectors of samples , This vector is a n n n Dimension vector .( Note that the eigenvector here is not the eigenvector mentioned in the past matrix .)
use x j ( i ) x^{(i)}_j xj(i) It means the first one i i i Of samples j j j The value of a characteristic quantity .
Hypothesis function
Hypothetical function of multivariable linear regression h h h It should be expressed as : h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + … + θ n x n h_{\theta}(x)=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n} hθ(x)=θ0+θ1x1+θ2x2+…+θnxn.
In order to simplify the representation , Can be introduced x 0 = 1 x_0=1 x0=1, be h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + … + θ n x n h_{\theta}(x)=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n} hθ(x)=θ0x0+θ1x1+θ2x2+…+θnxn. At this point, the eigenvector becomes n + 1 n+1 n+1 dimension .
You can use the eigenvector X X X Express , Parameters are vectors Θ \Theta Θ Express , be h θ ( x ) = Θ T X h_{\theta}(x)=\Theta^TX hθ(x)=ΘTX
summary
4.2 Multivariable gradient descent method
Multivariable gradient descent method parameter update rule
θ 0 : = θ 0 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) \theta_{0}:=\theta_{0}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)} θ0:=θ0−am1∑i=1m(hθ(x(i))−y(i))x0(i)
θ 1 : = θ 1 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 1 ( i ) \theta_{1}:=\theta_{1}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{1}^{(i)} θ1:=θ1−am1∑i=1m(hθ(x(i))−y(i))x1(i)
θ 2 : = θ 2 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 2 ( i ) \theta_{2}:=\theta_{2}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{2}^{(i)} θ2:=θ2−am1∑i=1m(hθ(x(i))−y(i))x2(i)
summary
The parameter update rule of multivariable gradient descent method is similar to that of univariate gradient descent method .
4.3-4.4 Practical skills of gradient descent method
skill 1- Feature scaling
For multi feature problems , If the features have similar scales ( That is, the values of different features are in a similar range ), The gradient descent method will converge faster .
The solution is to try to scale all features to -1 To 1 Between ( Or close to -1 To 1 that will do ). The specific implementation method is to divide each feature by its maximum value . The mean can also be normalized , That is for i ≥ 1 i\ge1 i≥1 Make x i = x i − μ i s i x_{i}=\frac{x_{i}-\mu_{i}}{s_i} xi=sixi−μi, among μ i \mu_{i} μi Is the average of eigenvalues , s i s_{i} si Is the range of eigenvalues ( Maximum minus minimum ), take s i s_{i} si Set to standard deviation , This can roughly make the feature x i x_i xi be in -0.5 To 0.5 Between .
skill 2- Learning rate
For each particular problem , The number of iterations required for the gradient descent method may vary greatly . Draw a graph of the cost function with respect to the number of iterations , Through this image, we can judge whether the gradient descent method has converged . It can also be judged by the method of automatic convergence test , Reduce the value of the cost function after one iteration to the threshold ( for example 0.001) Compare , However, it is usually very difficult to choose an appropriate threshold , Therefore, it is more appropriate to judge whether the gradient descent method converges with images .
At the same time, the graph of the cost function with respect to the number of iterations can also determine whether the algorithm works properly , For example, if the curve is rising or partially rising , Then it is likely to be α \alpha α Problems caused by too much .
Each iteration of gradient descent algorithm is affected by the learning rate , If the learning rate is too low , Then the number of iterations required to achieve convergence will be very high ; If the learning rate is too high , Each iteration may not reduce the cost function , It may go beyond the local minimum, leading to the failure of convergence .
You can usually consider trying the learning rate first α = 0.001 , 0.003 , 0.01 , 0.03 , 0.1 , 1 \alpha=0.001,0.003,0.01,0.03,0.1,1 α=0.001,0.003,0.01,0.03,0.1,1, Every time 3 Take a value , Then choose the one that makes the cost function drop quickly α \alpha α, Then select the largest of these values as the final choice α \alpha α.
summary
Use the technique of feature scaling , Can make the gradient drop faster , Convergence requires fewer iterations .
By observing the curve of the cost function with respect to the number of iterations , You can choose the right learning rate .
4.5 Characteristic and polynomial regression
feature selection
The length and width of the matrix , It may be possible to replace , Therefore, sometimes defining new features may lead to a better model .
Linear regression does not apply to all data , Sometimes it may be necessary to use a quadratic model or a cubic model to fit the data , But these models can also be transformed into linear regression models , Such as h θ ( x ) = θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 h_{\theta}(x)=\theta_{0}+\theta_{1} x+\theta_{2} x^{2}+\theta_{3} x^{3} hθ(x)=θ0+θ1x+θ2x2+θ3x3 You can order x 2 = x 2 , x 3 = x 3 x_{2}=x^{2}, x_{3}=x^{3} x2=x2,x3=x3 Turn into h θ ( X ) = θ 0 + θ 1 x 1 + θ 2 x 2 2 + θ 3 x 3 3 h_{\theta}(X)=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}^{2}+\theta_{3} x_{3}^{3} hθ(X)=θ0+θ1x1+θ2x22+θ3x33. Polynomial regression is also closely related to feature selection , And if you choose x 2 = x 2 , x 3 = x 3 x_{2}=x^{2}, x_{3}=x^{3} x2=x2,x3=x3 Such features , Feature scaling becomes very important .
summary
For multivariable linear regression problems , Choosing appropriate features will make the learning algorithm more effective .
Polynomial regression through feature selection can be fitted by linear regression .
4.6 Normal equation
Normal equation method
To minimize the cost function , The gradient descent method converges to the global minimum through multiple iterations , The normal equation is a kind of solution θ \theta θ The analytic solution of the problem , Solve once θ \theta θ The optimal value .
According to the method of calculus , You can find the partial derivative , And by solving ∂ ∂ θ j J ( θ j ) = 0 \frac{\partial}{\partial \theta_{j}} J\left(\theta_{j}\right)=0 ∂θj∂J(θj)=0 A set of linear equations , To solve the minimum value of the cost function .
Let the characteristic matrix of the training set be X X X ( Contains x 0 ( i ) = 1 x^{(i)}_0=1 x0(i)=1 ) , The result of the training set is a vector y y y, Then the normal equation can be used to solve the problem that minimizes the cost function θ = ( X T X ) − 1 X T y \theta=\left(X^{T} X\right)^{-1} X^{T} y θ=(XTX)−1XTy .
It should be noted that , If the normal equation method is used , Feature scaling is not required .
Comparison between gradient descent method and normal equation method
| gradient descent | Normal equation |
|---|---|
| You need to try and choose the learning rate | There is no need to choose the learning rate |
| It takes several iterations , Calculation may be slower | One calculation is enough , There is no need to take additional steps to test the convergence |
| When the number of features is large, it can also be better applied | Need to compute ( X T X ) − 1 \left(X^{T} X\right)^{-1} (XTX)−1, If the number of features 𝑛 𝑛 n If it's bigger, it's more expensive , Because the computation time complexity of matrix inverse is 𝑂 ( 𝑛 3 ) 𝑂(𝑛^3) O(n3), Generally speaking, when 𝑛 𝑛 n Less than 10000 It's still acceptable |
| It's suitable for all kinds of models | Only for linear models , Not suitable for logistic regression models 、 Classification model and other models |
summary
For linear regression model , When the characteristic number 𝑛 𝑛 n More hours ( 𝑛 𝑛 n Less than 10000), The normal equation method is better , When the characteristic number 𝑛 𝑛 n large , Gradient descent method is better .
For more complex models , The normal equation method is not applicable .
边栏推荐
- 王树尧老师运筹学课程笔记 05 线性规划与单纯形法(概念、建模、标准型)
- leetcode-1331:数组序号转换
- 猜数字//第一次使用生成随机数
- 如何优雅的写 Controller 层代码?
- SDN topology discovery principle
- 循环神经网络RNN
- Hongke share | let you have a comprehensive understanding of "can bus errors" (IV) -- producing and recording can errors in practice
- Etcd principle
- 吴恩达老师机器学习课程笔记 04 多元线性回归
- Teacher wangshuyao's notes on operations research 03 KKT theorem
猜你喜欢

会话推荐中的价格偏好和兴趣偏好共同建模-论文泛读

Unity探索地块通路设计分析 & 流程+代码具体实现

SDN topology discovery principle

Actual combat! Talk about how to solve the deep paging problem of MySQL

【冷冻电镜|论文阅读】A feature-guided, focused 3D signal permutation method for subtomogram averaging

AbstractQueuedSynchronizer(AQS)之 ReentrantLock 源码浅读

【flask入门系列】Flask-SQLAlchemy的安装与配置

Joint modeling of price preference and interest preference in conversation recommendation - extensive reading of papers

STP spanning tree principle and example of election rules

【冷冻电镜】Relion4.0——subtomogram教程
随机推荐
The latest pycharm2018 cracking tutorial
【冷冻电镜|论文阅读】子断层平均 M 软件解读:Multi-particle cryo-EM refinement with M
多线程并发下的指令重排问题
数据库多表查询 联合查询 增删改查
ECCV 2022丨轻量级模型架ParC-Net 力压苹果MobileViT代码和论文下载
【笔记】The art of research(明白问题的重要性)
王树尧老师运筹学课程笔记 10 线性规划与单纯形法(关于检测数与退化的讨论)
The core of openresty and cosocket
STP spanning tree principle and example of election rules
Teacher Wu Enda machine learning course notes 01 introduction
CVPR2022Oral专题系列(一):低光增强
Biased lock, lightweight lock test tool class level related commands
模拟卷Leetcode【普通】081. 搜索旋转排序数组 II
矩阵分解与梯度下降
【论文阅读 | 冷冻电镜】RELION 4.0 中新的 subtomogram averaging 方法解读
Instruction rearrangement under multithreading concurrency
CNN convolutional neural network
【冷冻电镜|论文阅读】emClarity:用于高分辨率冷冻电子断层扫描和子断层平均的软件
Jetpack Compose 中的键盘处理
Phantom reference virtual reference code demonstration