当前位置:网站首页>Teacher Wu Enda's machine learning course notes 04 multiple linear regression
Teacher Wu Enda's machine learning course notes 04 multiple linear regression
2022-07-29 06:53:00 【three billion seventy-seven million four hundred and ninety-one】
4 Multiple linear regression
4.1 Multidimensional characteristics
There is only one single feature in the linear regression problem studied before , But for other problems, there can also be multiple characteristics , This linear regression problem is called multiple linear regression .
Token Convention
use n n n Represents the number of characteristic quantities , use x ( i ) x^{(i)} x(i) It means the first one i i i Input eigenvectors of samples , This vector is a n n n Dimension vector .( Note that the eigenvector here is not the eigenvector mentioned in the past matrix .)
use x j ( i ) x^{(i)}_j xj(i) It means the first one i i i Of samples j j j The value of a characteristic quantity .
Hypothesis function
Hypothetical function of multivariable linear regression h h h It should be expressed as : h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + … + θ n x n h_{\theta}(x)=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n} hθ(x)=θ0+θ1x1+θ2x2+…+θnxn.
In order to simplify the representation , Can be introduced x 0 = 1 x_0=1 x0=1, be h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + … + θ n x n h_{\theta}(x)=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n} hθ(x)=θ0x0+θ1x1+θ2x2+…+θnxn. At this point, the eigenvector becomes n + 1 n+1 n+1 dimension .
You can use the eigenvector X X X Express , Parameters are vectors Θ \Theta Θ Express , be h θ ( x ) = Θ T X h_{\theta}(x)=\Theta^TX hθ(x)=ΘTX
summary
4.2 Multivariable gradient descent method
Multivariable gradient descent method parameter update rule
θ 0 : = θ 0 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) \theta_{0}:=\theta_{0}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)} θ0:=θ0−am1∑i=1m(hθ(x(i))−y(i))x0(i)
θ 1 : = θ 1 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 1 ( i ) \theta_{1}:=\theta_{1}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{1}^{(i)} θ1:=θ1−am1∑i=1m(hθ(x(i))−y(i))x1(i)
θ 2 : = θ 2 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 2 ( i ) \theta_{2}:=\theta_{2}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{2}^{(i)} θ2:=θ2−am1∑i=1m(hθ(x(i))−y(i))x2(i)
summary
The parameter update rule of multivariable gradient descent method is similar to that of univariate gradient descent method .
4.3-4.4 Practical skills of gradient descent method
skill 1- Feature scaling
For multi feature problems , If the features have similar scales ( That is, the values of different features are in a similar range ), The gradient descent method will converge faster .
The solution is to try to scale all features to -1 To 1 Between ( Or close to -1 To 1 that will do ). The specific implementation method is to divide each feature by its maximum value . The mean can also be normalized , That is for i ≥ 1 i\ge1 i≥1 Make x i = x i − μ i s i x_{i}=\frac{x_{i}-\mu_{i}}{s_i} xi=sixi−μi, among μ i \mu_{i} μi Is the average of eigenvalues , s i s_{i} si Is the range of eigenvalues ( Maximum minus minimum ), take s i s_{i} si Set to standard deviation , This can roughly make the feature x i x_i xi be in -0.5 To 0.5 Between .
skill 2- Learning rate
For each particular problem , The number of iterations required for the gradient descent method may vary greatly . Draw a graph of the cost function with respect to the number of iterations , Through this image, we can judge whether the gradient descent method has converged . It can also be judged by the method of automatic convergence test , Reduce the value of the cost function after one iteration to the threshold ( for example 0.001) Compare , However, it is usually very difficult to choose an appropriate threshold , Therefore, it is more appropriate to judge whether the gradient descent method converges with images .
At the same time, the graph of the cost function with respect to the number of iterations can also determine whether the algorithm works properly , For example, if the curve is rising or partially rising , Then it is likely to be α \alpha α Problems caused by too much .
Each iteration of gradient descent algorithm is affected by the learning rate , If the learning rate is too low , Then the number of iterations required to achieve convergence will be very high ; If the learning rate is too high , Each iteration may not reduce the cost function , It may go beyond the local minimum, leading to the failure of convergence .
You can usually consider trying the learning rate first α = 0.001 , 0.003 , 0.01 , 0.03 , 0.1 , 1 \alpha=0.001,0.003,0.01,0.03,0.1,1 α=0.001,0.003,0.01,0.03,0.1,1, Every time 3 Take a value , Then choose the one that makes the cost function drop quickly α \alpha α, Then select the largest of these values as the final choice α \alpha α.
summary
Use the technique of feature scaling , Can make the gradient drop faster , Convergence requires fewer iterations .
By observing the curve of the cost function with respect to the number of iterations , You can choose the right learning rate .
4.5 Characteristic and polynomial regression
feature selection
The length and width of the matrix , It may be possible to replace , Therefore, sometimes defining new features may lead to a better model .
Linear regression does not apply to all data , Sometimes it may be necessary to use a quadratic model or a cubic model to fit the data , But these models can also be transformed into linear regression models , Such as h θ ( x ) = θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 h_{\theta}(x)=\theta_{0}+\theta_{1} x+\theta_{2} x^{2}+\theta_{3} x^{3} hθ(x)=θ0+θ1x+θ2x2+θ3x3 You can order x 2 = x 2 , x 3 = x 3 x_{2}=x^{2}, x_{3}=x^{3} x2=x2,x3=x3 Turn into h θ ( X ) = θ 0 + θ 1 x 1 + θ 2 x 2 2 + θ 3 x 3 3 h_{\theta}(X)=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}^{2}+\theta_{3} x_{3}^{3} hθ(X)=θ0+θ1x1+θ2x22+θ3x33. Polynomial regression is also closely related to feature selection , And if you choose x 2 = x 2 , x 3 = x 3 x_{2}=x^{2}, x_{3}=x^{3} x2=x2,x3=x3 Such features , Feature scaling becomes very important .
summary
For multivariable linear regression problems , Choosing appropriate features will make the learning algorithm more effective .
Polynomial regression through feature selection can be fitted by linear regression .
4.6 Normal equation
Normal equation method
To minimize the cost function , The gradient descent method converges to the global minimum through multiple iterations , The normal equation is a kind of solution θ \theta θ The analytic solution of the problem , Solve once θ \theta θ The optimal value .
According to the method of calculus , You can find the partial derivative , And by solving ∂ ∂ θ j J ( θ j ) = 0 \frac{\partial}{\partial \theta_{j}} J\left(\theta_{j}\right)=0 ∂θj∂J(θj)=0 A set of linear equations , To solve the minimum value of the cost function .
Let the characteristic matrix of the training set be X X X ( Contains x 0 ( i ) = 1 x^{(i)}_0=1 x0(i)=1 ) , The result of the training set is a vector y y y, Then the normal equation can be used to solve the problem that minimizes the cost function θ = ( X T X ) − 1 X T y \theta=\left(X^{T} X\right)^{-1} X^{T} y θ=(XTX)−1XTy .
It should be noted that , If the normal equation method is used , Feature scaling is not required .
Comparison between gradient descent method and normal equation method
| gradient descent | Normal equation |
|---|---|
| You need to try and choose the learning rate | There is no need to choose the learning rate |
| It takes several iterations , Calculation may be slower | One calculation is enough , There is no need to take additional steps to test the convergence |
| When the number of features is large, it can also be better applied | Need to compute ( X T X ) − 1 \left(X^{T} X\right)^{-1} (XTX)−1, If the number of features 𝑛 𝑛 n If it's bigger, it's more expensive , Because the computation time complexity of matrix inverse is 𝑂 ( 𝑛 3 ) 𝑂(𝑛^3) O(n3), Generally speaking, when 𝑛 𝑛 n Less than 10000 It's still acceptable |
| It's suitable for all kinds of models | Only for linear models , Not suitable for logistic regression models 、 Classification model and other models |
summary
For linear regression model , When the characteristic number 𝑛 𝑛 n More hours ( 𝑛 𝑛 n Less than 10000), The normal equation method is better , When the characteristic number 𝑛 𝑛 n large , Gradient descent method is better .
For more complex models , The normal equation method is not applicable .
边栏推荐
- Shallow reading of condition object source code
- 5g service interface and reference point
- Difference between CNAME record and a record
- 【CryoEM】FSC, Fourier Shell Correlation简介
- 会话推荐中的价格偏好和兴趣偏好共同建模-论文泛读
- 矩阵分解与梯度下降
- Joint modeling of price preference and interest preference in conversation recommendation - extensive reading of papers
- 【flask入门系列】Flask-SQLAlchemy的安装与配置
- Invalid access control
- Etcd principle
猜你喜欢

AbstractQueuedSynchronizer(AQS)之 ReentrantLock 源码浅读

MySQL 事物四种隔离级别分析

王树尧老师运筹学课程笔记 10 线性规划与单纯形法(关于检测数与退化的讨论)

Hongke share | let you have a comprehensive understanding of "can bus error" (III) -- can node status and error counter

Condition 条件对象源码浅读

5g service interface and reference point

Etcd principle

Actual combat! Talk about how to solve the deep paging problem of MySQL

JVM之垃圾回收机制(GC)

Ping principle
随机推荐
猜数字//第一次使用生成随机数
模拟卷Leetcode【普通】061. 旋转链表
【讲座笔记】如何在稀烂的数据中做深度学习?
finally 和 return 的执行顺序
Instruction rearrangement under multithreading concurrency
实战!聊聊如何解决MySQL深分页问题
Shallow reading of shared lock source code of abstractqueuedsynchronizer (AQS)
Computer right mouse click always turn around what's going on
Use of PDO
王树尧老师运筹学课程笔记 09 线性规划与单纯形法(单纯形表的应用)
Unity探索地块通路设计分析 & 流程+代码具体实现
二次元卡通渲染——进阶技巧
【flask入门系列】Flask-SQLAlchemy的安装与配置
JMM 内存模型概念
多线程并发下的指令重排问题
Embedding understanding + code
C语言数据类型
量子机器学习中的安全性问题
矩阵分解与梯度下降
10道面试常问JVM题