当前位置:网站首页>Introduction to ML regression analysis of AI zhetianchuan
Introduction to ML regression analysis of AI zhetianchuan
2022-07-08 00:55:00 【Teacher, I forgot my homework】

I believe everyone has studied in junior high school Solve the regression line equation , Chapter 9 of University probability theory also talks about , It doesn't matter if you forget , Here is a brief recall :
The linear regression equation is :
We can find out x、y The average of :
![]()
For coefficients
:
For coefficients
:
example : It is known that x、y A set of data between :
| x | 0 | 1 | 2 | 3 |
| y | 1 | 3 | 5 | 7 |
seek y And x The regression equation of :
answer :
In fact, the connection is a line segment
One 、 What is regression analysis
Regression
Regression analysis is usually called Regression , It is actually a large class of methods . What we learned before Predicition It includes Regression It also includes Classification, Regression and classification . It seems that the decision tree is suitable discrete Output , We usually call it classification ; And for Continuous type Output problem , Such as user satisfaction 、 One year's expenditure of a family or star rating of users 、 User clicks, or some probability, etc , I will use this introduction Regression Method .
Regression analysis is a statistical analysis method to describe the relationship between variables
• example : Online education scene
• The dependent variable Y: Online learning course satisfaction
• The independent variables X: Platform interactivity 、 Teaching resources 、 curriculum design
• Predictability It's a new modeling technology , Usually used for predictive analysis
• Most of the predicted results are Continuous value ( But it can also be a discrete value , Even binary )
Two 、 Simple linear regression
Linear regression (Linear regression)
There is a linear relationship between dependent variables and independent variables , You can use linear regression to model

The purpose of linear regression is to find the best match ( explain ) Data intercept and Slope
- The linear relationship between some variables is deterministic
| x | 1 | 2 | 3 | 4 | 5 | 6 |
| y | 3 | 5 | 7 | 9 | 11 | 13 |


So when x=7 when , We forecast by 15.
- But usually , There is an approximate linear relationship between variables
| x | 1 | 2 | 3 | 4 | 5 | 6 |
| y | 3 | 2 | 8 | 8 | 11 | 13 |

The problem we have to solve is How to get a straight line can best explain the data ?
Fitting data
- Suppose there is only one dependent variable and independent variable , Each training example represents (𝑥𝑖 , 𝑦𝑖)
- use
Indicates that according to the fitting line and x𝑖 Yes 𝑦𝑖 The predicted value of : 
- Definition
by Error term / residual
A new definition is introduced here : Error term , It subtracts the estimated value of the sample from the true value of the sample .

Our goal is Get a straight line so that the error term for all training samples is as small as possible
Basic assumptions of linear regression
We assume that :
- Suppose there is a relationship between the independent variable and the dependent variable linear relationship
- Between data points Independent
Output results y1,y2,y3... It doesn't matter.
- There is no collinearity between independent variables , Are independent of each other
Are you tired of walking : If the feature is The umbrella and a bag The two variables umbrella and schoolbag have nothing to do
If it is The weather The umbrella a bag be The weather and The umbrella We don't think they are independent
- Residual independence 、 Equivariance 、 accord with Normal distribution
error Independent 、 Equivariance ( Facing the same problem , It is also identically distributed )
according to Central limit theorem : Set from mean value to μ、 The variance of σ^2;( Co., LTD. ) The number of samples taken from any population is n The sample of , When n Sufficiently large , The sampling distribution of the sample mean value approximately follows that the mean value is μ、 The variance of σ^2/n Is a normal distribution .
3、 ... and 、 Loss function (loss function) The definition of
Various loss functions are feasible , You can think of it intuitively :
- Sum of all error terms
- The sum of the absolute values of all error terms
Considering optimization and other issues , The most common is Based on the sum of squares of errors Loss function of

• Using the sum of squares of errors as the loss function has many advantages
• The loss function is strictly convex , There is a unique solution
• The solution process is simple and easy to calculate
• At the same time, there are also some shortcomings
• The result is for “ outliers ”(outlier) Very sensitive
• resolvent : Detect outliers in advance and remove
• The loss function is equivalent for predictions that are above and below the true value
• But in some real cases, the impact of the two is different
We need to find the appropriate parameters b1、b2 Minimize the sum of squares of errors .
Least square method (Least Square, LS)
In order to solve the optimal intercept and slope , It can be transformed into a loss function Convex optimization problem , be called Least square method :
We are respectively right b1、b2 Finding partial derivatives :

This is the linear regression equation we recall at the beginning of this article , Of course, we don't have to calculate the partial derivative when we use it , Direct use .
Gradient descent method (Gradient Descent, GD)
Except least squares , The intercept and slope can also be updated iteratively with a gradient based method :
- You can initialize randomly first 𝑏1, 𝑏2
- repeat :
With an initialized set b1、b2, We can get the corresponding sample 1 The error term error1, Update based on the error term b,b=b-a, among a Is the update of the coefficient ( Function related to error , such as 0.1*error), In this way, there is a new b1、b2, Use a sample 2 The error term error2 Find out a Keep updating and iterating ... Until it converges .
Four 、 Multiple linear regression (Multiple Linear Regression)
When there are multiple dependent variables , We can express in matrix form

Based on the above matrix representation , Can be written as

here :

notes :
- matrix X The first column of is all 1, And β Multiplication means intercept .
- The result of the loss function is still a number
- Obtained by the least square method solve β Formula :


for example :
Recorded 25 Families are selling fast-moving goods and daily services every year
- The total cost (𝑌)
- Annual fixed income ( 𝑋2)、 Current assets held ( 𝑋3)
The following linear regression model can be built :

5、 ... and 、 The phase covariance of linear regression 、 Number of relationships 、 Coefficient of determination
covariance : covariance , Describe two variables X and Y The degree of linear correlation

The correlation coefficient : Value range [-1,1]

Such as :

Coefficient of determination : Coefficient of determination
, Also called decision coefficient 、 Goodness of fit


Be careful :
It may be less than 0, It is not the square of a number .
It measures the degree to which the model interprets the data
- y What percentage of fluctuations can be x Described by fluctuations in
- 𝑅 2 The closer the 1, It means that in regression analysis, the better the independent variable explains the dependent variable
Particular attention : Variable correlation ≠ There is a causal relationship
Prediction of comprehensive scores of World Universities Based on regression analysis
University ranking is a very important, challenging and controversial issue , The comprehensive strength of a university involves scientific research 、 Teachers' 、 Students and other aspects . At present, hundreds of evaluation institutions around the world will evaluate the comprehensive scores of universities to sort , And the scores of these institutions are often inconsistent . Among these rating agencies , World university ranking Center (Center for World University Rankings, abbreviation CWUR) To assess the quality of Education 、 Alumni employment 、 Research results and citations , Instead of relying on surveys and data submitted by universities , Is a very influential .
In this task, we will base on CWUR The ranking of famous universities around the world ( Teachers' 、 Scientific research, etc ), On the one hand, observe the characteristics of different universities through data visualization , On the other hand, I hope to build a machine learning model ( Linear regression ) Predict the comprehensive score of a University .
Data sources :World University Rankings | Kaggle
Data observation and processing :
import pandas as pd
import numpy as np
data_df = pd.read_csv('./cwurData.csv')
data_df.head(3).T # Observe the first few columns and transpose them for convenient observation
Remove the inclusion NaN The data of
data_df = data_df.dropna()
len(data_df) # 2000Set up the matrix
feature_cols = ['quality_of_faculty', 'publications', 'citations', 'alumni_employment',
'influence', 'quality_of_education', 'broad_impact', 'patents'] # Extract eigenvalues
X = data_df[feature_cols]
Y = data_df['score']
# X Y They are independent variables Dependent variable matrix Data visualization
Observe the average scores of the top ten schools in the world , Therefore, we need to average the scores of the same school in different years . We can use groupby() function , Integrate the records of the same school and pass mean() The function averages . Then we sort them in descending order according to the average score , Take the top ten schools as the data to be observed .
import matplotlib.pyplot as plt
import seaborn as sns
mean_df = data_df.groupby('institution').mean() # Aggregate by school and average the aggregated columns
top_df = mean_df.sort_values(by='score', ascending=False).head(10) # Take the top ten schools
sns.set()
x = top_df['score'].values # Comprehensive score list
y = top_df.index.values # List of school names
sns.barplot(x, y, orient='h', palette="Blues_d") # Draw a bar chart
plt.xlim(75, 101) # Limit x Axis range
plt.show()
use pairplot Observe the correlation between variables , You can see from the figure , There is a linear relationship between a few variables ; Between variables and results , Approximate logarithmic relationship .
sns.pairplot(data_df[feature_cols + ['score']], height=3, diag_kind="kde")
plt.show()
The correlation matrix can also be presented in the form of thermal diagram :
Build the model
Take out the columns of the corresponding independent variables and dependent variables , Then you can segment the training set and the test set based on this , And carry out model construction and analysis .
all_y = data_df['score'].values
all_x = data_df[feature_cols].values
# take values It's to start from pandas Of Series Turn into numpy Of array
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(all_x, all_y, test_size=0.2, random_state=2020)
all_y.shape, all_x.shape, x_train.shape, x_test.shape, y_train.shape, y_test.shape # Output data row and column information
# ((2000,), (2000, 8), (1600, 8), (400, 8), (1600,), (400,))from sklearn.linear_model import LinearRegression
LR = LinearRegression() # linear regression model
LR.fit(x_train, y_train) # Train on the training set
p_test = LR.predict(x_test) # Predict on the test set , Obtain the predicted value
test_error = p_test - y_test # Prediction error
test_rmse = (test_error**2).mean()**0.5 # Calculation RMSE
'rmse: {:.4}'.format(test_rmse)
# rmse: 3.999Get the test set RMSE by 3.999, Calculate a fair result under the prediction goal of the percentage system . Judging from the evaluation indicators, it seems that we can estimate the comprehensive score according to the better ranking in all aspects , Next, let's observe the learned parameters , That is, the influence weight of each index ranking on the comprehensive score .
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.barplot(x=LR.coef_, y=feature_cols)
plt.show()
It will be found here that the prediction of comprehensive score is basically 「 The quality of teachers 」 This independent variable dominates ,「 employment 」 and 「 The quality of education 」 These two factors also have some influence , Other indicators play a small role .
To observe 「 The quality of teachers 」 The relationship between this dominant factor and the comprehensive score , We can go through seaborn Medium regplot() Function draws its distribution in the form of a scatter diagram .
sns.regplot(data_df['quality_of_faculty'], data_df['score'], marker="+")
plt.show()
边栏推荐
猜你喜欢

接口测试进阶接口脚本使用—apipost(预/后执行脚本)

ReentrantLock 公平锁源码 第0篇

1293_FreeRTOS中xTaskResumeAll()接口的实现分析

【GO记录】从零开始GO语言——用GO语言做一个示波器(一)GO语言基础

RPA cloud computer, let RPA out of the box with unlimited computing power?

What does interface testing test?

新库上线 | CnOpenData中国星级酒店数据

A network composed of three convolution layers completes the image classification task of cifar10 data set

浪潮云溪分布式数据库 Tracing(二)—— 源码解析

9.卷积神经网络介绍
随机推荐
ABAP ALV LVC template
Codeforces Round #804 (Div. 2)(A~D)
牛客基础语法必刷100题之基本类型
新库上线 | CnOpenData中国星级酒店数据
They gathered at the 2022 ecug con just for "China's technological power"
Fofa attack and defense challenge record
Kubernetes Static Pod (静态Pod)
Kubernetes static pod (static POD)
基于人脸识别实现课堂抬头率检测
Installation and configuration of sublime Text3
他们齐聚 2022 ECUG Con,只为「中国技术力量」
Invalid V-for traversal element style
Vscode software
Stock account opening is free of charge. Is it safe to open an account on your mobile phone
13.模型的保存和载入
[C language] objective questions - knowledge points
What is load balancing? How does DNS achieve load balancing?
Marubeni official website applet configuration tutorial is coming (with detailed steps)
图像数据预处理
NTT template for Tourism


: 
: 
Indicates that according to the fitting line and x𝑖 Yes 𝑦𝑖 The predicted value of : 
by
