当前位置：网站首页>Introduction to ML regression analysis of AI zhetianchuan

Introduction to ML regression analysis of AI zhetianchuan

2022-07-08 00:55:00 【Teacher, I forgot my homework】

I believe everyone has studied in junior high school Solve the regression line equation , Chapter 9 of University probability theory also talks about , It doesn't matter if you forget , Here is a brief recall ：

The linear regression equation is ： $\widehat{y}=\widehat{b}x+\widehat{a}$
We can find out x、y The average of ： $\overline{x} = \frac{1}{n}(x_{1}+x_{2}+x_{3}+...+x_{n})$
                                      $\overline{y} = \frac{1}{n}(y_{1}+y_{2}+y_{3}+...+y_{n})$
For coefficients $\widehat{b}$ :           $\widehat{b} = \frac{\sum_{i=1}^{n}(x_{i}-\overline{x})(y_{i}-\overline{y})}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}=\frac{(x_{1}y_{1}+x_{2}y_{2}+...+x_{n}y_{n})-n\overline{x}\overline{y}}{(x_{1}^{2}+x_{2}^{2}+...+x_{n}^{2})-n\overline{x}^{2}}$
For coefficients $\widehat{a}$ :          $\widehat{a} = \overline{y} - \widehat{b}\overline{x}$

example ： It is known that x、y A set of data between ：

x	0	1	2	3
y	1	3	5	7

seek y And x The regression equation of ：

answer ： $\widehat{y}=2x+1$ In fact, the connection is a line segment

One 、 What is regression analysis

Regression

Regression analysis is usually called Regression , It is actually a large class of methods . What we learned before Predicition It includes Regression It also includes Classification, Regression and classification . It seems that the decision tree is suitable discrete Output , We usually call it classification ; And for Continuous type Output problem , Such as user satisfaction 、 One year's expenditure of a family or star rating of users 、 User clicks, or some probability, etc , I will use this introduction Regression Method .

Regression analysis is a statistical analysis method to describe the relationship between variables

• example ： Online education scene

• The dependent variable Y： Online learning course satisfaction

• The independent variables X： Platform interactivity 、 Teaching resources 、 curriculum design

• Predictability It's a new modeling technology , Usually used for predictive analysis

• Most of the predicted results are Continuous value （ But it can also be a discrete value , Even binary ）

Two 、 Simple linear regression

Linear regression (Linear regression)

There is a linear relationship between dependent variables and independent variables , You can use linear regression to model

The purpose of linear regression is to find the best match ( explain ) Data intercept and Slope

The linear relationship between some variables is deterministic

x	1	2	3	4	5	6
y	3	5	7	9	11	13

y=2x+1

So when x=7 when , We forecast by 15.

But usually , There is an approximate linear relationship between variables

x	1	2	3	4	5	6
y	3	2	8	8	11	13

The problem we have to solve is How to get a straight line can best explain the data ？

Fitting data

Suppose there is only one dependent variable and independent variable , Each training example represents (𝑥𝑖 , 𝑦𝑖)
use $\widehat{y}_{i}$ Indicates that according to the fitting line and x𝑖 Yes 𝑦𝑖 The predicted value of : $\widehat{y}_{i} = b_{1}+b_{2}x_{i}$
Definition $e_{i} = y_{i} - \widehat{y}_{i}$ by Error term / residual

A new definition is introduced here ： Error term , It subtracts the estimated value of the sample from the true value of the sample .

Our goal is Get a straight line so that the error term for all training samples is as small as possible

Basic assumptions of linear regression

We assume that ：

Suppose there is a relationship between the independent variable and the dependent variable linear relationship
Between data points Independent

Output results y1,y2,y3... It doesn't matter.

There is no collinearity between independent variables , Are independent of each other

Are you tired of walking ： If the feature is The umbrella and a bag The two variables umbrella and schoolbag have nothing to do

If it is The weather The umbrella a bag be The weather and The umbrella We don't think they are independent

Residual independence 、 Equivariance 、 accord with Normal distribution

error Independent 、 Equivariance ( Facing the same problem , It is also identically distributed )

according to Central limit theorem ： Set from mean value to μ、 The variance of σ^2;( Co., LTD. ) The number of samples taken from any population is n The sample of , When n Sufficiently large , The sampling distribution of the sample mean value approximately follows that the mean value is μ、 The variance of σ^2/n Is a normal distribution .

3、 ... and 、 Loss function (loss function) The definition of

Various loss functions are feasible , You can think of it intuitively ：

Sum of all error terms
The sum of the absolute values of all error terms

Considering optimization and other issues , The most common is Based on the sum of squares of errors Loss function of

$\sum_{i=1}^{n}e_{i}=\sum_{i=1}^{n}(y_{i}-\widehat{y}_{i})^{2}=\sum_{i=1}^{n}(y_{i}-b_{1}-b_{2}x_{i})^{2}$

• Using the sum of squares of errors as the loss function has many advantages
        • The loss function is strictly convex , There is a unique solution
        • The solution process is simple and easy to calculate
• At the same time, there are also some shortcomings
        • The result is for “ outliers ”(outlier) Very sensitive
                • resolvent ： Detect outliers in advance and remove
        • The loss function is equivalent for predictions that are above and below the true value
                • But in some real cases, the impact of the two is different

We need to find the appropriate parameters b1、b2 Minimize the sum of squares of errors .

Least square method （Least Square, LS)

In order to solve the optimal intercept and slope , It can be transformed into a loss function Convex optimization problem , be called Least square method ：

We are respectively right b1、b2 Finding partial derivatives ：

This is the linear regression equation we recall at the beginning of this article , Of course, we don't have to calculate the partial derivative when we use it , Direct use .

Gradient descent method (Gradient Descent, GD)

Except least squares , The intercept and slope can also be updated iteratively with a gradient based method ：

You can initialize randomly first 𝑏1, 𝑏2
repeat ： $b_{1}=b_{1}-a$ $b_{2}=b_{2}-a$

With an initialized set b1、b2, We can get the corresponding sample 1 The error term error1, Update based on the error term b,b=b-a, among a Is the update of the coefficient ( Function related to error , such as 0.1*error), In this way, there is a new b1、b2, Use a sample 2 The error term error2 Find out a Keep updating and iterating ... Until it converges .

Four 、 Multiple linear regression (Multiple Linear Regression)

When there are multiple dependent variables , We can express in matrix form

Based on the above matrix representation , Can be written as

$Y=X\beta +\epsilon$

here ：

notes ：

matrix X The first column of is all 1, And β Multiplication means intercept .
The result of the loss function is still a number
Obtained by the least square method solve β Formula ： $\beta =(X^{T}X)^{-1}X^{T}Y$

for example ：

Recorded 25 Families are selling fast-moving goods and daily services every year

The total cost （𝑌）
Annual fixed income （ 𝑋2）、 Current assets held （ 𝑋3）

The following linear regression model can be built ：

$y_{i}=\beta _{1}+\beta _{2}x_{i2}+\beta _{3}x_{i3}+\epsilon _{i}\: \: \: \: \: \: \: \: \: i=1,...,25$

5、 ... and 、 The phase covariance of linear regression 、 Number of relationships 、 Coefficient of determination

covariance ： covariance , Describe two variables X and Y The degree of linear correlation

The correlation coefficient ： Value range [-1,1]

Such as ：

Coefficient of determination ： Coefficient of determination $R^{2}$ , Also called decision coefficient 、 Goodness of fit

$R^{2}=1-\frac{\sum (y_{i}-\widehat{y}_{i})^{2}}{\sum (y_{i}-\overline{y}_{i})^{2}}$

$R^{2}= 1-\frac{\sum (y_{i}-\widehat{y}_{i})^{2}/n}{\sum (y_{i}-\overline{y}_{i})^{2}/n}=1-\frac{MSE}{VAR}$

Be careful ： $R^{2}$ It may be less than 0, It is not the square of a number .

It measures the degree to which the model interprets the data

y What percentage of fluctuations can be x Described by fluctuations in
𝑅 2 The closer the 1, It means that in regression analysis, the better the independent variable explains the dependent variable

Particular attention ： Variable correlation ≠ There is a causal relationship

Prediction of comprehensive scores of World Universities Based on regression analysis

University ranking is a very important, challenging and controversial issue , The comprehensive strength of a university involves scientific research 、 Teachers' 、 Students and other aspects . At present, hundreds of evaluation institutions around the world will evaluate the comprehensive scores of universities to sort , And the scores of these institutions are often inconsistent . Among these rating agencies , World university ranking Center （Center for World University Rankings, abbreviation CWUR） To assess the quality of Education 、 Alumni employment 、 Research results and citations , Instead of relying on surveys and data submitted by universities , Is a very influential .

In this task, we will base on CWUR The ranking of famous universities around the world （ Teachers' 、 Scientific research, etc ）, On the one hand, observe the characteristics of different universities through data visualization , On the other hand, I hope to build a machine learning model （ Linear regression ） Predict the comprehensive score of a University .

Data sources ：World University Rankings | Kaggle

Data observation and processing ：

import pandas as pd
import numpy as np

data_df = pd.read_csv('./cwurData.csv')
data_df.head(3).T  #  Observe the first few columns and transpose them for convenient observation

Remove the inclusion NaN The data of

data_df = data_df.dropna()
len(data_df) # 2000

Set up the matrix

feature_cols = ['quality_of_faculty', 'publications', 'citations', 'alumni_employment', 
                'influence', 'quality_of_education', 'broad_impact', 'patents'] #  Extract eigenvalues 
X = data_df[feature_cols]
Y = data_df['score']
# X Y They are independent variables   Dependent variable matrix

Data visualization

Observe the average scores of the top ten schools in the world , Therefore, we need to average the scores of the same school in different years . We can use groupby() function , Integrate the records of the same school and pass mean() The function averages . Then we sort them in descending order according to the average score , Take the top ten schools as the data to be observed .

import matplotlib.pyplot as plt 
import seaborn as sns 

mean_df = data_df.groupby('institution').mean()  #  Aggregate by school and average the aggregated columns 
top_df = mean_df.sort_values(by='score', ascending=False).head(10)  #  Take the top ten schools 
sns.set()
x = top_df['score'].values  #  Comprehensive score list 
y = top_df.index.values  #  List of school names 
sns.barplot(x, y, orient='h', palette="Blues_d")  #  Draw a bar chart 
plt.xlim(75, 101)  #  Limit  x  Axis range 
plt.show()

use pairplot Observe the correlation between variables , You can see from the figure , There is a linear relationship between a few variables ; Between variables and results , Approximate logarithmic relationship .

sns.pairplot(data_df[feature_cols + ['score']], height=3, diag_kind="kde")
plt.show()

The correlation matrix can also be presented in the form of thermal diagram ：

Build the model

Take out the columns of the corresponding independent variables and dependent variables , Then you can segment the training set and the test set based on this , And carry out model construction and analysis .

all_y = data_df['score'].values  
all_x = data_df[feature_cols].values
#  take  values  It's to start from  pandas  Of  Series  Turn into  numpy  Of  array

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(all_x, all_y, test_size=0.2, random_state=2020)
all_y.shape, all_x.shape, x_train.shape, x_test.shape, y_train.shape, y_test.shape #  Output data row and column information 
# ((2000,), (2000, 8), (1600, 8), (400, 8), (1600,), (400,))

from sklearn.linear_model import LinearRegression
LR = LinearRegression()  #  linear regression model 
LR.fit(x_train, y_train)  #  Train on the training set 
p_test = LR.predict(x_test)  #  Predict on the test set , Obtain the predicted value 
test_error = p_test - y_test  #  Prediction error 
test_rmse = (test_error**2).mean()**0.5  #  Calculation  RMSE
'rmse: {:.4}'.format(test_rmse) 

# rmse: 3.999

Get the test set RMSE by 3.999, Calculate a fair result under the prediction goal of the percentage system . Judging from the evaluation indicators, it seems that we can estimate the comprehensive score according to the better ranking in all aspects , Next, let's observe the learned parameters , That is, the influence weight of each index ranking on the comprehensive score .

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
sns.barplot(x=LR.coef_, y=feature_cols)
plt.show()

It will be found here that the prediction of comprehensive score is basically 「 The quality of teachers 」 This independent variable dominates ,「 employment 」 and 「 The quality of education 」 These two factors also have some influence , Other indicators play a small role .

To observe 「 The quality of teachers 」 The relationship between this dominant factor and the comprehensive score , We can go through seaborn Medium regplot() Function draws its distribution in the form of a scatter diagram .

sns.regplot(data_df['quality_of_faculty'], data_df['score'], marker="+")
plt.show()

原网站

版权声明
本文为[Teacher, I forgot my homework]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/189/202207072310193523.html