当前位置:网站首页>[machine learning notes] univariate linear regression principle, formula and code implementation

[machine learning notes] univariate linear regression principle, formula and code implementation

2022-07-06 05:24:00 Catching sheep on the green grassland

Summary

Linear regression is the basis of logical regression , Logistic regression is also a part of neural network , For resolution 2 Classification problem

Linear regression is the basis of all algorithms

linear relationship And Nonlinear relation

Concept :

  • Linear relationship means that the relationship between variables is Once function , An argument x And dependent variables y The relationship between is expressed as A straight line , Two independent variables and dependent variables y The relationship between is expressed as A plane
  • Nonlinear relation refers to an independent variable x And dependent variables y The relationship between is expressed as A curve , Two independent variables and dependent variables y The relationship between is expressed as A surface

An argument x It is equivalent to a feature , When fitting, it is a line , And if there are multiple features , Is the fitting surface

The linear relation can be understood as a function of degree , No matter how many arguments there are

Example :

  • linear relationship :$$y = a\times x + b$$
  • Nonlinear relation :$$y = x^2$$

The return question

Concept : Predict a Continuous problem The numerical ,

Linear regression is mainly used to deal with regression problems , A few cases are used to deal with classification problems

Univariate linear regression

Concept : There is only one independent variable and dependent variable ( An independent variable is called a unary ) The situation of , And there is a linear relationship between the independent variable and the dependent variable ( Once function ) The regression model of

Representation form :$$y = a \times x + b$$

Only x One The independent variables ,y by The dependent variable ,a by Slope , Also known as x The weight of ,b by intercept

effect : Find a suitable straight line through the univariate linear regression model , Best fit the independent variables x And dependent variables y The relationship between , So we know one x Value , You can find the most possible through this fitting line y

The process of learning univariate linear model is to get the appropriate through training data a and b The process of , That is, the parameters of the univariate linear model are a and b, When entering a new test data point , We can predict through the trained model .

How to evaluate the quality of the model

The goal is : The smaller the difference between the predicted value and the real value, the better , The smaller the distance , The better the effect of our model

A natural idea is : For each of these points (x) All are calculated

\[y - y\_predict \]

Then add up all the values and divide by the number of samples , This is to reduce the impact of samples on the results .

The formula :

\[\displaystyle\frac{1}{n}\displaystyle \sum_{i=1}^{n}(y^i-y\_predict^i) \]

The problem with this formula : The predicted value may be greater than the real value or less than the real value , This will cause errors to be weakened , Positive and negative neutralization leads to the final cumulative error close to 0

improvement : Calculate the absolute value of the error of each point , That is to say \(|y^i-y\_predict^i|\), Then we add up

problem : Subsequent error calculation and derivation problems

Absolute value function , such as $$y = |x|$$ stay x=0 Continuous , But in x The left derivative at is -1, The right derivative is 1, It's not equal , Derivable functions must be smooth , So the function is x=0 Do not guide

Further optimization : For each point, the calculated error , Square the result once , And in order to ignore the influence of the number of samples , Average.

The formula :

\[\displaystyle\frac{1}{n}\displaystyle \sum_{i=1}^{n}(y^i-y\_predict^i)^2 \]

Least square method

because

\[\displaystyle y\_predict^i=ax^i+b \]

Substitute into the formula in the previous section

\[\displaystyle\frac{1}{n}\displaystyle \sum_{i=1}^{n}(y^i-ax^i-b)^2 \]

Find the optimal parameters by the least square method a and b, So that the expression is as small as possible

Concept : A mathematical optimization technique , Find the optimal parameters by minimizing the sum of squares of errors

The formula :

\[\displaystyle a=\frac{\sum_{i=1}^{n}(x^i-\overline x)(y^i-\overline y)}{\sum_{i=1}^{n}(x^i-\overline x)^2} \]

\[\displaystyle b = \overline y - a\overline x \]

Code implementation

import numpy as np
from matplotlib import pyplot as plt

if __name__ == '__main__':
    #  Prepare the data 
    x = np.array([1, 2, 4, 6, 8])  #  Univariate linear regression model only deals with vectors , Instead of dealing with matrices 
    y = np.array([2, 5, 7, 8, 9])
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    
    #  seek a and b
    denominator = 0.0  #  The denominator 
    numerator = 0.0  #  molecular 
    for x_i, y_i in zip(x, y):  #  take x, y Vectors are combined to form tuples (1, 2)、(2, 5)
        numerator += (x_i - x_mean) * (y_i - y_mean)
        denominator += (x_i - x_mean) ** 2
    a = numerator / denominator
    b = y_mean - a * x_mean
    
    #  use a and b Construct a linear function , The output prediction value is stored in y_predict
    y_predict = a * x + b  #  This function is a good fit to the training set x Of 
    
    #  Draw this straight line , And the data of the training set 
    plt.scatter(x, y, color='b')
    plt.plot(x, y_predict, color='r')
    plt.xlabel('x', fontsize=15)
    plt.ylabel('y', fontsize=15)
    plt.show()
    
    #  Input test data , Return to a value 
    x_test = 7
    y_predict_test = a * x_test + b
    print(y_predict_test)

Univariate linear regression model only deals with vectors , Instead of dealing with matrices

After some encapsulation :

import numpy as np
import matplotlib.pyplot as plt

class SimpleLinearRegressionSelf:

    #  Initialize variable 
    def __init__(self):
        """ initialization simple linear regression Model """
        self.a_ = None  #  Use in class , Non user external input variables 
        self.b_ = None  #  Use in class , Non user external input variables 

    #  Training models 
    def fit(self, x_train, y_train):
        assert x_train.ndim == 1
        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)
        denominator = 0.0
        numerator = 0.0

        for x_i, y_i in zip(x_train, y_train):
            numerator += (x_i - x_mean) * (y_i - y_mean)
            denominator += (x_i - x_mean) ** 2
        self.a_ = numerator / denominator
        self.b_ = y_mean - self.a_ * x_mean

        return self

    #  forecast 
    def predict(self, x_test_group): #  The input is a set of vectors 
        #  Make a prediction for each vector in the input vector set , The specific implementation of prediction is encapsulated in _predict Function 
        return np.array([self._predict(x_test) for x_test in x_test_group])

    def _predict(self, x_test):
        #  Find each input x_test To get the predicted value 
        return self.a_ * x_test + self.b_

    #  Measure model scores 
    def mean_squared_error(self, y_true, y_predict):
        return np.sum((y_true - y_predict) ** 2) / len(y_true)

    def r_square(self, y_true, y_predict):
        #  Calculate the specified data ( Array elements ) Variance along the specified axis 
        return 1 - (self.mean_squared_error(y_true, y_predict) / np.var(y_true))


if __name__ == '__main__':
    x = np.array([1, 2, 4, 6, 8])
    y = np.array([2, 5, 7, 8, 9])

    lr = SimpleLinearRegressionSelf()
    lr.fit(x, y)
    print(lr.predict([7]))
    print(lr.r_square([8, 9], lr.predict([6, 8])))

Here, the formula to measure the model score is :

\[\displaystyle R^2 = 1-\frac{\frac{1}{n}\sum_{i=1}^{n}(y\_predict^i-y^i)^2}{\sum_{i=1}^{n}(\overline y-y^i)^2} \]

原网站

版权声明
本文为[Catching sheep on the green grassland]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202132112569010.html