2022-07-06 12:00:00 Want to be a kite

Data analysis process :

  1. Clear purpose
  2. get data
  3. Data exploration and preprocessing
  4. Analyze the data
  5. Come to the conclusion
  6. Verification conclusion
  7. Result presentation

Linear regression : Linear regression is to use regression analysis in mathematical statistics , A statistical analysis method to determine the quantitative relationship between two or more variables , It's widely used . It is expressed in the form of y = w’x+e,e The mean value of error is 0 Is a normal distribution . In regression analysis , Include only one independent variable and one dependent variable , And the relationship between them can be approximately expressed by a straight line , This kind of regression analysis is called univariate linear regression analysis . If the regression analysis includes two or more independent variables , And the relationship between dependent variable and independent variable is linear , It is called multiple linear regression analysis .( Commonly used in demand forecasting 、 Sales forecast 、 Ranking forecast

Univariate linear regression equation : y = b + a X y=b+aX y=b+aX
b Is the intercept ,a Is the slope of the regression line

Multiple linear regression equation : y = b 0 + b 1 X 1 + b 2 X 2 + . . . + b n X n y=b0+b1X1+b2X2+...+bnXn y=b0+b1X1+b2X2+...+bnXn
b 0 by often Count term , b 1 , b 2 , b 3 , b n by y Yes Should be And X 1 , X 2 , X 3.. X n Of partial return return system Count . b0 Constant term ,b1,b2,b3,bn by y Corresponding to X1,X2,X3..Xn Partial regression coefficient of . b0 by often Count term ,b1,b2,b3,bn by y Yes Should be And X1,X2,X3..Xn Of partial return return system Count .

skearn library - Linear regression (LinearRegression)

Specific parameter interpretation and call method :

from sklearn.linear_model import LinearRegression

Parameter meaning :
1、fit_intercept: Boolean value , Specify whether to calculate the intercept in linear regression , namely b value . If False, Then don't count b value .
2、normalize: Boolean value . If False, Then the training samples will be normalized .
3、copy_x: Boolean value . If True, Will copy a copy of training data ,
4、n_jobs: An integer . Specified when tasks are parallel CPU Number . If the value is -1 Then use all available CPU.

attribute :
1、coef_: The weight vector
2、intercept_: intercept b value

Method :
1、fit(X,y): Training models
2、predict(X): Use the model of training number to predict , And return the predicted value .
3、score(X,y): Return the score of prediction performance . The formula is :score=(1-u/v)
among u=((y_ture-y_pred)**2).sum(),v=((y_true-y_ture.mean())**2).sum()
score The maximum is 1, But it may be negative ( The prediction effect is too poor ).score The bigger it is , The better the prediction performance .

Salary prediction case realization

Univariate linear regression ( Working years and salary ), The data is shown in the figure .

# Call the library necessary for data analysis 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model  # Linear model 
# Import data 
single_variable = pd.read_csv(r"E:\ Data analysis \ingle_variable.csv")
print(single_variable) # View the data 
print(single_variable.isnull().any())  # Whether there are missing values 

# Prepare the data 
length = len(single_variable['work_length'])
X = np.array(single_variable['work_length']).reshape([length,1])
Y = np.array(single_variable['year_salary'])
# Plot observation data 
# Draw a scatter plot ,X,Y, Set the color , Mark parameters such as point style and transparency 
# add to x Axis title 
plt.xlabel('work years')
# add to y Axis title 
plt.ylabel('year salary')
# Add chart title 
plt.title('work years and year salary')
# Set the background grid line color , style , Size and transparency 
plt.grid(color='#95a5a6',linestyle='--', linewidth=1,axis='both',alpha=0.4)
# Show chart 

# Call the linear regression model 
# Check intercept and coefficient 
print(linear.coef_ )
# Check the fitting effect score 

# New data forecast 
x_new = np.array(8).reshape(1, -1)
y_pred =linear.predict(x_new)
# Finally it is concluded that  y = ax+b

Multiple linear regression ( Years of service 、 place 、 educational level 、 Grade and salary ), The data is shown in the figure .
# Call the library necessary for data analysis 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model  # Linear model 
# Import data 
many_variable = pd.read_csv(r"E:\ Data analysis \many_variable.csv")
print(many_variable) # View the data 
print(many_variable.isnull().any())  # Whether there are missing values 
# Data processing 
many_variable['education']=many_variable['education'].replace([' Undergraduate ',' Graduate student '],
many_variable['city']=many_variable['city'].replace([' Beijing ',' Shanghai ',' Guangzhou ',' Hangzhou ',' Shenzhen '],
# View the data 
# Prepare the data 
x = np.array(many_variable[['work_length','education','title','city']])
y = np.array(many_variable['year_salary'])
# Sharding data sets ( Training set and test set )
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.4,random_state=1)
# Call the linear regression model 
linear2 = linear_model.LinearRegression()
# Check intercept and coefficient 
print(linear2.coef_ )
# Check the fitting effect score 

# New data forecast 
y_pred =list(linear2.predict(X_test))
# Finally it is concluded that  y=1.35+1.1*work_length+5.19*education+5.92*title+0.09*city

