当前位置:网站首页>E-commerce data analysis -- salary prediction (linear regression)
E-commerce data analysis -- salary prediction (linear regression)
2022-07-06 12:00:00 【Want to be a kite】
E-commerce data analysis – Salary forecast ( Linear regression )
Data analysis process :
- Clear purpose
- get data
- Data exploration and preprocessing
- Analyze the data
- Come to the conclusion
- Verification conclusion
- Result presentation
Linear regression : Linear regression is to use regression analysis in mathematical statistics , A statistical analysis method to determine the quantitative relationship between two or more variables , It's widely used . It is expressed in the form of y = w’x+e,e The mean value of error is 0 Is a normal distribution . In regression analysis , Include only one independent variable and one dependent variable , And the relationship between them can be approximately expressed by a straight line , This kind of regression analysis is called univariate linear regression analysis . If the regression analysis includes two or more independent variables , And the relationship between dependent variable and independent variable is linear , It is called multiple linear regression analysis .( Commonly used in demand forecasting 、 Sales forecast 、 Ranking forecast )
Univariate linear regression equation : y = b + a X y=b+aX y=b+aX
b Is the intercept ,a Is the slope of the regression line
Multiple linear regression equation : y = b 0 + b 1 X 1 + b 2 X 2 + . . . + b n X n y=b0+b1X1+b2X2+...+bnXn y=b0+b1X1+b2X2+...+bnXn
b 0 by often Count term , b 1 , b 2 , b 3 , b n by y Yes Should be And X 1 , X 2 , X 3.. X n Of partial return return system Count . b0 Constant term ,b1,b2,b3,bn by y Corresponding to X1,X2,X3..Xn Partial regression coefficient of . b0 by often Count term ,b1,b2,b3,bn by y Yes Should be And X1,X2,X3..Xn Of partial return return system Count .
skearn library - Linear regression (LinearRegression)
Specific parameter interpretation and call method :
from sklearn.linear_model import LinearRegression
LinearRegression(fit_intercept=True,normalize=False.copy_x=True,n_jobs=1)
Parameter meaning :
1、fit_intercept: Boolean value , Specify whether to calculate the intercept in linear regression , namely b value . If False, Then don't count b value .
2、normalize: Boolean value . If False, Then the training samples will be normalized .
3、copy_x: Boolean value . If True, Will copy a copy of training data ,
4、n_jobs: An integer . Specified when tasks are parallel CPU Number . If the value is -1 Then use all available CPU.
attribute :
1、coef_: The weight vector
2、intercept_: intercept b value
Method :
1、fit(X,y): Training models
2、predict(X): Use the model of training number to predict , And return the predicted value .
3、score(X,y): Return the score of prediction performance . The formula is :score=(1-u/v)
among u=((y_ture-y_pred)**2).sum(),v=((y_true-y_ture.mean())**2).sum()
score The maximum is 1, But it may be negative ( The prediction effect is too poor ).score The bigger it is , The better the prediction performance .
Salary prediction case realization
Univariate linear regression ( Working years and salary ), The data is shown in the figure .
# Call the library necessary for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model # Linear model
# Import data
single_variable = pd.read_csv(r"E:\ Data analysis \ingle_variable.csv")
print(single_variable) # View the data
print(single_variable.shape)
print(single_variable.isnull().any()) # Whether there are missing values
# Prepare the data
length = len(single_variable['work_length'])
X = np.array(single_variable['work_length']).reshape([length,1])
Y = np.array(single_variable['year_salary'])
# Plot observation data
# Draw a scatter plot ,X,Y, Set the color , Mark parameters such as point style and transparency
plt.scatter(X,Y,60,color='blue',marker='o',linewidth=3,alpha=0.8)
# add to x Axis title
plt.xlabel('work years')
# add to y Axis title
plt.ylabel('year salary')
# Add chart title
plt.title('work years and year salary')
# Set the background grid line color , style , Size and transparency
plt.grid(color='#95a5a6',linestyle='--', linewidth=1,axis='both',alpha=0.4)
# Show chart
plt.show()
# Call the linear regression model
linear=linear_model.LinearRegression()
linear.fit(X,Y)
# Check intercept and coefficient
print(linear.coef_ )
print(linear.intercept_)
# Check the fitting effect score
print(linear.score(X,Y))
# New data forecast
x_new = np.array(8).reshape(1, -1)
y_pred =linear.predict(x_new)
print(y_pred)
# Finally it is concluded that y = ax+b
Multiple linear regression ( Years of service 、 place 、 educational level 、 Grade and salary ), The data is shown in the figure .
# Call the library necessary for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model # Linear model
# Import data
many_variable = pd.read_csv(r"E:\ Data analysis \many_variable.csv")
print(many_variable) # View the data
print(many_variable.shape)
print(many_variable.isnull().any()) # Whether there are missing values
# Data processing
many_variable['education']=many_variable['education'].replace([' Undergraduate ',' Graduate student '],
[1,2])
many_variable['city']=many_variable['city'].replace([' Beijing ',' Shanghai ',' Guangzhou ',' Hangzhou ',' Shenzhen '],
[1,2,3,4,5])
many_variable['title']=many_variable['title'].replace(['P4','P5','P6','P7'],
[1,2,3,4])
# View the data
print(many_variable)
# Prepare the data
x = np.array(many_variable[['work_length','education','title','city']])
y = np.array(many_variable['year_salary'])
# Sharding data sets ( Training set and test set )
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.4,random_state=1)
# Call the linear regression model
linear2 = linear_model.LinearRegression()
linear2.fit(X_train,y_train)
# Check intercept and coefficient
print(linear2.coef_ )
print(linear2.intercept_)
# Check the fitting effect score
print(linear2.score(X,Y))
# New data forecast
y_pred =list(linear2.predict(X_test))
print(y_pred)
# Finally it is concluded that y=1.35+1.1*work_length+5.19*education+5.92*title+0.09*city
边栏推荐
- Using LinkedHashMap to realize the caching of an LRU algorithm
- arduino UNO R3的寄存器写法(1)-----引脚电平状态变化
- Basic knowledge of lithium battery
- C语言函数之可变参数原理:va_start、va_arg及va_end
- 第4阶段 Mysql数据库
- ToggleButton实现一个开关灯的效果
- [yarn] yarn container log cleaning
- Unit test - unittest framework
- Apprentissage automatique - - régression linéaire (sklearn)
- 2019 Tencent summer intern formal written examination
猜你喜欢
[CDH] cdh5.16 configuring the setting of yarn task centralized allocation does not take effect
Detailed explanation of 5g working principle (explanation & illustration)
Unit test - unittest framework
C语言回调函数【C语言】
Several declarations about pointers [C language]
I2C bus timing explanation
C language callback function [C language]
MySQL realizes read-write separation
JS object and event learning notes
arduino UNO R3的寄存器写法(1)-----引脚电平状态变化
随机推荐
inline详细讲解【C语言】
I2C bus timing explanation
MySQL数据库面试题
Matlab learning and actual combat notes
MP3mini播放模块arduino<DFRobotDFPlayerMini.h>函数详解
Reno7 60W超级闪充充电架构
Several declarations about pointers [C language]
I2C总线时序详解
Pytorch-温度预测
C language, log print file name, function name, line number, date and time
XML文件详解:XML是什么、XML配置文件、XML数据文件、XML文件解析教程
MongoDB
RT-Thread API参考手册
JS array + array method reconstruction
优先级反转与死锁
[CDH] cdh5.16 configuring the setting of yarn task centralized allocation does not take effect
ESP8266使用arduino连接阿里云物联网
OPPO VOOC快充电路和协议
XML file explanation: what is XML, XML configuration file, XML data file, XML file parsing tutorial
荣耀Magic 3Pro 充电架构分析