当前位置:网站首页>E-commerce data analysis -- salary prediction (linear regression)
E-commerce data analysis -- salary prediction (linear regression)
2022-07-06 12:00:00 【Want to be a kite】
E-commerce data analysis – Salary forecast ( Linear regression )
Data analysis process :
- Clear purpose
- get data
- Data exploration and preprocessing
- Analyze the data
- Come to the conclusion
- Verification conclusion
- Result presentation
Linear regression : Linear regression is to use regression analysis in mathematical statistics , A statistical analysis method to determine the quantitative relationship between two or more variables , It's widely used . It is expressed in the form of y = w’x+e,e The mean value of error is 0 Is a normal distribution . In regression analysis , Include only one independent variable and one dependent variable , And the relationship between them can be approximately expressed by a straight line , This kind of regression analysis is called univariate linear regression analysis . If the regression analysis includes two or more independent variables , And the relationship between dependent variable and independent variable is linear , It is called multiple linear regression analysis .( Commonly used in demand forecasting 、 Sales forecast 、 Ranking forecast )
Univariate linear regression equation : y = b + a X y=b+aX y=b+aX
b Is the intercept ,a Is the slope of the regression line
Multiple linear regression equation : y = b 0 + b 1 X 1 + b 2 X 2 + . . . + b n X n y=b0+b1X1+b2X2+...+bnXn y=b0+b1X1+b2X2+...+bnXn
b 0 by often Count term , b 1 , b 2 , b 3 , b n by y Yes Should be And X 1 , X 2 , X 3.. X n Of partial return return system Count . b0 Constant term ,b1,b2,b3,bn by y Corresponding to X1,X2,X3..Xn Partial regression coefficient of . b0 by often Count term ,b1,b2,b3,bn by y Yes Should be And X1,X2,X3..Xn Of partial return return system Count .
skearn library - Linear regression (LinearRegression)
Specific parameter interpretation and call method :
from sklearn.linear_model import LinearRegression
LinearRegression(fit_intercept=True,normalize=False.copy_x=True,n_jobs=1)
Parameter meaning :
1、fit_intercept: Boolean value , Specify whether to calculate the intercept in linear regression , namely b value . If False, Then don't count b value .
2、normalize: Boolean value . If False, Then the training samples will be normalized .
3、copy_x: Boolean value . If True, Will copy a copy of training data ,
4、n_jobs: An integer . Specified when tasks are parallel CPU Number . If the value is -1 Then use all available CPU.
attribute :
1、coef_: The weight vector
2、intercept_: intercept b value
Method :
1、fit(X,y): Training models
2、predict(X): Use the model of training number to predict , And return the predicted value .
3、score(X,y): Return the score of prediction performance . The formula is :score=(1-u/v)
among u=((y_ture-y_pred)**2).sum(),v=((y_true-y_ture.mean())**2).sum()
score The maximum is 1, But it may be negative ( The prediction effect is too poor ).score The bigger it is , The better the prediction performance .
Salary prediction case realization
Univariate linear regression ( Working years and salary ), The data is shown in the figure .
# Call the library necessary for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model # Linear model
# Import data
single_variable = pd.read_csv(r"E:\ Data analysis \ingle_variable.csv")
print(single_variable) # View the data
print(single_variable.shape)
print(single_variable.isnull().any()) # Whether there are missing values
# Prepare the data
length = len(single_variable['work_length'])
X = np.array(single_variable['work_length']).reshape([length,1])
Y = np.array(single_variable['year_salary'])
# Plot observation data
# Draw a scatter plot ,X,Y, Set the color , Mark parameters such as point style and transparency
plt.scatter(X,Y,60,color='blue',marker='o',linewidth=3,alpha=0.8)
# add to x Axis title
plt.xlabel('work years')
# add to y Axis title
plt.ylabel('year salary')
# Add chart title
plt.title('work years and year salary')
# Set the background grid line color , style , Size and transparency
plt.grid(color='#95a5a6',linestyle='--', linewidth=1,axis='both',alpha=0.4)
# Show chart
plt.show()
# Call the linear regression model
linear=linear_model.LinearRegression()
linear.fit(X,Y)
# Check intercept and coefficient
print(linear.coef_ )
print(linear.intercept_)
# Check the fitting effect score
print(linear.score(X,Y))
# New data forecast
x_new = np.array(8).reshape(1, -1)
y_pred =linear.predict(x_new)
print(y_pred)
# Finally it is concluded that y = ax+b
Multiple linear regression ( Years of service 、 place 、 educational level 、 Grade and salary ), The data is shown in the figure .
# Call the library necessary for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model # Linear model
# Import data
many_variable = pd.read_csv(r"E:\ Data analysis \many_variable.csv")
print(many_variable) # View the data
print(many_variable.shape)
print(many_variable.isnull().any()) # Whether there are missing values
# Data processing
many_variable['education']=many_variable['education'].replace([' Undergraduate ',' Graduate student '],
[1,2])
many_variable['city']=many_variable['city'].replace([' Beijing ',' Shanghai ',' Guangzhou ',' Hangzhou ',' Shenzhen '],
[1,2,3,4,5])
many_variable['title']=many_variable['title'].replace(['P4','P5','P6','P7'],
[1,2,3,4])
# View the data
print(many_variable)
# Prepare the data
x = np.array(many_variable[['work_length','education','title','city']])
y = np.array(many_variable['year_salary'])
# Sharding data sets ( Training set and test set )
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.4,random_state=1)
# Call the linear regression model
linear2 = linear_model.LinearRegression()
linear2.fit(X_train,y_train)
# Check intercept and coefficient
print(linear2.coef_ )
print(linear2.intercept_)
# Check the fitting effect score
print(linear2.score(X,Y))
# New data forecast
y_pred =list(linear2.predict(X_test))
print(y_pred)
# Finally it is concluded that y=1.35+1.1*work_length+5.19*education+5.92*title+0.09*city
边栏推荐
- Unit test - unittest framework
- [Flink] cdh/cdp Flink on Yan log configuration
- Basic knowledge of lithium battery
- MySQL realizes read-write separation
- C语言函数之可变参数原理:va_start、va_arg及va_end
- MySQL数据库面试题
- 选择法排序与冒泡法排序【C语言】
- Dependency in dependencymanagement cannot be downloaded and red is reported
- [yarn] yarn container log cleaning
- Apprentissage automatique - - régression linéaire (sklearn)
猜你喜欢

Basic knowledge of lithium battery

机器学习--决策树(sklearn)
![[yarn] CDP cluster yarn configuration capacity scheduler batch allocation](/img/85/0121478f8fc427d1200c5f060d5255.png)
[yarn] CDP cluster yarn configuration capacity scheduler batch allocation

Fashion-Gen: The Generative Fashion Dataset and Challenge 论文解读&数据集介绍

共用体(union)详解【C语言】

MySQL主从复制的原理以及实现

Come and walk into the JVM

Redis面试题

I2C bus timing explanation

Reno7 60W超级闪充充电架构
随机推荐
arduino JSON数据信息解析
Redis面试题
Wangeditor rich text component - copy available
Machine learning -- linear regression (sklearn)
Composition des mots (sous - total)
FreeRTOS 任务函数里面的死循环
List and set
Contiki源码+原理+功能+编程+移植+驱动+网络(转)
Time slice polling scheduling of RT thread threads
arduino获取随机数
Characteristics, task status and startup of UCOS III
Word排版(小计)
Dead loop in FreeRTOS task function
Variable star user module
Detailed explanation of 5g working principle (explanation & illustration)
Password free login of distributed nodes
Pytoch implements simple linear regression demo
[Kerberos] deeply understand the Kerberos ticket life cycle
高通&MTK&麒麟 手机平台USB3.0方案对比
Cannot change version of project facet Dynamic Web Module to 2.3.