当前位置:网站首页>机器学习实战-逻辑回归-19
机器学习实战-逻辑回归-19
2022-07-28 11:51:00 【gemoumou】
机器学习实战-逻辑回归-用户流失预测



import numpy as np
train_data = np.genfromtxt('Churn-Modelling.csv',delimiter=',',dtype=np.str)
test_data = np.genfromtxt('Churn-Modelling-Test-Data.csv',delimiter=',',dtype=np.str)
x_train = train_data[1:,:-1]
y_train = train_data[1:,-1].astype(int)
x_test = test_data[1:,:-1]
y_test = test_data[1:,-1].astype(int)
x_train = np.delete(x_train,[0,1,2],axis=1)
x_test = np.delete(x_test,[0,1,2],axis=1)
x_train[:5]

y_train[:5]

# x_train[x_train=='Female'] = 0
# x_train[x_train=='Male'] = 1
from sklearn.preprocessing import LabelEncoder
labelencoder1 = LabelEncoder()
x_train[:,1] = labelencoder1.fit_transform(x_train[:,1])
x_test[:,1] = labelencoder1.transform(x_test[:,1])
labelencoder2 = LabelEncoder()
x_train[:,2] = labelencoder2.fit_transform(x_train[:,2])
x_test[:,2] = labelencoder2.transform(x_test[:,2])

x_train = x_train.astype(np.float32)
x_test = x_test.astype(np.float32)
y_train = y_train.astype(np.float32)
y_test = y_test.astype(np.float32)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification
LR = LinearRegression()
LR.fit(x_train,y_train)
predictions = LR.predict(x_test)
print(classification_report(y_test, predictions))

机器学习实战-逻辑回归-糖尿病预测模型


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 载入数据
diabetes_data = pd.read_csv('diabetes.csv')
diabetes_data.head()

# 数据信息
diabetes_data.info(verbose=True)

# 数据描述
diabetes_data.describe()

# 数据形状
diabetes_data.shape

# 查看标签分布
print(diabetes_data.Outcome.value_counts())
# 使用柱状图的方式画出标签个数统计
p=diabetes_data.Outcome.value_counts().plot(kind="bar")
plt.show()

# 可视化数据分布
p=sns.pairplot(diabetes_data, hue = 'Outcome')
plt.show()

这里画的图主要是两种类型,直方图和散点图。单一特征对比的时候用的是直方图,不同特征对比的时候用的是散点图,显示两个特征的之间的关系。观察数据分布我们可以发现一些异常值,比如Glucose葡萄糖,BloodPressure血压,SkinThickness皮肤厚度,Insulin胰岛素,BMI身体质量指数这些特征应该是不可能出现0值的。
# 把葡萄糖,血压,皮肤厚度,胰岛素,身体质量指数中的0替换为nan
colume = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
diabetes_data[colume] = diabetes_data[colume].replace(0,np.nan)
# pip install missingno
import missingno as msno
p=msno.bar(diabetes_data)
plt.show()

# 设定阀值
thresh_count = diabetes_data.shape[0]*0.8
# 若某一列数据缺失的数量超过20%就会被删除
diabetes_data = diabetes_data.dropna(thresh=thresh_count, axis=1)
p=msno.bar(diabetes_data)
plt.show()

# 导入插补库
from sklearn.preprocessing import Imputer
# 对数值型变量的缺失值,我们采用均值插补的方法来填充缺失值
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
colume = ['Glucose', 'BloodPressure', 'BMI']
# 进行插补
diabetes_data[colume] = imr.fit_transform(diabetes_data[colume])
p=msno.bar(diabetes_data)
plt.show()

plt.figure(figsize=(12,10))
# 画热力图,数值为两个变量之间的相关系数
p=sns.heatmap(diabetes_data.corr(), annot=True)
plt.show()

# 把数据切分为特征x和标签y
x = diabetes_data.drop("Outcome",axis = 1)
y = diabetes_data.Outcome
from sklearn.model_selection import train_test_split
# 切分数据集,stratify=y表示切分后训练集和测试集中的数据类型的比例跟切分前y中的比例一致
# 比如切分前y中0和1的比例为1:2,切分后y_train和y_test中0和1的比例也都是1:2
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3, stratify=y)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
LR = LogisticRegression()
LR.fit(x_train,y_train)
predictions = LR.predict(x_test)
print(classification_report(y_test, predictions))

边栏推荐
- VS code更新后不在原来位置
- Open source huizhichuang future | 2022 open atom global open source summit openatom openeuler sub forum was successfully held
- GMT安装与使用
- 连通块&&食物链——(并查集小结)
- Hc-05 Bluetooth module debugging slave mode and master mode experience
- leetcode:704二分查找
- Analysis of new retail e-commerce o2o model
- Developing NES games (cc65) 05 and palette with C language
- 新东方单季营收5.24亿美元同比降56.8% 学习中心减少925间
- STM32F103 几个特殊引脚做普通io使用注意事项以及备份寄存器丢失数据问题1,2
猜你喜欢

Uncover why devaxpress WinForms, an interface control, discards the popular maskbox property

scala 转换、过滤、分组、排序

云原生—运行时环境

Redis实现分布式锁

开源社区三十年 | 2022 开放原子全球开源峰会开源社区三十年专题活动圆满召开

Markdown concise grammar manual

新东方单季营收5.24亿美元同比降56.8% 学习中心减少925间

How to realize more multimedia functions through the ffmpeg library and NaPi mechanism integrated in openharmony system?

Introduction to resttemplate

Distributed session solution
随机推荐
力扣315计算右侧小于当前元素的个数
AVL树(平衡搜索树)
Insufficient permission to pull server code through Jenkins and other precautions
The usage and Simulation Implementation of vector in STL
Ten prohibitions for men and women in love
Initialization examples of several modes of mma8452q
New progress in the implementation of the industry | the openatom openharmony sub forum of the 2022 open atom global open source summit was successfully held
Marketing play is changeable, and understanding the rules is the key!
【Base】优化性能到底在优化啥?
The input string contains an array of numbers and non characters, such as a123x456. Take the consecutive numbers as an integer, store them in an array in turn, such as 123 in a[0], 456 in a[1], and ou
设计一个线程池
图书馆自动预约脚本
单调栈Monotonic Stack
leetcode:数组
1331. Array sequence number conversion: simple simulation question
HC-05蓝牙模块调试从模式和主模式经历
Open source office (ospo) unveils Secrets
MySQL limit paging optimization
Developing NES game (cc65) 03 and VRAM buffer with C language
归并排序