当前位置:网站首页>数据集划分以及交叉验证法
数据集划分以及交叉验证法
2022-07-31 05:09:00 【Erosion_ww】
数据集划分
训练集 验证集 测试集
- 训练集用来构建模型
- 验证集用来在构建过程中微调模型,辅助模型构建,可以重复使用。(当有验证集存在时,我们从训练集选一部分用于测试模型,保留一部分训练集数据作为验证集)
- 测试集用于检验模型,评估模型的准确率。
Hold-out method
默认将数据集的75%作为训练集,数据集的25%作为测试集。
交叉验证
1.留一验证法
把一个大的数据集分为k个小数据集,k等于数据集中数据的个数,每次只使用一个作为测试集,剩下的全部作为训练集,这种方法得出的结果与训练整个测试集的期望值最为接近,但是成本过于庞大,适合小样本数据集。
2. K折交叉验证
将数据集分成k个子集,每次选k-1个子集作为训练集,剩下的那个子集作为测试集。一共进行k次,将k次的平均交叉验证正确率作为结果。train_test_split,默认训练集、测试集比例为3:1。如果是5折交叉验证,训练集比测试集为4:1;10折交叉验证训练集比测试集为9:1。数据量越大,模型准确率越高。
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn import metrics
data = load_iris() #获取鸢尾花数据集
X = data.data
y = data.target
kf = KFold(n_splits=5, random_state=None) # 5折交叉验证
i = 1
for train_index, test_index in kf.split(X, y):
print('\n{} of kfold {}'.format(i,kf.n_splits))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
pred_test = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred_test)
print('accuracy_score', score)
i += 1
pred = model.predict_proba(X_test)[:, 1]
3. 分层交叉验证
分层是重新将数据排列组合,使得每一折都能比较好地代表整体。
在一个二分类问题上,原始数据一共有两类(F和M),F:M的数据量比例大概是 1:3;划分了5折,每一折中F和M的比例都保持和原数据一致(1:3)。
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_index, test_index in skf.split(X,y):
print("Train:", train_index, "Validation:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
scores = cross_val_score(model,X,y,cv=skf)
print("straitified cross validation scores:{}".format(scores))
print("Mean score of straitified cross validation:{:.2f}".format(scores.mean()))
4. 重复交叉验证
其实就是重复n次k-fold,每次重复有不同的随机性。
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn import metrics
data = load_iris()
X = data.data
y = data.target
kf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=None)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
i = 1
for train_index, test_index in kf.split(X, y):
print('\n{} of kfold {}'.format(i,i))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
pred_test = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred_test)
print('accuracy_score', score)
i += 1
#pred_test = model.predict(X_test)
pred = model.predict_proba(X_test)[:, 1]
边栏推荐
- ERROR 1819 (HY000) Your password does not satisfy the current policy requirements
- On-line monitoring system for urban waterlogging and water accumulation in bridges and tunnels
- [debug highlights] Expected input batch_size (1) to match target batch_size (0)
- Go中间件
- Numpy中np.meshgrid的简单用法示例
- wx.miniProgram.navigateTo在web-view中跳回小程序并传参
- .NET-6.WinForm2.NanUI learning and summary
- MySQL-如何分库分表?一看就懂
- 【mysql 提高查询效率】Mysql 数据库查询好慢问题解决
- ERROR 2003 (HY000) Can't connect to MySQL server on 'localhost3306' (10061)Solution
猜你喜欢
The monitoring of Doris study notes
Mysql application cannot find my.ini file after installation
Temporal介绍
MySQL transaction (transaction) (this is enough..)
SQL行列转换
Heavyweight | The Open Atomic School Source Line activity was officially launched
sql语句-如何以一个表中的数据为条件据查询另一个表中的数据
关于小白安装nodejs遇到的问题(npm WARN config global `--global`, `--local` are deprecated. Use `--location=glob)
MySQL-如何分库分表?一看就懂
面试官问我TCP三次握手和四次挥手,我真的是
随机推荐
On-line monitoring system for urban waterlogging and water accumulation in bridges and tunnels
参考代码系列_1.各种语言的Hello World
Numpy中np.meshgrid的简单用法示例
ERROR 1819 (HY000) Your password does not satisfy the current policy requirements
Input length must be multiple of 8 when decrypting with padded cipher
ABC D - Distinct Trio (Number of k-tuples
[debug highlights] Expected input batch_size (1) to match target batch_size (0)
mysql5.7.35安装配置教程【超级详细安装教程】
Go中间件
centos7安装mysql5.7步骤(图解版)
The Vue project connects to the MySQL database through node and implements addition, deletion, modification and query operations
Interviewer: If the order is not paid within 30 minutes, it will be automatically canceled. How to do this?
Distributed Transactions - Introduction to Distributed Transactions, Distributed Transaction Framework Seata (AT Mode, Tcc Mode, Tcc Vs AT), Distributed Transactions - MQ
为什么要用Flink,怎么入门使用Flink?
Create componentized development based on ILRuntime hot update
datagrip带参sql查询
The monitoring of Doris study notes
Workflow番外篇
PWN ROP
MySQL常见面试题汇总(建议收藏!!!)