当前位置:网站首页>数据集划分以及交叉验证法
数据集划分以及交叉验证法
2022-07-31 05:09:00 【Erosion_ww】
数据集划分
训练集 验证集 测试集
- 训练集用来构建模型
- 验证集用来在构建过程中微调模型,辅助模型构建,可以重复使用。(当有验证集存在时,我们从训练集选一部分用于测试模型,保留一部分训练集数据作为验证集)
- 测试集用于检验模型,评估模型的准确率。
Hold-out method
默认将数据集的75%作为训练集,数据集的25%作为测试集。
交叉验证
1.留一验证法
把一个大的数据集分为k个小数据集,k等于数据集中数据的个数,每次只使用一个作为测试集,剩下的全部作为训练集,这种方法得出的结果与训练整个测试集的期望值最为接近,但是成本过于庞大,适合小样本数据集。
2. K折交叉验证
将数据集分成k个子集,每次选k-1个子集作为训练集,剩下的那个子集作为测试集。一共进行k次,将k次的平均交叉验证正确率作为结果。train_test_split,默认训练集、测试集比例为3:1。如果是5折交叉验证,训练集比测试集为4:1;10折交叉验证训练集比测试集为9:1。数据量越大,模型准确率越高。
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn import metrics
data = load_iris() #获取鸢尾花数据集
X = data.data
y = data.target
kf = KFold(n_splits=5, random_state=None) # 5折交叉验证
i = 1
for train_index, test_index in kf.split(X, y):
print('\n{} of kfold {}'.format(i,kf.n_splits))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
pred_test = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred_test)
print('accuracy_score', score)
i += 1
pred = model.predict_proba(X_test)[:, 1]
3. 分层交叉验证
分层是重新将数据排列组合,使得每一折都能比较好地代表整体。
在一个二分类问题上,原始数据一共有两类(F和M),F:M的数据量比例大概是 1:3;划分了5折,每一折中F和M的比例都保持和原数据一致(1:3)。
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_index, test_index in skf.split(X,y):
print("Train:", train_index, "Validation:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
scores = cross_val_score(model,X,y,cv=skf)
print("straitified cross validation scores:{}".format(scores))
print("Mean score of straitified cross validation:{:.2f}".format(scores.mean()))
4. 重复交叉验证
其实就是重复n次k-fold,每次重复有不同的随机性。
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn import metrics
data = load_iris()
X = data.data
y = data.target
kf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=None)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
i = 1
for train_index, test_index in kf.split(X, y):
print('\n{} of kfold {}'.format(i,i))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
pred_test = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred_test)
print('accuracy_score', score)
i += 1
#pred_test = model.predict(X_test)
pred = model.predict_proba(X_test)[:, 1]
边栏推荐
- 信息系统项目管理师核心考点(五十五)配置管理员(CMO)的工作
- [mysql improves query efficiency] Mysql database query is slow to solve the problem
- Centos7 install mysql5.7 steps (graphical version)
- Unity Tutorial: URP Rendering Pipeline Practical Tutorial Series [1]
- ERROR 2003 (HY000) Can't connect to MySQL server on 'localhost3306' (10061)
- 可点击也可直接复制指定内容js
- Goodbye to the cumbersome Excel, mastering data analysis and processing technology depends on it
- 质量小议12 -- 以测代评
- SQL row-column conversion
- 如何将项目部署到服务器上(全套教程)
猜你喜欢

ERROR 2003 (HY000) Can't connect to MySQL server on 'localhost3306' (10061)Solution

Interviewer: If the order is not paid within 30 minutes, it will be automatically canceled. How to do this?

Unity手机游戏性能优化系列:针对CPU端的性能调优

太厉害了,终于有人能把文件上传漏洞讲的明明白白了
![<urlopen error [Errno 11001] getaddrinfo failed>的解决、isinstance()函数初略介绍](/img/a4/8c75fab6a9858c5ddec25f6a8300fb.png)
<urlopen error [Errno 11001] getaddrinfo failed>的解决、isinstance()函数初略介绍

The Vue project connects to the MySQL database through node and implements addition, deletion, modification and query operations

Unity Framework Design Series: How Unity Designs Network Frameworks

MYSQL下载及安装完整教程

centos7安装mysql5.7
![【JS面试题】面试官:“[1,2,3].map(parseInt)“ 输出结果是什么?答上来就算你通过面试](/img/7a/c70077c7a95137aaeb49c344c82696.png)
【JS面试题】面试官:“[1,2,3].map(parseInt)“ 输出结果是什么?答上来就算你通过面试
随机推荐
CentOS7 install MySQL graphic detailed tutorial
MySQL database installation (detailed)
12 reasons for MySQL slow query
Why use Flink and how to get started with Flink?
Puzzle Game Level Design: Reverse Method--Explaining Puzzle Game Level Design
Information System Project Manager Core Test Site (55) Configuration Manager (CMO) Work
Sun Wenlong, Secretary General of the Open Atom Open Source Foundation |
MySQL transaction (transaction) (this is enough..)
With MVC, why DDD?
Minesweeper game (written in c language)
.NET-9. A mess of theoretical notes (concepts, ideas)
MySQL开窗函数
MySQL常见面试题汇总(建议收藏!!!)
.NET-6.WinForm2.NanUI learning and summary
A complete introduction to JSqlParse of Sql parsing and conversion
如何将项目部署到服务器上(全套教程)
[mysql improves query efficiency] Mysql database query is slow to solve the problem
MySQL优化:从十几秒优化到三百毫秒
ERROR 1064 (42000) You have an error in your SQL syntax; check the manual that corresponds to your
CentOS7 —— yum安装mysql