当前位置:网站首页>Data set partitioning and cross-validation
Data set partitioning and cross-validation
2022-07-31 05:32:00 【Erosion_ww】
数据集划分
训练集 验证集 测试集
- The training set is used to build the model
- The validation set is used to fine-tune the model during construction,辅助模型构建,可以重复使用.(When a validation set exists,We select a portion from the training set to test the model,A portion of the training set data is reserved as a validation set)
- 测试集用于检验模型,评估模型的准确率.
Hold-out method
The default will be the dataset75%作为训练集,数据集的25%作为测试集.
交叉验证
1.Leave-one-out verification
Divide a large dataset into k个小数据集,k等于数据集中数据的个数,每次只使用一个作为测试集,剩下的全部作为训练集,这种方法得出的结果与训练整个测试集的期望值最为接近,但是成本过于庞大,Suitable for small sample datasets.
2. K折交叉验证
将数据集分成k个子集,每次选k-1个子集作为训练集,剩下的那个子集作为测试集.一共进行k次,将kThe average cross-validation correct rate of times is used as the result.train_test_split,默认训练集、测试集比例为3:1.如果是5折交叉验证,训练集比测试集为4:1;10折交叉验证训练集比测试集为9:1.数据量越大,模型准确率越高.
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn import metrics
data = load_iris() #获取鸢尾花数据集
X = data.data
y = data.target
kf = KFold(n_splits=5, random_state=None) # 5折交叉验证
i = 1
for train_index, test_index in kf.split(X, y):
print('\n{} of kfold {}'.format(i,kf.n_splits))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
pred_test = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred_test)
print('accuracy_score', score)
i += 1
pred = model.predict_proba(X_test)[:, 1]
3. 分层交叉验证
Hierarchy is the rearrangement of data,Each fold can better represent the whole.
on a binary classification problem,There are two types of raw data(F和M),F:MThe proportion of data volume is approximately 1:3;划分了5折,every foldF和MThe proportions remain the same as the original data(1:3).
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_index, test_index in skf.split(X,y):
print("Train:", train_index, "Validation:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
scores = cross_val_score(model,X,y,cv=skf)
print("straitified cross validation scores:{}".format(scores))
print("Mean score of straitified cross validation:{:.2f}".format(scores.mean()))
4. 重复交叉验证
其实就是重复n次k-fold,Each repetition has a different randomness.
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn import metrics
data = load_iris()
X = data.data
y = data.target
kf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=None)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
i = 1
for train_index, test_index in kf.split(X, y):
print('\n{} of kfold {}'.format(i,i))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
pred_test = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred_test)
print('accuracy_score', score)
i += 1
#pred_test = model.predict(X_test)
pred = model.predict_proba(X_test)[:, 1]
边栏推荐
- DVWA installation tutorial (understand what you don't understand · in detail)
- Multiple table query of sql statement
- SQL行列转换
- 面试官,不要再问我三次握手和四次挥手
- Centos7 install mysql5.7
- Centos7 install mysql5.7 steps (graphical version)
- Pytorch教程Introduction中的神经网络实现示例
- MySQL8.0安装教程,在Linux环境安装MySQL8.0教程,最新教程 超详细
- Interviewer, don't ask me to shake hands three times and wave four times again
- Paginate the list collection and display the data on the page
猜你喜欢

Refinement of the four major collection frameworks: Summary of List core knowledge

Temporal介绍

目标检测学习笔记

The interviewer asked me how to divide the database and the table?Fortunately, I summed up a set of eight-part essays

再见了繁琐的Excel,掌握数据分析处理技术就靠它了

Information System Project Manager Core Test Site (55) Configuration Manager (CMO) Work
![[MQ I can speak for an hour]](/img/ef/863c994ac3a7de157bd39545218558.jpg)
[MQ I can speak for an hour]

pycharm专业版使用

matlab abel变换图片处理

面试官,不要再问我三次握手和四次挥手
随机推荐
Goodbye to the cumbersome Excel, mastering data analysis and processing technology depends on it
MYSQL一站式学习,看完即学完
The monitoring of Doris study notes
The interviewer asked me TCP three handshake and four wave, I really
Temporal客户端模型
Flask 的初识
Go中间件
面试官:生成订单30分钟未支付,则自动取消,该怎么实现?
DVWA installation tutorial (understand what you don't understand · in detail)
On-line monitoring system for urban waterlogging and water accumulation in bridges and tunnels
Summary of MySQL common interview questions (recommended collection!!!)
Distributed transaction processing solution big PK!
mysql使用on duplicate key update批量更新数据
TOGAF之架构标准规范(一)
CentOS7 install MySQL graphic detailed tutorial
剑指offer基础版--- 第23天
数据集划分以及交叉验证法
剑指offer专项突击版 ---第 5 天
Information System Project Manager Core Test Site (55) Configuration Manager (CMO) Work
.NET-9. A mess of theoretical notes (concepts, ideas)