当前位置:网站首页>Data set partitioning and cross-validation
Data set partitioning and cross-validation
2022-07-31 05:32:00 【Erosion_ww】
数据集划分
训练集 验证集 测试集
- The training set is used to build the model
- The validation set is used to fine-tune the model during construction,辅助模型构建,可以重复使用.(When a validation set exists,We select a portion from the training set to test the model,A portion of the training set data is reserved as a validation set)
- 测试集用于检验模型,评估模型的准确率.
Hold-out method
The default will be the dataset75%作为训练集,数据集的25%作为测试集.
交叉验证
1.Leave-one-out verification
Divide a large dataset into k个小数据集,k等于数据集中数据的个数,每次只使用一个作为测试集,剩下的全部作为训练集,这种方法得出的结果与训练整个测试集的期望值最为接近,但是成本过于庞大,Suitable for small sample datasets.
2. K折交叉验证
将数据集分成k个子集,每次选k-1个子集作为训练集,剩下的那个子集作为测试集.一共进行k次,将kThe average cross-validation correct rate of times is used as the result.train_test_split,默认训练集、测试集比例为3:1.如果是5折交叉验证,训练集比测试集为4:1;10折交叉验证训练集比测试集为9:1.数据量越大,模型准确率越高.
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn import metrics
data = load_iris() #获取鸢尾花数据集
X = data.data
y = data.target
kf = KFold(n_splits=5, random_state=None) # 5折交叉验证
i = 1
for train_index, test_index in kf.split(X, y):
print('\n{} of kfold {}'.format(i,kf.n_splits))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
pred_test = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred_test)
print('accuracy_score', score)
i += 1
pred = model.predict_proba(X_test)[:, 1]
3. 分层交叉验证
Hierarchy is the rearrangement of data,Each fold can better represent the whole.
on a binary classification problem,There are two types of raw data(F和M),F:MThe proportion of data volume is approximately 1:3;划分了5折,every foldF和MThe proportions remain the same as the original data(1:3).
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_index, test_index in skf.split(X,y):
print("Train:", train_index, "Validation:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
scores = cross_val_score(model,X,y,cv=skf)
print("straitified cross validation scores:{}".format(scores))
print("Mean score of straitified cross validation:{:.2f}".format(scores.mean()))
4. 重复交叉验证
其实就是重复n次k-fold,Each repetition has a different randomness.
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn import metrics
data = load_iris()
X = data.data
y = data.target
kf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=None)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
i = 1
for train_index, test_index in kf.split(X, y):
print('\n{} of kfold {}'.format(i,i))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
pred_test = model.predict(X_test)
score = metrics.accuracy_score(y_test, pred_test)
print('accuracy_score', score)
i += 1
#pred_test = model.predict(X_test)
pred = model.predict_proba(X_test)[:, 1]
边栏推荐
- Goodbye to the cumbersome Excel, mastering data analysis and processing technology depends on it
- datagrip带参sql查询
- MySQL8.0安装教程,在Linux环境安装MySQL8.0教程,最新教程 超详细
- Typec手机有线网卡网线转网口转接口快充方案
- MySQL8--Windows下使用压缩包安装的方法
- Pytorch教程Introduction中的神经网络实现示例
- The interviewer asked me TCP three handshake and four wave, I really
- Kubernetes加入集群的TOKEN值过期
- sql statement - how to query data in another table based on the data in one table
- 剑指offer专项突击版 ---- 第2天
猜你喜欢

Apache DButils使用注意事项--with modifiers “public“

STM32——DMA

pycharm专业版使用

面试官:生成订单30分钟未支付,则自动取消,该怎么实现?

Redis Advanced - Cache Issues: Consistency, Penetration, Penetration, Avalanche, Pollution, etc.

110道 MySQL面试题及答案 (持续更新)

【mysql 提高查询效率】Mysql 数据库查询好慢问题解决

Quickly master concurrent programming --- the basics

1. Get data - requests.get()

MySQL forgot password
随机推荐
Tapdata 与 Apache Doris 完成兼容性互认证,共建新一代数据架构
12个MySQL慢查询的原因分析
[Introduction to MySQL 8 to Mastery] Basics - silent installation of MySQL on Linux system, cross-version upgrade
MySQL(更新中)
tf.keras.utils.get_file()
Typec手机有线网卡网线转网口转接口快充方案
对list集合进行分页,并将数据显示在页面中
SQL injection of DVWA
CentOS7 安装MySQL 图文详细教程
剑指offer专项突击版 --- 第 4 天
Numpy中np.meshgrid的简单用法示例
面试官:生成订单30分钟未支付,则自动取消,该怎么实现?
如何将项目部署到服务器上(全套教程)
Moment Pool Cloud quickly installs packages such as torch-sparse and torch-geometric
TOGAF之架构标准规范(一)
MySQL_关于JSON数据的查询
Linux系统安装mysql(rpm方式安装)
CentOS7 —— yum安装mysql
With MVC, why DDD?
Flink sink redis 写入Redis