当前位置：网站首页>How to solve the problem of large distribution gap between training set and test set

How to solve the problem of large distribution gap between training set and test set

2022-07-24 05:57:00 【Didi'cv】

StratifiedKFold

You can borrow sklearn Medium StratifiedKFold Come to realize K Crossover verification , At the same time, split the data according to the proportion of different categories in the label , So as to solve the problem of sample imbalance .

#!/usr/bin/python3
# -*- coding:utf-8 -*-
""" @author: xcd @file: StratifiedKFold-test.py @time: 2021/1/26 10:14 @desc: """

import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold

X = np.array([
    [1, 2, 3, 4],
    [11, 12, 13, 14],
    [21, 22, 23, 24],
    [31, 32, 33, 34],
    [41, 42, 43, 44],
    [51, 52, 53, 54],
    [61, 62, 63, 64],
    [71, 72, 73, 74]
])

y = np.array([1, 1, 1, 1, 1, 1, 0, 0])

sfolder = StratifiedKFold(n_splits=4, random_state=0, shuffle=True)
folder = KFold(n_splits=4, random_state=0, shuffle=False)

for train, test in sfolder.split(X, y):
    print(train, test)

print("-------------------------------")
for train, test in folder.split(X, y):
    print(train, test)

Insert picture description here

for fold, (train_idx, val_idx) in enumerate(sfolder.split(X, y)):
    train_set, val_set = X[train_idx], X[val_idx]

Follow KFold There is a clear contrast ,StratifiedKFold Usage is similar. Kfold, But it is stratified sampling , Make sure the training set , The proportion of samples in the test set is the same as that in the original data set .

###
Parameters

n_splits : int, default=3
Number of folds. Must be at least 2.

shuffle : boolean, optional
Whether to shuffle each stratification of the data before splitting into batches.

random_state :
int, RandomState instance or None, optional, default=None

If int, random_state is the seed used by the random number generatorIf RandomState instance, random_state is the random number generator;

If None, the random number generator is the RandomState instance used

by `np.random`. Used when ``shuffle`` == True.
###

原网站

版权声明
本文为[Didi'cv]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/205/202207240517104814.html

当前位置：网站首页>How to solve the problem of large distribution gap between training set and test set

How to solve the problem of large distribution gap between training set and test set

StratifiedKFold

边栏推荐

猜你喜欢

随机推荐