当前位置:网站首页>[machine learning notes] several methods of splitting data into training sets and test sets
[machine learning notes] several methods of splitting data into training sets and test sets
2022-07-05 13:49:00 【Weichi Begonia】
Problem description :
In general , We are used to referring to 80% As a training set , 20% As test set ( When the amount of data is large enough , Can also be 10% As test set . Small data volume , If the training set is randomly divided every time , After many exercises , The model may be acquired, and the completion is the data set . This is what we want to avoid .
There are several solutions to the above problems :
- Save the test set and training set generated by the first run , In the follow-up training , First load the saved training set and test set data
- Set up random_state, Ensure that the test set allocated each time is the same
- These two methods , When the original data set is increased , There will still be the problem of mixing test data and training data , So when raw data sets continue to increase , You can convert a field in the dataset into an identifier , Then convert the identifier to a hash value , If Hashi is worth < Maximum hash value * 20%, Will be put into the test set
Method 2 The corresponding code is as follows :
from sklearn.model_selection import train_test_split
# Set the test set size to 20% Raw data , Set up random_state It can ensure that every time you perform training , The split training set and test set are the same , random_state It can be set to any integer , As long as the value used in each training is the same
train_test_split(data, test_size=0.2, random_state=42)
Or use numpy Generate random unordered sequences , To divide the test set and the training set
def split_train_test(data, test_ratio):
np.random.seed(42)
shuffled_indices = np.random.permutation(len(data)) # Generate an unordered index of the same length as the original data
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Method 3 The corresponding code is as follows :
# Divide the training set and the test set - Method 3
# Data currently selected as the test set , After adding data , To ensure that it will not be selected as a training set
# A unique identifier can be created based on a stable feature , such as id
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2 ** 32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Problem description :
For example, we want to do a sampling survey , The proportion of men and women in the known population is 47% : 53%, Then the proportion of men and women in our sampling data also needs to maintain this proportion , The above random sampling method is obviously dissatisfied with meeting this demand . Divide the population into uniform subsets , Each subset is called a layer , Then extract the correct amount of data from each layer , To ensure that the test set represents the proportion of the total population , This process is Stratified sampling .
# For example, the following example predicts house prices , We know that income is strongly related to house prices , But income is a continuous value , We first divide the income data into 5 files .
from matplotlib import pyplot as plt
housing['income_cat'] = pd.cut(housing['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]) # Divide the revenue data into 5 files
housing['income_cat'].hist() # Draw a histogram
plt.show()
# n_splits=1 To divide into 1 Share , The test set size is 20%, random_state=42 Ensure that the test set and training set generated by each training model are unchanged
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']): # Classify according to income category
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
# View the proportion distribution of revenue categories in the test set
print(' The proportion distribution of income categories by stratified sampling :')
print(strat_test_set['income_cat'].value_counts() / len(strat_test_set))
print(' The proportion distribution of income categories in the original data :')
print(housing['income_cat'].value_counts() / len(housing))
It can be seen that the proportion of each income category after stratified sampling is consistent with the original data
The proportion distribution of income categories by stratified sampling :
3 0.350533
2 0.318798
4 0.176357
5 0.114583
1 0.039729
Name: income_cat, dtype: float64
The proportion distribution of income categories in the original data :
3 0.350581
2 0.318847
4 0.176308
5 0.114438
1 0.039826
Name: income_cat, dtype: float64
边栏推荐
- Network security - Novice introduction
- 运筹说 第68期|2022年最新影响因子正式发布 快看管科领域期刊的变化
- Solution to the prompt of could not close zip file during phpword use
- ZABBIX monitoring
- 南理工在线交流群
- 【公开课预告】:视频质量评价基础与实践
- 2022建筑焊工(建筑特殊工种)特种作业证考试题库及在线模拟考试
- Binder communication process and servicemanager creation process
- ELFK部署
- The "Baidu Cup" CTF competition was held in February 2017, Web: explosion-2
猜你喜欢
[server data recovery] a case of RAID5 data recovery stored in a brand of server
龙芯派2代烧写PMON和重装系统
The real king of caching, Google guava is just a brother
Laravel framework operation error: no application encryption key has been specified
TortoiseSVN使用情形、安装与使用
记录一下在深度学习-一些bug处理
嵌入式软件架构设计-消息交互
[public class preview]: basis and practice of video quality evaluation
stm32逆向入门
About the problem and solution of 403 error in wampserver
随机推荐
leetcode 10. Regular expression matching regular expression matching (difficult)
Prefix, infix, suffix expression "recommended collection"
Liar report query collection network PHP source code
荐号 | 有趣的人都在看什么?
Laravel framework operation error: no application encryption key has been specified
Convolutional Neural Networks简述
Datapipeline was selected into the 2022 digital intelligence atlas and database development report of China Academy of communications and communications
Wonderful express | Tencent cloud database June issue
The development of speech recognition app with uni app is simple and fast.
The "Baidu Cup" CTF competition was held in February 2017, Web: explosion-2
Interviewer soul torture: why does the code specification require SQL statements not to have too many joins?
What happened to the communication industry in the first half of this year?
When using Tencent cloud for the first time, you can only use webshell connection instead of SSH connection.
Network security - Novice introduction
Huawei push service content, read notes
Redis6 master-slave replication and clustering
asp. Net read TXT file
ETCD数据库源码分析——rawnode简单封装
Pancake Bulldog robot V2 (code optimized)
Integer ==比较会自动拆箱 该变量不能赋值为空