当前位置:网站首页>[machine learning notes] several methods of splitting data into training sets and test sets
[machine learning notes] several methods of splitting data into training sets and test sets
2022-07-05 13:49:00 【Weichi Begonia】
Problem description :
In general , We are used to referring to 80% As a training set , 20% As test set ( When the amount of data is large enough , Can also be 10% As test set . Small data volume , If the training set is randomly divided every time , After many exercises , The model may be acquired, and the completion is the data set . This is what we want to avoid .
There are several solutions to the above problems :
- Save the test set and training set generated by the first run , In the follow-up training , First load the saved training set and test set data
- Set up random_state, Ensure that the test set allocated each time is the same
- These two methods , When the original data set is increased , There will still be the problem of mixing test data and training data , So when raw data sets continue to increase , You can convert a field in the dataset into an identifier , Then convert the identifier to a hash value , If Hashi is worth < Maximum hash value * 20%, Will be put into the test set
Method 2 The corresponding code is as follows :
from sklearn.model_selection import train_test_split
# Set the test set size to 20% Raw data , Set up random_state It can ensure that every time you perform training , The split training set and test set are the same , random_state It can be set to any integer , As long as the value used in each training is the same
train_test_split(data, test_size=0.2, random_state=42)
Or use numpy Generate random unordered sequences , To divide the test set and the training set
def split_train_test(data, test_ratio):
np.random.seed(42)
shuffled_indices = np.random.permutation(len(data)) # Generate an unordered index of the same length as the original data
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Method 3 The corresponding code is as follows :
# Divide the training set and the test set - Method 3
# Data currently selected as the test set , After adding data , To ensure that it will not be selected as a training set
# A unique identifier can be created based on a stable feature , such as id
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2 ** 32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Problem description :
For example, we want to do a sampling survey , The proportion of men and women in the known population is 47% : 53%, Then the proportion of men and women in our sampling data also needs to maintain this proportion , The above random sampling method is obviously dissatisfied with meeting this demand . Divide the population into uniform subsets , Each subset is called a layer , Then extract the correct amount of data from each layer , To ensure that the test set represents the proportion of the total population , This process is Stratified sampling .
# For example, the following example predicts house prices , We know that income is strongly related to house prices , But income is a continuous value , We first divide the income data into 5 files .
from matplotlib import pyplot as plt
housing['income_cat'] = pd.cut(housing['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]) # Divide the revenue data into 5 files
housing['income_cat'].hist() # Draw a histogram
plt.show()
# n_splits=1 To divide into 1 Share , The test set size is 20%, random_state=42 Ensure that the test set and training set generated by each training model are unchanged
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']): # Classify according to income category
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
# View the proportion distribution of revenue categories in the test set
print(' The proportion distribution of income categories by stratified sampling :')
print(strat_test_set['income_cat'].value_counts() / len(strat_test_set))
print(' The proportion distribution of income categories in the original data :')
print(housing['income_cat'].value_counts() / len(housing))
It can be seen that the proportion of each income category after stratified sampling is consistent with the original data
The proportion distribution of income categories by stratified sampling :
3 0.350533
2 0.318798
4 0.176357
5 0.114583
1 0.039729
Name: income_cat, dtype: float64
The proportion distribution of income categories in the original data :
3 0.350581
2 0.318847
4 0.176308
5 0.114438
1 0.039826
Name: income_cat, dtype: float64
边栏推荐
- Attack and defense world crypto WP
- Network security HSRP protocol
- Kotlin collaboration uses coroutinecontext to implement the retry logic after a network request fails
- 2022年机修钳工(高级)考试题模拟考试题库模拟考试平台操作
- Pancake Bulldog robot V2 (code optimized)
- Matlab paper chart standard format output (dry goods)
- What are the private addresses
- 华为推送服务内容,阅读笔记
- Idea set method annotation and class annotation
- :: ffff:192.168.31.101 what address is it?
猜你喜欢
Can graduate students not learn English? As long as the score of postgraduate entrance examination English or CET-6 is high!
laravel-dompdf导出pdf,中文乱码问题解决
About the problem and solution of 403 error in wampserver
::ffff:192.168.31.101 是一个什么地址?
面试官灵魂拷问:为什么代码规范要求 SQL 语句不要过多的 join?
Set up a website with a sense of ceremony, and post it to the public 2/2 through the intranet
MySQL - database query - sort query, paging query
TortoiseSVN使用情形、安装与使用
Aikesheng sqle audit tool successfully completed the evaluation of "SQL quality management platform grading ability" of the Academy of communications and communications
Liar report query collection network PHP source code
随机推荐
Log4j utilization correlation
Godson 2nd generation burn PMON and reload system
2022建筑焊工(建筑特殊工种)特种作业证考试题库及在线模拟考试
Zhubo Huangyu: these spot gold investment skills are not really bad
mysql获得时间
What about data leakage? " Watson k'7 moves to eliminate security threats
Source code analysis of etcd database -- peer RT of inter cluster network layer client
Solve the problem of "unable to open source file" xx.h "in the custom header file on vs from the source
【华南理工大学】考研初试复试资料分享
What are the private addresses
RK3566添加LED
研究生可以不用学英语?只要考研英语或六级分数高!
[cloud resources] what software is good for cloud resource security management? Why?
Idea set method annotation and class annotation
Wechat app payment callback processing method PHP logging method, notes. 2020/5/26
Integer = = the comparison will unpack automatically. This variable cannot be assigned empty
龙芯派2代烧写PMON和重装系统
ELK 企业级日志分析系统
[MySQL usage Script] catch all MySQL time and date types and related operation functions (3)
Rk3566 add LED