当前位置:网站首页>[machine learning notes] several methods of splitting data into training sets and test sets
[machine learning notes] several methods of splitting data into training sets and test sets
2022-07-05 13:49:00 【Weichi Begonia】
Problem description :
In general , We are used to referring to 80% As a training set , 20% As test set ( When the amount of data is large enough , Can also be 10% As test set . Small data volume , If the training set is randomly divided every time , After many exercises , The model may be acquired, and the completion is the data set . This is what we want to avoid .
There are several solutions to the above problems :
- Save the test set and training set generated by the first run , In the follow-up training , First load the saved training set and test set data
- Set up random_state, Ensure that the test set allocated each time is the same
- These two methods , When the original data set is increased , There will still be the problem of mixing test data and training data , So when raw data sets continue to increase , You can convert a field in the dataset into an identifier , Then convert the identifier to a hash value , If Hashi is worth < Maximum hash value * 20%, Will be put into the test set
Method 2 The corresponding code is as follows :
from sklearn.model_selection import train_test_split
# Set the test set size to 20% Raw data , Set up random_state It can ensure that every time you perform training , The split training set and test set are the same , random_state It can be set to any integer , As long as the value used in each training is the same
train_test_split(data, test_size=0.2, random_state=42)
Or use numpy Generate random unordered sequences , To divide the test set and the training set
def split_train_test(data, test_ratio):
np.random.seed(42)
shuffled_indices = np.random.permutation(len(data)) # Generate an unordered index of the same length as the original data
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Method 3 The corresponding code is as follows :
# Divide the training set and the test set - Method 3
# Data currently selected as the test set , After adding data , To ensure that it will not be selected as a training set
# A unique identifier can be created based on a stable feature , such as id
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2 ** 32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Problem description :
For example, we want to do a sampling survey , The proportion of men and women in the known population is 47% : 53%, Then the proportion of men and women in our sampling data also needs to maintain this proportion , The above random sampling method is obviously dissatisfied with meeting this demand . Divide the population into uniform subsets , Each subset is called a layer , Then extract the correct amount of data from each layer , To ensure that the test set represents the proportion of the total population , This process is Stratified sampling .
# For example, the following example predicts house prices , We know that income is strongly related to house prices , But income is a continuous value , We first divide the income data into 5 files .
from matplotlib import pyplot as plt
housing['income_cat'] = pd.cut(housing['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]) # Divide the revenue data into 5 files
housing['income_cat'].hist() # Draw a histogram
plt.show()
# n_splits=1 To divide into 1 Share , The test set size is 20%, random_state=42 Ensure that the test set and training set generated by each training model are unchanged
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']): # Classify according to income category
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
# View the proportion distribution of revenue categories in the test set
print(' The proportion distribution of income categories by stratified sampling :')
print(strat_test_set['income_cat'].value_counts() / len(strat_test_set))
print(' The proportion distribution of income categories in the original data :')
print(housing['income_cat'].value_counts() / len(housing))
It can be seen that the proportion of each income category after stratified sampling is consistent with the original data
The proportion distribution of income categories by stratified sampling :
3 0.350533
2 0.318798
4 0.176357
5 0.114583
1 0.039729
Name: income_cat, dtype: float64
The proportion distribution of income categories in the original data :
3 0.350581
2 0.318847
4 0.176308
5 0.114438
1 0.039826
Name: income_cat, dtype: float64
边栏推荐
- FPGA learning notes: vivado 2019.1 add IP MicroBlaze
- Redis6 transaction and locking mechanism
- Prefix, infix, suffix expression "recommended collection"
- "Baidu Cup" CTF competition in September, web:upload
- [public class preview]: basis and practice of video quality evaluation
- PHP basic syntax
- 53. Maximum subarray sum: give you an integer array num, please find a continuous subarray with the maximum sum (the subarray contains at least one element) and return its maximum sum.
- MySQL if else use case use
- redis6主从复制及集群
- Primary code audit [no dolls (modification)] assessment
猜你喜欢
[server data recovery] a case of RAID5 data recovery stored in a brand of server
Wonderful express | Tencent cloud database June issue
Huawei push service content, read notes
STM32 reverse entry
[public class preview]: basis and practice of video quality evaluation
Don't know these four caching modes, dare you say you understand caching?
内网穿透工具 netapp
私有地址有那些
Mmseg - Mutli view time series data inspection and visualization
Liar report query collection network PHP source code
随机推荐
How to apply the updated fluent 3.0 to applet development
嵌入式软件架构设计-消息交互
When using Tencent cloud for the first time, you can only use webshell connection instead of SSH connection.
内网穿透工具 netapp
Source code analysis of etcd database -- peer RT of inter cluster network layer client
MMSeg——Mutli-view时序数据检查与可视化
Redis6 master-slave replication and clustering
53. Maximum subarray sum: give you an integer array num, please find a continuous subarray with the maximum sum (the subarray contains at least one element) and return its maximum sum.
Mmseg - Mutli view time series data inspection and visualization
Kafaka log collection
我为什么支持 BAT 拆掉「AI 研究院」
Basic characteristics and isolation level of transactions
Integer = = the comparison will unpack automatically. This variable cannot be assigned empty
【MySQL 使用秘籍】一網打盡 MySQL 時間和日期類型與相關操作函數(三)
Rk3566 add LED
Integer ==比较会自动拆箱 该变量不能赋值为空
Datapipeline was selected into the 2022 digital intelligence atlas and database development report of China Academy of communications and communications
Wonderful express | Tencent cloud database June issue
Godson 2nd generation burn PMON and reload system
Etcd database source code analysis -- rawnode simple package