当前位置:网站首页>[machine learning notes] several methods of splitting data into training sets and test sets
[machine learning notes] several methods of splitting data into training sets and test sets
2022-07-05 13:49:00 【Weichi Begonia】
Problem description :
In general , We are used to referring to 80% As a training set , 20% As test set ( When the amount of data is large enough , Can also be 10% As test set . Small data volume , If the training set is randomly divided every time , After many exercises , The model may be acquired, and the completion is the data set . This is what we want to avoid .
There are several solutions to the above problems :
- Save the test set and training set generated by the first run , In the follow-up training , First load the saved training set and test set data
- Set up random_state, Ensure that the test set allocated each time is the same
- These two methods , When the original data set is increased , There will still be the problem of mixing test data and training data , So when raw data sets continue to increase , You can convert a field in the dataset into an identifier , Then convert the identifier to a hash value , If Hashi is worth < Maximum hash value * 20%, Will be put into the test set
Method 2 The corresponding code is as follows :
from sklearn.model_selection import train_test_split
# Set the test set size to 20% Raw data , Set up random_state It can ensure that every time you perform training , The split training set and test set are the same , random_state It can be set to any integer , As long as the value used in each training is the same
train_test_split(data, test_size=0.2, random_state=42)
Or use numpy Generate random unordered sequences , To divide the test set and the training set
def split_train_test(data, test_ratio):
np.random.seed(42)
shuffled_indices = np.random.permutation(len(data)) # Generate an unordered index of the same length as the original data
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Method 3 The corresponding code is as follows :
# Divide the training set and the test set - Method 3
# Data currently selected as the test set , After adding data , To ensure that it will not be selected as a training set
# A unique identifier can be created based on a stable feature , such as id
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2 ** 32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Problem description :
For example, we want to do a sampling survey , The proportion of men and women in the known population is 47% : 53%, Then the proportion of men and women in our sampling data also needs to maintain this proportion , The above random sampling method is obviously dissatisfied with meeting this demand . Divide the population into uniform subsets , Each subset is called a layer , Then extract the correct amount of data from each layer , To ensure that the test set represents the proportion of the total population , This process is Stratified sampling .
# For example, the following example predicts house prices , We know that income is strongly related to house prices , But income is a continuous value , We first divide the income data into 5 files .
from matplotlib import pyplot as plt
housing['income_cat'] = pd.cut(housing['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]) # Divide the revenue data into 5 files
housing['income_cat'].hist() # Draw a histogram
plt.show()
# n_splits=1 To divide into 1 Share , The test set size is 20%, random_state=42 Ensure that the test set and training set generated by each training model are unchanged
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']): # Classify according to income category
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
# View the proportion distribution of revenue categories in the test set
print(' The proportion distribution of income categories by stratified sampling :')
print(strat_test_set['income_cat'].value_counts() / len(strat_test_set))
print(' The proportion distribution of income categories in the original data :')
print(housing['income_cat'].value_counts() / len(housing))
It can be seen that the proportion of each income category after stratified sampling is consistent with the original data
The proportion distribution of income categories by stratified sampling :
3 0.350533
2 0.318798
4 0.176357
5 0.114583
1 0.039729
Name: income_cat, dtype: float64
The proportion distribution of income categories in the original data :
3 0.350581
2 0.318847
4 0.176308
5 0.114438
1 0.039826
Name: income_cat, dtype: float64
边栏推荐
- Laravel framework operation error: no application encryption key has been specified
- The "Baidu Cup" CTF competition was held in February 2017, Web: explosion-2
- Redis6 master-slave replication and clustering
- 内网穿透工具 netapp
- Log4j utilization correlation
- leetcode 10. Regular expression matching regular expression matching (difficult)
- Catch all asynchronous artifact completable future
- TortoiseSVN使用情形、安装与使用
- 基于微信小程序的订餐系统
- Parsing XML using Dom4j
猜你喜欢
These 18 websites can make your page background cool
When using Tencent cloud for the first time, you can only use webshell connection instead of SSH connection.
zabbix 监控
那些考研后才知道的事
The "Baidu Cup" CTF competition was held in February 2017, Web: explosion-2
Binder communication process and servicemanager creation process
Redis6 master-slave replication and clustering
嵌入式软件架构设计-消息交互
STM32 reverse entry
Datapipeline was selected into the 2022 digital intelligence atlas and database development report of China Academy of communications and communications
随机推荐
Primary code audit [no dolls (modification)] assessment
我为什么支持 BAT 拆掉「AI 研究院」
PHP basic syntax
Liar report query collection network PHP source code
Record in-depth learning - some bug handling
南理工在线交流群
French scholars: the explicability of counter attack under optimal transmission theory
Catch all asynchronous artifact completable future
Operational research 68 | the latest impact factors in 2022 were officially released. Changes in journals in the field of rapid care
Flutter 3.0更新后如何应用到小程序开发中
【 script secret pour l'utilisation de MySQL 】 un jeu en ligne sur l'heure et le type de date de MySQL et les fonctions d'exploitation connexes (3)
[South China University of technology] information sharing of postgraduate entrance examination and re examination
53. 最大子数组和:给你一个整数数组 nums ,请你找出一个具有最大和的连续子数组(子数组最少包含一个元素),返回其最大和。
RK3566添加LED
如何把大的‘tar‘存档文件分割成特定大小的多个文件
Usage, installation and use of TortoiseSVN
真正的缓存之王,Google Guava 只是弟弟
华为推送服务内容,阅读笔记
What is information security? What is included? What is the difference with network security?
::ffff:192.168.31.101 是一个什么地址?