当前位置:网站首页>Introduction to Resampling
Introduction to Resampling
2022-07-05 18:13:00 【Dreamer DBA】
Data sampling refers to statistical methods for selecting observations from the domain with the objective of estimating a population parameter. Whereas data resampling refers to methods for economically using a collected dataset to improve the estimate of the population parameter and help to quantify of the estimate.Both data sampling and data resampling are methods that are required in a predictive modeling problem.
- Sampling is an active process of gathering observations with the intent of estimating a population variable.
- Resampling is a methodology of economically using a data sample to improve the accuracy and quantify the uncertainty of a population parameter.
- Resampling methods, make use of a nested resampling method.
1.1 Statistical Sampling
Observations made in a domain represent samples of some broader idealized and unknown population of all possible observation that could be made in the domain.
Sampling consists of selecting some part of the population to observe so that one may estimate something about the whole population.
1.1.1 How to Sample
Some aspects to consider prior to collecting a data sample include:
- Sample Goal
- Population
- Selection Criteria
- Sample Size.
Statistical sampling is a large field of study, but in applied machine learning , there may be three types of sampling that you are likely to use: simple random sampling, systematic sampling, and stratified sampling.
- Simple Random Sampling : Samples are drawn with a uniform probability from the domain.
- Systematic Sampling : Samples are drawn using a pre-specified pattern , such as at intervals
- Stratified Sampling : Samples are drawn within pre-specified categories.
1.1.2 Sampling Errors
Two main types of errors include selection bias and sampling error.
Selection Bias: Caused when the method of drawing observations skews the sample in some way.
Sampling Error: Caused due to the random nature of drawing observations skewing the sample in some way.
1.1.3 Statistical Resampling
Statistical resampling methods are procedures that describe how to economically use available data to estimate a population parameter.Resampling methods are very easy to use.requiring little mathematical knowledege.They are methods that are easy to understand and implement compared to specialized statistical methods that may require deep technical skill in order to select and interpret.
Two commonly used resampling methods that you may encounter are k-fold cross-validation the bootstrap.
- Bootstrap. Samples are drawn from the dataset with replacement.where those instances not drawn into the data sample may be used for the test set.
- k-fold Cross-Validation. A dataset is partitioned into k groups, where each group is given the opportunity
The k-fold cross-validation method specifically lends itself to use in the evaluation of predictive models that are repeatedly trained on one subset of the data and evaluated on a second held-out subset of the data.
Generally, resampling techniques for estimating model performance operate similarly: a subset of samples are used to fit a model and the remaining samples are used to estimate the efficacy of the model. This process is repeated multiple times and the results are aggregated and summarized. The differences in techniques usually center around the method in which subsamples are chosen.
The bootstrap method can be used for the same purpose, but is a more general and simpler method intended for estimating a population parameter.
边栏推荐
- Electron installation problems
- What are the changes in the 2022 PMP Exam?
- Maximum artificial island [how to make all nodes of a connected component record the total number of nodes? + number the connected component]
- matlab内建函数怎么不同颜色,matlab分段函数不同颜色绘图
- 从XML架构生成类
- Tencent music launched its new product "quyimai", which provides music commercial copyright authorization
- 南京大学:新时代数字化人才培养方案探讨
- 第十一届中国云计算标准和应用大会 | 华云数据成为全国信标委云计算标准工作组云迁移专题组副组长单位副组长单位
- U-Net: Convolutional Networks for Biomedical Images Segmentation
- jdbc读大量数据导致内存溢出
猜你喜欢
Career advancement Guide: recommended books for people in big factories
Sophon base 3.1 launched mlops function to provide wings for the operation of enterprise AI capabilities
第十一届中国云计算标准和应用大会 | 云计算国家标准及白皮书系列发布 华云数据全面参与编制
Privacy computing helps secure data circulation and sharing
第十一届中国云计算标准和应用大会 | 华云数据成为全国信标委云计算标准工作组云迁移专题组副组长单位副组长单位
"Xiaodeng in operation and maintenance" is a single sign on solution for cloud applications
Record a case of using WinDbg to analyze memory "leakage"
Wu Enda team 2022 machine learning course, coming
Neural network self cognition model
Image classification, just look at me!
随机推荐
Leetcode exercise - 206 Reverse linked list
开户注册股票炒股安全吗?有没有风险的?靠谱吗?
热通孔的有效放置如何改善PCB设计中的热管理?
Check namespaces and classes
让更多港澳青年了解南沙特色文创产品!“南沙麒麟”正式亮相
Gimp 2.10 tutorial "suggestions collection"
Elk log analysis system
Tkinter window preload
多线程(一) 进程与线程
星环科技重磅推出数据要素流通平台Transwarp Navier,助力企业实现隐私保护下的数据安全流通与协作
[PM2 details]
Cmake tutorial step1 (basic starting point)
Whether to take a duplicate subset with duplicate elements [how to take a subset? How to remove duplicates?]
Access the database and use redis as the cache of MySQL (a combination of redis and MySQL)
Xiaobai getting started with NAS - quick building private cloud tutorial series (I) [easy to understand]
Le cours d'apprentissage de la machine 2022 de l'équipe Wunda arrive.
《2022中国信创生态市场研究及选型评估报告》发布 华云数据入选信创IT基础设施主流厂商!
Leetcode notes: Weekly contest 300
ISPRS2022/雲檢測:Cloud detection with boundary nets基於邊界網的雲檢測
最大人工岛[如何让一个连通分量的所有节点都记录总节点数?+给连通分量编号]