当前位置:网站首页>Data processing methods - smote series and adasyn
Data processing methods - smote series and adasyn
2022-07-06 04:09:00 【Code Taoist】
brief introduction
Unbalanced dataset refers to the extremely unbalanced sample size of each category of the dataset . A case study of dichotomous problems , Suppose that the number of samples of the positive class is much larger than that of the negative class , Usually, the proportion of most samples is close to 100:1 The data in this case is called unbalanced data . The learning of unbalanced data requires learning useful information in unevenly distributed data sets .
The processing methods of unbalanced data sets are mainly divided into two aspects :
1、 From a data perspective , The main method is sampling , It is divided into undersampling and oversampling and some corresponding improvement methods .
2、 From the perspective of Algorithm , Considering the cost difference of different misclassification cases, the algorithm is optimized , Mainly based on cost sensitive learning algorithm (Cost-Sensitive Learning), The representative algorithms are adacost;
In addition, the problem of unbalanced data sets can be considered as a classification (One Class Learning) Or anomaly detection (Novelty Detection) problem , The representative algorithms are One-class SVM.
SMOTE series
SMOTE
SMOTE(Synthetic Minority Oversampling Technique) Synthesis of a few oversampling techniques , It is an over sampling algorithm improved on the basis of random sampling . Select a sample from a few samples xi. secondly , By sampling magnification N, from xi Of K Random selection among nearest neighbors N Samples xzi. Last , In turn, it's xzi and xi Randomly synthesize new samples , The synthesis formula is as follows :
$$xn=xi+beta(x{zi}-xi)$$
Address of thesis
SMOTE: Synthetic Minority Over-sampling Technique
Borderline SMOTE
Borderline SMOTE Is in SMOTE Based on the improved oversampling algorithm , The algorithm only uses a few class samples on the boundary to synthesize new samples , So as to improve the category distribution of samples .
Borderline SMOTE The sampling process is to divide a small number of samples into 3 class , Respectively Safe、Danger and Noise,Safe, More than half of the samples are minority samples ;Danger: More than half of the samples around are most types of samples , As a sample on the boundary ;Noise: The samples are surrounded by most types of samples , Considered noise , As shown in the middle of the picture C Last , For tables only Danger A few classes of samples are oversampled .
Address of thesis
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
ADASYN series
ADASYN
ADASYN (adaptive synthetic sampling) Adaptive synthetic sampling , And Borderline SMOTE be similar , Give different weights to different minority samples , So as to generate different numbers of samples .
step
- Calculate the number of samples to be synthesized , The formula is as follows :
$$G=left(m{l}-m{s}right) times beta$$
among , $m{text { 丨 }}$ Number of samples for most classes , $m{s}$ Is the number of samples of a few classes , $beta in[0,1]$ random number , if $beta$ be equal to 1 , The positive and negative ratio after sampling is $1: 1$ .
- Calculation K Most classes in the nearest neighbors account for , The formula is as follows :
$$r{i}=Delta{i} / K$$
among , $Delta{i}$ by $K$ Number of samples of most classes in nearest neighbors , $i=1,2,3, ldots ldots, m{s}$
- Yes ri Standardization , The formula is as follows :
$$hat{r}{i}=r{i} / sum{i=1}^{m{s}} r_{i}$$
- According to the sample weight , Calculate the number of new samples to be generated for each minority sample , The formula is as follows :
$$g=hat{r}_{i} times G$$
- according to $g$ Calculate the number of samples to be generated for each small number of samples , according to SMOTE The algorithm generates samples , The formula is as follows :
$$s{i}=x{i}+left(x{z i}-x{i}right) times lambda$$
among , $mathrm{s}{i}$ For synthetic samples , $mathrm{x}{i}$ It is the second in a few samples $i$ Samples , $mathrm{x}{mathrm{z} i}$ yes $mathrm{x}{mathrm{i}}$ Of K Randomly select a minority sample from the nearest neighbors $lambda in[0,1]$ The random number .
Address of thesis
ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning
边栏推荐
- Redis (replicate dictionary server) cache
- [Key shake elimination] development of key shake elimination module based on FPGA
- Detailed explanation of serialization and deserialization
- KS008基于SSM的新闻发布系统
- HotSpot VM
- How can programmers resist the "three poisons" of "greed, anger and ignorance"?
- Global and Chinese markets for endoscopic drying storage cabinets 2022-2028: Research Report on technology, participants, trends, market size and share
- 食品行业仓储条码管理系统解决方案
- [Zhao Yuqiang] deploy kubernetes cluster with binary package
- How to standardize the deployment of automated testing?
猜你喜欢
Ks008 SSM based press release system
【leetcode】1189. Maximum number of "balloons"
MySql數據庫root賬戶無法遠程登陸解决辦法
[adjustable delay network] development of FPGA based adjustable delay network system Verilog
Comprehensive ability evaluation system
10 exemples les plus courants de gestion du trafic istio, que savez - vous?
What is the difference between gateway address and IP address in tcp/ip protocol?
Ks003 mall system based on JSP and Servlet
Facebook等大廠超十億用戶數據遭泄露,早該關注DID了
MySql数据库root账户无法远程登陆解决办法
随机推荐
判断当天是当月的第几周
2/11 matrix fast power +dp+ bisection
DM8 backup set deletion
About some basic DP -- those things about coins (the basic introduction of DP)
Hashcode and equals
[FPGA tutorial case 11] design and implementation of divider based on vivado core
Introduction to data types in MySQL
Cross domain and jsonp details
80% of the diseases are caused by bad living habits. There are eight common bad habits, which are both physical and mental
[Key shake elimination] development of key shake elimination module based on FPGA
Codeforces Round #770 (Div. 2) B. Fortune Telling
Global and Chinese market of plasma separator 2022-2028: Research Report on technology, participants, trends, market size and share
如何修改表中的字段约束条件(类型,default, null等)
C form application of C (27)
MySQL reads missing data from a table in a continuous period of time
Global and Chinese markets for MRI safe implants 2022-2028: technology, participants, trends, market size and share Research Report
Thread sleep, thread sleep application scenarios
/usr/bin/gzip: 1: ELF: not found/usr/bin/gzip: 3: : not found/usr/bin/gzip: 4: Syntax error:
Ks003 mall system based on JSP and Servlet
Oracle ORA error message