当前位置:网站首页>Data processing methods - smote series and adasyn
Data processing methods - smote series and adasyn
2022-07-06 04:09:00 【Code Taoist】
brief introduction
Unbalanced dataset refers to the extremely unbalanced sample size of each category of the dataset . A case study of dichotomous problems , Suppose that the number of samples of the positive class is much larger than that of the negative class , Usually, the proportion of most samples is close to 100:1 The data in this case is called unbalanced data . The learning of unbalanced data requires learning useful information in unevenly distributed data sets .
The processing methods of unbalanced data sets are mainly divided into two aspects :
1、 From a data perspective , The main method is sampling , It is divided into undersampling and oversampling and some corresponding improvement methods .
2、 From the perspective of Algorithm , Considering the cost difference of different misclassification cases, the algorithm is optimized , Mainly based on cost sensitive learning algorithm (Cost-Sensitive Learning), The representative algorithms are adacost;
In addition, the problem of unbalanced data sets can be considered as a classification (One Class Learning) Or anomaly detection (Novelty Detection) problem , The representative algorithms are One-class SVM.
SMOTE series
SMOTE
SMOTE(Synthetic Minority Oversampling Technique) Synthesis of a few oversampling techniques , It is an over sampling algorithm improved on the basis of random sampling . Select a sample from a few samples xi. secondly , By sampling magnification N, from xi Of K Random selection among nearest neighbors N Samples xzi. Last , In turn, it's xzi and xi Randomly synthesize new samples , The synthesis formula is as follows :
$$xn=xi+beta(x{zi}-xi)$$
Address of thesis
SMOTE: Synthetic Minority Over-sampling Technique
Borderline SMOTE
Borderline SMOTE Is in SMOTE Based on the improved oversampling algorithm , The algorithm only uses a few class samples on the boundary to synthesize new samples , So as to improve the category distribution of samples .
Borderline SMOTE The sampling process is to divide a small number of samples into 3 class , Respectively Safe、Danger and Noise,Safe, More than half of the samples are minority samples ;Danger: More than half of the samples around are most types of samples , As a sample on the boundary ;Noise: The samples are surrounded by most types of samples , Considered noise , As shown in the middle of the picture C Last , For tables only Danger A few classes of samples are oversampled .
Address of thesis
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
ADASYN series
ADASYN
ADASYN (adaptive synthetic sampling) Adaptive synthetic sampling , And Borderline SMOTE be similar , Give different weights to different minority samples , So as to generate different numbers of samples .
step
- Calculate the number of samples to be synthesized , The formula is as follows :
$$G=left(m{l}-m{s}right) times beta$$
among , $m{text { 丨 }}$ Number of samples for most classes , $m{s}$ Is the number of samples of a few classes , $beta in[0,1]$ random number , if $beta$ be equal to 1 , The positive and negative ratio after sampling is $1: 1$ .
- Calculation K Most classes in the nearest neighbors account for , The formula is as follows :
$$r{i}=Delta{i} / K$$
among , $Delta{i}$ by $K$ Number of samples of most classes in nearest neighbors , $i=1,2,3, ldots ldots, m{s}$
- Yes ri Standardization , The formula is as follows :
$$hat{r}{i}=r{i} / sum{i=1}^{m{s}} r_{i}$$
- According to the sample weight , Calculate the number of new samples to be generated for each minority sample , The formula is as follows :
$$g=hat{r}_{i} times G$$
- according to $g$ Calculate the number of samples to be generated for each small number of samples , according to SMOTE The algorithm generates samples , The formula is as follows :
$$s{i}=x{i}+left(x{z i}-x{i}right) times lambda$$
among , $mathrm{s}{i}$ For synthetic samples , $mathrm{x}{i}$ It is the second in a few samples $i$ Samples , $mathrm{x}{mathrm{z} i}$ yes $mathrm{x}{mathrm{i}}$ Of K Randomly select a minority sample from the nearest neighbors $lambda in[0,1]$ The random number .
Address of thesis
ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning
边栏推荐
- Prime protocol announces cross chain interconnection applications on moonbeam
- Comprehensive ability evaluation system
- 【leetcode】1189. Maximum number of "balloons"
- Facebook等大厂超十亿用户数据遭泄露,早该关注DID了
- Oracle ORA error message
- 1291_Xshell日志中增加时间戳的功能
- Maxay paper latex template description
- [FPGA tutorial case 11] design and implementation of divider based on vivado core
- Le compte racine de la base de données MySQL ne peut pas se connecter à distance à la solution
- MySql数据库root账户无法远程登陆解决办法
猜你喜欢
Do you know cookies, sessions, tokens?
TCP/IP协议里面的网关地址和ip地址有什么区别?
Basic use of MySQL (it is recommended to read and recite the content)
Ipv4中的A 、B、C类网络及子网掩码
How to modify field constraints (type, default, null, etc.) in a table
MySQL master-slave replication
MLAPI系列 - 04 - 网络变量和网络序列化【网络同步】
MySql數據庫root賬戶無法遠程登陸解决辦法
Record an excel xxE vulnerability
Comprehensive ability evaluation system
随机推荐
About some basic DP -- those things about coins (the basic introduction of DP)
C (thirty) C combobox listview TreeView
R note prophet
Unity中几个重要类
DM8 archive log file manual switching
ESP32_ FreeRTOS_ Arduino_ 1_ Create task
How to standardize the deployment of automated testing?
Record the pit of NETCORE's memory surge
Leetcode32 longest valid bracket (dynamic programming difficult problem)
Développement d'un module d'élimination des bavardages à clé basé sur la FPGA
HotSpot VM
mysql关于自增长增长问题
[introduction to Django] 11 web page associated MySQL single field table (add, modify, delete)
图应用详解
关于进程、线程、协程、同步、异步、阻塞、非阻塞、并发、并行、串行的理解
[adjustable delay network] development of FPGA based adjustable delay network system Verilog
Proof of Stirling formula
TCP/IP协议里面的网关地址和ip地址有什么区别?
Le compte racine de la base de données MySQL ne peut pas se connecter à distance à la solution
[001] [stm32] how to download STM32 original factory data