当前位置:网站首页>Data processing methods - smote series and adasyn
Data processing methods - smote series and adasyn
2022-07-06 04:09:00 【Code Taoist】
brief introduction
Unbalanced dataset refers to the extremely unbalanced sample size of each category of the dataset . A case study of dichotomous problems , Suppose that the number of samples of the positive class is much larger than that of the negative class , Usually, the proportion of most samples is close to 100:1 The data in this case is called unbalanced data . The learning of unbalanced data requires learning useful information in unevenly distributed data sets .
The processing methods of unbalanced data sets are mainly divided into two aspects :
1、 From a data perspective , The main method is sampling , It is divided into undersampling and oversampling and some corresponding improvement methods .
2、 From the perspective of Algorithm , Considering the cost difference of different misclassification cases, the algorithm is optimized , Mainly based on cost sensitive learning algorithm (Cost-Sensitive Learning), The representative algorithms are adacost;
In addition, the problem of unbalanced data sets can be considered as a classification (One Class Learning) Or anomaly detection (Novelty Detection) problem , The representative algorithms are One-class SVM.
SMOTE series
SMOTE
SMOTE(Synthetic Minority Oversampling Technique) Synthesis of a few oversampling techniques , It is an over sampling algorithm improved on the basis of random sampling . Select a sample from a few samples xi. secondly , By sampling magnification N, from xi Of K Random selection among nearest neighbors N Samples xzi. Last , In turn, it's xzi and xi Randomly synthesize new samples , The synthesis formula is as follows :
$$xn=xi+beta(x{zi}-xi)$$
Address of thesis
SMOTE: Synthetic Minority Over-sampling Technique
Borderline SMOTE
Borderline SMOTE Is in SMOTE Based on the improved oversampling algorithm , The algorithm only uses a few class samples on the boundary to synthesize new samples , So as to improve the category distribution of samples .
Borderline SMOTE The sampling process is to divide a small number of samples into 3 class , Respectively Safe、Danger and Noise,Safe, More than half of the samples are minority samples ;Danger: More than half of the samples around are most types of samples , As a sample on the boundary ;Noise: The samples are surrounded by most types of samples , Considered noise , As shown in the middle of the picture C Last , For tables only Danger A few classes of samples are oversampled .
Address of thesis
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
ADASYN series
ADASYN
ADASYN (adaptive synthetic sampling) Adaptive synthetic sampling , And Borderline SMOTE be similar , Give different weights to different minority samples , So as to generate different numbers of samples .
step
- Calculate the number of samples to be synthesized , The formula is as follows :
$$G=left(m{l}-m{s}right) times beta$$
among , $m{text { 丨 }}$ Number of samples for most classes , $m{s}$ Is the number of samples of a few classes , $beta in[0,1]$ random number , if $beta$ be equal to 1 , The positive and negative ratio after sampling is $1: 1$ .
- Calculation K Most classes in the nearest neighbors account for , The formula is as follows :
$$r{i}=Delta{i} / K$$
among , $Delta{i}$ by $K$ Number of samples of most classes in nearest neighbors , $i=1,2,3, ldots ldots, m{s}$
- Yes ri Standardization , The formula is as follows :
$$hat{r}{i}=r{i} / sum{i=1}^{m{s}} r_{i}$$
- According to the sample weight , Calculate the number of new samples to be generated for each minority sample , The formula is as follows :
$$g=hat{r}_{i} times G$$
- according to $g$ Calculate the number of samples to be generated for each small number of samples , according to SMOTE The algorithm generates samples , The formula is as follows :
$$s{i}=x{i}+left(x{z i}-x{i}right) times lambda$$
among , $mathrm{s}{i}$ For synthetic samples , $mathrm{x}{i}$ It is the second in a few samples $i$ Samples , $mathrm{x}{mathrm{z} i}$ yes $mathrm{x}{mathrm{i}}$ Of K Randomly select a minority sample from the nearest neighbors $lambda in[0,1]$ The random number .
Address of thesis
ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning
边栏推荐
- Stack and queue
- Global and Chinese market of plasma separator 2022-2028: Research Report on technology, participants, trends, market size and share
- WPF效果第一百九十一篇之框选ListBox
- Chinese brand hybrid technology: there is no best technical route, only better products
- AcWing 243. A simple integer problem 2 (tree array interval modification interval query)
- 【PSO】基于PSO粒子群优化的物料点货物运输成本最低值计算matlab仿真,包括运输费用、代理人转换费用、运输方式转化费用和时间惩罚费用
- 综合能力测评系统
- Global and Chinese market of aircraft anti icing and rain protection systems 2022-2028: Research Report on technology, participants, trends, market size and share
- Global and Chinese markets for endoscopic drying storage cabinets 2022-2028: Research Report on technology, participants, trends, market size and share
- No qualifying bean of type ‘......‘ available
猜你喜欢
In Net 6 CS more concise method
MySQL about self growth
Scalpel like analysis of JVM -- this article takes you to peek into the secrets of JVM
DM8 archive log file manual switching
10個 Istio 流量管理 最常用的例子,你知道幾個?
10个 Istio 流量管理 最常用的例子,你知道几个?
[disassembly] a visual air fryer. By the way, analyze the internal circuit
WPF effect Article 191 box selection listbox
How to modify field constraints (type, default, null, etc.) in a table
Security xxE vulnerability recurrence (XXe Lab)
随机推荐
Record the pit of NETCORE's memory surge
Path of class file generated by idea compiling JSP page
P3033 [usaco11nov]cow steelchase g (similar to minimum path coverage)
使用JS完成一个LRU缓存
软考 系统架构设计师 简明教程 | 总目录
WPF effect Article 191 box selection listbox
Custom event of C (31)
The global and Chinese market of negative pressure wound therapy unit (npwtu) 2022-2028: Research Report on technology, participants, trends, market size and share
Brief tutorial for soft exam system architecture designer | general catalog
What is the difference between gateway address and IP address in tcp/ip protocol?
Cross domain and jsonp details
关于进程、线程、协程、同步、异步、阻塞、非阻塞、并发、并行、串行的理解
Global and Chinese markets for fire resistant conveyor belts 2022-2028: Research Report on technology, participants, trends, market size and share
Esp32 (based on Arduino) connects the mqtt server of emqx to upload information and command control
math_极限&微分&导数&微商/对数函数的导函数推导(导数定义极限法)/指数函数求导公式推导(反函数求导法则/对数求导法)
Global and Chinese markets for medical gas manifolds 2022-2028: Research Report on technology, participants, trends, market size and share
[001] [stm32] how to download STM32 original factory data
Ybtoj coloring plan [tree chain dissection, segment tree, tarjan]
10個 Istio 流量管理 最常用的例子,你知道幾個?
Le compte racine de la base de données MySQL ne peut pas se connecter à distance à la solution