当前位置:网站首页>Use 3 in data modeling σ Eliminate outliers for data cleaning
Use 3 in data modeling σ Eliminate outliers for data cleaning
2022-07-07 09:43:00 【Data stroller】
Methods the principle :
3σ The criterion is also known as the laida criterion , It assumes that a set of test data contains only random errors , The standard deviation is calculated and processed , Determine an interval according to a certain probability , If the error exceeds this range , It's not a random error, it's a gross error , Data containing this error should be eliminated .
In the normal distribution σ Represents the standard deviation ,μ For mean .x=μ That is, the axis of symmetry of the image
3σ principle :
The values are distributed in (μ-σ,μ+σ) The probability of 0.6827
The values are distributed in (μ-2σ,μ+2σ) The probability of 0.9544
The values are distributed in (μ-3σ,μ+3σ) The probability of 0.9974
It can be said that ,Y The values of are almost all concentrated in (μ-3σ,μ+3σ) Within the interval , There is no possibility of going beyond that 0.3%.
Example data :
date | Commodity code | sales volumes | |
---|---|---|---|
0 | 2020-12-01 | A005 | 1 |
1 | 2020-12-01 | A014 | 2 |
2 | 2020-12-01 | A007 | 3 |
3 | 2020-12-01 | A012 | 4 |
4 | 2020-12-01 | A009 | 5 |
5 | 2020-12-01 | A019 | 6 |
6 | 2020-12-01 | A008 | 7 |
7 | 2020-12-01 | A019 | 8 |
8 | 2020-12-01 | A002 | 9 |
9 | 2020-12-02 | A005 | 10 |
Now use 3σ Operation of eliminating outliers .
df_avg = df[' sales volumes '].mean() # Calculate the mean
df_std = df[' sales volumes '].std() # Calculate the standard deviation
df['z_score'] = (df[' sales volumes '] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # Filter out data with excessive sales , Eliminate the interference of outliers
df = df.drop('z_score',axis =1)
print(' mean value :',df_avg)
print(' Standard deviation :',df_std)
mean value : 5.5 Standard deviation : 3.0276503540974917
View the processed results :
date | Commodity code | sales volumes | |
---|---|---|---|
0 | 2020-12-01 | A005 | 1 |
1 | 2020-12-01 | A014 | 2 |
2 | 2020-12-01 | A007 | 3 |
3 | 2020-12-01 | A012 | 4 |
4 | 2020-12-01 | A009 | 5 |
5 | 2020-12-01 | A019 | 6 |
6 | 2020-12-01 | A008 | 7 |
7 | 2020-12-01 | A019 | 8 |
8 | 2020-12-01 | A002 | 9 |
9 | 2020-12-02 | A005 | 10 |
You can see that the result has not changed , Now manually add an outlier , See if it can be filtered out .
utilize loc Function method to add .
df.loc[10]=['2020-12-02','A007',1000]
df
Run it again 3σ Operation code for eliminating outliers , View results , The sales quantity of abnormal data is 1000 Has been eliminated
date | Commodity code | sales volumes | |
---|---|---|---|
0 | 2020-12-01 00:00:00 | A005 | 1 |
1 | 2020-12-01 00:00:00 | A014 | 2 |
2 | 2020-12-01 00:00:00 | A007 | 3 |
3 | 2020-12-01 00:00:00 | A012 | 4 |
4 | 2020-12-01 00:00:00 | A009 | 5 |
5 | 2020-12-01 00:00:00 | A019 | 6 |
6 | 2020-12-01 00:00:00 | A008 | 7 |
7 | 2020-12-01 00:00:00 | A019 | 8 |
8 | 2020-12-01 00:00:00 | A002 | 9 |
9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- 根据热门面试题分析Android事件分发机制(二)---事件冲突分析处理
- How to become a senior digital IC Design Engineer (1-6) Verilog coding Grammar: Classic Digital IC Design
- Information Security Experiment 2: using x-scanner scanning tool
- 根据热门面试题分析Android事件分发机制(一)
- Lesson 1: hardness of eggs
- [4G/5G/6G专题基础-146]: 6G总体愿景与潜在关键技术白皮书解读-1-总体愿景
- 其实特简单,教你轻松实现酷炫的数据可视化大屏
- thinkphp数据库的增删改查
- 信息安全实验四:Ip包监视程序实现
- Impression notes finally support the default markdown preview mode
猜你喜欢
VSCode+mingw64+cmake
[4G/5G/6G专题基础-147]: 6G总体愿景与潜在关键技术白皮书解读-2-6G发展的宏观驱动力
第一讲:寻找矩阵的极小值
CSDN salary increase technology - learn about the use of several common logic controllers of JMeter
網易雲微信小程序
Dynamics 365Online ApplicationUser创建方式变更
ComputeShader
Information Security Experiment 2: using x-scanner scanning tool
第一讲:包含min函数的栈
JMeter JDBC batch references data as input parameters (the simplest method for the whole website)
随机推荐
数据库多表关联查询问题
js逆向教程第二发-猿人学第一题
Lesson 1: hardness of eggs
Jenkins modifies the system time
How to become a senior digital IC Design Engineer (1-6) Verilog coding Grammar: Classic Digital IC Design
PostgreSQL reports an error when creating a trigger,
Write VBA in Excel, connect to Oracle and query the contents in the database
Difference between process and thread
[bw16 application] Anxin can realize mqtt communication with bw16 module / development board at instruction
浏览器中如何让视频倍速播放
MongoDB怎么实现创建删除数据库、创建删除表、数据增删改查
Sqlplus garbled code problem, find the solution
ComputeShader
How to solve the problem of golang select mechanism and timeout
flinkcdc 用sqlclient可以指定mysqlbinlog id执行任务吗
Detailed explanation of diffusion model
ViewPager2和VIewPager的区别以及ViewPager2实现轮播图
PLC信号处理系列之开关量信号防抖FB
4、 Fundamentals of machine learning
第一讲:寻找矩阵的极小值