当前位置:网站首页>Use 3 in data modeling σ Eliminate outliers for data cleaning
Use 3 in data modeling σ Eliminate outliers for data cleaning
2022-07-07 09:43:00 【Data stroller】
Methods the principle :
3σ The criterion is also known as the laida criterion , It assumes that a set of test data contains only random errors , The standard deviation is calculated and processed , Determine an interval according to a certain probability , If the error exceeds this range , It's not a random error, it's a gross error , Data containing this error should be eliminated .
In the normal distribution σ Represents the standard deviation ,μ For mean .x=μ That is, the axis of symmetry of the image
3σ principle :
The values are distributed in (μ-σ,μ+σ) The probability of 0.6827
The values are distributed in (μ-2σ,μ+2σ) The probability of 0.9544
The values are distributed in (μ-3σ,μ+3σ) The probability of 0.9974
It can be said that ,Y The values of are almost all concentrated in (μ-3σ,μ+3σ) Within the interval , There is no possibility of going beyond that 0.3%.
Example data :
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
Now use 3σ Operation of eliminating outliers .
df_avg = df[' sales volumes '].mean() # Calculate the mean
df_std = df[' sales volumes '].std() # Calculate the standard deviation
df['z_score'] = (df[' sales volumes '] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # Filter out data with excessive sales , Eliminate the interference of outliers
df = df.drop('z_score',axis =1)
print(' mean value :',df_avg)
print(' Standard deviation :',df_std)mean value : 5.5 Standard deviation : 3.0276503540974917
View the processed results :
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
You can see that the result has not changed , Now manually add an outlier , See if it can be filtered out .
utilize loc Function method to add .
df.loc[10]=['2020-12-02','A007',1000]
df
Run it again 3σ Operation code for eliminating outliers , View results , The sales quantity of abnormal data is 1000 Has been eliminated
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 00:00:00 | A005 | 1 |
| 1 | 2020-12-01 00:00:00 | A014 | 2 |
| 2 | 2020-12-01 00:00:00 | A007 | 3 |
| 3 | 2020-12-01 00:00:00 | A012 | 4 |
| 4 | 2020-12-01 00:00:00 | A009 | 5 |
| 5 | 2020-12-01 00:00:00 | A019 | 6 |
| 6 | 2020-12-01 00:00:00 | A008 | 7 |
| 7 | 2020-12-01 00:00:00 | A019 | 8 |
| 8 | 2020-12-01 00:00:00 | A002 | 9 |
| 9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- Lecture 1: stack containing min function
- The configuration and options of save actions are explained in detail, and you won't be confused after reading it
- 有没有大佬帮忙看看这个报错,有啥排查思路,oracle cdc 2.2.1 flink 1.14.4
- How to become a senior digital IC Design Engineer (5-2) theory: ULP low power design technology (Part 1)
- 数据库多表关联查询问题
- Connecting mobile phone with ADB
- 创建一个长度为6的int型数组,要求数组元素的值都在1-30之间,且是随机赋值。同时,要求元素的值各不相同。
- flinkcdc 用sqlclient可以指定mysqlbinlog id执行任务吗
- shake数据库中怎么使用Mongo-shake实现MongoDB的双向同步啊?
- 其实特简单,教你轻松实现酷炫的数据可视化大屏
猜你喜欢

12、 Sort

Over 100000 words_ Ultra detailed SSM integration practice_ Manually implement permission management

How to speed up video playback in browser

esp8266使用TF卡并读写数据(基于arduino)

嵌套(多级)childrn路由,query参数,命名路由,replace属性,路由的props配置,路由的params参数

数据建模中利用3σ剔除异常值进行数据清洗

第一讲:包含min函数的栈

JS reverse tutorial second issue - Ape anthropology first question

Strategic cooperation subquery becomes the secret weapon of Octopus web browser

【frida实战】“一行”代码教你获取WeGame平台中所有的lua脚本
随机推荐
JS逆向教程第一发
flinkcdc 用sqlclient可以指定mysqlbinlog id执行任务吗
[Frida practice] "one line" code teaches you to obtain all Lua scripts in wegame platform
**grafana安装**
数据库多表关联查询问题
Database multi table Association query problem
Yapi test plug-in -- cross request
Jenkins modifies the system time
华为HCIP-DATACOM-Core_03day
Netease cloud wechat applet
沙龙预告|GameFi 领域的瓶颈和解决方案
PLC信号处理系列之开关量信号防抖FB
【原创】程序员团队管理的核心是什么?
網易雲微信小程序
Difference between interface iterator and iteratable
Unity shader (data type in cghlsl)
其实特简单,教你轻松实现酷炫的数据可视化大屏
十二、排序
二叉树高频题型
Information Security Experiment 4: implementation of IP packet monitoring program