当前位置:网站首页>Use 3 in data modeling σ Eliminate outliers for data cleaning
Use 3 in data modeling σ Eliminate outliers for data cleaning
2022-07-07 09:43:00 【Data stroller】
Methods the principle :
3σ The criterion is also known as the laida criterion , It assumes that a set of test data contains only random errors , The standard deviation is calculated and processed , Determine an interval according to a certain probability , If the error exceeds this range , It's not a random error, it's a gross error , Data containing this error should be eliminated .
In the normal distribution σ Represents the standard deviation ,μ For mean .x=μ That is, the axis of symmetry of the image
3σ principle :
The values are distributed in (μ-σ,μ+σ) The probability of 0.6827
The values are distributed in (μ-2σ,μ+2σ) The probability of 0.9544
The values are distributed in (μ-3σ,μ+3σ) The probability of 0.9974
It can be said that ,Y The values of are almost all concentrated in (μ-3σ,μ+3σ) Within the interval , There is no possibility of going beyond that 0.3%.
Example data :
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
Now use 3σ Operation of eliminating outliers .
df_avg = df[' sales volumes '].mean() # Calculate the mean
df_std = df[' sales volumes '].std() # Calculate the standard deviation
df['z_score'] = (df[' sales volumes '] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # Filter out data with excessive sales , Eliminate the interference of outliers
df = df.drop('z_score',axis =1)
print(' mean value :',df_avg)
print(' Standard deviation :',df_std)mean value : 5.5 Standard deviation : 3.0276503540974917
View the processed results :
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
You can see that the result has not changed , Now manually add an outlier , See if it can be filtered out .
utilize loc Function method to add .
df.loc[10]=['2020-12-02','A007',1000]
df
Run it again 3σ Operation code for eliminating outliers , View results , The sales quantity of abnormal data is 1000 Has been eliminated
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 00:00:00 | A005 | 1 |
| 1 | 2020-12-01 00:00:00 | A014 | 2 |
| 2 | 2020-12-01 00:00:00 | A007 | 3 |
| 3 | 2020-12-01 00:00:00 | A012 | 4 |
| 4 | 2020-12-01 00:00:00 | A009 | 5 |
| 5 | 2020-12-01 00:00:00 | A019 | 6 |
| 6 | 2020-12-01 00:00:00 | A008 | 7 |
| 7 | 2020-12-01 00:00:00 | A019 | 8 |
| 8 | 2020-12-01 00:00:00 | A002 | 9 |
| 9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- [4g/5g/6g topic foundation -147]: Interpretation of the white paper on 6G's overall vision and potential key technologies -2-6g's macro driving force for development
- H5 web player easyplayer How does JS realize live video real-time recording?
- 数据库多表关联查询问题
- 第一讲:包含min函数的栈
- Unity shader (learn more about vertex fragment shaders)
- 沙龙预告|GameFi 领域的瓶颈和解决方案
- 用flinksql的方式 写进 sr的表,发现需要删除的数据没有删除,参照文档https://do
- ViewPager2和VIewPager的區別以及ViewPager2實現輪播圖
- Connecting mobile phone with ADB
- Switching value signal anti shake FB of PLC signal processing series
猜你喜欢

PLC信号处理系列之开关量信号防抖FB

Kubernetes cluster capacity expansion to add node nodes

Binary tree high frequency question type

华为HCIP-DATACOM-Core_03day

華為HCIP-DATACOM-Core_03day

What development models did you know during the interview? Just read this one

Strategic cooperation subquery becomes the secret weapon of Octopus web browser

【frida实战】“一行”代码教你获取WeGame平台中所有的lua脚本

Oracle installation enhancements error

How does mongodb realize the creation and deletion of databases, the creation of deletion tables, and the addition, deletion, modification and query of data
随机推荐
二叉树高频题型
Netease Cloud Wechat applet
嵌套(多级)childrn路由,query参数,命名路由,replace属性,路由的props配置,路由的params参数
# Arthas 简单使用说明
Arthas simple instructions
在EXCEL写VBA连接ORACLE并查询数据库中的内容
Windows starts redis service
软件建模与分析
Jenkins modifies the system time
大佬们,请问 MySQL-CDC 有什么办法将 upsert 消息转换为 append only 消
[4g/5g/6g topic foundation-146]: Interpretation of white paper on 6G overall vision and potential key technologies-1-overall vision
Oracle installation enhancements error
網易雲微信小程序
shake数据库中怎么使用Mongo-shake实现MongoDB的双向同步啊?
IIS faked death this morning, various troubleshooting, has been solved
Lesson 1: hardness of eggs
JS inheritance prototype
JS逆向教程第一发
JS reverse tutorial second issue - Ape anthropology first question
NATAPP内网穿透