当前位置:网站首页>Use 3 in data modeling σ Eliminate outliers for data cleaning
Use 3 in data modeling σ Eliminate outliers for data cleaning
2022-07-07 09:43:00 【Data stroller】
Methods the principle :
3σ The criterion is also known as the laida criterion , It assumes that a set of test data contains only random errors , The standard deviation is calculated and processed , Determine an interval according to a certain probability , If the error exceeds this range , It's not a random error, it's a gross error , Data containing this error should be eliminated .
In the normal distribution σ Represents the standard deviation ,μ For mean .x=μ That is, the axis of symmetry of the image
3σ principle :
The values are distributed in (μ-σ,μ+σ) The probability of 0.6827
The values are distributed in (μ-2σ,μ+2σ) The probability of 0.9544
The values are distributed in (μ-3σ,μ+3σ) The probability of 0.9974
It can be said that ,Y The values of are almost all concentrated in (μ-3σ,μ+3σ) Within the interval , There is no possibility of going beyond that 0.3%.
Example data :
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
Now use 3σ Operation of eliminating outliers .
df_avg = df[' sales volumes '].mean() # Calculate the mean
df_std = df[' sales volumes '].std() # Calculate the standard deviation
df['z_score'] = (df[' sales volumes '] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # Filter out data with excessive sales , Eliminate the interference of outliers
df = df.drop('z_score',axis =1)
print(' mean value :',df_avg)
print(' Standard deviation :',df_std)mean value : 5.5 Standard deviation : 3.0276503540974917
View the processed results :
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
You can see that the result has not changed , Now manually add an outlier , See if it can be filtered out .
utilize loc Function method to add .
df.loc[10]=['2020-12-02','A007',1000]
df
Run it again 3σ Operation code for eliminating outliers , View results , The sales quantity of abnormal data is 1000 Has been eliminated
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 00:00:00 | A005 | 1 |
| 1 | 2020-12-01 00:00:00 | A014 | 2 |
| 2 | 2020-12-01 00:00:00 | A007 | 3 |
| 3 | 2020-12-01 00:00:00 | A012 | 4 |
| 4 | 2020-12-01 00:00:00 | A009 | 5 |
| 5 | 2020-12-01 00:00:00 | A019 | 6 |
| 6 | 2020-12-01 00:00:00 | A008 | 7 |
| 7 | 2020-12-01 00:00:00 | A019 | 8 |
| 8 | 2020-12-01 00:00:00 | A002 | 9 |
| 9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- Niuke - Huawei question bank (61~70)
- nlohmann json
- [4g/5g/6g topic foundation -147]: Interpretation of the white paper on 6G's overall vision and potential key technologies -2-6g's macro driving force for development
- 超十万字_超详细SSM整合实践_手动实现权限管理
- **Grafana installation**
- Schema-validation: wrong column type encountered in column XXX in table XXX
- 【无标题】
- Difference between interface iterator and iteratable
- 面试被问到了解哪些开发模型?看这一篇就够了
- 网易云微信小程序
猜你喜欢
![[cloud native] Devops (I): introduction to Devops and use of code tool](/img/e0/6152b3248ce19d0dbba3ac4845eb65.png)
[cloud native] Devops (I): introduction to Devops and use of code tool

MongoDB怎么实现创建删除数据库、创建删除表、数据增删改查

網易雲微信小程序

VSCode+mingw64
![[4G/5G/6G专题基础-147]: 6G总体愿景与潜在关键技术白皮书解读-2-6G发展的宏观驱动力](/img/21/6a183e4e10daed90c66235bdbdc3bf.png)
[4G/5G/6G专题基础-147]: 6G总体愿景与潜在关键技术白皮书解读-2-6G发展的宏观驱动力

章鱼未来之星获得25万美金奖励|章鱼加速器2022夏季创业营圆满落幕

Network request process

Colorbar of using vertexehelper to customize controls (II)

第一讲:寻找矩阵的极小值

Huawei hcip datacom core_ 03day
随机推荐
Final keyword
[Frida practice] "one line" code teaches you to obtain all Lua scripts in wegame platform
Dynamics 365Online ApplicationUser创建方式变更
華為HCIP-DATACOM-Core_03day
信息安全实验一:DES加密算法的实现
Netease Cloud Wechat applet
JMeter JDBC batch references data as input parameters (the simplest method for the whole website)
How to become a senior digital IC Design Engineer (5-2) theory: ULP low power design technology (Part 1)
Network request process
JS judge whether checkbox is selected in the project
liunx命令
Strategic cooperation subquery becomes the secret weapon of Octopus web browser
Difference between interface iterator and iteratable
【原创】程序员团队管理的核心是什么?
Create an int type array with a length of 6. The values of the array elements are required to be between 1-30 and are assigned randomly. At the same time, the values of the required elements are diffe
Unity shader (learn more about vertex fragment shaders)
印象笔记终于支持默认markdown预览模式
华为HCIP-DATACOM-Core_03day
esp8266使用TF卡并读写数据(基于arduino)
First issue of JS reverse tutorial