当前位置:网站首页>数据建模中利用3σ剔除异常值进行数据清洗
数据建模中利用3σ剔除异常值进行数据清洗
2022-07-07 06:40:00 【数据闲逛人】
方法原理:
3σ准则又称为拉依达准则,它是先假设一组检测数据只含有随机误差,对其进行计算处理得到标准偏差,按一定概率确定一个区间,认为凡超过这个区间的误差,就不属于随机误差而是粗大误差,含有该误差的数据应予以剔除。
在正态分布中σ代表标准差,μ代表均值。x=μ即为图像的对称轴
3σ原则:
数值分布在(μ-σ,μ+σ)中的概率为0.6827
数值分布在(μ-2σ,μ+2σ)中的概率为0.9544
数值分布在(μ-3σ,μ+3σ)中的概率为0.9974
可以认为,Y 的取值几乎全部集中在(μ-3σ,μ+3σ)区间内,超出这个范围的可能性仅占不到0.3%。
示例数据:
| 日期 | 商品编码 | 销售数量 | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
现在利用3σ剔除异常值的操作。
df_avg = df['销售数量'].mean() # 计算均值
df_std = df['销售数量'].std() # 计算标准差
df['z_score'] = (df['销售数量'] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # 过滤掉销量过高的数据,排除一下异常值干扰
df = df.drop('z_score',axis =1)
print('均值:',df_avg)
print('标准差:',df_std)均值: 5.5 标准差: 3.0276503540974917
查看处理后的结果:
| 日期 | 商品编码 | 销售数量 | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
可以看到结果没有变 ,现在人工加入一个异常值,看看能不能被过滤掉。
利用loc函数方法进行添加。
df.loc[10]=['2020-12-02','A007',1000]
df
再运行一下3σ剔除异常值的操作代码,查看结果,异常数据销售数量为1000已经剔除掉
| 日期 | 商品编码 | 销售数量 | |
|---|---|---|---|
| 0 | 2020-12-01 00:00:00 | A005 | 1 |
| 1 | 2020-12-01 00:00:00 | A014 | 2 |
| 2 | 2020-12-01 00:00:00 | A007 | 3 |
| 3 | 2020-12-01 00:00:00 | A012 | 4 |
| 4 | 2020-12-01 00:00:00 | A009 | 5 |
| 5 | 2020-12-01 00:00:00 | A019 | 6 |
| 6 | 2020-12-01 00:00:00 | A008 | 7 |
| 7 | 2020-12-01 00:00:00 | A019 | 8 |
| 8 | 2020-12-01 00:00:00 | A002 | 9 |
| 9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- Difference between interface iterator and iteratable
- Postman data driven
- SiteMesh getting started example
- C language pointer (special article)
- Do you have any certificates with high gold content?
- Mysql数据库-锁-学习笔记
- (3/8)枚举的不当用法 之 方法参数(二)
- 信息安全实验一:DES加密算法的实现
- On December 8th, 2020, the memory of marketing MRC application suddenly increased, resulting in system oom
- Selenium mouse sliding operation event
猜你喜欢

JWT certification used in DRF
![[istio introduction, architecture, components]](/img/2b/f84e5cdac6ed9b429e053ffc8dbeb0.png)
[istio introduction, architecture, components]

Detailed learning notes of JVM memory structure (I)

STM32 clock system

Reflections on the way of enterprise IT architecture transformation (Alibaba's China Taiwan strategic thought and architecture practice)

信息安全实验三 :PGP邮件加密软件的使用

Mysql database lock learning notes

Jmeters use

PMP Exam Preparation experience systematically improve project management knowledge through learning

How long does the PMP usually need to prepare for the exam in advance?
随机推荐
Record of structured interview
Huawei hcip datacom core_ 03day
Pycharm importing third-party libraries
JVM garbage collection detailed learning notes (II)
Selenium mouse sliding operation event
Detailed learning notes of JVM memory structure (I)
信息安全实验三 :PGP邮件加密软件的使用
C语言指针(下篇)
超十万字_超详细SSM整合实践_手动实现权限管理
5A summary: seven stages of PMP learning
【云原生】DevOps(一):DevOps介绍及Code工具使用
Chaosblade: introduction to chaos Engineering (I)
Yapi test plug-in -- cross request
What are the suggestions for PMP candidates?
Redis common commands
Colorbar of using vertexehelper to customize controls (II)
When inputting an expression in the input box, an error is reported: incorrect string value:'\xf0\x9f... ' for column 'XXX' at row 1
Test Engineer Interview Questions 2022
華為HCIP-DATACOM-Core_03day
Expérience de port série - simple réception et réception de données