当前位置:网站首页>数据建模中利用3σ剔除异常值进行数据清洗
数据建模中利用3σ剔除异常值进行数据清洗
2022-07-07 06:40:00 【数据闲逛人】
方法原理:
3σ准则又称为拉依达准则,它是先假设一组检测数据只含有随机误差,对其进行计算处理得到标准偏差,按一定概率确定一个区间,认为凡超过这个区间的误差,就不属于随机误差而是粗大误差,含有该误差的数据应予以剔除。
在正态分布中σ代表标准差,μ代表均值。x=μ即为图像的对称轴
3σ原则:
数值分布在(μ-σ,μ+σ)中的概率为0.6827
数值分布在(μ-2σ,μ+2σ)中的概率为0.9544
数值分布在(μ-3σ,μ+3σ)中的概率为0.9974
可以认为,Y 的取值几乎全部集中在(μ-3σ,μ+3σ)区间内,超出这个范围的可能性仅占不到0.3%。
示例数据:
| 日期 | 商品编码 | 销售数量 | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
现在利用3σ剔除异常值的操作。
df_avg = df['销售数量'].mean() # 计算均值
df_std = df['销售数量'].std() # 计算标准差
df['z_score'] = (df['销售数量'] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # 过滤掉销量过高的数据,排除一下异常值干扰
df = df.drop('z_score',axis =1)
print('均值:',df_avg)
print('标准差:',df_std)均值: 5.5 标准差: 3.0276503540974917
查看处理后的结果:
| 日期 | 商品编码 | 销售数量 | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
可以看到结果没有变 ,现在人工加入一个异常值,看看能不能被过滤掉。
利用loc函数方法进行添加。
df.loc[10]=['2020-12-02','A007',1000]
df
再运行一下3σ剔除异常值的操作代码,查看结果,异常数据销售数量为1000已经剔除掉
| 日期 | 商品编码 | 销售数量 | |
|---|---|---|---|
| 0 | 2020-12-01 00:00:00 | A005 | 1 |
| 1 | 2020-12-01 00:00:00 | A014 | 2 |
| 2 | 2020-12-01 00:00:00 | A007 | 3 |
| 3 | 2020-12-01 00:00:00 | A012 | 4 |
| 4 | 2020-12-01 00:00:00 | A009 | 5 |
| 5 | 2020-12-01 00:00:00 | A019 | 6 |
| 6 | 2020-12-01 00:00:00 | A008 | 7 |
| 7 | 2020-12-01 00:00:00 | A019 | 8 |
| 8 | 2020-12-01 00:00:00 | A002 | 9 |
| 9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- PMP examination experience sharing
- Cesium does not support 4490 problem solution and cesium modified source code packaging scheme
- 徽商期货公司评级是多少?开户安全吗?我想开户,可以吗?
- Redis common commands
- Error: selenium common. exceptions. WebDriverException: Messag‘geckodriver‘ execute
- Windows starts redis service
- JWT certification used in DRF
- Analysis of Hessian serialization principle
- STM32 clock system
- What are the conditions for applying for NPDP?
猜你喜欢

MySql数据库-索引-学习笔记

【云原生】DevOps(一):DevOps介绍及Code工具使用

Variable parameter of variable length function

Expérience de port série - simple réception et réception de données

Run can start normally, and debug doesn't start or report an error, which seems to be stuck

NVIC interrupt priority management

Skill review of test engineer before interview

Several stages of PMP preparation study

Systick tick timer

Interview question: general layout and wiring principles of high-speed PCB
随机推荐
Where is the answer? action config/Interceptor/class/servlet
JVM garbage collection detailed learning notes (II)
Postman interface test (I. installation and use)
Serial port experiment - simple data sending and receiving
Jenkins+ant+jmeter use
Error: selenium common. exceptions. WebDriverException: Messag‘geckodriver‘ execute
【云原生】DevOps(一):DevOps介绍及Code工具使用
LeetCode每日一题(2316. Count Unreachable Pairs of Nodes in an Undirected Graph)
Leetcode question brushing record (array) combination sum, combination sum II
[istio introduction, architecture, components]
数据在内存中的存储
H3C vxlan configuration
Cesium load vector data
正则匹配以XXX开头的,XXX结束的
How does the project manager write the weekly summary and weekly plan?
STM32 serial port register library function configuration method
Idea development environment installation
The configuration and options of save actions are explained in detail, and you won't be confused after reading it
信息安全实验一:DES加密算法的实现
【Istio Network CRD VirtualService、Envoyfilter】