当前位置:网站首页>数据建模中利用3σ剔除异常值进行数据清洗
数据建模中利用3σ剔除异常值进行数据清洗
2022-07-07 06:40:00 【数据闲逛人】
方法原理:
3σ准则又称为拉依达准则,它是先假设一组检测数据只含有随机误差,对其进行计算处理得到标准偏差,按一定概率确定一个区间,认为凡超过这个区间的误差,就不属于随机误差而是粗大误差,含有该误差的数据应予以剔除。
在正态分布中σ代表标准差,μ代表均值。x=μ即为图像的对称轴
3σ原则:
数值分布在(μ-σ,μ+σ)中的概率为0.6827
数值分布在(μ-2σ,μ+2σ)中的概率为0.9544
数值分布在(μ-3σ,μ+3σ)中的概率为0.9974
可以认为,Y 的取值几乎全部集中在(μ-3σ,μ+3σ)区间内,超出这个范围的可能性仅占不到0.3%。
示例数据:
日期 | 商品编码 | 销售数量 | |
---|---|---|---|
0 | 2020-12-01 | A005 | 1 |
1 | 2020-12-01 | A014 | 2 |
2 | 2020-12-01 | A007 | 3 |
3 | 2020-12-01 | A012 | 4 |
4 | 2020-12-01 | A009 | 5 |
5 | 2020-12-01 | A019 | 6 |
6 | 2020-12-01 | A008 | 7 |
7 | 2020-12-01 | A019 | 8 |
8 | 2020-12-01 | A002 | 9 |
9 | 2020-12-02 | A005 | 10 |
现在利用3σ剔除异常值的操作。
df_avg = df['销售数量'].mean() # 计算均值
df_std = df['销售数量'].std() # 计算标准差
df['z_score'] = (df['销售数量'] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # 过滤掉销量过高的数据,排除一下异常值干扰
df = df.drop('z_score',axis =1)
print('均值:',df_avg)
print('标准差:',df_std)
均值: 5.5 标准差: 3.0276503540974917
查看处理后的结果:
日期 | 商品编码 | 销售数量 | |
---|---|---|---|
0 | 2020-12-01 | A005 | 1 |
1 | 2020-12-01 | A014 | 2 |
2 | 2020-12-01 | A007 | 3 |
3 | 2020-12-01 | A012 | 4 |
4 | 2020-12-01 | A009 | 5 |
5 | 2020-12-01 | A019 | 6 |
6 | 2020-12-01 | A008 | 7 |
7 | 2020-12-01 | A019 | 8 |
8 | 2020-12-01 | A002 | 9 |
9 | 2020-12-02 | A005 | 10 |
可以看到结果没有变 ,现在人工加入一个异常值,看看能不能被过滤掉。
利用loc函数方法进行添加。
df.loc[10]=['2020-12-02','A007',1000]
df
再运行一下3σ剔除异常值的操作代码,查看结果,异常数据销售数量为1000已经剔除掉
日期 | 商品编码 | 销售数量 | |
---|---|---|---|
0 | 2020-12-01 00:00:00 | A005 | 1 |
1 | 2020-12-01 00:00:00 | A014 | 2 |
2 | 2020-12-01 00:00:00 | A007 | 3 |
3 | 2020-12-01 00:00:00 | A012 | 4 |
4 | 2020-12-01 00:00:00 | A009 | 5 |
5 | 2020-12-01 00:00:00 | A019 | 6 |
6 | 2020-12-01 00:00:00 | A008 | 7 |
7 | 2020-12-01 00:00:00 | A019 | 8 |
8 | 2020-12-01 00:00:00 | A002 | 9 |
9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- stm32和电机开发(从单机版到网络化)
- Leetcode question brushing record (array) combination sum, combination sum II
- C语言指针(习题篇)
- Full link voltage test of the e-commerce campaign Guide
- PMP Exam details after the release of the new exam outline
- What are the conditions for applying for NPDP?
- Run can start normally, and debug doesn't start or report an error, which seems to be stuck
- Using JWT to realize login function
- JWT certification used in DRF
- C language pointer (Part 2)
猜你喜欢
随机推荐
Why is access to the external network prohibited for internal services of the company?
Expérience de port série - simple réception et réception de données
正则匹配以XXX开头的,XXX结束的
C语言指针(上篇)
Where is the answer? action config/Interceptor/class/servlet
四、机器学习基础
C language pointer (special article)
Locust performance test 2 (interface request)
Pytest installation (command line installation)
C language pointer (Part 2)
How to use Arthas to view class variable values
PMP Exam Preparation experience, seek common ground while reserving differences, and successfully pass the exam
Difference between interface iterator and iteratable
The essence of high availability
PMP examination experience sharing
Count the number of words C language
[chaosblade: delete pod according to the tag, pod domain name access exception scenario, pod file system i/o failure scenario]
Mysql database lock learning notes
信息安全实验二 :使用X-SCANNER扫描工具
信息安全实验一:DES加密算法的实现