当前位置:网站首页>数据建模中利用3σ剔除异常值进行数据清洗
数据建模中利用3σ剔除异常值进行数据清洗
2022-07-07 06:40:00 【数据闲逛人】
方法原理:
3σ准则又称为拉依达准则,它是先假设一组检测数据只含有随机误差,对其进行计算处理得到标准偏差,按一定概率确定一个区间,认为凡超过这个区间的误差,就不属于随机误差而是粗大误差,含有该误差的数据应予以剔除。
在正态分布中σ代表标准差,μ代表均值。x=μ即为图像的对称轴
3σ原则:
数值分布在(μ-σ,μ+σ)中的概率为0.6827
数值分布在(μ-2σ,μ+2σ)中的概率为0.9544
数值分布在(μ-3σ,μ+3σ)中的概率为0.9974
可以认为,Y 的取值几乎全部集中在(μ-3σ,μ+3σ)区间内,超出这个范围的可能性仅占不到0.3%。
示例数据:
日期 | 商品编码 | 销售数量 | |
---|---|---|---|
0 | 2020-12-01 | A005 | 1 |
1 | 2020-12-01 | A014 | 2 |
2 | 2020-12-01 | A007 | 3 |
3 | 2020-12-01 | A012 | 4 |
4 | 2020-12-01 | A009 | 5 |
5 | 2020-12-01 | A019 | 6 |
6 | 2020-12-01 | A008 | 7 |
7 | 2020-12-01 | A019 | 8 |
8 | 2020-12-01 | A002 | 9 |
9 | 2020-12-02 | A005 | 10 |
现在利用3σ剔除异常值的操作。
df_avg = df['销售数量'].mean() # 计算均值
df_std = df['销售数量'].std() # 计算标准差
df['z_score'] = (df['销售数量'] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # 过滤掉销量过高的数据,排除一下异常值干扰
df = df.drop('z_score',axis =1)
print('均值:',df_avg)
print('标准差:',df_std)
均值: 5.5 标准差: 3.0276503540974917
查看处理后的结果:
日期 | 商品编码 | 销售数量 | |
---|---|---|---|
0 | 2020-12-01 | A005 | 1 |
1 | 2020-12-01 | A014 | 2 |
2 | 2020-12-01 | A007 | 3 |
3 | 2020-12-01 | A012 | 4 |
4 | 2020-12-01 | A009 | 5 |
5 | 2020-12-01 | A019 | 6 |
6 | 2020-12-01 | A008 | 7 |
7 | 2020-12-01 | A019 | 8 |
8 | 2020-12-01 | A002 | 9 |
9 | 2020-12-02 | A005 | 10 |
可以看到结果没有变 ,现在人工加入一个异常值,看看能不能被过滤掉。
利用loc函数方法进行添加。
df.loc[10]=['2020-12-02','A007',1000]
df
再运行一下3σ剔除异常值的操作代码,查看结果,异常数据销售数量为1000已经剔除掉
日期 | 商品编码 | 销售数量 | |
---|---|---|---|
0 | 2020-12-01 00:00:00 | A005 | 1 |
1 | 2020-12-01 00:00:00 | A014 | 2 |
2 | 2020-12-01 00:00:00 | A007 | 3 |
3 | 2020-12-01 00:00:00 | A012 | 4 |
4 | 2020-12-01 00:00:00 | A009 | 5 |
5 | 2020-12-01 00:00:00 | A019 | 6 |
6 | 2020-12-01 00:00:00 | A008 | 7 |
7 | 2020-12-01 00:00:00 | A019 | 8 |
8 | 2020-12-01 00:00:00 | A002 | 9 |
9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- [chaosblade: node disk filling, killing the specified process on the node, suspending the specified process on the node]
- Leetcode刷题记录(数组)组合总和、组合总和 II
- Connecting mobile phone with ADB
- 網易雲微信小程序
- 嵌套(多级)childrn路由,query参数,命名路由,replace属性,路由的props配置,路由的params参数
- Colorbar of using vertexehelper to customize controls (II)
- Interpretation of MySQL optimization principle
- 2020 year end summary
- C语言指针(特别篇)
- 信息安全实验二 :使用X-SCANNER扫描工具
猜你喜欢
Systick tick timer
信息安全实验二 :使用X-SCANNER扫描工具
Do you have any certificates with high gold content?
Unittest simple project
Locust performance test 4 (custom load Policy)
Storage of data in memory
Reflections on the way of enterprise IT architecture transformation (Alibaba's China Taiwan strategic thought and architecture practice)
PMP Exam details after the release of the new exam outline
Yapi test plug-in -- cross request
Data association between two interfaces of postman
随机推荐
H3C vxlan configuration
[chaosblade: node CPU load, node network delay, node network packet loss, node domain name access exception]
PMP certificate preparation experience sharing
Port multiplexing and re imaging
Where is the answer? action config/Interceptor/class/servlet
华为HCIP-DATACOM-Core_03day
Variable parameter of variable length function
STM32 and motor development (from stand-alone version to Networking)
信息安全实验三 :PGP邮件加密软件的使用
嵌套(多级)childrn路由,query参数,命名路由,replace属性,路由的props配置,路由的params参数
LeetCode每日一题(2316. Count Unreachable Pairs of Nodes in an Undirected Graph)
STM32 serial port register library function configuration method
[cloud native] Devops (I): introduction to Devops and use of code tool
Serial port experiment - simple data sending and receiving
PMP Exam Preparation experience, seek common ground while reserving differences, and successfully pass the exam
Systick tick timer
Windows starts redis service
Serializer & modelserializer of DRF serialization and deserialization
超十万字_超详细SSM整合实践_手动实现权限管理
STM32 clock system