当前位置:网站首页>Use 3 in data modeling σ Eliminate outliers for data cleaning
Use 3 in data modeling σ Eliminate outliers for data cleaning
2022-07-07 09:43:00 【Data stroller】
Methods the principle :
3σ The criterion is also known as the laida criterion , It assumes that a set of test data contains only random errors , The standard deviation is calculated and processed , Determine an interval according to a certain probability , If the error exceeds this range , It's not a random error, it's a gross error , Data containing this error should be eliminated .
In the normal distribution σ Represents the standard deviation ,μ For mean .x=μ That is, the axis of symmetry of the image
3σ principle :
The values are distributed in (μ-σ,μ+σ) The probability of 0.6827
The values are distributed in (μ-2σ,μ+2σ) The probability of 0.9544
The values are distributed in (μ-3σ,μ+3σ) The probability of 0.9974
It can be said that ,Y The values of are almost all concentrated in (μ-3σ,μ+3σ) Within the interval , There is no possibility of going beyond that 0.3%.
Example data :
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
Now use 3σ Operation of eliminating outliers .
df_avg = df[' sales volumes '].mean() # Calculate the mean
df_std = df[' sales volumes '].std() # Calculate the standard deviation
df['z_score'] = (df[' sales volumes '] - df_avg)/ df_std
print(display(df))
df = df.loc[(df['z_score']>-3)|(df['z_score']<3)] # Filter out data with excessive sales , Eliminate the interference of outliers
df = df.drop('z_score',axis =1)
print(' mean value :',df_avg)
print(' Standard deviation :',df_std)mean value : 5.5 Standard deviation : 3.0276503540974917
View the processed results :
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 | A005 | 1 |
| 1 | 2020-12-01 | A014 | 2 |
| 2 | 2020-12-01 | A007 | 3 |
| 3 | 2020-12-01 | A012 | 4 |
| 4 | 2020-12-01 | A009 | 5 |
| 5 | 2020-12-01 | A019 | 6 |
| 6 | 2020-12-01 | A008 | 7 |
| 7 | 2020-12-01 | A019 | 8 |
| 8 | 2020-12-01 | A002 | 9 |
| 9 | 2020-12-02 | A005 | 10 |
You can see that the result has not changed , Now manually add an outlier , See if it can be filtered out .
utilize loc Function method to add .
df.loc[10]=['2020-12-02','A007',1000]
df
Run it again 3σ Operation code for eliminating outliers , View results , The sales quantity of abnormal data is 1000 Has been eliminated
| date | Commodity code | sales volumes | |
|---|---|---|---|
| 0 | 2020-12-01 00:00:00 | A005 | 1 |
| 1 | 2020-12-01 00:00:00 | A014 | 2 |
| 2 | 2020-12-01 00:00:00 | A007 | 3 |
| 3 | 2020-12-01 00:00:00 | A012 | 4 |
| 4 | 2020-12-01 00:00:00 | A009 | 5 |
| 5 | 2020-12-01 00:00:00 | A019 | 6 |
| 6 | 2020-12-01 00:00:00 | A008 | 7 |
| 7 | 2020-12-01 00:00:00 | A019 | 8 |
| 8 | 2020-12-01 00:00:00 | A002 | 9 |
| 9 | 2020-12-02 00:00:00 | A005 | 10 |
边栏推荐
- Dynamics 365online applicationuser creation method change
- Huawei HCIP - datacom - Core 03 jours
- sqlplus乱码问题,求解答
- Kubernetes cluster capacity expansion to add node nodes
- Colorbar of using vertexehelper to customize controls (II)
- Regular matching starts with XXX and ends with XXX
- The difference between viewpager2 and viewpager and the implementation of viewpager2 in the rotation chart
- H5 web player easyplayer How does JS realize live video real-time recording?
- La différence entre viewpager 2 et viewpager et la mise en œuvre de la rotation viewpager 2
- Octopus future star won a reward of 250000 US dollars | Octopus accelerator 2022 summer entrepreneurship camp came to a successful conclusion
猜你喜欢

【无标题】

【frida实战】“一行”代码教你获取WeGame平台中所有的lua脚本

AI从感知走向智能认知

js逆向教程第二发-猿人学第一题

How does mongodb realize the creation and deletion of databases, the creation of deletion tables, and the addition, deletion, modification and query of data

How to speed up video playback in browser

细说Mysql MVCC多版本控制

csdn涨薪技术-浅学Jmeter的几个常用的逻辑控制器使用

# Arthas 简单使用说明

Dynamics 365online applicationuser creation method change
随机推荐
Unity uses mesh to realize real-time point cloud (I)
網易雲微信小程序
Nested (multi-level) childrn routes, query parameters, named routes, replace attribute, props configuration of routes, params parameters of routes
数据建模中利用3σ剔除异常值进行数据清洗
Oracle安装增强功能出错
sql 里面使用中文字符判断有问题,哪位遇到过?比如value&lt;&gt;`无`
Unity uses mesh to realize real-time point cloud (II)
**Grafana installation**
VSCode+mingw64+cmake
第一讲:鸡蛋的硬度
软件建模与分析
In fact, it's very simple. It teaches you to easily realize the cool data visualization big screen
Octopus future star won a reward of 250000 US dollars | Octopus accelerator 2022 summer entrepreneurship camp came to a successful conclusion
What is MD5
Strategic cooperation subquery becomes the secret weapon of Octopus web browser
Information Security Experiment 1: implementation of DES encryption algorithm
First issue of JS reverse tutorial
How to solve the problem of golang select mechanism and timeout
Pick up the premise idea of programming
Schema-validation: wrong column type encountered in column XXX in table XXX