当前位置:网站首页>Graphic data analysis | data cleaning and pretreatment
Graphic data analysis | data cleaning and pretreatment
2022-06-12 02:01:00 【ShowMeAI】

author : Han Xinzi @ShowMeAI
Tutorial address :http://www.showmeai.tech/tutorials/33
This paper addresses :http://www.showmeai.tech/article-detail/138
Statement : copyright , For reprint, please contact the platform and the author and indicate the source

The core steps of data analysis are divided into : Business cognition and data exploration 、 Data preprocessing 、 Business cognition and data exploration Wait for three core steps . This article introduces the second step —— Data preprocessing .
Data cannot be taken for granted to be valid .
In the real world , Data is generally heterogeneous 、 There is something missing 、 Dimensional . Some data is obtained from multiple different data sources , These heterogeneous data , They're all right in their own systems , It's just a lot of “ personality ”.
for example , Some systems use 0 and 1, For gender ; And some systems use f and m For gender .
- Before using the data , First of all, the data should be regularized , Use consistent units 、 Use unified text to describe objects, etc .
- Some data contains a lot of duplicate data 、 Missing data 、 Or outlier data , Before we start analyzing the data , You have to check that the data is valid , And preprocess the data .
- Judge the outlier value , And analyze it , Sometimes it leads to major discoveries .
One 、 The data is regular
1.1 dimension
The so-called dimension , Simply speaking , That's the unit of data . Some data are dimensional , For example, height ; And some data are dimensionless , for example , The male to female ratio . Different evaluation indexes often have different dimensions , The difference between the data can be big , No processing will affect the results of data analysis .

1.2 Data standardization
In order to eliminate the influence of dimension and value range differences between indicators on data analysis results , Data needs to be standardized . That is to say , Scale the data to scale , Make it fall into a specific area , It is convenient for comprehensive analysis .
1.3 Data normalization
Normalization is the simplest way of data standardization , The purpose is to change numbers into [0, 1] Decimal between , Convert dimensional data into dimensionless pure quantity . Normalization can avoid the influence of range and dimension on data , It is convenient for comprehensive analysis of data .
Illustrate with examples
A simple example , In an exam , Xiao Ming's Chinese achievement is 100 branch 、 The English score is 100 branch , Just from the test results , Xiao Ming is as good at Chinese as he is at English . however , If you know the total score of Chinese is 150 branch , And the total score of English is only 120 branch , Do you still think Xiao Ming's Chinese and English scores are the same ?
Make a simple normalization of Xiao Ming's achievements :
Using the method of deviation normalization , Formula is :y = (x-min) / range, Set here min=0, that range = max - min = max, It can be inferred that Xiaoming's Chinese achievement is 4/6, The English score is 5/6. therefore , Xiao Ming's English is better than his Chinese .
Return to the real scene , The difficulty of each subject is different , The lowest score of Chinese in the class is min Chinese language and literature = 60, The lowest score in English is min English = 85, It is estimated that Xiaoming's Chinese achievement is 0.44 =(100-60)/(150-60), The English score is 0.43 = (100-85)/(120-85), Accordingly , It can be judged that Xiao Ming's English is slightly worse than his Chinese .
Normalized so that it has a different range of values 、 There is comparability between data of different dimensions , Make the results of data analysis more comprehensive , Closer to the truth .
Two 、 Data outliers detection and analysis
The full name of abnormal value in statistics is suspected abnormal value , Also called outliers (outlier), The analysis of outliers is also called outlier analysis .
Outlier analysis is to check whether there are unreasonable data in the data , In data analysis , The existence of outliers cannot be ignored , Nor can we simply eliminate outliers from data analysis . Pay attention to the occurrence of outliers , Analyze the causes , It often becomes an opportunity to discover new problems and improve decision-making .

In the diagram above , outliers (outlier) The deviation from other observation points is very large . Be careful , Outliers are abnormal data points , But not necessarily the wrong data point .
2.1 Outlier detection
(1) Descriptive analysis methods
In data processing , You can do a descriptive analysis of the data , And then look at what data is unreasonable . The commonly used statistics are maximum and minimum , It is used to judge whether the value of the variable exceeds the reasonable range . for example , The maximum age of the customer is 199, The value is abnormal .
(2)Z-Score Method

[1] 3σ principle
Introducing Z-score Before method , So let's see 3σ principle —— If the data follows a normal distribution , stay 3σ In principle , Outliers are defined as 『 Of a set of measured values , A value that deviates from the mean by more than three times the standard deviation 』.
Under normal distribution , Distance average 3σ The probability of occurrence of values other than P(|x-μ|>3σ)<=0.003, It belongs to a very small probability event . stay 3σ In principle , If the difference between the observed value and the average value exceeds 3 Times the standard deviation , Then you can treat it as an outlier .
[2] Z-Score
If the data does not obey the normal distribution , You can use 『 How many times the standard deviation is the distance from the mean 』 To describe , This multiple is Z-scor.
Z-Score In standard deviation (σ) In units of , To measure an original score (X) Deviation from the average (μ) Distance of . Z-Score It needs to be decided according to experience and actual situation , It is usually far from the standard deviation 3 Data points more than times the distance are regarded as outliers .
Python The implementation of the code is as follows :
import numpy as np
import pandas as pd
def detect_outliers(data,threshold=3):
mean_d = np.mean(data)
std_d = np.std(data)
outliers = []
for y in data_d:
z_score= (y - mean_d)/std_d
if np.abs(z_score) > threshold:
outliers.append(y)
return outliers
(3)IQR Anomaly detection
Interquartile distance (Inter-Quartile Range,IQR), It means in the 75 Percentage points vs 25 Percent difference , Or say , The difference between the upper quartile and the lower quartile .

IQR Is a measure of statistical dispersion , The degree of dispersion depends on the box diagram (Box Plot) To observe . Usually less than Q1-1.5_IQR Or greater than Q3+1.5_IQR Data points are treated as outliers .
The following important characteristics of the dataset can be seen visually from the box diagram :
Center position : The position of the median is the center of the data set , Look up or down from the center , You can see the inclination of the data .
The degree of dispersion : The box diagram is divided into several sections , The interval is short , Indicates that the points falling in this interval are concentrated ;
symmetry : If the median is in the middle of the box , Then the data distribution is more symmetrical ; If the extreme value is far from the median , That means the data distribution is skewed .
outliers : Outliers are distributed outside the upper and lower edges of the box graph .
Use Python Realization , Parameters sr yes Series Variable of type :
def detect_outliers(sr):
q1 = sr.quantile(0.25)
q3 = sr.quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
outliers = sr.loc[(sr < fence_low) | (sr > fence_high)]
return outliers
2.2 Exception handling
In data processing , How to handle outliers , It depends on the circumstances . Sometimes , Outliers can also be normal values , It's just abnormally large or small , therefore , In many cases , First analyze the possible causes of outliers , Then determine how to handle outliers . Common methods for handling outliers are :
Delete records with outliers .
interpolation , Treat outliers as missing values , Handle with missing values , The advantage is to replace outliers with existing data , Or imputation .
Don't deal with , Perform data analysis directly on data sets containing outliers .
3、 ... and 、 Processing of missing values
Not all data is complete , Some observations may be missing . For missing values , The usual treatment is to delete the data row where the missing value is located 、 Fill in missing values 、 Impute missing values .

Data and code download
The code for this tutorial series can be found in ShowMeAI Corresponding github Download , Can be local python Environment is running , Access to Google Your baby can also use google colab One click operation and interactive operation learning Oh !
The quick look-up tables involved in this series of tutorials can be downloaded and obtained at the following address :
Expand references
ShowMeAI Recommended articles
- Introduction to data analysis
- Data analysis thinking
- Mathematical basis of data analysis
- Business cognition and data exploration
- Data cleaning and preprocessing
- Business analysis and data mining
- Data analysis tool map
- Statistical and data science computing tool library Numpy Introduce
- Numpy And 1 Dimension array operations
- Numpy And 2 Dimension array operations
- Numpy And high-dimensional array operation
- Data analysis tool library Pandas Introduce
- The illustration Pandas A complete collection of core operating functions
- The illustration Pandas Data transformation advanced functions
- Pandas Data grouping and operation
- Principles and methods of data visualization
- be based on Pandas Data visualization
- seaborn Tools and data visualization
ShowMeAI A series of tutorials are recommended
- The illustration Python Programming : From introduction to mastery
- Graphical data analysis : From introduction to mastery
- The illustration AI Mathematical basis : From introduction to mastery
- Illustrate big data technology : From introduction to mastery

边栏推荐
- Point cloud perception algorithm interview knowledge points (II)
- Manually tear the linked list (insert, delete, sort) and pointer operation
- Huawei intermodal game or application review rejected: the application detected payment servicecatalog:x6
- LeetCode Algorithm 997. Find the town judge
- "It's safer to learn a skill!" The little brother of Hangzhou campus changes to software testing, and likes to mention 10k+ weekend!
- PHP security development 13 column module of blog system
- MySQL training report [with source code]
- Why do we use Google search ads?
- In 2022, the internal promotion of the "MIHA Tour" golden, silver and silver social recruitment started in April and march! Less overtime, good welfare, 200+ posts for you to choose, come and see!
- LeetCode LCP 07. 传递信息
猜你喜欢

Four schemes for redis to implement message queue

Point cloud perception algorithm interview knowledge points (I)

MySQL table common operation mind map

2022最全面的Redis事务控制(带图讲解)

Point cloud perception algorithm interview knowledge points (II)

The road of global evolution of vivo global mall -- multilingual solution

如何让杀毒软件停止屏蔽某个网页?以GDATA为例

Operating mechanism of Google ads bidding

Implementation scheme of iteration and combination pattern for general tree structure

C language programming classic games - minesweeping
随机推荐
小程序111111
php开发09 文章模块的删除和文章分类编写
MySQL高级部分知识点
Linux (centos7) installer mysql - 5.7
西南林业大学“西林链”通过工信部电子标准院功能测试 | FISCO BCOS案例
颠倒字符串中的单词(split、双端队列)
Introduction to SVM
How to access the traifik proxy dashboard using the rancher desktop
C language programming classic games - minesweeping
[popular science video] what is a lens antenna?
Is the bidding price fixed for each click?
Pyinstaller packaging Exe (detailed tutorial)
Data system provider Jidao technology joins dragon lizard community
JSON conversion: entity classes and jsonobject are converted to each other, and list and jsonarray are converted to each other (fastjson version)
华为,这也太强了吧..
Linux(CentOS7)安裝MySQL-5.7版本
pip运行报错:Fatal error in launcher: Unable to create process using
聯調這夜,我把同事打了...
自适应搜索广告有哪些优势?
Simulated 100 questions and simulated examination for safety management personnel of metal and nonmetal mines (small open pit quarries) in 2022