当前位置：网站首页>Data preprocessing of data mining

Data preprocessing of data mining

2022-07-26 15:17:00 【Caaaaaan】

Data quality

A widely accepted measure of data quality ：

accuracy
integrity ( There are missing values )
Uniformity
Timeliness ( Data is out of date )
Credibility ( Database source )
Explanatory

Data preprocessing

The purpose of data preprocessing is , Improve data quality

Main task

Data cleaning
- Fill in the missing value
- Smoothing noise data
- Identify or delete outliers
- Resolve inconsistencies
Data integration
- Consolidate multiple databases
- Cube or file
Data reduction
- Dimension reduction
- Drop data （Numerosity reduction）
- data compression
Data conversion and data discretization
- Normalization
- discretization

Data cleaning

Handling missing values

Ignore tuples （ That is, delete a single object ）
When This is usually done when the class label is missing （ The training set lacks class labels in supervised machine learning ）
- Class label It refers to the prediction type Training set in , final The prediction results are missing
When each attribute ( That is, the fields ) The proportion of missing values is relatively large , Poor effect
- In this case , It will make the data set too small
- Sure Consider deleting a single attribute
Fill in by hand ： Big workload
Automatically fill in ： Use The average value of the attribute is filled （ Commonly used ）

df_values=df_values.drop((miss_data[miss_data['total']>200]).index,axis=1)
df_values['pres'].fillna(df_values['pres'].mean(),inplace=True)
df_values['mass'].fillna(df_values['mass'].mean(),inplace=True)
df_values['plas'].fillna(df_values['plas'].mean(),inplace=True)

Processing noise data

Box chart detects outlier data : Delete outliers
When there are many outliers , It will also lead to smaller data sets

Insert picture description here

Deal with inconsistent data

Computational reasoning 、 Replace
Global replacement

Data integration

Data integration ： Will come from Data from multiple data sources are combined into a coherent data source

Insert picture description here

Pattern integration

That is, when two data sets Field names are different , But when the expression is the same , Conduct integrated processing

Insert picture description here

Entity recognition

When one of the data sets , The name is in Chinese , But another data set is named in English
But they express the same person （ That is, the same entity , So in this environment , We need to recognize entities , Reintegration ）

Insert picture description here

Data conflict detection and resolution

For the same real-world entity , Attribute values from different sources
Probable cause ： Different ways of expression , Different scales ( Such as metric and English units )

As shown in the figure above , Describe the entity of height , This value is different （ The units are different ）

Processing of redundant information

Such as ： A dataset has 3000m The achievement of , The other has 5000m The achievement of , Then it is integrated into the running ability to measure

The same attribute or object may have different text in different databases
An attribute may be “ The derived ” Properties in another table of , For example, running ability
Redundant attributes can be detected through correlation analysis and covariance analysis
Carefully integrate data from multiple sources , May help reduce / Avoid redundancy and inconsistencies , and Improve reading speed and quality

Insert picture description here

Correlation analysis —— Discrete variables

$\\ \chi^2(chi-square)test\\ \chi^2=\sum\frac{(Observed-Expected)^2}{Expected}\\ \bullet \chi^2 The bigger the value is. , The more likely it is that the variable is relevant \\ \bullet Correlation doesn't mean causation$

Insert picture description here

The first number is Statistics , Both like playing chess , I also like science fiction
The value in parentheses is Expectations
The expected value is calculated through the corresponding line total * Total of corresponding columns / total
Such as 450*300/1500=90
After getting the expected value and statistical value , Can Get the corresponding chi square test

Insert picture description here

Correlation analysis —— Continuous variable

Continuous variables have no way to count statistical values and expected values

The correlation coefficient —— Pearson correlation coefficient
You can use corr() After getting the correlation coefficient matrix , Use a heat map

$\\ r_{p,q}=\frac{\sum(p-\overline{p})(q-\overline{q})}{(n-1)\sigma_p\sigma_q}=\frac{\sum(pq)-n\overline{p}\,\overline{q}}{(n-1)\sigma_p\sigma_q}$

among n Is the number of tuples , and p and q yes Specific values of respective attributes , $\sigma_p$ and $\sigma_q$ Is the respective standard deviation
When r>0 yes , It means that two variables are positively correlated ;r<0 when , The two variables are negatively correlated
When |r|=1 when , It means that two variables are completely linear correlation , Functional relation
When r=0 when , It means that there is no linear correlation between two variables
When 0<|r|<1, It means that there is a certain degree of linear correlation between two variables .
- And when |r| The closer the 1, The closer the linear relationship between the two variables is ;
- |r| The more close to 0 when , The weaker the linear correlation between two variables .
Generally, it can be divided into three levels
- |r|<0.4 It is a low degree linear correlation
- 0.4<=|r|<0.7 It is significant correlation
- 0.7<=|r|<1 It's a highly linear correlation

covariance

Covariance is also used to indicate the correlation between two sets of data

$\\ r_{p,q}=\frac{Cov(p,q)}{\sigma_p\sigma_q}$

$\\ Cov(p,q)=E((p-\overline{p})(q-\overline{q}))\\ =\frac{\sum_{i=1}^n(p_i-\overline{p})(q_i-\overline{q})}{n}\\ Can be simplified as ：\\ Cov(A,B)=E(A*B)-\overline{A}\,\overline{B}$

among n Is the number of tuples , and p and q yes Specific values of respective attributes , $\sigma_p$ and $\sigma_q$ Is the respective standard deviation
positive correlation ： $C o v (p, q) > 0$
negative correlation ： $C o v (p, q) < 0$
independence ： $C o v p (p, q) = 0$
It can have some covariance of random variables 0, But not independent
Need some additional assumptions , For example, whether the data obey multivariate normal distribution , The covariance is 0 It means independence

Be careful ：

independence $\Rightarrow Cov(p,q)=0$
$Cov(p,q)=0\nRightarrow$ independence

Data protocol

Because the data warehouse can store TB The data of , So when running on a complete dataset , Complex data analysis may take a long time

Dimension reduction

Integrate high-dimensional data , Some methods are used to turn high-dimensional data into low-dimensional data

for example ： Facing a data set of achievements , Yes 6 Accounts as attributes （ Language, number and English materialize ）, We can reduce the dimension of attributes to —— Liberal arts scores and science scores are two dimensions

reason ：
- With Increase in dimension , Data will become more and more sparse
  - For example, in the case data set , As the dimensions increase , There will be a lot of normal values pouring out , The disease data we need to pay attention to is flooded
- Possible subspaces The portfolio will grow exponentially
  - Rule based classification , The established rules will be multiplied
  - The higher the dimension , The more complex the rules that may lead to features
- Machine learning method similar to neural network , Main needs ** Learn the weight parameters of each feature .** The more features , The more parameters you need to learn , The more complex the model
  $\widehat{y}=sign(\omega_1x_1+\omega_2x_2+...+\omega_dx_d-t)\\$
- machine learning Training set principles ： The more complex the model , More training sets are needed to learn model parameters , otherwise The model will be under fitted
- therefore , If the dataset dimension is high , And the number of training sets is very small , When using complex machine learning models , Dimensionality reduction is preferred
- You need to visualize
  - When your dimension is higher , The more complex visualization

Dimensionality reduction method ——PCA Principal component analysis

PCA Principal component analysis The core idea
- Many attributes in the data may be related in one way or another
- Can you find a way , take The combination of multiple correlated attributes only forms one attribute

Insert picture description here

Principal component analysis primary coverage
- Try to put the original numerous Attributes with certain relevance , Regroup into a group Unrelated comprehensive attributes To replace the original attribute
- Usually Mathematical treatment Will be original p A linear combination of attributes , As a new comprehensive attribute —— That is, through Linear weighted combination

$x_1,x_2,...,x_p Is the original variable index ,z_1,z_2,...,z_m(m\leq p)\\ \begin{cases} z_1=l_{11}x_1+l_{12}x_2+...+l_{1p}x_p\\ z_2=l_{21}x_1+l_{22}x_2+...+l_{2p}x_p\\ \vdots\\ z_m=l_{m1}x_1+l_{m2}x_2+...+l_{mp}x_p\\ \end{cases}$

Drop data

The data scale is very large , The computer is out of memory ;
Second time , We don't plan to take out all the data for training

Simple random sampling (Simple Random Sampling)
- Equal probability choice
- Don't put back the sample
  - Once the object is selected , Then delete
- There's a sample put back
  - Select the object not to delete

The impact of sample size on data quality

Insert picture description here

data compression

Insert picture description here

Data conversion

Function mapping ： The given attribute value is replaced by a new representation , Each old value and new value can be identified

Normalization

primary coverage ： Put the dataset Scale to a specific interval

reason ：

For example, the college entrance examination results , Guangdong Province has the evaluation criteria of Guangdong Province , Beijing has Beijing Standards
In the data set, it appears as , The variation range between each attribute is very, very different

Minimax normalization

$：\\ v'=\frac{v-min_A}{max_A-min_A}(new\_max_A-new\_min_A)+new\_min_A\\ v That is, the data that needs to be standardized$

$new\_max_A and new\_min_A$ The value of depends mainly on how you want to standardize , If it is normalized ( Data processing to 0 To 1 This interval ), Then the new maximum is 1, The new minimum is 0

Z- Score normalization

$v'=\frac{v- mean value A}{ Standard deviation A}\\ v That is, the data that originally needs to be standardized$

If the data set is streaming data ( That is, new data will be added at any time ), And we assume that the distribution of streaming data is constant

Then we sample part of the streaming data , Calculate the mean and standard deviation

Let's face it , use Z-score Method standardization is more reasonable

Decimal scaling

Move properties A The decimal point of ( The number of moving bits depends on the attribute A The maximum of )

$v'=\frac{v}{10^j}\\ j Is making Max(|v'|)<1 Minimum integer of$

For example, the minimum value in the data is 12000, The maximum value is 98000, be j=5

discretization

Discretize numerical data
eg: Age turns into —— Young and middle-aged

Insert picture description here

Unsupervised discrete — Constant width method

Based on the range To differentiate , Make the width of each interval equal
That is, according to the maximum value of the attribute 、 The minimum value is divided into equal width

Insert picture description here

Unsupervised discrete — Equal frequency method

It is divided according to the frequency of occurrence of the value , Divide the value range of attributes into several small areas , also The number of samples falling in each interval is required to be equal

Insert picture description here