当前位置:网站首页>Data preprocessing of data mining

Data preprocessing of data mining

2022-07-26 15:17:00 Caaaaaan

Data quality

A widely accepted measure of data quality :

  • accuracy
  • integrity ( There are missing values )
  • Uniformity
  • Timeliness ( Data is out of date )
  • Credibility ( Database source )
  • Explanatory

Data preprocessing

The purpose of data preprocessing is , Improve data quality

Main task

  • Data cleaning
    • Fill in the missing value
    • Smoothing noise data
    • Identify or delete outliers
    • Resolve inconsistencies
  • Data integration
    • Consolidate multiple databases
    • Cube or file
  • Data reduction
    • Dimension reduction
    • Drop data (Numerosity reduction)
    • data compression
  • Data conversion and data discretization
    • Normalization
    • discretization

Data cleaning

Handling missing values

  • Ignore tuples ( That is, delete a single object )

    When This is usually done when the class label is missing ( The training set lacks class labels in supervised machine learning )

    • Class label It refers to the prediction type Training set in , final The prediction results are missing

    When each attribute ( That is, the fields ) The proportion of missing values is relatively large , Poor effect

    • In this case , It will make the data set too small
    • Sure Consider deleting a single attribute
  • Fill in by hand : Big workload

  • Automatically fill in : Use The average value of the attribute is filled ( Commonly used )

df_values=df_values.drop((miss_data[miss_data['total']>200]).index,axis=1)
df_values['pres'].fillna(df_values['pres'].mean(),inplace=True)
df_values['mass'].fillna(df_values['mass'].mean(),inplace=True)
df_values['plas'].fillna(df_values['plas'].mean(),inplace=True)

Processing noise data

  • Box chart detects outlier data : Delete outliers

    When there are many outliers , It will also lead to smaller data sets

 Insert picture description here

Deal with inconsistent data

  • Computational reasoning 、 Replace
  • Global replacement

Data integration

Data integration : Will come from Data from multiple data sources are combined into a coherent data source

 Insert picture description here

Pattern integration

  • That is, when two data sets Field names are different , But when the expression is the same , Conduct integrated processing

 Insert picture description here

Entity recognition

  • When one of the data sets , The name is in Chinese , But another data set is named in English
  • But they express the same person ( That is, the same entity , So in this environment , We need to recognize entities , Reintegration )

 Insert picture description here

Data conflict detection and resolution

  • For the same real-world entity , Attribute values from different sources
  • Probable cause : Different ways of expression , Different scales ( Such as metric and English units )

As shown in the figure above , Describe the entity of height , This value is different ( The units are different )

Processing of redundant information

Such as : A dataset has 3000m The achievement of , The other has 5000m The achievement of , Then it is integrated into the running ability to measure

  • The same attribute or object may have different text in different databases
  • An attribute may be “ The derived ” Properties in another table of , For example, running ability
  • Redundant attributes can be detected through correlation analysis and covariance analysis
  • Carefully integrate data from multiple sources , May help reduce / Avoid redundancy and inconsistencies , and Improve reading speed and quality

 Insert picture description here

Correlation analysis —— Discrete variables

Chi square test χ 2 ( c h i − s q u a r e ) t e s t χ 2 = ∑ ( O b s e r v e d − E x p e c t e d ) 2 E x p e c t e d ∙ χ 2 The bigger the value is. , The more likely it is that the variable is relevant ∙ Correlation doesn't mean causation Chi square test \\ \chi^2(chi-square)test\\ \chi^2=\sum\frac{(Observed-Expected)^2}{Expected}\\ \bullet \chi^2 The bigger the value is. , The more likely it is that the variable is relevant \\ \bullet Correlation doesn't mean causation Chi square test χ2(chisquare)testχ2=Expected(ObservedExpected)2χ2 The bigger the value is. , The more likely it is that the variable is relevant Correlation doesn't mean causation

 Insert picture description here

  • The first number is Statistics , Both like playing chess , I also like science fiction

  • The value in parentheses is Expectations

  • The expected value is calculated through the corresponding line total * Total of corresponding columns / total

    Such as 450*300/1500=90

  • After getting the expected value and statistical value , Can Get the corresponding chi square test

 Insert picture description here

Correlation analysis —— Continuous variable

Continuous variables have no way to count statistical values and expected values

  • The correlation coefficient —— Pearson correlation coefficient
  • You can use corr() After getting the correlation coefficient matrix , Use a heat map

Pearson correlation coefficient r p , q = ∑ ( p − p ‾ ) ( q − q ‾ ) ( n − 1 ) σ p σ q = ∑ ( p q ) − n p ‾   q ‾ ( n − 1 ) σ p σ q Pearson correlation coefficient \\ r_{p,q}=\frac{\sum(p-\overline{p})(q-\overline{q})}{(n-1)\sigma_p\sigma_q}=\frac{\sum(pq)-n\overline{p}\,\overline{q}}{(n-1)\sigma_p\sigma_q} Pearson correlation coefficient rp,q=(n1)σpσq(pp)(qq)=(n1)σpσq(pq)npq

  • among n Is the number of tuples , and p and q yes Specific values of respective attributes , σ p \sigma_p σp and σ q \sigma_q σq Is the respective standard deviation

  • When r>0 yes , It means that two variables are positively correlated ;r<0 when , The two variables are negatively correlated

  • When |r|=1 when , It means that two variables are completely linear correlation , Functional relation

  • When r=0 when , It means that there is no linear correlation between two variables

  • When 0<|r|<1, It means that there is a certain degree of linear correlation between two variables .

    • And when |r| The closer the 1, The closer the linear relationship between the two variables is ;
    • |r| The more close to 0 when , The weaker the linear correlation between two variables .
  • Generally, it can be divided into three levels

    • |r|<0.4 It is a low degree linear correlation
    • 0.4<=|r|<0.7 It is significant correlation
    • 0.7<=|r|<1 It's a highly linear correlation

covariance

  • Covariance is also used to indicate the correlation between two sets of data

Transformation of covariance and correlation coefficient r p , q = C o v ( p , q ) σ p σ q Transformation of covariance and correlation coefficient \\ r_{p,q}=\frac{Cov(p,q)}{\sigma_p\sigma_q} Transformation of covariance and correlation coefficient rp,q=σpσqCov(p,q)

The covariance formula C o v ( p , q ) = E ( ( p − p ‾ ) ( q − q ‾ ) ) = ∑ i = 1 n ( p i − p ‾ ) ( q i − q ‾ ) n Can be simplified as : C o v ( A , B ) = E ( A ∗ B ) − A ‾   B ‾ The covariance formula \\ Cov(p,q)=E((p-\overline{p})(q-\overline{q}))\\ =\frac{\sum_{i=1}^n(p_i-\overline{p})(q_i-\overline{q})}{n}\\ Can be simplified as :\\ Cov(A,B)=E(A*B)-\overline{A}\,\overline{B} The covariance formula Cov(p,q)=E((pp)(qq))=ni=1n(pip)(qiq) Can be simplified as :Cov(A,B)=E(AB)AB

  • among n Is the number of tuples , and p and q yes Specific values of respective attributes , σ p \sigma_p σp and σ q \sigma_q σq Is the respective standard deviation

  • positive correlation : C o v ( p , q ) > 0 Cov(p,q)>0 Cov(p,q)>0

  • negative correlation : C o v ( p , q ) < 0 Cov(p,q)<0 Cov(p,q)<0

  • independence : C o v p ( p , q ) = 0 Covp(p,q)=0 Covp(p,q)=0

  • It can have some covariance of random variables 0, But not independent

  • Need some additional assumptions , For example, whether the data obey multivariate normal distribution , The covariance is 0 It means independence

Be careful :

  • independence ⇒ C o v ( p , q ) = 0 \Rightarrow Cov(p,q)=0 Cov(p,q)=0

  • C o v ( p , q ) = 0 ⇏ Cov(p,q)=0\nRightarrow Cov(p,q)=0 independence

Data protocol

  • Because the data warehouse can store TB The data of , So when running on a complete dataset , Complex data analysis may take a long time

Dimension reduction

Integrate high-dimensional data , Some methods are used to turn high-dimensional data into low-dimensional data

for example : Facing a data set of achievements , Yes 6 Accounts as attributes ( Language, number and English materialize ), We can reduce the dimension of attributes to —— Liberal arts scores and science scores are two dimensions

  • reason :

    • With Increase in dimension , Data will become more and more sparse

      • For example, in the case data set , As the dimensions increase , There will be a lot of normal values pouring out , The disease data we need to pay attention to is flooded
    • Possible subspaces The portfolio will grow exponentially

      • Rule based classification , The established rules will be multiplied
      • The higher the dimension , The more complex the rules that may lead to features
    • Machine learning method similar to neural network , Main needs ** Learn the weight parameters of each feature .** The more features , The more parameters you need to learn , The more complex the model
      y ^ = s i g n ( ω 1 x 1 + ω 2 x 2 + . . . + ω d x d − t ) \widehat{y}=sign(\omega_1x_1+\omega_2x_2+...+\omega_dx_d-t)\\ y=sign(ω1x1+ω2x2+...+ωdxdt)

    • machine learning Training set principles : The more complex the model , More training sets are needed to learn model parameters , otherwise The model will be under fitted

    • therefore , If the dataset dimension is high , And the number of training sets is very small , When using complex machine learning models , Dimensionality reduction is preferred

    • You need to visualize

      • When your dimension is higher , The more complex visualization

Dimensionality reduction method ——PCA Principal component analysis

  • PCA Principal component analysis The core idea
    • Many attributes in the data may be related in one way or another
    • Can you find a way , take The combination of multiple correlated attributes only forms one attribute

 Insert picture description here

  • Principal component analysis primary coverage
    • Try to put the original numerous Attributes with certain relevance , Regroup into a group Unrelated comprehensive attributes To replace the original attribute
    • Usually Mathematical treatment Will be original p A linear combination of attributes , As a new comprehensive attribute —— That is, through Linear weighted combination

Definition : remember x 1 , x 2 , . . . , x p Is the original variable index , z 1 , z 2 , . . . , z m ( m ≤ p ) { z 1 = l 11 x 1 + l 12 x 2 + . . . + l 1 p x p z 2 = l 21 x 1 + l 22 x 2 + . . . + l 2 p x p ⋮ z m = l m 1 x 1 + l m 2 x 2 + . . . + l m p x p Definition : remember x_1,x_2,...,x_p Is the original variable index ,z_1,z_2,...,z_m(m\leq p)\\ \begin{cases} z_1=l_{11}x_1+l_{12}x_2+...+l_{1p}x_p\\ z_2=l_{21}x_1+l_{22}x_2+...+l_{2p}x_p\\ \vdots\\ z_m=l_{m1}x_1+l_{m2}x_2+...+l_{mp}x_p\\ \end{cases} Definition : remember x1,x2,...,xp Is the original variable index ,z1,z2,...,zm(mp)z1=l11x1+l12x2+...+l1pxpz2=l21x1+l22x2+...+l2pxpzm=lm1x1+lm2x2+...+lmpxp

Drop data

The data scale is very large , The computer is out of memory ;

Second time , We don't plan to take out all the data for training

  • Simple random sampling (Simple Random Sampling)
    • Equal probability choice
    • Don't put back the sample
      • Once the object is selected , Then delete
    • There's a sample put back
      • Select the object not to delete

The impact of sample size on data quality

 Insert picture description here

data compression

 Insert picture description here

Data conversion

  • Function mapping : The given attribute value is replaced by a new representation , Each old value and new value can be identified

Normalization

primary coverage : Put the dataset Scale to a specific interval

reason

  • For example, the college entrance examination results , Guangdong Province has the evaluation criteria of Guangdong Province , Beijing has Beijing Standards
  • In the data set, it appears as , The variation range between each attribute is very, very different

Minimax normalization

Definition : v ′ = v − m i n A m a x A − m i n A ( n e w _ m a x A − n e w _ m i n A ) + n e w _ m i n A v That is, the data that needs to be standardized Definition :\\ v'=\frac{v-min_A}{max_A-min_A}(new\_max_A-new\_min_A)+new\_min_A\\ v That is, the data that needs to be standardized Definition :v=maxAminAvminA(new_maxAnew_minA)+new_minAv That is, the data that needs to be standardized

n e w _ m a x A and n e w _ m i n A new\_max_A and new\_min_A new_maxA and new_minA The value of depends mainly on how you want to standardize , If it is normalized ( Data processing to 0 To 1 This interval ), Then the new maximum is 1, The new minimum is 0

Z- Score normalization

Definition : v ′ = v − mean value A Standard deviation A v That is, the data that originally needs to be standardized Definition : v'=\frac{v- mean value A}{ Standard deviation A}\\ v That is, the data that originally needs to be standardized Definition :v= Standard deviation Av mean value Av That is, the data that originally needs to be standardized

If the data set is streaming data ( That is, new data will be added at any time ), And we assume that the distribution of streaming data is constant

Then we sample part of the streaming data , Calculate the mean and standard deviation

Let's face it , use Z-score Method standardization is more reasonable

Decimal scaling

  • Move properties A The decimal point of ( The number of moving bits depends on the attribute A The maximum of )

v ′ = v 1 0 j j Is making M a x ( ∣ v ′ ∣ ) < 1 Minimum integer of v'=\frac{v}{10^j}\\ j Is making Max(|v'|)<1 Minimum integer of v=10jvj Is making Max(v)<1 Minimum integer of

For example, the minimum value in the data is 12000, The maximum value is 98000, be j=5

discretization

Discretize numerical data

eg: Age turns into —— Young and middle-aged

 Insert picture description here

Unsupervised discrete — Constant width method

  • Based on the range To differentiate , Make the width of each interval equal
  • That is, according to the maximum value of the attribute 、 The minimum value is divided into equal width

 Insert picture description here

Unsupervised discrete — Equal frequency method

  • It is divided according to the frequency of occurrence of the value , Divide the value range of attributes into several small areas , also The number of samples falling in each interval is required to be equal

 Insert picture description here

clustering

  • Clustering is used to divide the data into different discrete categories
原网站

版权声明
本文为[Caaaaaan]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/207/202207261446180706.html