当前位置:网站首页>Data preprocessing of data mining
Data preprocessing of data mining
2022-07-26 15:17:00 【Caaaaaan】
Data quality
A widely accepted measure of data quality :
- accuracy
- integrity ( There are missing values )
- Uniformity
- Timeliness ( Data is out of date )
- Credibility ( Database source )
- Explanatory
Data preprocessing
The purpose of data preprocessing is , Improve data quality
Main task
- Data cleaning
- Fill in the missing value
- Smoothing noise data
- Identify or delete outliers
- Resolve inconsistencies
- Data integration
- Consolidate multiple databases
- Cube or file
- Data reduction
- Dimension reduction
- Drop data (Numerosity reduction)
- data compression
- Data conversion and data discretization
- Normalization
- discretization
Data cleaning
Handling missing values
Ignore tuples ( That is, delete a single object )
When This is usually done when the class label is missing ( The training set lacks class labels in supervised machine learning )
- Class label It refers to the prediction type Training set in , final The prediction results are missing
When each attribute ( That is, the fields ) The proportion of missing values is relatively large , Poor effect
- In this case , It will make the data set too small
- Sure Consider deleting a single attribute
Fill in by hand : Big workload
Automatically fill in : Use The average value of the attribute is filled ( Commonly used )
df_values=df_values.drop((miss_data[miss_data['total']>200]).index,axis=1)
df_values['pres'].fillna(df_values['pres'].mean(),inplace=True)
df_values['mass'].fillna(df_values['mass'].mean(),inplace=True)
df_values['plas'].fillna(df_values['plas'].mean(),inplace=True)
Processing noise data
Box chart detects outlier data : Delete outliers
When there are many outliers , It will also lead to smaller data sets

Deal with inconsistent data
- Computational reasoning 、 Replace
- Global replacement
Data integration
Data integration : Will come from Data from multiple data sources are combined into a coherent data source

Pattern integration
- That is, when two data sets Field names are different , But when the expression is the same , Conduct integrated processing

Entity recognition
- When one of the data sets , The name is in Chinese , But another data set is named in English
- But they express the same person ( That is, the same entity , So in this environment , We need to recognize entities , Reintegration )

Data conflict detection and resolution
- For the same real-world entity , Attribute values from different sources
- Probable cause : Different ways of expression , Different scales ( Such as metric and English units )
As shown in the figure above , Describe the entity of height , This value is different ( The units are different )
Processing of redundant information
Such as : A dataset has 3000m The achievement of , The other has 5000m The achievement of , Then it is integrated into the running ability to measure
- The same attribute or object may have different text in different databases
- An attribute may be “ The derived ” Properties in another table of , For example, running ability
- Redundant attributes can be detected through correlation analysis and covariance analysis
- Carefully integrate data from multiple sources , May help reduce / Avoid redundancy and inconsistencies , and Improve reading speed and quality

Correlation analysis —— Discrete variables
Chi square test χ 2 ( c h i − s q u a r e ) t e s t χ 2 = ∑ ( O b s e r v e d − E x p e c t e d ) 2 E x p e c t e d ∙ χ 2 The bigger the value is. , The more likely it is that the variable is relevant ∙ Correlation doesn't mean causation Chi square test \\ \chi^2(chi-square)test\\ \chi^2=\sum\frac{(Observed-Expected)^2}{Expected}\\ \bullet \chi^2 The bigger the value is. , The more likely it is that the variable is relevant \\ \bullet Correlation doesn't mean causation Chi square test χ2(chi−square)testχ2=∑Expected(Observed−Expected)2∙χ2 The bigger the value is. , The more likely it is that the variable is relevant ∙ Correlation doesn't mean causation

The first number is Statistics , Both like playing chess , I also like science fiction
The value in parentheses is Expectations
The expected value is calculated through the corresponding line total * Total of corresponding columns / total
Such as 450*300/1500=90
After getting the expected value and statistical value , Can Get the corresponding chi square test

Correlation analysis —— Continuous variable
Continuous variables have no way to count statistical values and expected values
- The correlation coefficient —— Pearson correlation coefficient
- You can use corr() After getting the correlation coefficient matrix , Use a heat map
Pearson correlation coefficient r p , q = ∑ ( p − p ‾ ) ( q − q ‾ ) ( n − 1 ) σ p σ q = ∑ ( p q ) − n p ‾ q ‾ ( n − 1 ) σ p σ q Pearson correlation coefficient \\ r_{p,q}=\frac{\sum(p-\overline{p})(q-\overline{q})}{(n-1)\sigma_p\sigma_q}=\frac{\sum(pq)-n\overline{p}\,\overline{q}}{(n-1)\sigma_p\sigma_q} Pearson correlation coefficient rp,q=(n−1)σpσq∑(p−p)(q−q)=(n−1)σpσq∑(pq)−npq
among n Is the number of tuples , and p and q yes Specific values of respective attributes , σ p \sigma_p σp and σ q \sigma_q σq Is the respective standard deviation
When r>0 yes , It means that two variables are positively correlated ;r<0 when , The two variables are negatively correlated
When |r|=1 when , It means that two variables are completely linear correlation , Functional relation
When r=0 when , It means that there is no linear correlation between two variables
When 0<|r|<1, It means that there is a certain degree of linear correlation between two variables .
- And when |r| The closer the 1, The closer the linear relationship between the two variables is ;
- |r| The more close to 0 when , The weaker the linear correlation between two variables .
Generally, it can be divided into three levels
- |r|<0.4 It is a low degree linear correlation
- 0.4<=|r|<0.7 It is significant correlation
- 0.7<=|r|<1 It's a highly linear correlation
covariance
- Covariance is also used to indicate the correlation between two sets of data
Transformation of covariance and correlation coefficient r p , q = C o v ( p , q ) σ p σ q Transformation of covariance and correlation coefficient \\ r_{p,q}=\frac{Cov(p,q)}{\sigma_p\sigma_q} Transformation of covariance and correlation coefficient rp,q=σpσqCov(p,q)
The covariance formula C o v ( p , q ) = E ( ( p − p ‾ ) ( q − q ‾ ) ) = ∑ i = 1 n ( p i − p ‾ ) ( q i − q ‾ ) n Can be simplified as : C o v ( A , B ) = E ( A ∗ B ) − A ‾ B ‾ The covariance formula \\ Cov(p,q)=E((p-\overline{p})(q-\overline{q}))\\ =\frac{\sum_{i=1}^n(p_i-\overline{p})(q_i-\overline{q})}{n}\\ Can be simplified as :\\ Cov(A,B)=E(A*B)-\overline{A}\,\overline{B} The covariance formula Cov(p,q)=E((p−p)(q−q))=n∑i=1n(pi−p)(qi−q) Can be simplified as :Cov(A,B)=E(A∗B)−AB
among n Is the number of tuples , and p and q yes Specific values of respective attributes , σ p \sigma_p σp and σ q \sigma_q σq Is the respective standard deviation
positive correlation : C o v ( p , q ) > 0 Cov(p,q)>0 Cov(p,q)>0
negative correlation : C o v ( p , q ) < 0 Cov(p,q)<0 Cov(p,q)<0
independence : C o v p ( p , q ) = 0 Covp(p,q)=0 Covp(p,q)=0
It can have some covariance of random variables 0, But not independent
Need some additional assumptions , For example, whether the data obey multivariate normal distribution , The covariance is 0 It means independence
Be careful :
independence ⇒ C o v ( p , q ) = 0 \Rightarrow Cov(p,q)=0 ⇒Cov(p,q)=0
C o v ( p , q ) = 0 ⇏ Cov(p,q)=0\nRightarrow Cov(p,q)=0⇏ independence
Data protocol
- Because the data warehouse can store TB The data of , So when running on a complete dataset , Complex data analysis may take a long time
Dimension reduction
Integrate high-dimensional data , Some methods are used to turn high-dimensional data into low-dimensional data
for example : Facing a data set of achievements , Yes 6 Accounts as attributes ( Language, number and English materialize ), We can reduce the dimension of attributes to —— Liberal arts scores and science scores are two dimensions
reason :
With Increase in dimension , Data will become more and more sparse
- For example, in the case data set , As the dimensions increase , There will be a lot of normal values pouring out , The disease data we need to pay attention to is flooded
Possible subspaces The portfolio will grow exponentially
- Rule based classification , The established rules will be multiplied
- The higher the dimension , The more complex the rules that may lead to features
Machine learning method similar to neural network , Main needs ** Learn the weight parameters of each feature .** The more features , The more parameters you need to learn , The more complex the model
y ^ = s i g n ( ω 1 x 1 + ω 2 x 2 + . . . + ω d x d − t ) \widehat{y}=sign(\omega_1x_1+\omega_2x_2+...+\omega_dx_d-t)\\ y=sign(ω1x1+ω2x2+...+ωdxd−t)machine learning Training set principles : The more complex the model , More training sets are needed to learn model parameters , otherwise The model will be under fitted
therefore , If the dataset dimension is high , And the number of training sets is very small , When using complex machine learning models , Dimensionality reduction is preferred
You need to visualize
- When your dimension is higher , The more complex visualization
Dimensionality reduction method ——PCA Principal component analysis
- PCA Principal component analysis The core idea
- Many attributes in the data may be related in one way or another
- Can you find a way , take The combination of multiple correlated attributes only forms one attribute

- Principal component analysis primary coverage
- Try to put the original numerous Attributes with certain relevance , Regroup into a group Unrelated comprehensive attributes To replace the original attribute
- Usually Mathematical treatment Will be original p A linear combination of attributes , As a new comprehensive attribute —— That is, through Linear weighted combination
Definition : remember x 1 , x 2 , . . . , x p Is the original variable index , z 1 , z 2 , . . . , z m ( m ≤ p ) { z 1 = l 11 x 1 + l 12 x 2 + . . . + l 1 p x p z 2 = l 21 x 1 + l 22 x 2 + . . . + l 2 p x p ⋮ z m = l m 1 x 1 + l m 2 x 2 + . . . + l m p x p Definition : remember x_1,x_2,...,x_p Is the original variable index ,z_1,z_2,...,z_m(m\leq p)\\ \begin{cases} z_1=l_{11}x_1+l_{12}x_2+...+l_{1p}x_p\\ z_2=l_{21}x_1+l_{22}x_2+...+l_{2p}x_p\\ \vdots\\ z_m=l_{m1}x_1+l_{m2}x_2+...+l_{mp}x_p\\ \end{cases} Definition : remember x1,x2,...,xp Is the original variable index ,z1,z2,...,zm(m≤p)⎩⎨⎧z1=l11x1+l12x2+...+l1pxpz2=l21x1+l22x2+...+l2pxp⋮zm=lm1x1+lm2x2+...+lmpxp
Drop data
The data scale is very large , The computer is out of memory ;
Second time , We don't plan to take out all the data for training
- Simple random sampling (Simple Random Sampling)
- Equal probability choice
- Don't put back the sample
- Once the object is selected , Then delete
- There's a sample put back
- Select the object not to delete
The impact of sample size on data quality

data compression

Data conversion
- Function mapping : The given attribute value is replaced by a new representation , Each old value and new value can be identified
Normalization
primary coverage : Put the dataset Scale to a specific interval
reason :
- For example, the college entrance examination results , Guangdong Province has the evaluation criteria of Guangdong Province , Beijing has Beijing Standards
- In the data set, it appears as , The variation range between each attribute is very, very different
Minimax normalization
Definition : v ′ = v − m i n A m a x A − m i n A ( n e w _ m a x A − n e w _ m i n A ) + n e w _ m i n A v That is, the data that needs to be standardized Definition :\\ v'=\frac{v-min_A}{max_A-min_A}(new\_max_A-new\_min_A)+new\_min_A\\ v That is, the data that needs to be standardized Definition :v′=maxA−minAv−minA(new_maxA−new_minA)+new_minAv That is, the data that needs to be standardized
n e w _ m a x A and n e w _ m i n A new\_max_A and new\_min_A new_maxA and new_minA The value of depends mainly on how you want to standardize , If it is normalized ( Data processing to 0 To 1 This interval ), Then the new maximum is 1, The new minimum is 0
Z- Score normalization
Definition : v ′ = v − mean value A Standard deviation A v That is, the data that originally needs to be standardized Definition : v'=\frac{v- mean value A}{ Standard deviation A}\\ v That is, the data that originally needs to be standardized Definition :v′= Standard deviation Av− mean value Av That is, the data that originally needs to be standardized
If the data set is streaming data ( That is, new data will be added at any time ), And we assume that the distribution of streaming data is constant
Then we sample part of the streaming data , Calculate the mean and standard deviation
Let's face it , use Z-score Method standardization is more reasonable
Decimal scaling
- Move properties A The decimal point of ( The number of moving bits depends on the attribute A The maximum of )
v ′ = v 1 0 j j Is making M a x ( ∣ v ′ ∣ ) < 1 Minimum integer of v'=\frac{v}{10^j}\\ j Is making Max(|v'|)<1 Minimum integer of v′=10jvj Is making Max(∣v′∣)<1 Minimum integer of
For example, the minimum value in the data is 12000, The maximum value is 98000, be j=5
discretization
Discretize numerical data
eg: Age turns into —— Young and middle-aged

Unsupervised discrete — Constant width method
- Based on the range To differentiate , Make the width of each interval equal
- That is, according to the maximum value of the attribute 、 The minimum value is divided into equal width

Unsupervised discrete — Equal frequency method
- It is divided according to the frequency of occurrence of the value , Divide the value range of attributes into several small areas , also The number of samples falling in each interval is required to be equal

clustering
- Clustering is used to divide the data into different discrete categories
边栏推荐
- OSPF and mGRE experiments
- Practical purchasing skills, purchasing methods of five bottleneck materials
- 装备制造业的变革时代,SCM供应链管理系统如何赋能装备制造企业转型升级
- 广州地铁十三号线二期全线土建已完成53%,预计明年开通
- How to translate academic documents?
- Minecraft 1.16.5 module development (52) modify the original biological trophy (lot table)
- 固态硬盘对游戏运行的帮助有多少
- 北京的大学排名
- 【LeetCode每日一题】——268.丢失的数字
- Chapter 08_ Principles of index creation and design
猜你喜欢

NLP之NER:商品标题属性识别探索与实践

Parallel d-pipeline: a cuckoo hashing implementation for increased throughput

Minecraft 1.16.5 module development (52) modify the original biological trophy (lot table)

The leader took credit for it. I changed the variable name and laid him off

2. 两数相加

北京的大学排名

How to query foreign literature?

jmeter分布式

最详细的专利申请教程,教你如何申请专利

数商云:引领化工业态数字升级,看摩贝如何快速打通全场景互融互通
随机推荐
R语言ggplot2可视化:使用ggpubr包的ggdotplot函数可视化点阵图(dot plot)、设置add参数添加均值和标准差竖线、设置error.plot参数实际显示箱体
装备制造业的变革时代,SCM供应链管理系统如何赋能装备制造企业转型升级
jmeter分布式
How do college students apply for utility model patents?
有哪些科研人员看文献必用的软件?
VP视频结构化框架
Unity URP entry practice
FOC电机控制基础
筑牢生态安全防线,广州开展突发环境事件应急演练
The most detailed patent application tutorial, teaching you how to apply for a patent
Where is the foreign literature needed to write the graduation thesis?
小白哪个券商开户最好 开户最安全
C # set different text watermarks for each page of word
采购实用技巧,5个瓶颈物料的采购方法
Which software must be used by scientific researchers to read literature?
Parallel d-Pipeline: A Cuckoo Hashing Implementation for Increased Throughput论文总结
Soft test (VII) performance test (1) brief introduction
谷歌尝试为ChromeOS引入密码强度指示器以提升线上安全性
【基础】动态链接库/静态链接库的区别
The R language uses the histogram function in the lattice package to visualize the histogram (histogram plot), the col parameter to customize the fill color, and the type parameter to customize the hi