当前位置:网站首页>Data preprocessing of data mining
Data preprocessing of data mining
2022-07-26 15:17:00 【Caaaaaan】
Data quality
A widely accepted measure of data quality :
- accuracy
- integrity ( There are missing values )
- Uniformity
- Timeliness ( Data is out of date )
- Credibility ( Database source )
- Explanatory
Data preprocessing
The purpose of data preprocessing is , Improve data quality
Main task
- Data cleaning
- Fill in the missing value
- Smoothing noise data
- Identify or delete outliers
- Resolve inconsistencies
- Data integration
- Consolidate multiple databases
- Cube or file
- Data reduction
- Dimension reduction
- Drop data (Numerosity reduction)
- data compression
- Data conversion and data discretization
- Normalization
- discretization
Data cleaning
Handling missing values
Ignore tuples ( That is, delete a single object )
When This is usually done when the class label is missing ( The training set lacks class labels in supervised machine learning )
- Class label It refers to the prediction type Training set in , final The prediction results are missing
When each attribute ( That is, the fields ) The proportion of missing values is relatively large , Poor effect
- In this case , It will make the data set too small
- Sure Consider deleting a single attribute
Fill in by hand : Big workload
Automatically fill in : Use The average value of the attribute is filled ( Commonly used )
df_values=df_values.drop((miss_data[miss_data['total']>200]).index,axis=1)
df_values['pres'].fillna(df_values['pres'].mean(),inplace=True)
df_values['mass'].fillna(df_values['mass'].mean(),inplace=True)
df_values['plas'].fillna(df_values['plas'].mean(),inplace=True)
Processing noise data
Box chart detects outlier data : Delete outliers
When there are many outliers , It will also lead to smaller data sets

Deal with inconsistent data
- Computational reasoning 、 Replace
- Global replacement
Data integration
Data integration : Will come from Data from multiple data sources are combined into a coherent data source

Pattern integration
- That is, when two data sets Field names are different , But when the expression is the same , Conduct integrated processing

Entity recognition
- When one of the data sets , The name is in Chinese , But another data set is named in English
- But they express the same person ( That is, the same entity , So in this environment , We need to recognize entities , Reintegration )

Data conflict detection and resolution
- For the same real-world entity , Attribute values from different sources
- Probable cause : Different ways of expression , Different scales ( Such as metric and English units )
As shown in the figure above , Describe the entity of height , This value is different ( The units are different )
Processing of redundant information
Such as : A dataset has 3000m The achievement of , The other has 5000m The achievement of , Then it is integrated into the running ability to measure
- The same attribute or object may have different text in different databases
- An attribute may be “ The derived ” Properties in another table of , For example, running ability
- Redundant attributes can be detected through correlation analysis and covariance analysis
- Carefully integrate data from multiple sources , May help reduce / Avoid redundancy and inconsistencies , and Improve reading speed and quality

Correlation analysis —— Discrete variables
Chi square test χ 2 ( c h i − s q u a r e ) t e s t χ 2 = ∑ ( O b s e r v e d − E x p e c t e d ) 2 E x p e c t e d ∙ χ 2 The bigger the value is. , The more likely it is that the variable is relevant ∙ Correlation doesn't mean causation Chi square test \\ \chi^2(chi-square)test\\ \chi^2=\sum\frac{(Observed-Expected)^2}{Expected}\\ \bullet \chi^2 The bigger the value is. , The more likely it is that the variable is relevant \\ \bullet Correlation doesn't mean causation Chi square test χ2(chi−square)testχ2=∑Expected(Observed−Expected)2∙χ2 The bigger the value is. , The more likely it is that the variable is relevant ∙ Correlation doesn't mean causation

The first number is Statistics , Both like playing chess , I also like science fiction
The value in parentheses is Expectations
The expected value is calculated through the corresponding line total * Total of corresponding columns / total
Such as 450*300/1500=90
After getting the expected value and statistical value , Can Get the corresponding chi square test

Correlation analysis —— Continuous variable
Continuous variables have no way to count statistical values and expected values
- The correlation coefficient —— Pearson correlation coefficient
- You can use corr() After getting the correlation coefficient matrix , Use a heat map
Pearson correlation coefficient r p , q = ∑ ( p − p ‾ ) ( q − q ‾ ) ( n − 1 ) σ p σ q = ∑ ( p q ) − n p ‾ q ‾ ( n − 1 ) σ p σ q Pearson correlation coefficient \\ r_{p,q}=\frac{\sum(p-\overline{p})(q-\overline{q})}{(n-1)\sigma_p\sigma_q}=\frac{\sum(pq)-n\overline{p}\,\overline{q}}{(n-1)\sigma_p\sigma_q} Pearson correlation coefficient rp,q=(n−1)σpσq∑(p−p)(q−q)=(n−1)σpσq∑(pq)−npq
among n Is the number of tuples , and p and q yes Specific values of respective attributes , σ p \sigma_p σp and σ q \sigma_q σq Is the respective standard deviation
When r>0 yes , It means that two variables are positively correlated ;r<0 when , The two variables are negatively correlated
When |r|=1 when , It means that two variables are completely linear correlation , Functional relation
When r=0 when , It means that there is no linear correlation between two variables
When 0<|r|<1, It means that there is a certain degree of linear correlation between two variables .
- And when |r| The closer the 1, The closer the linear relationship between the two variables is ;
- |r| The more close to 0 when , The weaker the linear correlation between two variables .
Generally, it can be divided into three levels
- |r|<0.4 It is a low degree linear correlation
- 0.4<=|r|<0.7 It is significant correlation
- 0.7<=|r|<1 It's a highly linear correlation
covariance
- Covariance is also used to indicate the correlation between two sets of data
Transformation of covariance and correlation coefficient r p , q = C o v ( p , q ) σ p σ q Transformation of covariance and correlation coefficient \\ r_{p,q}=\frac{Cov(p,q)}{\sigma_p\sigma_q} Transformation of covariance and correlation coefficient rp,q=σpσqCov(p,q)
The covariance formula C o v ( p , q ) = E ( ( p − p ‾ ) ( q − q ‾ ) ) = ∑ i = 1 n ( p i − p ‾ ) ( q i − q ‾ ) n Can be simplified as : C o v ( A , B ) = E ( A ∗ B ) − A ‾ B ‾ The covariance formula \\ Cov(p,q)=E((p-\overline{p})(q-\overline{q}))\\ =\frac{\sum_{i=1}^n(p_i-\overline{p})(q_i-\overline{q})}{n}\\ Can be simplified as :\\ Cov(A,B)=E(A*B)-\overline{A}\,\overline{B} The covariance formula Cov(p,q)=E((p−p)(q−q))=n∑i=1n(pi−p)(qi−q) Can be simplified as :Cov(A,B)=E(A∗B)−AB
among n Is the number of tuples , and p and q yes Specific values of respective attributes , σ p \sigma_p σp and σ q \sigma_q σq Is the respective standard deviation
positive correlation : C o v ( p , q ) > 0 Cov(p,q)>0 Cov(p,q)>0
negative correlation : C o v ( p , q ) < 0 Cov(p,q)<0 Cov(p,q)<0
independence : C o v p ( p , q ) = 0 Covp(p,q)=0 Covp(p,q)=0
It can have some covariance of random variables 0, But not independent
Need some additional assumptions , For example, whether the data obey multivariate normal distribution , The covariance is 0 It means independence
Be careful :
independence ⇒ C o v ( p , q ) = 0 \Rightarrow Cov(p,q)=0 ⇒Cov(p,q)=0
C o v ( p , q ) = 0 ⇏ Cov(p,q)=0\nRightarrow Cov(p,q)=0⇏ independence
Data protocol
- Because the data warehouse can store TB The data of , So when running on a complete dataset , Complex data analysis may take a long time
Dimension reduction
Integrate high-dimensional data , Some methods are used to turn high-dimensional data into low-dimensional data
for example : Facing a data set of achievements , Yes 6 Accounts as attributes ( Language, number and English materialize ), We can reduce the dimension of attributes to —— Liberal arts scores and science scores are two dimensions
reason :
With Increase in dimension , Data will become more and more sparse
- For example, in the case data set , As the dimensions increase , There will be a lot of normal values pouring out , The disease data we need to pay attention to is flooded
Possible subspaces The portfolio will grow exponentially
- Rule based classification , The established rules will be multiplied
- The higher the dimension , The more complex the rules that may lead to features
Machine learning method similar to neural network , Main needs ** Learn the weight parameters of each feature .** The more features , The more parameters you need to learn , The more complex the model
y ^ = s i g n ( ω 1 x 1 + ω 2 x 2 + . . . + ω d x d − t ) \widehat{y}=sign(\omega_1x_1+\omega_2x_2+...+\omega_dx_d-t)\\ y=sign(ω1x1+ω2x2+...+ωdxd−t)machine learning Training set principles : The more complex the model , More training sets are needed to learn model parameters , otherwise The model will be under fitted
therefore , If the dataset dimension is high , And the number of training sets is very small , When using complex machine learning models , Dimensionality reduction is preferred
You need to visualize
- When your dimension is higher , The more complex visualization
Dimensionality reduction method ——PCA Principal component analysis
- PCA Principal component analysis The core idea
- Many attributes in the data may be related in one way or another
- Can you find a way , take The combination of multiple correlated attributes only forms one attribute

- Principal component analysis primary coverage
- Try to put the original numerous Attributes with certain relevance , Regroup into a group Unrelated comprehensive attributes To replace the original attribute
- Usually Mathematical treatment Will be original p A linear combination of attributes , As a new comprehensive attribute —— That is, through Linear weighted combination
Definition : remember x 1 , x 2 , . . . , x p Is the original variable index , z 1 , z 2 , . . . , z m ( m ≤ p ) { z 1 = l 11 x 1 + l 12 x 2 + . . . + l 1 p x p z 2 = l 21 x 1 + l 22 x 2 + . . . + l 2 p x p ⋮ z m = l m 1 x 1 + l m 2 x 2 + . . . + l m p x p Definition : remember x_1,x_2,...,x_p Is the original variable index ,z_1,z_2,...,z_m(m\leq p)\\ \begin{cases} z_1=l_{11}x_1+l_{12}x_2+...+l_{1p}x_p\\ z_2=l_{21}x_1+l_{22}x_2+...+l_{2p}x_p\\ \vdots\\ z_m=l_{m1}x_1+l_{m2}x_2+...+l_{mp}x_p\\ \end{cases} Definition : remember x1,x2,...,xp Is the original variable index ,z1,z2,...,zm(m≤p)⎩⎨⎧z1=l11x1+l12x2+...+l1pxpz2=l21x1+l22x2+...+l2pxp⋮zm=lm1x1+lm2x2+...+lmpxp
Drop data
The data scale is very large , The computer is out of memory ;
Second time , We don't plan to take out all the data for training
- Simple random sampling (Simple Random Sampling)
- Equal probability choice
- Don't put back the sample
- Once the object is selected , Then delete
- There's a sample put back
- Select the object not to delete
The impact of sample size on data quality

data compression

Data conversion
- Function mapping : The given attribute value is replaced by a new representation , Each old value and new value can be identified
Normalization
primary coverage : Put the dataset Scale to a specific interval
reason :
- For example, the college entrance examination results , Guangdong Province has the evaluation criteria of Guangdong Province , Beijing has Beijing Standards
- In the data set, it appears as , The variation range between each attribute is very, very different
Minimax normalization
Definition : v ′ = v − m i n A m a x A − m i n A ( n e w _ m a x A − n e w _ m i n A ) + n e w _ m i n A v That is, the data that needs to be standardized Definition :\\ v'=\frac{v-min_A}{max_A-min_A}(new\_max_A-new\_min_A)+new\_min_A\\ v That is, the data that needs to be standardized Definition :v′=maxA−minAv−minA(new_maxA−new_minA)+new_minAv That is, the data that needs to be standardized
n e w _ m a x A and n e w _ m i n A new\_max_A and new\_min_A new_maxA and new_minA The value of depends mainly on how you want to standardize , If it is normalized ( Data processing to 0 To 1 This interval ), Then the new maximum is 1, The new minimum is 0
Z- Score normalization
Definition : v ′ = v − mean value A Standard deviation A v That is, the data that originally needs to be standardized Definition : v'=\frac{v- mean value A}{ Standard deviation A}\\ v That is, the data that originally needs to be standardized Definition :v′= Standard deviation Av− mean value Av That is, the data that originally needs to be standardized
If the data set is streaming data ( That is, new data will be added at any time ), And we assume that the distribution of streaming data is constant
Then we sample part of the streaming data , Calculate the mean and standard deviation
Let's face it , use Z-score Method standardization is more reasonable
Decimal scaling
- Move properties A The decimal point of ( The number of moving bits depends on the attribute A The maximum of )
v ′ = v 1 0 j j Is making M a x ( ∣ v ′ ∣ ) < 1 Minimum integer of v'=\frac{v}{10^j}\\ j Is making Max(|v'|)<1 Minimum integer of v′=10jvj Is making Max(∣v′∣)<1 Minimum integer of
For example, the minimum value in the data is 12000, The maximum value is 98000, be j=5
discretization
Discretize numerical data
eg: Age turns into —— Young and middle-aged

Unsupervised discrete — Constant width method
- Based on the range To differentiate , Make the width of each interval equal
- That is, according to the maximum value of the attribute 、 The minimum value is divided into equal width

Unsupervised discrete — Equal frequency method
- It is divided according to the frequency of occurrence of the value , Divide the value range of attributes into several small areas , also The number of samples falling in each interval is required to be equal

clustering
- Clustering is used to divide the data into different discrete categories
边栏推荐
- DICOM learning materials collection
- How to find the translation of foreign literature for undergraduate graduation thesis?
- NAT/NAPT地址转换(内外网通信)技术详解【华为eNSP】
- JS to realize the number to amount price thousand separator
- jmeter分布式
- php反序列化部分学习
- 基于物联网的环境调节系统(ESP32-C3+Onenet+微信小程序)
- 【基础】动态链接库/静态链接库的区别
- JMeter distributed
- How to query foreign literature?
猜你喜欢

益方生物上市首日跌16%:公司市值88亿 高瓴与礼来是股东

哪里有写毕业论文需要的外文文献?

Yifang biological fell 16% on the first day of listing: the company's market value was 8.8 billion, and Hillhouse and Lilly were shareholders

Parallel d-pipeline: a cuckoo hashing implementation for increased throughput

Which software must be used by scientific researchers to read literature?

生泰尔科技IPO被终止:曾拟募资5.6亿 启明与济峰资本是股东

装备制造业的变革时代,SCM供应链管理系统如何赋能装备制造企业转型升级

How to find undergraduate dissertations of domestic universities?

Ner of NLP: Exploration and practice of product title attribute recognition

CVE-2022-33891 Apache spark shell 命令注入漏洞复现
随机推荐
Environment regulation system based on Internet of things (esp32-c3+onenet+ wechat applet)
R language ggplot2 visualization: use ggplot2 to visualize the scatter diagram, and use the theme of ggpubr package_ The pubclean function sets the theme without axis lines in the visual image
Next generation visual transformer: Unlocking the correct combination of CNN and transformer
Advanced Qt development: how to fit the window width and height when displaying (fitwidth+fitheight)
R language Visual scatter diagram, geom using ggrep package_ text_ The rep function avoids overlapping labels between data points (set the min.segment.length parameter to 0 to add line segments to the
2. 两数相加
Cve-2022-33891 Apache spark shell command injection vulnerability recurrence
MySQL builds master-slave replication
Ner of NLP: Exploration and practice of product title attribute recognition
Devsecops, speed and security
筑牢生态安全防线,广州开展突发环境事件应急演练
7. In JS [] = =! [] Why is it true?
VP视频结构化框架
蓝牙BLE4.0-HM-10设备配对指南
广州地铁十三号线二期全线土建已完成53%,预计明年开通
C# 给Word每一页设置不同文字水印
About the selection of industrial control gateway IOT serial port to WiFi module and serial port to network port module
Driver development environment
9. Learn MySQL delete statement
【留点代码】将transformer运用到目标检测上来,通过debug了解模型的模型运算流程