当前位置:网站首页>Data mining knowledge points sorting (final review version)
Data mining knowledge points sorting (final review version)
2022-06-30 16:52:00 【A cute little monkey】
Catalog
- == Chapter one The introduction ==
- The background of data mining ? What is the driving force ?
- What are the characteristics of big data ?
- What is data mining ?
- What is the general process of data mining ?
- Industry data consolidation & What is the process of analysis ?
- The four main tasks of data mining ? What's the difference between them ?
- Combined with classification , This paper introduces some common concepts in data mining
- == Chapter two data ==
- Data attribute type
- What is an asymmetric property ?
- General characteristics of data sets
- What is a dimensional disaster ?
- How to understand dimensional disasters ?
- How to avoid dimension disaster ?
- What are the types of datasets ?
- What are the common data quality problems ?
- Measure the similarity and dissimilarity of data
- == The third chapter Data preprocessing ==
- Why do I need data preprocessing ?
- The main task of data preprocessing ?
- Data cleaning
- Missing data
- How to deal with missing data
- Abnormal data
- How to smooth outliers ( Mainly for outliers )?
- Type conversion
- sampling
- Data standardization
- Statistical description and visualization of data
- Selection and extraction of data
Chapter one The introduction
The background of data mining ? What is the driving force ?
DRIP(Data Rich,Information Poor)
What are the characteristics of big data ?
3v : volume、velocity、Varity
The amount of data is TB Level to ZB level
Data diversity has changed from structured to unstructured and structured data
The speed of data transmission is very fast
Big data leads to difficult storage and computing

What is data mining ?
Data mining is Discover knowledge from data .
Mining from large amounts of data is Interested in 、 Helpful 、 The implicit 、 Previously unknown 、 and Patterns or knowledge that may be useful .
data mining Not fully automatic The process of , Human participation may be required in all links .
What is the general process of data mining ?

Industry data consolidation & What is the process of analysis ?

Extract data from various data sources , Then do some cleaning and fusion ( Extract transform load ), These data are processed , To put it together , And then install it in the data warehouse , Then take the data from the data warehouse to do all kinds of analysis and mining .
### Examples of data mining applications in various fields ** Public safety **: Excavate the law of crime , Prevent crime or reduce the occurrence of crime ** Personalized medicine **: Yes DNA Analyze , Depending on the genes , More tailored to the case ** city planning **: Use big data to analyze traffic heat maps in different periods , To help the staff lay out the route ** Precision sales **: Leverage customer information , Implement accurate recommendations ** motion **: Use data analysis to select potential players with low value
The four main tasks of data mining ? What's the difference between them ?
Main task : Clustering analysis 、 Classified forecast 、 Correlation analysis 、 Anomaly detection .
difference :
Classification is the use of labels for model building , Then use the model to predict , It is a supervised learning method
Clustering is achieved by minimizing the distance between clusters , Maximize the distance between clusters , It is an unsupervised learning method
Combined with classification , This paper introduces some common concepts in data mining
What is the classification boundary ?
Learn such classification boundaries by building models , A category boundary can be a category line , The classification surface can also be a hyperplane .
What is over fitting ?
The trained classification boundary fits the training data too much , It may cause the model to have a good effect in training , The effect is not good in the test set

What is confusion matrix ?

The evaluation index :
TPR = TP / (TP + FN)( The predicted value of the true correct value is the correct proportion )
TNR = TN / (TN + FP) ( The proportion of the truly wrong values in which the predicted value is wrong )
Accuracy = (TP + TN) / (P + N) ( Predict the correct proportion of all actual results )
What is? ROC curve /AUC Evaluation criteria ?

What is cost sensitive learning ?
There are two kinds of errors in the confusion matrix , One is positive and the other is negative ; One is that the negative prediction becomes positive
Among the practical problems , These two mistakes are put together , The cost of error varies , Therefore, we should focus on reducing the high cost of errors in learning
for example : During medical treatment , Those who are really ill are diagnosed as having no disease , Those who are not ill are diagnosed as ill , It must be the former that is more costly , So reduce the occurrence of the former
Chapter two data
Data attribute type
It is divided into : Continuous and discrete
What is an asymmetric property ?
Only pay attention to A few non-zero attribute values It makes sense , Call this attribute asymmetric ( for example : Supermarket shopping , It only matters what you buy , And don't care what you didn't buy )
General characteristics of data sets
(1) dimension
Is the number of attributes in the dataset . When analyzing high-dimensional data, it is easy to fall into Dimension disaster . An important motivation for data preprocessing is to reduce dimensions , And dimensional reduction .
(2) sparsity
Some datasets, such as datasets with asymmetric attributes , The non-zero term is less than 1%, This allows you to store only non-zero values , It will greatly reduce the computing time and storage space . There are algorithms for sparse data ( sparse matrix ) To deal with .
(3) The resolution of the
Data with different resolutions can be obtained at different acquisition frequencies . for example : Data with a resolution of several meters , The earth is very uneven , But if the data with resolution of tens of kilometers , But relatively flat . The data mode depends on the resolution . The resolution is too small , The pattern may not appear . Too much resolution , The pattern may not see .
What is a dimensional disaster ?
In order to get better classification effect , We can add more features , But when we have a certain number of characteristics , Instead, the effect of classifiers began to decline .

How to understand dimensional disasters ?

The high-dimensional classifier learns the noise and anomalies of the training data , However, the fitting effect of data outside the sample is not ideal . Resulting in over fitting .
let me put it another way , As the dimensions increase , But the data is fixed , therefore The data becomes more and more sparse in the feature space , Make the model easy to over fit , Learned noise and outliers , Thus, dimensional disasters occur .
How to avoid dimension disaster ?
(1) Amount of training data
In theory, , The number of training samples is required As the index increases ( Infinite ), Dimensional disasters will not happen .
(2) The type of model
A classifier with nonlinear decision boundary , Such as neural network 、KNN, Decision tree , Good classification effect , But the generalization ability is poor .
therefore , The dimension cannot be too high when using these classifiers , Instead, you need to increase the amount of data .
If it is a classifier with good generalization ability , Like Bayes 、 Linear classifier , More features can be used .
What are the types of datasets ?
(1) Record data ( Data matrix 、 Trading data 、 Text data )
The standard form of data set is data matrix .( Data objects have the same set of numeric attributes )( It's a table )

What is the word bag model ?( Each document is expressed as a word vector ; Each word is a component of a vector ; The value of each component is the number of times the word appears in the document .)

(2) Figure data ( web 、 Molecular structure )
(3) Sequence data ( The time series 、 Spatial data 、 Image data 、 Video data )
What are the common data quality problems ?
Poor data quality will have a negative impact on many data processing work ( for example : Some people in good standing are refused loans )
Common data quality problems : noise 、 outliers 、 Missing value 、 duplicate value 、 Inconsistent values 、 Unbalanced data
noise (Noise): Is an unrelated data object
outliers (Outliers): It's data objects , But its characteristics are significantly different from most objects in the dataset
Measure the similarity and dissimilarity of data
Similarity measure : Measure the similarity of data objects . The more similar , The higher the value ; The value usually falls in [0,1].
Dissimilarity measure : Measure the dissimilarity of data objects . The more dissimilar , The higher the value ; The value usually falls in [0,+), The upper bound is uncertain .
Similarity measurement method : Binary vector similarity (SMC、Jaccard coefficient )、 Cosine similarity 、 Pearson correlation
Dissimilarity measures :Euclidean distance 、Minkowski distance 、 Markov distance
Similarity measure between binary vectors

Similarity measure between multivariate vectors ( Cosine similarity )

Pearson correlation coefficient
The correlation coefficient (x, y) = covariance (x, y) / ( Standard deviation (x) * Standard deviation (y))
The correlation is 【-1, 1】 There is a linear correlation between , Therefore, the variables of the nonlinear function are uncorrelated ( The correlation coefficient is 0 0)
Pearson test can only prove the linear correlation of variables , Whether the two variables are related , Chi square test can be used
Euclidean distance

Minkowski distance
Minkowski distance (Minkowski distance) yes Euclidean A generalization of distance

among r Is the parameter , n Is dimension ( attribute ),xk and yk Namely x and y Of the k Attributes ( component ) .
r = 1. Manhattan distance (Manhattan,L1 norm )
r = 2. Euclid distance (Euclidean,L2 norm )
r-> ∞. Supremum distance (Lmax or L∞ norm )
Markov distance

Advantages of Mahalanobis distance :
(1) Not affected by dimensions
The Mahalanobis distance is divided by a covariance matrix , This removes the variance between the components , Eliminate dimensionality , The Mahalanobis distance between two points is independent of the unit of measurement of the original data , More scientific and reasonable .
(2) Mahalanobis distance can also eliminate the interference of correlation between variables
The third chapter Data preprocessing
Why do I need data preprocessing ?
Because the real data is very “dirty”, There's a lot of data , The following problems may arise 
The main task of data preprocessing ?

Data cleaning

Missing data

How to deal with missing data

Abnormal data
Random errors in measurement variables (Noise) Or deviation (Outlier)
That is, noise and outliers
How to smooth outliers ( Mainly for outliers )?



Type conversion
There are several categories of attributes , Any conversion can be realized by encoding

discretization ( Continuous variable discrete type )




Supervised discretization : Find breakpoints using class labels , New samples can be discretized according to this , Reclassification
sampling
Sampling to reduce the time complexity of data reading processing
Sampling can be used to adjust the distribution of classes ( Apply to unbalanced datasets )
What is an unbalanced dataset ?
Unbalanced dataset refers to the dataset with unbalanced sample size of each category when solving classification problems .
What's wrong with unbalanced data sets ?
Here's an example :100 One of them 99 It's all healthy , A man has cancer . Train a classifier through this unbalanced data set , Whether the predicted people are healthy or not , The accuracy is 99%, The model trained by this unbalanced data set is meaningless .
This is the drawback of unbalanced data sets .
How to avoid the disadvantages of unbalanced data sets ?
(1) Adjust the distribution of classes by sampling
To increase the number of small class samples by sampling small class samples — Oversampling ( Add copies of some samples )
Sample large classes of samples to increase the number of small classes of samples — Undersampling ( Delete some samples )

(2) Define a new accuracy evaluation annotation
Data standardization
There are definite upper and lower bounds :Min-max Standardization

There are no upper and lower bounds :Z-score Standardization

Statistical description and visualization of data
Statistics of data description


Data visualization





Selection and extraction of data
Why feature extraction ?
Too many attributes will cause the dimension of the whole space to be too large ( It may lead to dimensional disaster ), For example, in 100 Maintenance classification , It needs to be in 100 Search for decision boundary on dimensional special space , This will make the problem too difficult .
Therefore, feature extraction is needed , Pick out the most relevant attributes , Make the problem less difficult .
How to judge the quality of attributes ?
qualitative : Category histogram ( Discrete attributes )、 Category distribution map ( Continuous attribute )
ration : entropy 、 Information gain
Entropy is used to measure the uncertainty of a system . That is, the confidence level to measure how much a value is taken or to judge what a class is . Mathematical expectation of information quantity , Measuring the uncertainty of a system in information theory .( The smaller the better. )
Information gain : When additional attributes are known , How much uncertainty has been reduced for the whole system .( The bigger the better )
边栏推荐
- HMS Core音频编辑服务3D音频技术,助力打造沉浸式听觉盛宴
- [Verilog basics] octal and hexadecimal representation of decimal negative numbers
- Etcd tutorial - Chapter 9 etcd implementation of distributed locks
- [wechat applet] basic use of common components (view/scroll-view/wiper, text/rich-text, button/image)
- 2022蓝桥杯国赛B组-费用报销-(线性dp|状态dp)
- 【机器学习】K-means聚类分析
- Anaconda下安装Jupyter notebook
- Mysql8.0 method and steps for enabling remote connection permission
- 声网自研传输层协议 AUT 的落地实践丨Dev for Dev 专栏
- STL tutorial 7-set, pair pair pair group and functor
猜你喜欢

RT-Thread 堆区大小设置

Carry two load balancing notes and find them in the future

On July 2, I invited you to TD Hero online conference

ArcMap operation series: 80 plane to latitude and longitude 84

The new tea drinks are "dead and alive", but the suppliers are "full of pots and bowls"?

微信表情符号写入判决书,你发的OK、炸弹都可能成为“呈堂证供”

深度学习——(2)几种常见的损失函数

I implement "stack" with C I

Exception class_ Log frame

猎头5万挖我去VC
随机推荐
备战数学建模35-时间序列预测模型
STL教程7-set、pair对组和仿函数
Niuke network: longest continuous subarray with positive product
Mysql8.0 method and steps for enabling remote connection permission
药品管理系统加数据库,一夜做完,加报告
The meaning of linetypes enumeration values (line_4, line_8, line_aa) in opencv
【微信小程序】常用组件基本使用(view/scroll-view/swiper、text/rich-text、button/image)
Delete duplicates in an ordered array ii[double pointers -- unified in multiple cases]
IO stream_ recursion
Mathematical modeling for war preparation 33- grey prediction model 2
Rongsheng biology rushes to the scientific innovation board: it plans to raise 1.25 billion yuan, with an annual revenue of 260million yuan
MicroBlaze serial port learning · 2
Carry two load balancing notes and find them in the future
Simpleitk encountered an ITK only supports orthonormal direction cosines error while reading NII
Dart: string replace related methods to solve replacement characters
Deep learning - (2) several common loss functions
Undistorted resize using pil
GaussDB创新特性解读:Partial Result Cache,通过缓存中间结果对算子进行加速
register_chrdev和cdev_init cdev_add用法区别
Installing jupyter notebook under Anaconda