当前位置:网站首页>[data mining] Chapter 2 understanding data
[data mining] Chapter 2 understanding data
2022-07-25 09:46:00 【clusters of stars ¹ ⁸⁹⁵】
data mining
Chapter two Knowledge data
The future world is a ternary world : Physical space ; Human social space ; Information space
2.1 Type of data
give an example :
- Trading data
- Document data
- network data
- Gene sequence
- Environmental data
What is data ?
Data is a collection of objects and their attributes , It can be expressed as a matrix . Data can be understood as a point in high-dimensional space
Attribute types
- Classified
* Nominal : male or Woman
* Ordinal number : Undergraduate , master , Doctor - Numerical
* Section : Today's temperature is 15~28 degree
* ratio : today A The stock market fell 2.26%
Normalization 、 distance 、 angle
- Normalization : Normalize the data , Standardization and other operations . For example, subtract the average divided by the standard deviation
- Euclidean distance : Is the most commonly used distance .
- Cosine distance : Cosine of the angle between two vectors , The bigger the value is. , The smaller the angle , It shows that two vectors are almost in the same direction , The more relevant . It can be used to judge the correlation between two vectors , For example, for face recognition .
Probability view
For example, analyze the probability distribution of an attribute .
- Probability distribution function
- Probability density function :pdf
- Probability mass function :pmf, The probability that the sample is some discrete value
- Common probability distribution
- Bernoulli distribution
- The binomial distribution
Data quality
- Noise and outliers
- Missing value
- Duplicate data
sampling
- Simple random sampling
- The probability of selecting any object is the same
- No return sampling
- There is a return sample
- Stratified sampling
- Sample size
Property transfer
- Nonlinear functions : power function , Exponential function , Logarithmic function
- Standardization
Dimension disaster and dimension reduction
Purpose :
- Avoid dimensional disaster
- Reduce the time and memory required by data mining algorithms
- Make data easier to visualize
- Help eliminate irrelevant features or reduce noise
Method :
- Principal component analysis (PLA)
- Singular value decomposition (SVD)
- TSNE: Put high-dimensional data in 2,3 Dimensional visualization
Selection of feature subset
It is also a data dimensionality reduction
post-processing
- visualization : intuitive , The human eye is a powerful analytical tool
边栏推荐
猜你喜欢

CoreData存储待办事项

main函数的一些操作

*6-3 save small experts

【Android studio】批量数据导入到android 本地数据库

关于C和OC

@2-1 safety index predicted by CCF at the end of December 1, 2020

Operation 7.19 sequence table

Wechat applet realizes the rotation map (automatic switching & manual switching)

The jar package has been launched on Alibaba cloud server and the security group has been opened, but postman still can't run. What should we do

Definition of cell
随机推荐
How many regions can a positive odd polygon be divided into
cf #785(div2) C. Palindrome Basis
服务器cuda toolkit多版本切换
Create personal extreme writing process - reprint
初识Opencv4.X----在图像上绘制形状
@4-1 CCF 2020-06-1 linear classifier
一张图讲解 SQL Join 左连 又连
Why use json.stringify() and json.parse()
基于树莓派4b的传感器数据可视化实现
1094 - Google recruitment
【降维打击】希尔伯特曲线
Some skills to reduce the complexity of program space
OC -- Foundation -- dictionary
【数据挖掘】最近邻和贝叶斯分类器
基于机智云平台的温湿度和光照强度获取
Flutter Rive 多状态例子
自定义Dialog 实现 仿网易云音乐的隐私条款声明弹框
解决esp8266无法连接手机和电脑热点的问题
How to write Android switching interface with kotlin
~1 CCF 2022-06-2 treasure hunt! Big adventure!