当前位置：网站首页>[data mining] final review Chapter 3

[data mining] final review Chapter 3

2022-06-21 06:17:00 【A delicious little pig】

The third chapter classification

1. Definition of classification

Classification is to learn from a data set and construct a Prediction function The classification model of , Class label used to predict unknown samples , Such as ： Predict whether the email is spam according to its title and content . Both classification and regression have the function of prediction , however ： The output of classification prediction is discrete or nominal ; The output of regression prediction is continuous attribute value , for example ： It is predicted that a bank customer will lose or not lose in the future , This is a classification task , Predict the total turnover of a shopping mall in the next year , This is the return mission .

2. Application fields of classification

At present, classification and regression methods have been widely used in all walks of life , Such as ： In Finance , The classifier is used to predict the future direction of the stock . In medical diagnosis , Predict the diagnosis of the disease . In marketing , Use historical sales data , Predict whether certain goods can be sold 、 Predict which area the advertisement should be placed in .

3. General steps for classification

(1) Divide the data set into training set and test set ;
(2) Learn from the training set , Building a classification model ;（ This model can be a decision tree or a classification rule ）
(3) Use the classification model to classify the test set ; Evaluate the classification accuracy and other performance of the classification model ;
(4) The classification model with high classification accuracy is used to classify the future sample data with unknown class label .

4. Classification algorithm classification

classification method ：

Classification method based on decision tree
Bayesian classification method
Nearest neighbor classification method
Neural network method
Support vector machine, etc

Regression method ：

Linear regression
Nonlinear regression
Logical regression, etc

5. Decision tree classification algorithm

ID3、C4.5、CART etc.

6. ID3 Decision tree

ID3 Classification algorithm use Information gain As the selection criteria of attributes . Its basic idea is as follows ： First, check all attributes , choice Maximum information gain Attribute generation for Decision tree node , Branches are established by different values of this attribute , Then recursively call the method to the subset of each branch to establish the branch of the decision tree node , Until all subsets contain only data of the same category . Finally, we get a decision tree , It can be used to classify new samples .

The definition of information entropy ：

Calculate with probability
Please add a picture description

Definition of information gain ：

Before division - After division
Please add a picture description

7. C4.5 Algorithm

characteristic ：

Able to handle Continuous type Attribute data and discrete attribute data
Use Information gain rate As the attribute selection criteria of decision tree