当前位置:网站首页>[data mining] final review Chapter 3

[data mining] final review Chapter 3

2022-06-21 06:17:00 A delicious little pig

The third chapter classification

1. Definition of classification

Classification is to learn from a data set and construct a Prediction function The classification model of , Class label used to predict unknown samples , Such as : Predict whether the email is spam according to its title and content . Both classification and regression have the function of prediction , however : The output of classification prediction is discrete or nominal ; The output of regression prediction is continuous attribute value , for example : It is predicted that a bank customer will lose or not lose in the future , This is a classification task , Predict the total turnover of a shopping mall in the next year , This is the return mission .

2. Application fields of classification

At present, classification and regression methods have been widely used in all walks of life , Such as : In Finance , The classifier is used to predict the future direction of the stock . In medical diagnosis , Predict the diagnosis of the disease . In marketing , Use historical sales data , Predict whether certain goods can be sold 、 Predict which area the advertisement should be placed in .

3. General steps for classification

(1) Divide the data set into training set and test set ;
(2) Learn from the training set , Building a classification model ;( This model can be a decision tree or a classification rule )
(3) Use the classification model to classify the test set ; Evaluate the classification accuracy and other performance of the classification model ;
(4) The classification model with high classification accuracy is used to classify the future sample data with unknown class label .

4. Classification algorithm classification

classification method :

  • Classification method based on decision tree
  • Bayesian classification method
  • Nearest neighbor classification method
  • Neural network method
  • Support vector machine, etc

Regression method :

  • Linear regression
  • Nonlinear regression
  • Logical regression, etc

5. Decision tree classification algorithm

ID3C4.5、CART etc.

6. ID3 Decision tree

ID3 Classification algorithm use Information gain As the selection criteria of attributes . Its basic idea is as follows : First, check all attributes , choice Maximum information gain Attribute generation for Decision tree node , Branches are established by different values of this attribute , Then recursively call the method to the subset of each branch to establish the branch of the decision tree node , Until all subsets contain only data of the same category . Finally, we get a decision tree , It can be used to classify new samples .

The definition of information entropy :

Calculate with probability
 Please add a picture description

Definition of information gain :

Before division - After division
 Please add a picture description

7. C4.5 Algorithm

characteristic :

  • Able to handle Continuous type Attribute data and discrete attribute data
  • Use Information gain rate As the attribute selection criteria of decision tree
Split information :

 Please add a picture description

Information gain rate :

 Please add a picture description

8. CART Algorithm

Gini coefficient :

 Please add a picture description

Example :

 Please add a picture description
 Please add a picture description
answer :
 Please add a picture description

原网站

版权声明
本文为[A delicious little pig]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/172/202206210555087881.html