当前位置:网站首页>Handwriting perceptron, KNN, decision tree (ID3) for binary classification of iris
Handwriting perceptron, KNN, decision tree (ID3) for binary classification of iris
2022-06-09 02:44:00 【Can you bug, please】
Handwriting sensor ,KNN, Decision tree (ID3), Bayes , Logistic returns (Lgistic),SVM Yes iris( Iris ) Make a dichotomy
At the end of this course, we also need to use different binary classification methods to classify and compare the same data set . So I read a lot of articles on the Internet that are not relevant , So here I record the classification and comparison of the same data set by different methods .
We use iris Data sets , This data set is in sklearn There are some in the library , altogether 150 Row data , Divided into three categories . In this experiment, we use the former 100 Line for experiment , Just two classes .
There are different ways to process data , It depends on the situation . I will put the data set at the end . Statement : The code is all for you , Then I made the necessary changes on it , Link below .
1. perceptron
I won't talk about the concept of perceptron here .
The main idea is to judge whether there are misclassification points , Misclassification point judgment is y(wx+b)<=0, When there are misclassification points , Just do the gradient , Until all of them are classified correctly . But it should be noted that , This data set should be linearly separable , If it is not separable, it will lead to consistent updates and enter an endless loop . Of course , We can set the number of points allowed to be misclassified in the training model through the code , Thus, the loss degree of the model is minimized . This iris The data set of is very good , So it usually doesn't take long for the results to come out . Oh, yes , As a result, the accuracy of the test set will be 1, So I added some noise to the code , So as to judge and compare again .
Code explanation and data processing are all in the code .
# perceptron (perceptron) It's a two class linear classification model , Where the input is the eigenvector of the instance , The output is a category ,
# Category retrieval +1 and -1 binary . The goal of the perceptron is to find a hyperplane to divide the training data linearly .
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target # Put the label at the end
class Model():
def __init__(self):
# A total of two features need to be trained 2 Two weight parameters ,data The data is 3 Column , The third column is the data label
self.w = np.ones(len(data[0]) - 1, dtype=np.float32)
self.b = 0
self.l_rate = 0.05
def sign(self, x, w, b):
y = np.dot(x, w) + b
return y
# Fitting using random gradients
def fit(self, X_train, y_train):
is_wrong = False
while not is_wrong:
wrong_count = 0
for d in range(len(X_train)):
X = X_train[d]
y = y_train[d]
# If the classification is wrong , To update the gradient
if y * self.sign(X, self.w, self.b) <= 0:
self.w = self.w + self.l_rate * np.dot(y, X)
self.b = self.b + self.l_rate * y
wrong_count += 1
if wrong_count == 0: # If all are classified correctly ,wrong_count The value is 0, Then stop fitting
is_wrong = True
return 'Perceptron Model'
def score(self):
pass
# Realize the perceptron
# take 2,3 Two column feature ,-1 It means label
data = np.array(df.iloc[:100])
dataDiyList = [[5,2.8,1.9,0.3,0],[6.2,2.8,1.9,0.3,1],[5.8,2.3,3.4,0.3,1],[5,2.8,1.9,0.3,0],[7.0,2.0,3.4,0.2,1],
[5.0,3.4,1.5,0.8,0],[4.78,3.1,1.7,0.9,1],[5.6,2.9,3.4,.99,0]]
dataDiyAaary = np.array(dataDiyList)
data = np.row_stack((data,dataDiyAaary))
X = data[:, :-1] # Take out the features
y = data[:, -1] # Take out the label
y = np.array([1 if i == 1 else -1 for i in y]) # y Presentation category , The normal value is 0 and 1, hold 0 Change the value of the category to -1
for i in range(1,10):
print(" The first %d round ---------------\n"%i)
X_train, X_test, y_train, y_test = train_test_split( \
X, y, test_size=0.50, random_state=10, shuffle=True) # random_state Not for 0 In order to get different data each time
# fitting
perceptron = Model()
perceptron.fit(X_train, y_train)
y_predict = perceptron.sign(X_test, perceptron.w, perceptron.b)
y_predict = np.array([1 if i > 0 else -1 for i in y_predict]) # y Presentation category , The normal value is 0 and 1, hold 0 Change the value of the category to -1
print(y_predict)
M = metrics.confusion_matrix(y_predict, y_test)
print(' Confusion matrix :\n', M)
# Accuracy 、 Recall rate
n = len(M)
for i in range(n):
rowsum, colsum = sum(M[i]), sum(M[r][i] for r in range(n))
precision = M[i][i] / float(colsum)
recall = M[i][i] / float(rowsum)
F1 = precision * recall * 2 / (precision + recall)
print('%d Accuracy of categories : %s' % (i, precision), '%d Recall rate of categories : %s' % (i, recall), '%d Categories of F1 value : %s' % (i, F1))
print(" Model accuracy ( Accuracy rate ):{:.5f}".format(np.mean(y_predict == y_test)))
print(' The weight :', perceptron.w[0], perceptron.w[1], perceptron.w[2], perceptron.w[3])
print(' bias :', perceptron.b)
result : The explanation of the code is also very clear , The result is that when there is more noise , Although using a small amount of data for training , The effect is also very good , The main reason is that this data set is easy to classify .
2.KNN
KNN The idea is very simple , There is no need to train , Just give you a point , Then you find the distance between it and all the points , And then take K Vote at the nearest point . this K Which category has the largest number of , Just decide to which category this point belongs to . We need to pay attention to normalizing the data .
Code :
# -*- coding:utf-8 -*-
import pandas as pd
import numpy as np
import operator as opt
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target # Put the label home in the back
# Considering the influence of different eigenvalue ranges , This kind of data can be normalized to the maximum and minimum data sets
def normData(dataset):
maxVals = dataset.max(axis=0) # Find the maximum value of the column
minVals = dataset.min(axis=0) # Find the minimum value of the column
ranges = maxVals - minVals
retData = (dataset - minVals) / ranges
return retData, minVals, ranges
#KNN, Euclidean distance
def kNN(dataset, labels, testdata, k):
distSquareMat = (dataset - testdata) ** 2 # Calculate the square of the difference
distSquareSums = distSquareMat.sum(axis=1) # Find the sum of the squares of the differences in each line
distances = distSquareSums ** 0.5 # Square root , Get the distance from each sample to the test point
sortedIndices = distances.argsort() #array.argsort(), Default axis=0 Sort from small to large , Get the subscript position after sorting
indices = sortedIndices[:k] # Take the least distance k The position of the subscript corresponding to the values
labelCount = {
} # Store each label Number of occurrences of
for i in indices: # Ergodic subscript , The value of the inside
label = labels[i]
labelCount[label] = labelCount.get(label, 0) + 1 # The number of add 1,dict.get(k, val) Get in the dictionary k Corresponding value , No, k, Then return to val
sortedCount = sorted(labelCount.items(), key=opt.itemgetter(1), reverse=True) #operator.itemgetter(), combination sorted Use , You can sort by different areas
return sortedCount[0][0] # Return the most one label
# The main function
if __name__ == "__main__":
data = np.array(df.iloc[:100])
dataDiyList = [[5, 2.8, 1.9, 0.3, 1], [6.2, 2.8, 1.9, 0.3, 0], [5.8, 2.3, 3.4, 0.3, 0], [5, 2.8, 1.9, 0.3, 1],[7.0, 2.0, 3.4, 0.2, 0],
[5.0, 3.4, 1.5, 0.8, 0], [4.78, 3.1, 1.7, 0.9, 0], [5.6, 2.9, 3.4, .99, 0]]
dataDiyAaary = np.array(dataDiyList)
data = np.row_stack((data, dataDiyAaary))
X = data[:, :-1] # Take out the features
y = data[:, -1] # Take out the label
for i in range (1,10):
print(" The first %d round ---------------\n" % i)
X_train, X_test, y_train, y_test = train_test_split( \
X, y, test_size=0.50, random_state=i, shuffle=True) # random_state Not for 0 In order to get different data each time
y_pred = []
for i in range(len(X_test)):
onedata = X_test[i]
result = kNN(X_train, y_train, onedata, 2)
y_pred.append(result)
print(y_pred)
M = metrics.confusion_matrix(y_pred, y_test)
print(' Confusion matrix :\n', M)
n = len(M)
for i in range(n):
rowsum, colsum = sum(M[i]), sum(M[r][i] for r in range(n))
precision = M[i][i] / float(colsum)
recall = M[i][i] / float(rowsum)
F1 = precision * recall * 2 / (precision + recall)
print('%d Accuracy of categories : %s' % (i, precision), '%d Recall rate of categories : %s' % (i, recall), '%d Categories of F1 value : %s' % (i, F1))
print(" Model accuracy ( Accuracy rate ):{:.5f}".format(np.mean(y_pred == y_test)))
3 Decision tree
The key point of decision tree is that each feature selected is the one with the greatest information gain . The most basic thing is ID3 了 , The following derivatives are C4.5,CART, And pruning . Here we go through ID3 Write . In this code, discrete values are processed respectively ( Such as handsome or not , Is there any money , Melons are sweet or not ), There are also continuous (1,2,3,4) such . These two different types need to be handled in different ways . It's all in it , There are also functions to save the generated tree and draw it . This code is what I found on the Internet , I really can't write it by hand , Just understand what they write , Admire the boss .
Code :
#https://blog.csdn.net/kai123wen/article/details/105769970
#https://github.com/login?return_to=%2Fjiajia0%2FMachineLearningTopic
import collections
import random
import string
from math import log
import pandas as pd
import operator
from End of term . Two kinds of flowers import treePlotter
def calcShannonEnt(dataSet):
""" Calculate the information entropy of a given data set ( Shannon entropy ) :param dataSet: :return: """
# Calculate the total number of data sets
numEntries = len(dataSet)
# Used to count tags
labelCounts = collections.defaultdict(int)
# Loop the entire dataset , Get the classification label of the data
for featVec in dataSet:
# Get the current tag
currentLabel = featVec[-1]
# Add one to the corresponding tag value
labelCounts[currentLabel] += 1
# Default information entropy
shannonEnt = 0.0
for key in labelCounts:
# Calculate the proportion of current category labels in the total labels
prob = float(labelCounts[key]) / numEntries
# With 2 Find the logarithm of the base
shannonEnt -= prob * log(prob, 2)
return shannonEnt
def splitDataSetForSeries(dataSet, axis, value):
""" According to the given value , Divide the data set into two parts no greater than and greater than :param dataSet: The data set to be divided :param i: The subscript of the characteristic value :param value: Division value :return: """
# Used to save a set that is not greater than the partition value
eltDataSet = []
# Used to save sets larger than the partition value
gtDataSet = []
# division , Keep the characteristic value
for feat in dataSet:
if feat[axis] <= value:
eltDataSet.append(feat)
else:
gtDataSet.append(feat)
return eltDataSet, gtDataSet
def splitDataSet(dataSet, axis, value):
""" According to the given eigenvalue , Divide the dataset into :param dataSet: Data sets :param axis: Coordinates of a given eigenvalue :param value: Conditions satisfied by a given eigenvalue , Only a given eigenvalue is equal to this value I'll be back when :return: """
# Create a new list , Prevent changes to the original list
retDataSet = []
# Traversing the entire dataset
for featVec in dataSet:
# If the given eigenvalue is equal to the desired eigenvalue
if featVec[axis] == value:
# Save the contents in front of the characteristic value
reducedFeatVec = featVec[:axis]
# Save the contents after the characteristic value , So the given eigenvalue is removed
reducedFeatVec.extend(featVec[axis + 1:])
# Add to the return list
retDataSet.append(reducedFeatVec)
return retDataSet
def calcInfoGainForSeries(dataSet, i, baseEntropy):
""" Calculate the information gain of continuous values :param dataSet: The whole dataset :param i: Corresponding eigenvalue subscript :param baseEntropy: Basic information entropy :return: Returns an information gain value , And the current partition point """
# Record the maximum information gain
maxInfoGain = 0.0
# The best dividing point
bestMid = -1
# Get the list of all current eigenvalues in the dataset
featList = [example[i] for example in dataSet]
# Get a list of categories
classList = [example[-1] for example in dataSet]
dictList = dict(zip(featList, classList))
# Sort them from small to large , Arrange according to the size of continuous values
sortedFeatList = sorted(dictList.items(), key=operator.itemgetter(0))
# Calculate the number of consecutive values
numberForFeatList = len(sortedFeatList)
# Calculate the partition point , Keep three decimal places , Will be the first i And i+1 The values of the first column add up and divide by 2, That is to find out the score line greater than or less than .
midFeatList = [round((sortedFeatList[i][0] + sortedFeatList[i + 1][0]) / 2.0, 3) for i in
range(numberForFeatList - 1)]
# Calculate the information gain of each partition point
for mid in midFeatList:
# The continuous value is divided into two parts: not greater than the current partition point and greater than the current partition point
eltDataSet, gtDataSet = splitDataSetForSeries(dataSet, i, mid)
# Calculate the sum of the product of the eigenvalue entropy and weight of the two parts
# Conditional entropy
newEntropy = len(eltDataSet) / len(dataSet) * calcShannonEnt(eltDataSet) + len(gtDataSet) / len(
dataSet) * calcShannonEnt(gtDataSet)
# Calculate the information gain
infoGain = baseEntropy - newEntropy
# print(' The current partition value is :' + str(mid) + ', At this time, the information gain is :' + str(infoGain))
if infoGain > maxInfoGain:
bestMid = mid
maxInfoGain = infoGain
return maxInfoGain, bestMid
def calcInfoGain(dataSet, featList, i, baseEntropy):
""" Calculated information gain :param dataSet: Data sets :param featList: Current feature list :param i: Current eigenvalue subscript :param baseEntropy: Basic information entropy :return: """
# Make the current feature unique , That is, how many kinds of eigenvalues are there at present
uniqueVals = set(featList)
# New entropy , Entropy representing the current eigenvalue
newEntropy = 0.0
# The possibility of traversing existing features
for value in uniqueVals:
# At the current feature location of all data sets , Find the set whose characteristic value is equal to the current value
subDataSet = splitDataSet(dataSet=dataSet, axis=i, value=value)
# Calculate the weight
prob = len(subDataSet) / float(len(dataSet))
# Calculate the entropy of the current eigenvalue
newEntropy += prob * calcShannonEnt(subDataSet)
# To calculate the “ Information gain ”
infoGain = baseEntropy - newEntropy
return infoGain
def chooseBestFeatureToSplit(dataSet, labels):
""" Select the best dataset partition feature , Calculate... According to the information gain value , Can handle continuous values :param dataSet: :return: """
# Get the total number of eigenvalues of the data
numFeatures = len(dataSet[0]) - 1
# Calculate the basic information entropy , To calculate the H(D).
baseEntropy = calcShannonEnt(dataSet)
# The basic information gain is 0.0
bestInfoGain = 0.0
# The best eigenvalue
bestFeature = -1
# Mark whether the current best characteristic value is a continuous value
flagSeries = 0
# If it is a continuous value , A dividing point used to record continuous values
bestSeriesMid = 0.0
# Calculate the information entropy for each eigenvalue
for i in range(numFeatures):
# Get the list of all current eigenvalues in the dataset , The first i Column data
featList = [example[i] for example in dataSet]
# Deal with continuous values and discontinuous values respectively
if isinstance(featList[0], str):
infoGain = calcInfoGain(dataSet, featList, i, baseEntropy)
else:
# print(' The current partition attribute is :' + str(labels[i]))
infoGain, bestMid = calcInfoGainForSeries(dataSet, i, baseEntropy)
# print(' The current eigenvalue is :' + labels[i] + ', The corresponding information gain value is :' + str(infoGain))
# If the current information gain is greater than the original
if infoGain > bestInfoGain:
# The best information gain
bestInfoGain = infoGain
# The new best eigenvalue used to divide
bestFeature = i
flagSeries = 0
if not isinstance(dataSet[0][bestFeature], str):
flagSeries = 1
bestSeriesMid = bestMid
# print(' The characteristic of the largest information gain is :' + labels[bestFeature])
if flagSeries:
return bestFeature, bestSeriesMid
else:
return bestFeature
def getDataSet(test_size):
""" Create a test data set , The values inside have continuous values :return: """
dataSet = pd.read_csv("iris.data").values.tolist()
random.shuffle(dataSet)
train_dataset = dataSet[:int(len(dataSet) * (1 - test_size))] # Training data
test_dataset = dataSet[int(len(dataSet) * (1 - test_size)):] # Test data
# List of eigenvalues
labels = ['sepal length', 'sepal width', 'petal length', 'petal width']
return train_dataset, test_dataset, labels
def majorityCnt(classList):
""" The most frequently found category labels :param classList: :return: """
# Used to count the number of votes in the tag , Stronger containers , The default is 0
classCount = collections.defaultdict(int)
# Traverse all tag categories
for vote in classList:
classCount[vote] += 1
# Sort from large to small
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
# The most frequently returned tag
return sortedClassCount[0][0]
def createTree(dataSet, labels):
""" Create a decision tree :param dataSet: Data sets :param labels: Feature tags :return: """
# Get the classification labels of all data sets
classList = [example[-1] for example in dataSet]
# Count the number of times the first tag appears , Compare with the total number of tags , If equal, it means that all in the current list are one kind of label , Stop dividing at this time
if classList.count(classList[0]) == len(classList):
return classList[0]
# Calculate how many data there are in the first row , If there is only one, it means that all feature attributes have been traversed , The remaining one is the category label
if len(dataSet[0]) == 1:
# Return the one that appears more frequently in the remaining tags
return majorityCnt(classList)
# Select the best division feature , Get the subscript of the feature
bestFeat = chooseBestFeatureToSplit(dataSet=dataSet, labels=labels)
# Get the name of the best feature
bestFeatLabel = ''
# Record whether the current value is continuous or discrete ,1 continuity ,2 discrete
flagSeries = 0
# If it's a continuous value , Record the dividing points of continuous values
midSeries = 0.0
# If it's a tuple , This indicates that this is a continuous value
if isinstance(bestFeat, tuple):
# Modify the bifurcation point information again
bestFeatLabel = str(labels[bestFeat[0]]) + '=' + str(bestFeat[1])
# Get the current partition point
midSeries = bestFeat[1]
# Get the subscript value
bestFeat = bestFeat[0]
# Continuous value flag
flagSeries = 1
else: # If it's not continuous
# Get bifurcation information
bestFeatLabel = labels[bestFeat]
# Discrete value flag
flagSeries = 0
# Use a dictionary to store the tree structure , The bifurcation is the name of the divided feature
myTree = {
bestFeatLabel: {
}}
# Get all possible values of the current feature tag
featValues = [example[bestFeat] for example in dataSet]
# Continuous value processing
if flagSeries:
# The continuous value is divided into two parts: not greater than the current partition point and greater than the current partition point
eltDataSet, gtDataSet = splitDataSetForSeries(dataSet, bestFeat, midSeries)
# Get the remaining feature tags
subLabels = labels[:]
# Recursively handle subtrees smaller than the partition point
subTree = createTree(eltDataSet, subLabels)
myTree[bestFeatLabel][' Less than '] = subTree
# Recursively process subtrees larger than the current partition point
subTree = createTree(gtDataSet, subLabels)
myTree[bestFeatLabel][' Greater than '] = subTree
return myTree
# Discrete value processing
else:
# Delete the characteristic value of this division from the list
del (labels[bestFeat])
# The only change , Remove duplicate eigenvalues
uniqueVals = set(featValues)
# Traverse all eigenvalues
for value in uniqueVals:
# Get the remaining feature tags
subLabels = labels[:]
# Recursively call , Divide all data in the data set whose characteristic is equal to the current characteristic value into the current node , When calling recursively, you need to remove the current feature first
subTree = createTree(splitDataSet(dataSet=dataSet, axis=bestFeat, value=value), subLabels)
# Put the subtree under the fork
myTree[bestFeatLabel][value] = subTree
return myTree
# Enter three variables ( Decision tree , Attribute feature tag , Test data )
def classify(inputTree, featLables, testVec):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr] # Branch of tree , subset Dict
featIndex = featLables.index(firstStr[:firstStr.index('=')]) # Get the first layer of the decision tree in featLables Position in
for key in secondDict.keys():
if testVec[featIndex] > float(firstStr[firstStr.index('=') + 1:]):
if type(secondDict[' Greater than ']).__name__ == 'dict':
classLabel = classify(secondDict[' Greater than '], featLables, testVec)
else:
classLabel = secondDict[' Greater than ']
return classLabel
else:
if type(secondDict[' Less than ']).__name__ == 'dict':
classLabel = classify(secondDict[' Less than '], featLables, testVec)
else:
classLabel = secondDict[' Less than ']
return classLabel
# Save the decision tree
def storeTree(inputTree, filename):
import pickle
fw = open(filename, 'wb')
pickle.dump(inputTree, fw)
fw.close()
# Read the decision tree
def grabTree(filename):
import pickle
fr = open(filename, 'rb')
return pickle.load(fr)
def calConfuMatrix(myTree, label, test_dataSet):
matrix = {
'Iris-setosa': {
'Iris-setosa': 0, 'Iris-versicolor': 0},
'Iris-versicolor': {
'Iris-setosa': 0, 'Iris-versicolor': 0}}
for example in test_dataSet:
predict = classify(myTree, label, example) # Test the test data ,predict Classification for forecast
actual = example[-1] # actual For the actual classification
matrix[actual][predict] += 1 # fill confusion matrix
return matrix
# Accuracy
def precision(classes, matrix):
precisionDict = {
'Iris-setosa': 0, 'Iris-versicolor': 0}
all_predict_num = 0 # For example, when the predicted value is Iris-setosa The number of all cases when
for classItem in classes:
true_predict_num = matrix[classItem][classItem] # Accurately predict the quantity
all_predict_num = 0
for temp_class_item in classes:
all_predict_num += matrix[temp_class_item][classItem]
precisionDict[classItem] = round(true_predict_num / all_predict_num, 5)
return precisionDict
# Accuracy rate
def score(classes, matrix):
all_num=0
all_right=0
for classItem in classes:
all_right += matrix[classItem][classItem] # Accurately predict the quantity
for temp_class_item in classes:
all_num += matrix[temp_class_item][classItem]
score = round(all_right / all_num, 5)
return score
# Recall rate
def recall(classes, matrix):
recallDict = {
'Iris-setosa': 0, 'Iris-versicolor': 0}
for classItem in classes:
true_predict_num = matrix[classItem][classItem] # Accurately predict the quantity
all_predict_num = 0 # For example, when the predicted value is Iris-setosa The number of all cases when
for temp_class_item in classes:
all_predict_num += matrix[classItem][temp_class_item]
recallDict[classItem] = round(true_predict_num / all_predict_num, 5)
return recallDict
# Display the results
def showResult(precisionValue, recallValue, classes):
print('\t\t\t\t', ' Accuracy ', '\t', ' Recall rate ','\t','F1 value ')
for classItem in classes:
print(classItem, '\t', precisionValue[classItem], '\t', recallValue[classItem],'\t',round((2*precisionValue[classItem]*recallValue[classItem])/(precisionValue[classItem]+recallValue[classItem]),5))
if __name__ == '__main__':
""" Decision tree when dealing with continuous values """
train_dataSet, test_dataSet, labels = getDataSet(0.5)
temp_label = labels[:] # Deep copy
myTree = createTree(train_dataSet, labels)
storeTree(myTree, 'tree.txt')
myTree = grabTree('tree.txt')
print(myTree)
treePlotter.createPlot(myTree)
matrix = calConfuMatrix(myTree, temp_label, test_dataSet) # confusion matrix
classes = ['Iris-setosa', 'Iris-versicolor'] # All categories
precisionValue = precision(classes, matrix)
recallValue = recall(classes, matrix)
score=score(classes,matrix)
showResult(precisionValue, recallValue, classes)
print(" Accuracy rate is :%f"%score)
treePlotter.py
# @Time : 2017/12/18 19:46
# @Author : Leafage
# @File : treePlotter.py
# @Software: PyCharm
import matplotlib.pylab as plt
import matplotlib
# Can display Chinese
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.serif'] = ['SimHei']
# Bifurcation node , That is, the decision node
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
# Leaf node
leafNode = dict(boxstyle="round4", fc="0.8")
# Arrow pattern
arrow_args = dict(arrowstyle="<-")
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
""" Draw a node :param nodeTxt: Text information describing the node :param centerPt: The coordinates of the text :param parentPt: The coordinates of point , This also refers to the coordinates of the parent node :param nodeType: Node type , It is divided into leaf node and decision node :return: """
createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
xytext=centerPt, textcoords='axes fraction',
va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)
def getNumLeafs(myTree):
""" Get the number of leaf nodes :param myTree: :return: """
# Count the total number of leaf nodes
numLeafs = 0
# Get the current first key, Root node
firstStr = list(myTree.keys())[0]
# Get the first one key Corresponding content
secondDict = myTree[firstStr]
# Recursively traverses leaf nodes
for key in secondDict.keys():
# If key It corresponds to a dictionary , Call recursively
if type(secondDict[key]).__name__ == 'dict':
numLeafs += getNumLeafs(secondDict[key])
# No. , This indicates that this is a leaf node
else:
numLeafs += 1
return numLeafs
def getTreeDepth(myTree):
""" Get the number of depth layers :param myTree: :return: """
# Used to save the maximum number of layers
maxDepth = 0
# Get the root node
firstStr = list(myTree.keys())[0]
# obtain key Corresponding content
secondDic = myTree[firstStr]
# Traverse all child nodes
for key in secondDic.keys():
# If the node is a dictionary , Call recursively
if type(secondDic[key]).__name__ == 'dict':
# The depth of the child node plus 1
thisDepth = 1 + getTreeDepth(secondDic[key])
# This indicates that this is the leaf node
else:
thisDepth = 1
# Replace the maximum number of layers
if thisDepth > maxDepth:
maxDepth = thisDepth
return maxDepth
def plotMidText(cntrPt, parentPt, txtString):
""" Calculate the middle position between the parent node and the child node , Fill in the information :param cntrPt: Child node coordinates :param parentPt: Parent node coordinates :param txtString: Filled text information :return: """
# Calculation x The middle position of the shaft
xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
# Calculation y The middle position of the shaft
yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
# Drawing
createPlot.ax1.text(xMid, yMid, txtString)
def plotTree(myTree, parentPt, nodeTxt):
""" Draw all nodes of the tree , Recursive drawing :param myTree: Trees :param parentPt: The coordinates of the parent node :param nodeTxt: The text information of the node :return: """
# Calculate the number of leaf nodes
numLeafs = getNumLeafs(myTree=myTree)
# Calculate the depth of the tree
depth = getTreeDepth(myTree=myTree)
# Get the information content of the root node
firstStr = list(myTree.keys())[0]
# Calculate the middle coordinates of the current root node in all child nodes , That is, at present x The offset of the axis plus the calculated center position of the root node is used as x Axis ( For example, for the first time : Initial x The offset for the :-1/2W, The calculated center position of the root node is :(1+W)/2W, Add up to get :1/2), At present y The axis offset is taken as y Axis
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
# Draw the connection between the node and the parent node
plotMidText(cntrPt, parentPt, nodeTxt)
# Draw the node
plotNode(firstStr, cntrPt, parentPt, decisionNode)
# Get the subtree corresponding to the current root node
secondDict = myTree[firstStr]
# Work out a new y Shaft offset , Move down the 1/D, That is, the drawing of the next layer y Axis
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
# Loop through all key
for key in secondDict.keys():
# If the current key If it's a dictionary , Represents a subtree , Then recursively traverse
if isinstance(secondDict[key], dict):
plotTree(secondDict[key], cntrPt, str(key))
else:
# Calculate the new x Shaft offset , That's what the next leaf draws x The axis coordinates move to the right 1/W
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
# Open the annotation to observe the coordinate changes of leaf nodes
# print((plotTree.xOff, plotTree.yOff), secondDict[key])
# Draw leaf nodes
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
# Draw the content of the middle line between the leaf node and the parent node
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
# Before returning recursion , Need to put y The offset of the axis increases , Move up 1/D, That is, go back and draw the upper layer y Axis
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
def createPlot(inTree):
""" The decision tree that needs to be drawn :param inTree: Decision tree Dictionary :return: """
# Create an image
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
# Calculate the total width of the decision tree
plotTree.totalW = float(getNumLeafs(inTree))
# Calculate the total depth of the decision tree
plotTree.totalD = float(getTreeDepth(inTree))
# Initial x Shaft offset , That is to say -1/2W, Every time I move to the right 1/W, That is, the first leaf node is drawn x Coordinate for :1/2W, the second :3/2W, Third :5/2W, the last one :(W-1)/2W
plotTree.xOff = -0.5/plotTree.totalW
# Initial y Shaft offset , Every time you move down or up 1/D
plotTree.yOff = 1.0
# Call the function to draw the node image
plotTree(inTree, (0.5, 1.0), '')
# draw
plt.show()
if __name__ == '__main__':
testTree = {
'no surfacing': {
0: 'no', 1: {
'flippers': {
0: 'no', 1: 'yes'}}, 3: 'maybe'}}
createPlot(testTree)
result :
Others can test it slowly .
There was still SVM, Logistic returns , Bayesian binary classification of this data set , But just a little at a time , Later updates will follow ~
Data sets can be accessed through sklearn Library processing , if necessary txt Data sets , I'll send you the private letter for free .
The source of the code is :
https://blog.csdn.net/kai123wen/article/details/105769970
https://github.com/login?return_to=%2Fjiajia0%2FMachineLearningTopic
KNN The code forgot its source .
I only understand on the basis of , Then you can change the parameters and code to what you want .
Come on !!!
边栏推荐
- MySQL data type enum enumeration type
- C#中的反射原理及应用
- Tiflash source code reading (III) design and implementation analysis of tiflash deltatree storage engine - Part 1
- Redis cluster setup
- C# 基础篇
- vins feature_ track
- Fight the high vision medical service of HKEx again, the gospel of short-sighted partners?
- Ccp-csp 201909-3 character drawing 100
- The difference between new and newinstance() for creating objects
- In unity, inherit the lifecycle of monobehavior game objects
猜你喜欢

Go技術日報(2022-06-07)——go程序員開發效率神器匯總

Deux facteurs importants affectant la défaillance de togglerowselection ()

【HomeAssistant外网访问(cpolar)】
Ccf-csp 202012-3 file system with quota
How to implement the project practice of user registration, login and logout in flask + MySQL

Go Technology Daily (June 7, 2022) - go programmer development efficiency artifact summary

进程与线程

Ccf-csp 201903-4 messaging interface

Leetcode 1310. Subarray XOR query prefix and + XOR

Go Technology Daily (2022 - 06 - 07) - go programer Development Efficiency God Summary
随机推荐
Leetcode 1185. Day of the week
Ccf-csp 202203-1 uninitialized warning
Ccf-csp 202203-3 computing resource scheduler 80 points
Dynamic programming / Fibonacci series
Leetcode 1074. Element sum is the number of submatrixes of the target value two-dimensional prefix sum
Exporter les connaissances pertinentes
杰理之关于 SPI 主机配置参数的几个说明:【篇】
Ccf-csp 201903-2 24:00
Formatting and parsing of simpledateformat time
Jerry's SPI host [chapter]
Blue Bridge Cup_ Frog date_ Extended Euclid
From the ECS SSRF vulnerability to taking over your alicloud console
Basic method of missing data filling (3) -- multiple imputation by chained equations (mice)
【网络协议】| 【01】网络字节序大端、小端
Two important influencing factors of togglerowselection() failure
进程与线程
How to implement the project practice of user registration, login and logout in flask + MySQL
LocalTime 、LocalDate 、LocalDateTime
Leetcode 871. Minimum refuelling times priority queue
Common commands for detecting Huawei network devices