当前位置：网站首页>Handwriting perceptron, KNN, decision tree (ID3) for binary classification of iris

Handwriting perceptron, KNN, decision tree (ID3) for binary classification of iris

2022-06-09 02:44:00 【Can you bug, please】

Handwriting sensor ,KNN, Decision tree （ID3）, Bayes , Logistic returns （Lgistic),SVM Yes iris（ Iris ） Make a dichotomy

At the end of this course, we also need to use different binary classification methods to classify and compare the same data set . So I read a lot of articles on the Internet that are not relevant , So here I record the classification and comparison of the same data set by different methods .
We use iris Data sets , This data set is in sklearn There are some in the library , altogether 150 Row data , Divided into three categories . In this experiment, we use the former 100 Line for experiment , Just two classes .
There are different ways to process data , It depends on the situation . I will put the data set at the end . Statement ： The code is all for you , Then I made the necessary changes on it , Link below .

1. perceptron

I won't talk about the concept of perceptron here .
The main idea is to judge whether there are misclassification points , Misclassification point judgment is y(wx+b)<=0, When there are misclassification points , Just do the gradient , Until all of them are classified correctly . But it should be noted that , This data set should be linearly separable , If it is not separable, it will lead to consistent updates and enter an endless loop . Of course , We can set the number of points allowed to be misclassified in the training model through the code , Thus, the loss degree of the model is minimized . This iris The data set of is very good , So it usually doesn't take long for the results to come out . Oh, yes , As a result, the accuracy of the test set will be 1, So I added some noise to the code , So as to judge and compare again .
Code explanation and data processing are all in the code .

#  perceptron （perceptron） It's a two class linear classification model , Where the input is the eigenvector of the instance , The output is a category ,
#  Category retrieval +1 and -1 binary . The goal of the perceptron is to find a hyperplane to divide the training data linearly .
from sklearn import metrics
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target   # Put the label at the end 

class Model():
    def __init__(self):
        #  A total of two features need to be trained 2 Two weight parameters ,data The data is 3 Column , The third column is the data label 
        self.w = np.ones(len(data[0]) - 1, dtype=np.float32)
        self.b = 0
        self.l_rate = 0.05

    def sign(self, x, w, b):
        y = np.dot(x, w) + b
        return y

    #  Fitting using random gradients 
    def fit(self, X_train, y_train):
        is_wrong = False
        while not is_wrong:
            wrong_count = 0
            for d in range(len(X_train)):
                X = X_train[d]
                y = y_train[d]
                #  If the classification is wrong , To update the gradient 
                if y * self.sign(X, self.w, self.b) <= 0:
                    self.w = self.w + self.l_rate * np.dot(y, X)
                    self.b = self.b + self.l_rate * y
                    wrong_count += 1

            if wrong_count == 0:  #  If all are classified correctly ,wrong_count The value is 0, Then stop fitting 
                is_wrong = True
        return 'Perceptron Model'

    def score(self):
        pass
#  Realize the perceptron 
#  take 2,3 Two column feature ,-1 It means label 
data = np.array(df.iloc[:100])
dataDiyList = [[5,2.8,1.9,0.3,0],[6.2,2.8,1.9,0.3,1],[5.8,2.3,3.4,0.3,1],[5,2.8,1.9,0.3,0],[7.0,2.0,3.4,0.2,1],
               [5.0,3.4,1.5,0.8,0],[4.78,3.1,1.7,0.9,1],[5.6,2.9,3.4,.99,0]]
dataDiyAaary = np.array(dataDiyList)
data = np.row_stack((data,dataDiyAaary))

X = data[:, :-1]  #  Take out the features 
y = data[:, -1]  #  Take out the label 
y = np.array([1 if i == 1 else -1 for i in y])  # y Presentation category , The normal value is 0 and 1, hold 0 Change the value of the category to -1

for i in range(1,10):
    print(" The first %d round ---------------\n"%i)
    X_train, X_test, y_train, y_test = train_test_split( \
        X, y, test_size=0.50, random_state=10, shuffle=True)  # random_state Not for 0 In order to get different data each time 
    #  fitting 
    perceptron = Model()
    perceptron.fit(X_train, y_train)
    y_predict = perceptron.sign(X_test, perceptron.w, perceptron.b)
    y_predict = np.array([1 if i > 0 else -1 for i in y_predict])  # y Presentation category , The normal value is 0 and 1, hold 0 Change the value of the category to -1
    print(y_predict)
    M = metrics.confusion_matrix(y_predict, y_test)
    print(' Confusion matrix :\n', M)
    #  Accuracy 、 Recall rate 
    n = len(M)
    for i in range(n):
        rowsum, colsum = sum(M[i]), sum(M[r][i] for r in range(n))
        precision = M[i][i] / float(colsum)
        recall = M[i][i] / float(rowsum)
        F1 = precision * recall * 2 / (precision + recall)
        print('%d Accuracy of categories : %s' % (i, precision), '%d Recall rate of categories : %s' % (i, recall), '%d Categories of F1 value : %s' % (i, F1))
    print(" Model accuracy ( Accuracy rate )：{:.5f}".format(np.mean(y_predict == y_test)))
    print(' The weight :', perceptron.w[0], perceptron.w[1], perceptron.w[2], perceptron.w[3])
    print(' bias :', perceptron.b)

result ： The explanation of the code is also very clear , The result is that when there is more noise , Although using a small amount of data for training , The effect is also very good , The main reason is that this data set is easy to classify .

2.KNN

KNN The idea is very simple , There is no need to train , Just give you a point , Then you find the distance between it and all the points , And then take K Vote at the nearest point . this K Which category has the largest number of , Just decide to which category this point belongs to . We need to pay attention to normalizing the data .
Code ：

# -*- coding:utf-8 -*-
import pandas as pd
import numpy as np
import operator as opt
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target   # Put the label home in the back 


# Considering the influence of different eigenvalue ranges , This kind of data can be normalized to the maximum and minimum data sets 
def normData(dataset):
    maxVals = dataset.max(axis=0) # Find the maximum value of the column 
    minVals = dataset.min(axis=0) # Find the minimum value of the column 
    ranges = maxVals - minVals
    retData = (dataset - minVals) / ranges
    return retData, minVals, ranges

#KNN, Euclidean distance 
def kNN(dataset, labels, testdata, k):
    distSquareMat = (dataset - testdata) ** 2 # Calculate the square of the difference 
    distSquareSums = distSquareMat.sum(axis=1) # Find the sum of the squares of the differences in each line 
    distances = distSquareSums ** 0.5 # Square root , Get the distance from each sample to the test point 
    sortedIndices = distances.argsort()  #array.argsort(), Default axis=0 Sort from small to large , Get the subscript position after sorting 
    indices = sortedIndices[:k] # Take the least distance k The position of the subscript corresponding to the values 
    labelCount = {
    } # Store each label Number of occurrences of 
    for i in indices:  # Ergodic subscript , The value of the inside 
        label = labels[i]
        labelCount[label] = labelCount.get(label, 0) + 1 # The number of add 1,dict.get(k, val) Get in the dictionary k Corresponding value , No, k, Then return to val

    sortedCount = sorted(labelCount.items(), key=opt.itemgetter(1), reverse=True) #operator.itemgetter(), combination sorted Use , You can sort by different areas 
    return sortedCount[0][0] # Return the most one label

# The main function 
if __name__ == "__main__":
    data = np.array(df.iloc[:100])
    dataDiyList = [[5, 2.8, 1.9, 0.3, 1], [6.2, 2.8, 1.9, 0.3, 0], [5.8, 2.3, 3.4, 0.3, 0], [5, 2.8, 1.9, 0.3, 1],[7.0, 2.0, 3.4, 0.2, 0],
                   [5.0, 3.4, 1.5, 0.8, 0], [4.78, 3.1, 1.7, 0.9, 0], [5.6, 2.9, 3.4, .99, 0]]
    dataDiyAaary = np.array(dataDiyList)
    data = np.row_stack((data, dataDiyAaary))
    X = data[:, :-1]  #  Take out the features 
    y = data[:, -1]  #  Take out the label 
    for i in range (1,10):
        print(" The first %d round ---------------\n" % i)
        X_train, X_test, y_train, y_test = train_test_split( \
            X, y, test_size=0.50, random_state=i, shuffle=True)  # random_state Not for 0 In order to get different data each time 
        y_pred = []

        for i in range(len(X_test)):
            onedata = X_test[i]
            result = kNN(X_train, y_train, onedata, 2)
            y_pred.append(result)
        print(y_pred)
        M = metrics.confusion_matrix(y_pred, y_test)
        print(' Confusion matrix :\n', M)
        n = len(M)
        for i in range(n):
            rowsum, colsum = sum(M[i]), sum(M[r][i] for r in range(n))
            precision = M[i][i] / float(colsum)
            recall = M[i][i] / float(rowsum)
            F1 = precision * recall * 2 / (precision + recall)
            print('%d Accuracy of categories : %s' % (i, precision), '%d Recall rate of categories : %s' % (i, recall), '%d Categories of F1 value : %s' % (i, F1))
        print(" Model accuracy ( Accuracy rate )：{:.5f}".format(np.mean(y_pred == y_test)))

3 Decision tree

The key point of decision tree is that each feature selected is the one with the greatest information gain . The most basic thing is ID3 了 , The following derivatives are C4.5,CART, And pruning . Here we go through ID3 Write . In this code, discrete values are processed respectively （ Such as handsome or not , Is there any money , Melons are sweet or not ）, There are also continuous （1,2,3,4) such . These two different types need to be handled in different ways . It's all in it , There are also functions to save the generated tree and draw it . This code is what I found on the Internet , I really can't write it by hand , Just understand what they write , Admire the boss .
Code ：


#https://blog.csdn.net/kai123wen/article/details/105769970

#https://github.com/login?return_to=%2Fjiajia0%2FMachineLearningTopic
import collections
import random
import string
from math import log
import pandas as pd
import operator
from  End of term . Two kinds of flowers  import treePlotter


def calcShannonEnt(dataSet):
    """  Calculate the information entropy of a given data set ( Shannon entropy ) :param dataSet: :return: """
    #  Calculate the total number of data sets 
    numEntries = len(dataSet)

    #  Used to count tags 
    labelCounts = collections.defaultdict(int)

    #  Loop the entire dataset , Get the classification label of the data 
    for featVec in dataSet:
        #  Get the current tag 
        currentLabel = featVec[-1]

        #  Add one to the corresponding tag value 
        labelCounts[currentLabel] += 1

    #  Default information entropy 
    shannonEnt = 0.0

    for key in labelCounts:
        #  Calculate the proportion of current category labels in the total labels 
        prob = float(labelCounts[key]) / numEntries

        #  With 2 Find the logarithm of the base 
        shannonEnt -= prob * log(prob, 2)

    return shannonEnt


def splitDataSetForSeries(dataSet, axis, value):
    """  According to the given value , Divide the data set into two parts no greater than and greater than  :param dataSet:  The data set to be divided  :param i:  The subscript of the characteristic value  :param value:  Division value  :return: """
    #  Used to save a set that is not greater than the partition value 
    eltDataSet = []
    #  Used to save sets larger than the partition value 
    gtDataSet = []
    #  division , Keep the characteristic value 
    for feat in dataSet:
        if feat[axis] <= value:
            eltDataSet.append(feat)
        else:
            gtDataSet.append(feat)

    return eltDataSet, gtDataSet


def splitDataSet(dataSet, axis, value):
    """  According to the given eigenvalue , Divide the dataset into  :param dataSet:  Data sets  :param axis:  Coordinates of a given eigenvalue  :param value:  Conditions satisfied by a given eigenvalue , Only a given eigenvalue is equal to this value I'll be back when  :return: """
    #  Create a new list , Prevent changes to the original list 
    retDataSet = []

    #  Traversing the entire dataset 
    for featVec in dataSet:
        #  If the given eigenvalue is equal to the desired eigenvalue 
        if featVec[axis] == value:
            #  Save the contents in front of the characteristic value 
            reducedFeatVec = featVec[:axis]
            #  Save the contents after the characteristic value , So the given eigenvalue is removed 
            reducedFeatVec.extend(featVec[axis + 1:])
            #  Add to the return list 
            retDataSet.append(reducedFeatVec)

    return retDataSet


def calcInfoGainForSeries(dataSet, i, baseEntropy):
    """  Calculate the information gain of continuous values  :param dataSet: The whole dataset  :param i:  Corresponding eigenvalue subscript  :param baseEntropy:  Basic information entropy  :return:  Returns an information gain value , And the current partition point  """

    #  Record the maximum information gain 
    maxInfoGain = 0.0

    #  The best dividing point 
    bestMid = -1

    #  Get the list of all current eigenvalues in the dataset 
    featList = [example[i] for example in dataSet]

    #  Get a list of categories 
    classList = [example[-1] for example in dataSet]

    dictList = dict(zip(featList, classList))

    #  Sort them from small to large , Arrange according to the size of continuous values 
    sortedFeatList = sorted(dictList.items(), key=operator.itemgetter(0))

    #  Calculate the number of consecutive values 
    numberForFeatList = len(sortedFeatList)

    #  Calculate the partition point , Keep three decimal places , Will be the first i And i+1 The values of the first column add up and divide by 2, That is to find out the score line greater than or less than .
    midFeatList = [round((sortedFeatList[i][0] + sortedFeatList[i + 1][0]) / 2.0, 3) for i in
                   range(numberForFeatList - 1)]

    #  Calculate the information gain of each partition point 
    for mid in midFeatList:
        #  The continuous value is divided into two parts: not greater than the current partition point and greater than the current partition point 
        eltDataSet, gtDataSet = splitDataSetForSeries(dataSet, i, mid)

        #  Calculate the sum of the product of the eigenvalue entropy and weight of the two parts 
        # Conditional entropy 
        newEntropy = len(eltDataSet) / len(dataSet) * calcShannonEnt(eltDataSet) + len(gtDataSet) / len(
            dataSet) * calcShannonEnt(gtDataSet)

        #  Calculate the information gain 
        infoGain = baseEntropy - newEntropy

        # print(' The current partition value is ：' + str(mid) + ', At this time, the information gain is ：' + str(infoGain))
        if infoGain > maxInfoGain:
            bestMid = mid
            maxInfoGain = infoGain

    return maxInfoGain, bestMid


def calcInfoGain(dataSet, featList, i, baseEntropy):
    """  Calculated information gain  :param dataSet:  Data sets  :param featList:  Current feature list  :param i:  Current eigenvalue subscript  :param baseEntropy:  Basic information entropy  :return: """
    #  Make the current feature unique , That is, how many kinds of eigenvalues are there at present 
    uniqueVals = set(featList)

    #  New entropy , Entropy representing the current eigenvalue 
    newEntropy = 0.0

    #  The possibility of traversing existing features 
    for value in uniqueVals:
        #  At the current feature location of all data sets , Find the set whose characteristic value is equal to the current value 
        subDataSet = splitDataSet(dataSet=dataSet, axis=i, value=value)
        #  Calculate the weight 
        prob = len(subDataSet) / float(len(dataSet))
        #  Calculate the entropy of the current eigenvalue 
        newEntropy += prob * calcShannonEnt(subDataSet)

    #  To calculate the “ Information gain ”
    infoGain = baseEntropy - newEntropy

    return infoGain


def chooseBestFeatureToSplit(dataSet, labels):
    """  Select the best dataset partition feature , Calculate... According to the information gain value , Can handle continuous values  :param dataSet: :return: """
    #  Get the total number of eigenvalues of the data 
    numFeatures = len(dataSet[0]) - 1

    #  Calculate the basic information entropy , To calculate the H(D).
    baseEntropy = calcShannonEnt(dataSet)

    #  The basic information gain is 0.0
    bestInfoGain = 0.0

    #  The best eigenvalue 
    bestFeature = -1

    #  Mark whether the current best characteristic value is a continuous value 
    flagSeries = 0

    #  If it is a continuous value , A dividing point used to record continuous values 
    bestSeriesMid = 0.0

    #  Calculate the information entropy for each eigenvalue 
    for i in range(numFeatures):

        #  Get the list of all current eigenvalues in the dataset , The first i Column data 
        featList = [example[i] for example in dataSet]
        # Deal with continuous values and discontinuous values respectively 
        if isinstance(featList[0], str):
            infoGain = calcInfoGain(dataSet, featList, i, baseEntropy)
        else:
            # print(' The current partition attribute is ：' + str(labels[i]))
            infoGain, bestMid = calcInfoGainForSeries(dataSet, i, baseEntropy)

        # print(' The current eigenvalue is ：' + labels[i] + ', The corresponding information gain value is ：' + str(infoGain))

        #  If the current information gain is greater than the original 
        if infoGain > bestInfoGain:
            #  The best information gain 
            bestInfoGain = infoGain
            #  The new best eigenvalue used to divide 
            bestFeature = i

            flagSeries = 0
            if not isinstance(dataSet[0][bestFeature], str):
                flagSeries = 1
                bestSeriesMid = bestMid

    # print(' The characteristic of the largest information gain is ：' + labels[bestFeature])
    if flagSeries:
        return bestFeature, bestSeriesMid
    else:
        return bestFeature


def getDataSet(test_size):
    """  Create a test data set , The values inside have continuous values  :return: """
    dataSet = pd.read_csv("iris.data").values.tolist()
    random.shuffle(dataSet)
    train_dataset = dataSet[:int(len(dataSet) * (1 - test_size))]  #  Training data 
    test_dataset = dataSet[int(len(dataSet) * (1 - test_size)):]  #  Test data 
    #  List of eigenvalues 
    labels = ['sepal length', 'sepal width', 'petal length', 'petal width']

    return train_dataset, test_dataset, labels


def majorityCnt(classList):
    """  The most frequently found category labels  :param classList: :return: """
    #  Used to count the number of votes in the tag , Stronger containers , The default is 0
    classCount = collections.defaultdict(int)

    #  Traverse all tag categories 
    for vote in classList:
        classCount[vote] += 1

    #  Sort from large to small 
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)

    #  The most frequently returned tag 
    return sortedClassCount[0][0]


def createTree(dataSet, labels):
    """  Create a decision tree  :param dataSet:  Data sets  :param labels:  Feature tags  :return: """
    #  Get the classification labels of all data sets 
    classList = [example[-1] for example in dataSet]

    #  Count the number of times the first tag appears , Compare with the total number of tags , If equal, it means that all in the current list are one kind of label , Stop dividing at this time 
    if classList.count(classList[0]) == len(classList):
        return classList[0]

    #  Calculate how many data there are in the first row , If there is only one, it means that all feature attributes have been traversed , The remaining one is the category label 
    if len(dataSet[0]) == 1:
        #  Return the one that appears more frequently in the remaining tags 
        return majorityCnt(classList)

    #  Select the best division feature , Get the subscript of the feature 
    bestFeat = chooseBestFeatureToSplit(dataSet=dataSet, labels=labels)

    #  Get the name of the best feature 
    bestFeatLabel = ''

    #  Record whether the current value is continuous or discrete ,1 continuity ,2 discrete 
    flagSeries = 0

    #  If it's a continuous value , Record the dividing points of continuous values 
    midSeries = 0.0

    #  If it's a tuple , This indicates that this is a continuous value 
    if isinstance(bestFeat, tuple):
        #  Modify the bifurcation point information again 
        bestFeatLabel = str(labels[bestFeat[0]]) + '=' + str(bestFeat[1])
        #  Get the current partition point 
        midSeries = bestFeat[1]
        #  Get the subscript value 
        bestFeat = bestFeat[0]
        #  Continuous value flag 
        flagSeries = 1
    else: # If it's not continuous 
        #  Get bifurcation information 
        bestFeatLabel = labels[bestFeat]
        #  Discrete value flag 
        flagSeries = 0

    #  Use a dictionary to store the tree structure , The bifurcation is the name of the divided feature 
    myTree = {
    bestFeatLabel: {
    }}

    #  Get all possible values of the current feature tag 
    featValues = [example[bestFeat] for example in dataSet]

    #  Continuous value processing 
    if flagSeries:

        #  The continuous value is divided into two parts: not greater than the current partition point and greater than the current partition point 
        eltDataSet, gtDataSet = splitDataSetForSeries(dataSet, bestFeat, midSeries)
        #  Get the remaining feature tags 
        subLabels = labels[:]
        #  Recursively handle subtrees smaller than the partition point 
        subTree = createTree(eltDataSet, subLabels)
        myTree[bestFeatLabel][' Less than '] = subTree

        #  Recursively process subtrees larger than the current partition point 
        subTree = createTree(gtDataSet, subLabels)
        myTree[bestFeatLabel][' Greater than '] = subTree

        return myTree

    #  Discrete value processing 
    else:

        #  Delete the characteristic value of this division from the list 
        del (labels[bestFeat])
        #  The only change , Remove duplicate eigenvalues 
        uniqueVals = set(featValues)
        #  Traverse all eigenvalues 
        for value in uniqueVals:
            #  Get the remaining feature tags 
            subLabels = labels[:]
            #  Recursively call , Divide all data in the data set whose characteristic is equal to the current characteristic value into the current node , When calling recursively, you need to remove the current feature first 
            subTree = createTree(splitDataSet(dataSet=dataSet, axis=bestFeat, value=value), subLabels)
            #  Put the subtree under the fork 
            myTree[bestFeatLabel][value] = subTree
        return myTree


#  Enter three variables （ Decision tree , Attribute feature tag , Test data ）
def classify(inputTree, featLables, testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]  #  Branch of tree , subset Dict
    featIndex = featLables.index(firstStr[:firstStr.index('=')])  #  Get the first layer of the decision tree in featLables Position in 
    for key in secondDict.keys():
        if testVec[featIndex] > float(firstStr[firstStr.index('=') + 1:]):
            if type(secondDict[' Greater than ']).__name__ == 'dict':
                classLabel = classify(secondDict[' Greater than '], featLables, testVec)
            else:
                classLabel = secondDict[' Greater than ']
            return classLabel
        else:
            if type(secondDict[' Less than ']).__name__ == 'dict':
                classLabel = classify(secondDict[' Less than '], featLables, testVec)
            else:
                classLabel = secondDict[' Less than ']
            return classLabel


#  Save the decision tree 
def storeTree(inputTree, filename):
    import pickle
    fw = open(filename, 'wb')
    pickle.dump(inputTree, fw)
    fw.close()


#  Read the decision tree 
def grabTree(filename):
    import pickle
    fr = open(filename, 'rb')
    return pickle.load(fr)


def calConfuMatrix(myTree, label, test_dataSet):
    matrix = {
    'Iris-setosa': {
    'Iris-setosa': 0, 'Iris-versicolor': 0},
              'Iris-versicolor': {
    'Iris-setosa': 0, 'Iris-versicolor': 0}}
    for example in test_dataSet:
        predict = classify(myTree, label, example)  #  Test the test data ,predict Classification for forecast 
        actual = example[-1]  # actual  For the actual classification 
        matrix[actual][predict] += 1  #  fill confusion matrix
    return matrix


#  Accuracy 
def precision(classes, matrix):
    precisionDict = {
    'Iris-setosa': 0, 'Iris-versicolor': 0}
    all_predict_num = 0  #  For example, when the predicted value is Iris-setosa The number of all cases when 
    for classItem in classes:
        true_predict_num = matrix[classItem][classItem]  #  Accurately predict the quantity 
        all_predict_num = 0
        for temp_class_item in classes:
            all_predict_num += matrix[temp_class_item][classItem]
        precisionDict[classItem] = round(true_predict_num / all_predict_num, 5)
    return precisionDict

# Accuracy rate 
def score(classes, matrix):
    all_num=0
    all_right=0
    for classItem in classes:
        all_right += matrix[classItem][classItem]  #  Accurately predict the quantity 
        for temp_class_item in classes:
            all_num += matrix[temp_class_item][classItem]
    score = round(all_right / all_num, 5)
    return score

#  Recall rate 
def recall(classes, matrix):
    recallDict = {
    'Iris-setosa': 0, 'Iris-versicolor': 0}
    for classItem in classes:
        true_predict_num = matrix[classItem][classItem]  #  Accurately predict the quantity 
        all_predict_num = 0  #  For example, when the predicted value is Iris-setosa The number of all cases when 
        for temp_class_item in classes:
            all_predict_num += matrix[classItem][temp_class_item]
        recallDict[classItem] = round(true_predict_num / all_predict_num, 5)
    return recallDict


#  Display the results 
def showResult(precisionValue, recallValue, classes):
    print('\t\t\t\t', ' Accuracy ', '\t', ' Recall rate ','\t','F1 value ')
    for classItem in classes:
        print(classItem, '\t', precisionValue[classItem], '\t', recallValue[classItem],'\t',round((2*precisionValue[classItem]*recallValue[classItem])/(precisionValue[classItem]+recallValue[classItem]),5))


if __name__ == '__main__':
    """  Decision tree when dealing with continuous values  """
    train_dataSet, test_dataSet, labels = getDataSet(0.5)
    temp_label = labels[:]  #  Deep copy 
    myTree = createTree(train_dataSet, labels)
    storeTree(myTree, 'tree.txt')
    myTree = grabTree('tree.txt')
    print(myTree)
    treePlotter.createPlot(myTree)
    matrix = calConfuMatrix(myTree, temp_label, test_dataSet)  # confusion matrix
    classes = ['Iris-setosa', 'Iris-versicolor']  #  All categories 
    precisionValue = precision(classes, matrix)
    recallValue = recall(classes, matrix)
    score=score(classes,matrix)
    showResult(precisionValue, recallValue, classes)
    print(" Accuracy rate is :%f"%score)

treePlotter.py

# @Time : 2017/12/18 19:46
# @Author : Leafage
# @File : treePlotter.py
# @Software: PyCharm

import matplotlib.pylab as plt
import matplotlib

#  Can display Chinese 
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.serif'] = ['SimHei']

#  Bifurcation node , That is, the decision node 
decisionNode = dict(boxstyle="sawtooth", fc="0.8")

#  Leaf node 
leafNode = dict(boxstyle="round4", fc="0.8")

#  Arrow pattern 
arrow_args = dict(arrowstyle="<-")


def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    """  Draw a node  :param nodeTxt:  Text information describing the node  :param centerPt:  The coordinates of the text  :param parentPt:  The coordinates of point , This also refers to the coordinates of the parent node  :param nodeType:  Node type , It is divided into leaf node and decision node  :return: """
    createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
                            xytext=centerPt, textcoords='axes fraction',
                            va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)


def getNumLeafs(myTree):
    """  Get the number of leaf nodes  :param myTree: :return: """
    #  Count the total number of leaf nodes 
    numLeafs = 0

    #  Get the current first key, Root node 
    firstStr = list(myTree.keys())[0]

    #  Get the first one key Corresponding content 
    secondDict = myTree[firstStr]

    #  Recursively traverses leaf nodes 
    for key in secondDict.keys():
        #  If key It corresponds to a dictionary , Call recursively 
        if type(secondDict[key]).__name__ == 'dict':
            numLeafs += getNumLeafs(secondDict[key])
        #  No. , This indicates that this is a leaf node 
        else:
            numLeafs += 1
    return numLeafs


def getTreeDepth(myTree):
    """  Get the number of depth layers  :param myTree: :return: """
    #  Used to save the maximum number of layers 
    maxDepth = 0

    #  Get the root node 
    firstStr = list(myTree.keys())[0]

    #  obtain key Corresponding content 
    secondDic = myTree[firstStr]

    #  Traverse all child nodes 
    for key in secondDic.keys():
        #  If the node is a dictionary , Call recursively 
        if type(secondDic[key]).__name__ == 'dict':
            #  The depth of the child node plus 1
            thisDepth = 1 + getTreeDepth(secondDic[key])

        #  This indicates that this is the leaf node 
        else:
            thisDepth = 1

        #  Replace the maximum number of layers 
        if thisDepth > maxDepth:
            maxDepth = thisDepth

    return maxDepth


def plotMidText(cntrPt, parentPt, txtString):
    """  Calculate the middle position between the parent node and the child node , Fill in the information  :param cntrPt:  Child node coordinates  :param parentPt:  Parent node coordinates  :param txtString:  Filled text information  :return: """
    #  Calculation x The middle position of the shaft 
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    #  Calculation y The middle position of the shaft 
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    #  Drawing 
    createPlot.ax1.text(xMid, yMid, txtString)


def plotTree(myTree, parentPt, nodeTxt):
    """  Draw all nodes of the tree , Recursive drawing  :param myTree:  Trees  :param parentPt:  The coordinates of the parent node  :param nodeTxt:  The text information of the node  :return: """
    #  Calculate the number of leaf nodes 
    numLeafs = getNumLeafs(myTree=myTree)

    #  Calculate the depth of the tree 
    depth = getTreeDepth(myTree=myTree)

    #  Get the information content of the root node 
    firstStr = list(myTree.keys())[0]

    #  Calculate the middle coordinates of the current root node in all child nodes , That is, at present x The offset of the axis plus the calculated center position of the root node is used as x Axis （ For example, for the first time ： Initial x The offset for the ：-1/2W, The calculated center position of the root node is ：(1+W)/2W, Add up to get ：1/2）, At present y The axis offset is taken as y Axis 
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)

    #  Draw the connection between the node and the parent node 
    plotMidText(cntrPt, parentPt, nodeTxt)

    #  Draw the node 
    plotNode(firstStr, cntrPt, parentPt, decisionNode)

    #  Get the subtree corresponding to the current root node 
    secondDict = myTree[firstStr]

    #  Work out a new y Shaft offset , Move down the 1/D, That is, the drawing of the next layer y Axis 
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD

    #  Loop through all key
    for key in secondDict.keys():
        #  If the current key If it's a dictionary , Represents a subtree , Then recursively traverse 
        if isinstance(secondDict[key], dict):
            plotTree(secondDict[key], cntrPt, str(key))
        else:
            #  Calculate the new x Shaft offset , That's what the next leaf draws x The axis coordinates move to the right 1/W
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            #  Open the annotation to observe the coordinate changes of leaf nodes 
            # print((plotTree.xOff, plotTree.yOff), secondDict[key])
            #  Draw leaf nodes 
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            #  Draw the content of the middle line between the leaf node and the parent node 
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))

    #  Before returning recursion , Need to put y The offset of the axis increases , Move up 1/D, That is, go back and draw the upper layer y Axis 
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD


def createPlot(inTree):
    """  The decision tree that needs to be drawn  :param inTree:  Decision tree Dictionary  :return: """
    #  Create an image 
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
    #  Calculate the total width of the decision tree 
    plotTree.totalW = float(getNumLeafs(inTree))
    #  Calculate the total depth of the decision tree 
    plotTree.totalD = float(getTreeDepth(inTree))
    #  Initial x Shaft offset , That is to say -1/2W, Every time I move to the right 1/W, That is, the first leaf node is drawn x Coordinate for ：1/2W, the second ：3/2W, Third ：5/2W, the last one ：(W-1)/2W
    plotTree.xOff = -0.5/plotTree.totalW
    #  Initial y Shaft offset , Every time you move down or up 1/D
    plotTree.yOff = 1.0
    #  Call the function to draw the node image 
    plotTree(inTree, (0.5, 1.0), '')
    #  draw 
    plt.show()


if __name__ == '__main__':
    testTree = {
    'no surfacing': {
    0: 'no', 1: {
    'flippers': {
    0: 'no', 1: 'yes'}}, 3: 'maybe'}}
    createPlot(testTree)

result ：
Insert picture description here
Others can test it slowly .
There was still SVM, Logistic returns , Bayesian binary classification of this data set , But just a little at a time , Later updates will follow ~

Data sets can be accessed through sklearn Library processing , if necessary txt Data sets , I'll send you the private letter for free .
The source of the code is ：
https://blog.csdn.net/kai123wen/article/details/105769970

https://github.com/login?return_to=%2Fjiajia0%2FMachineLearningTopic

KNN The code forgot its source .
I only understand on the basis of , Then you can change the parameters and code to what you want .
Come on ！！！

原网站

版权声明
本文为[Can you bug, please]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/159/202206081209340465.html

当前位置：网站首页>Handwriting perceptron, KNN, decision tree (ID3) for binary classification of iris

Handwriting perceptron, KNN, decision tree (ID3) for binary classification of iris

Handwriting sensor ,KNN, Decision tree （ID3）, Bayes , Logistic returns （Lgistic),SVM Yes iris（ Iris ） Make a dichotomy

1. perceptron

2.KNN

3 Decision tree

边栏推荐

猜你喜欢

随机推荐