当前位置：网站首页>Ml self realization / logistic regression / binary classification

Ml self realization / logistic regression / binary classification

2022-07-08 01:58:00 【xcrj】

principle

Prediction function ：

classification ： $0\leq h_\theta(x)\leq1$
Tradition ： $h_\theta(x)\gg 1 || h_\theta(x)\ll 0$

Traditional prediction function is transformed into classification prediction function ：

Traditional prediction function $h_\theta(x)$ $\stackrel{sigmoid}{\longrightarrow}$ Classification prediction function $h_\theta(x)$

The process ：
sigmoid: $g(z)=\frac{1}{1+e^{-z}}\stackrel{z=h_\theta(x)=\theta^Tx}{\longrightarrow}=\frac{1}{1+e^{-\theta^Tx}}$
primary $h_\theta(x)=z=\theta^Tx=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n$
new $h_\theta(x)=P(y=1|x;\theta)=\frac{1}{1+e^{-z}}$

Decision boundaries ：

$h_\theta(x)=g(z) \geq0.5 when , Think y=1 \Rightarrow z \geq0 \Rightarrow \theta^Tx \geq0 \Rightarrow used h_\theta(x) \geq0$
$h_\theta(x)=g(z) \leq0.5 when , Think y=0 \Rightarrow z \leq0 \Rightarrow \theta^Tx \leq0 \Rightarrow used h_\theta(x) \leq0$
$h_\theta(x)=0, It's the decision boundary$

cost function ：
The original cost function ：

The original cost function cannot be used , Because there are too many local optima
$J(\theta)=\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$ , $0\leq h_\theta(x^{(i)} \leq1$ And $y = 0 or 1$ , Lead to the existence of too many local optima , Not a typical convex function

New cost function ： Converse thinking
$cost(h_\theta(x),y)= \begin{cases} -\log(h_\theta(x))& y=1 \\ -\log(1-h_\theta(x))& y=0 \end{cases}$
Insert picture description here
Introduce

l
$\begin{cases} h_\theta(x)\rightarrow1 \\ If you want to h_\theta(x)\rightarrow0 be cost\rightarrow+\infty \end{cases}$
r
$\begin{cases} h_\theta(x)\rightarrow0 \\ If you want to h_\theta(x)\rightarrow1 be cost\rightarrow+\infty \end{cases}$

Unified cost function ：

$cost(h_\theta(x),y)=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))$
When y=1 when , Keep the front of the above formula 1 part
When y=0 when , Keep the last of the above formula 1 part

cost function ： Least square method
$\begin{aligned} J(\theta) &=\frac{1}{m}\sum\limits_{i=1}^mcost(h_\theta(x),y) \\ &=\frac{1}{m}\sum\limits_{i=1}^m[-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))] \end{aligned}$

Batch gradient descent algorithm ：

Repeat until it converges {
$\theta_0:=\theta_0-\alpha\frac{\partial{J(\theta)}}{\partial{\theta_0}}$
$\theta_j:=\theta_j-\alpha\frac{\partial{J(\theta)}}{\partial{\theta_j}}$
}
Repeat until it converges {
$\theta_0:=\theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^m[h_\theta(x)-y]$
$\theta_j:=\theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m[h_\theta(x)-y]x_j$
}
Be careful ： Batch gradient descent algorithm needs to update all at the same time $\theta_j$

Data sets

Spam differentiation

Address download spambase.data File can
Dichotomous problem ：spam or non-spam
This experiment only takes the first 3 Columns as features , Last 1 Column as the target , Last 1 The column value is 1 when spam, Last 1 The column value is 0 when non-spam,

Code

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['font.family'] = 'STSong'
matplotlib.rcParams['font.size'] = 20


class DataSet(object):
    """ X_train  Training set samples  y_train  Training set sample value  X_test  Test set samples  y_test  Test set sample values  """

    def __init__(self, X_train, y_train, X_test, y_test):
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test


class LogisticRegression(object):
    """  Logical regression  """

    def __init__(self, n_feature):
        self.theta0 = 0
        self.theta = np.zeros((n_feature, 1))

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def gradient_descent(self, X, y, alpha=0.001, num_iter=100):
        costs = []
        m, _ = X.shape
        for i in range(num_iter):
            #  Predictive value 
            h = self.sigmoid(np.dot(X, self.theta) + self.theta0)
            #  cost function 
            cost = (1 / m) * np.sum(-y * np.log(h) - (1 - y) * (np.log(1 - h)))
            costs.append(cost)
            #  gradient 
            dJ_dtheta0 = (1 / m) * np.sum(h - y)
            dJ_dtheta = (1 / m) * np.dot((h - y).T, X).T
            #  Update all at the same time theta
            self.theta0 = self.theta0 - alpha * dJ_dtheta0
            self.theta = self.theta - alpha * dJ_dtheta

        return costs

    def show_train(self, costs, num_iter):
        """  Show the training process  """
        fig = plt.figure(figsize=(10, 6))
        plt.plot(np.arange(num_iter), costs)
        plt.title(" Cost changes ")
        plt.xlabel(" The number of iterations ")
        plt.ylabel(" cost ")
        plt.show()

    def hypothesis(self, X, theta0, theta):
        """  Prediction function  """
        h0 = self.sigmoid(self.theta0 + np.dot(X, self.theta))
        h = [1 if elem > 0.5 else 0 for elem in h0]
        return np.array(h)[:, np.newaxis]


def read_data():
    """  Reading data  """
    # names： Header 
    # sep： Separator 
    # skipinitialspace： Ignore the space after the delimiter 
    # comment： Ignore \t Note after 
    # na_values： Use ? Replace NA Value 
    origin_data = pd.read_csv("./data/spambase.data", sep=",", skipinitialspace=True, comment="\t", na_values="?")
    data = origin_data.copy()
    # tail() Print last n Row data 
    print(data.tail())
    return data


def clean_data(data):
    """  Data cleaning ： Handling outliers  """
    # dataset Does it contain NA data 
    # pandas 0.22.0+ Only then isna(), Upgrade order ：pip install --upgrade pandas==0.22.0
    print('NA Row number ：', data.isna().sum())
    #  Delete the exception line 
    cleaned_data = data.dropna()
    return cleaned_data


def show_data(data):
    """  Show the data  """
    count_spam = 0
    count_non_spam = 0
    for c in data.iloc[:, -1]:
        if c == 1:
            count_spam += 1
        else:
            count_non_spam += 1

    print(" Number of spam ：", count_spam)
    print(" Number of normal mail ：", count_non_spam)


def split_data(data):
    """  Divide the data   Divided into train, test;train Used to train the prediction function ,test Used to test the generalization ability of the predicted function value  """
    copied_data = data.copy()
    # frac： The proportion of rows extracted ;random_state  Random seeds 
    train_dataset = copied_data.sample(frac=0.8, random_state=1)
    #  Take the rest of the test set 
    test_dataset = copied_data.drop(train_dataset.index)

    X_train = train_dataset.iloc[:, 0:3]
    y_train = train_dataset.iloc[:, -1]
    X_test = test_dataset.iloc[:, 0:3]
    y_test = test_dataset.iloc[:, -1]
    dataset = DataSet(X_train, y_train, X_test, y_test)

    return dataset


def evaluate_model(y_test, h):
    """  Evaluation model  """
    # MSE： Mean square error 
    print("MSE: %f" % (np.sum((h - y_test) ** 2) / len(y_test)))
    # RMSE： Root mean square difference 
    print("RMSE: %f" % (np.sqrt(np.sum((h - y_test) ** 2) / len(y_test))))

def show_result(X_test, y_test, h):
    # figure canvas 
    fig = plt.figure(figsize=(16, 8), facecolor='w')
    # subplot Subgraphs 
    plt.subplots_adjust(left=0.05, right=0.95, bottom=0.05, top=0.9)

    # 221：nrows=2, ncols=2, index=1
    ax = fig.add_subplot(121, projection='3d')
    ax.set_title("y_test")
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_zlabel('Feature 3')
    # x,y,z,c(color),marker( shape )
    ax.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=y_test, marker='o')
    plt.grid(True)

    ax1 = fig.add_subplot(122, projection='3d')
    ax1.set_title("h")
    ax1.set_xlabel('Feature 1')
    ax1.set_ylabel('Feature 2')
    ax1.set_zlabel('Feature 3')
    # x,y,z,c(color),marker( shape )
    ax1.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=h, marker='*')
    plt.grid(True)

    plt.show()

def main():
    #  Reading data 
    data = read_data()
    #  Data cleaning 
    cleaned_data = clean_data(data)
    #  Mean normalization , Before the data set used 3 There is little difference in the range of column data , Do not normalize the mean 
    #  Display data 
    show_data(cleaned_data)
    #  Split data 
    dataset = split_data(cleaned_data)
    #  Build the model 
    _, n = dataset.X_train.shape
    logistic_regression = LogisticRegression(n)
    num_iteration = 300
    costs = logistic_regression.gradient_descent(dataset.X_train, dataset.y_train.values[:, np.newaxis], alpha=0.5,
                                                 num_iter=num_iteration)
    #  Show the training process 
    logistic_regression.show_train(costs, num_iteration)
    #  Evaluation model 
    h = logistic_regression.hypothesis(dataset.X_test, logistic_regression.theta0, logistic_regression.theta)
    evaluate_model(dataset.y_test.values[:, np.newaxis], h)
    #  Display the results 
    show_result(dataset.X_test.values,dataset.y_test.values, h.ravel())


if __name__ == '__main__':
    main()

原网站

版权声明
本文为[xcrj]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202130541456945.html

当前位置：网站首页>Ml self realization / logistic regression / binary classification

Ml self realization / logistic regression / binary classification

principle

Data sets

Code

边栏推荐

猜你喜欢

随机推荐