当前位置：网站首页>Spam filtering challenges

Spam filtering challenges

2022-07-01 11:05:00 【I'm not zzy1231a】

Spam filtering challenges

With the gradual development of network applications , E-mail has become an integral part of people's daily work and life . meanwhile , The problem of spam plagues many e-mail users , They not only bring reading burden to e-mail users , It also takes up limited mailbox space . Therefore, this study proposes a mail classification method based on multi-layer perceptron , Used to classify unknown messages , Detect and identify spam .

Goal question ： Solve the two classification problem of spam filtering
The dataset includes 3017 mail , The training set includes 2082 Normal sealing ,935 Seal garbage , The data set is at the end of this article
Insert picture description here

TF-IDF modular

TF-IDF It is a statistical method of word frequency , Suitable for solving classification problems , It is mainly used to evaluate the importance of a word to a document set or one of the documents in a corpus . If the frequency of a word in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification . In the process of transforming text data into feature vectors , The commonly used text feature representation is word bag method , That is, we don't consider the order of words , Each of the words that appear as a separate set of features , These unique words are called thesaurus , Every text can count a lot of columns of eigenvectors on a long vocabulary , If every text has words , Usually marked as “ Stop words ” Not included in the eigenvector . This article USES the SKlearn Machine learning library TfidfVectorizer Method , Compare with CountVectorizer,TfidfVectorizer In addition to considering the frequency of a word in the text , Also look at the number of texts that contain the word . It can reduce the impact of high-frequency meaningless words , Mining more meaningful features .

MLP Multilayer perceptron module

MLP It is the most common feedforward artificial neural network model , In addition to the I / O layer , It can have multiple hidden layers in the middle , The simplest MLP Only one hidden layer , That is, the three-tier structure . The layers of multi-layer perceptron are fully connected . The bottom layer of multi-layer perceptron is the input layer , In the middle is the hidden layer , Finally, the output layer . Suppose you input a n Dimension vector , There is n Neurons , The regression function is softmax. The neurons in the hidden layer are fully connected with the input layer , Suppose the input layer uses a vector X Express , The output of the hidden layer is f(W1X+b1),W1 Weight. （ Also called connection coefficient ）,b1 It's bias , function f Can be common sigmoid Function or tanh function . The output of the output layer is softmax(W2X1+b2),X1 Represents the output of the hidden layer f(W1X+b1). therefore ,MLP All the parameters are the connection weights and offsets between the layers , The parameters of multilayer perceptron set in this project are as follows ： The activation function is the default relu function ; Optimizer selection lbfgs（quasi-Newton Method optimizer ） Used to optimize weights ;alpha Set to 1e-5; The number of hidden layers is 2, Each floor has its own 50 Neurons ; The maximum number of iterations is 100.

Source code

import os
import re
from html import unescape
from re import T
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report


def html_div(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)


def main(cv=1):
    path_spam = 'C:/cmail/trainspam/'
    path_ham = 'C:/cmail/trainham/'
    path_test = 'C:/cmail/test/'

    dataset_x_train = []
    dataset_y_train = []
    dataset_file_train = []

    for filename in list(os.listdir(path_spam)):
        f = open(path_spam + filename, encoding='ISO-8859-1')
        cont = f.read()
        content = html_div(cont)
        dataset_x_train.append(content)
        dataset_y_train.append(1)
        dataset_file_train.append(filename)

    for filename in list(os.listdir(path_ham)):
        f = open(path_ham + filename, encoding='ISO-8859-1')
        cont = f.read()
        content = html_div(cont)
        dataset_x_train.append(content)
        dataset_y_train.append(0)
        dataset_file_train.append(filename)

    dataset_x = np.array(dataset_x_train)
    dataset_y = np.array(dataset_y_train)
    dataset_file = np.array(dataset_file_train)

    x_train_data, y_train_data, file_train_data = dataset_x, dataset_y, dataset_file

    dataset_x_test = []
    dataset_file_test = []

    for filename in list(os.listdir(path_test)):
        f = open(path_test + filename, encoding='ISO-8859-1')
        cont = f.read()
        content = html_div(cont)
        dataset_x_test.append(content)
        dataset_file_test.append(filename)

    dataset_x_test = np.array(dataset_x)
    dataset_file_test = np.array(dataset_file)

    x_test_data, file_test_data = dataset_x_test, dataset_file_test

    vectorizer = TfidfVectorizer(min_df=2, max_df=0.6, ngram_range=(1, 2), stop_words='english',
                                 strip_accents='unicode', norm='l2')
    # max_df： It can be set in the range of [0.0 1.0] Of float, There is no limit to the scope of int, The default is 1.0. This parameter is used as a threshold , When constructing a corpus keyword set , If there's a word document frequence Greater than max_df, This word will not be used as a keyword . If this parameter is float, The percentage between the number of words and the number of corpus documents , If it is int, The number of times a word appears . If the parameter has been given vocabulary, This parameter is invalid .
    # min_df： Be similar to max_df, The difference is that if a word is document frequence Less than min_df, Then the word will not be used as a key word .
    # ngram_range： for example ngram_range(min,max), It means to be text Divide into min,min+1,min+2,.........max  Two different phrases . such as  ' I   Love   China '  in ngram_range(1,3) And then you get ' I '  ' Love '  ' China '  ' I   Love '  ' Love   China '  and ' I   Love   China ', If it is ngram_range (1,1)  You can only get a single word ' I '  ' Love ' and ' China '.
    # stop_words： Set stop words , Set to english Built in English stop words will be used 
    #
    train_x_vec = vectorizer.fit_transform(x_train_data)
    test_x_vec = vectorizer.transform(x_test_data)

    acc = 0.0

    #  Split the test set to test the accuracy of the model 
    train_x, test_x, train_y, test_y = train_test_split(train_x_vec, y_train_data, test_size=0.2)

    mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(50, 50), random_state=0, max_iter=100)
    mlp.fit(train_x_vec, y_train_data)
    prediction_mlp = mlp.predict(test_x_vec)

    # mlp.fit(train_x, train_y)
    # prediction_mlp_test = mlp.predict(test_x)
    acc = format(accuracy_score(test_y, prediction_mlp))
    print('acc: ', format(accuracy_score(test_y, prediction_mlp)))

    #  File is written to 
    for i in range(len(x_test_data)):
        with open('C:/cmail//re.txt', 'a', encoding='ISO-8859-1') as f:
            if int(prediction_mlp[i]) == 0:
                text = 'ham ' + file_test_data[i]
            else:
                text = 'spam ' + file_test_data[i]
            f.write(text + '\n')

    return acc


if __name__ == '__main__':

    #  Ten fold cross validation   The training set opens on April 1 
    # n = 1
    # avg_acc = 0.0
    # for i in range(1, n + 1):
    #    print('cv {}:'.format(i))
    #    b_acc = main(i)
    #    avg_acc = avg_acc + float(b_acc)
    # avg_acc /= n
    # print('avg acc:'.avg_acc)