当前位置:网站首页>Spam filtering challenges
Spam filtering challenges
2022-07-01 11:05:00 【I'm not zzy1231a】
Spam filtering challenges
With the gradual development of network applications , E-mail has become an integral part of people's daily work and life . meanwhile , The problem of spam plagues many e-mail users , They not only bring reading burden to e-mail users , It also takes up limited mailbox space . Therefore, this study proposes a mail classification method based on multi-layer perceptron , Used to classify unknown messages , Detect and identify spam .
Goal question : Solve the two classification problem of spam filtering
The dataset includes 3017 mail , The training set includes 2082 Normal sealing ,935 Seal garbage , The data set is at the end of this article 
TF-IDF modular
TF-IDF It is a statistical method of word frequency , Suitable for solving classification problems , It is mainly used to evaluate the importance of a word to a document set or one of the documents in a corpus . If the frequency of a word in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification . In the process of transforming text data into feature vectors , The commonly used text feature representation is word bag method , That is, we don't consider the order of words , Each of the words that appear as a separate set of features , These unique words are called thesaurus , Every text can count a lot of columns of eigenvectors on a long vocabulary , If every text has words , Usually marked as “ Stop words ” Not included in the eigenvector . This article USES the SKlearn Machine learning library TfidfVectorizer Method , Compare with CountVectorizer,TfidfVectorizer In addition to considering the frequency of a word in the text , Also look at the number of texts that contain the word . It can reduce the impact of high-frequency meaningless words , Mining more meaningful features .
MLP Multilayer perceptron module
MLP It is the most common feedforward artificial neural network model , In addition to the I / O layer , It can have multiple hidden layers in the middle , The simplest MLP Only one hidden layer , That is, the three-tier structure . The layers of multi-layer perceptron are fully connected . The bottom layer of multi-layer perceptron is the input layer , In the middle is the hidden layer , Finally, the output layer . Suppose you input a n Dimension vector , There is n Neurons , The regression function is softmax. The neurons in the hidden layer are fully connected with the input layer , Suppose the input layer uses a vector X Express , The output of the hidden layer is f(W1X+b1),W1 Weight. ( Also called connection coefficient ),b1 It's bias , function f Can be common sigmoid Function or tanh function . The output of the output layer is softmax(W2X1+b2),X1 Represents the output of the hidden layer f(W1X+b1). therefore ,MLP All the parameters are the connection weights and offsets between the layers , The parameters of multilayer perceptron set in this project are as follows : The activation function is the default relu function ; Optimizer selection lbfgs(quasi-Newton Method optimizer ) Used to optimize weights ;alpha Set to 1e-5; The number of hidden layers is 2, Each floor has its own 50 Neurons ; The maximum number of iterations is 100.
Source code
import os
import re
from html import unescape
from re import T
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
def html_div(html):
text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
text = re.sub('<.*?>', '', text, flags=re.M | re.S)
text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
return unescape(text)
def main(cv=1):
path_spam = 'C:/cmail/trainspam/'
path_ham = 'C:/cmail/trainham/'
path_test = 'C:/cmail/test/'
dataset_x_train = []
dataset_y_train = []
dataset_file_train = []
for filename in list(os.listdir(path_spam)):
f = open(path_spam + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_train.append(content)
dataset_y_train.append(1)
dataset_file_train.append(filename)
for filename in list(os.listdir(path_ham)):
f = open(path_ham + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_train.append(content)
dataset_y_train.append(0)
dataset_file_train.append(filename)
dataset_x = np.array(dataset_x_train)
dataset_y = np.array(dataset_y_train)
dataset_file = np.array(dataset_file_train)
x_train_data, y_train_data, file_train_data = dataset_x, dataset_y, dataset_file
dataset_x_test = []
dataset_file_test = []
for filename in list(os.listdir(path_test)):
f = open(path_test + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_test.append(content)
dataset_file_test.append(filename)
dataset_x_test = np.array(dataset_x)
dataset_file_test = np.array(dataset_file)
x_test_data, file_test_data = dataset_x_test, dataset_file_test
vectorizer = TfidfVectorizer(min_df=2, max_df=0.6, ngram_range=(1, 2), stop_words='english',
strip_accents='unicode', norm='l2')
# max_df: It can be set in the range of [0.0 1.0] Of float, There is no limit to the scope of int, The default is 1.0. This parameter is used as a threshold , When constructing a corpus keyword set , If there's a word document frequence Greater than max_df, This word will not be used as a keyword . If this parameter is float, The percentage between the number of words and the number of corpus documents , If it is int, The number of times a word appears . If the parameter has been given vocabulary, This parameter is invalid .
# min_df: Be similar to max_df, The difference is that if a word is document frequence Less than min_df, Then the word will not be used as a key word .
# ngram_range: for example ngram_range(min,max), It means to be text Divide into min,min+1,min+2,.........max Two different phrases . such as ' I Love China ' in ngram_range(1,3) And then you get ' I ' ' Love ' ' China ' ' I Love ' ' Love China ' and ' I Love China ', If it is ngram_range (1,1) You can only get a single word ' I ' ' Love ' and ' China '.
# stop_words: Set stop words , Set to english Built in English stop words will be used
#
train_x_vec = vectorizer.fit_transform(x_train_data)
test_x_vec = vectorizer.transform(x_test_data)
acc = 0.0
# Split the test set to test the accuracy of the model
train_x, test_x, train_y, test_y = train_test_split(train_x_vec, y_train_data, test_size=0.2)
mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(50, 50), random_state=0, max_iter=100)
mlp.fit(train_x_vec, y_train_data)
prediction_mlp = mlp.predict(test_x_vec)
# mlp.fit(train_x, train_y)
# prediction_mlp_test = mlp.predict(test_x)
acc = format(accuracy_score(test_y, prediction_mlp))
print('acc: ', format(accuracy_score(test_y, prediction_mlp)))
# File is written to
for i in range(len(x_test_data)):
with open('C:/cmail//re.txt', 'a', encoding='ISO-8859-1') as f:
if int(prediction_mlp[i]) == 0:
text = 'ham ' + file_test_data[i]
else:
text = 'spam ' + file_test_data[i]
f.write(text + '\n')
return acc
if __name__ == '__main__':
# Ten fold cross validation The training set opens on April 1
# n = 1
# avg_acc = 0.0
# for i in range(1, n + 1):
# print('cv {}:'.format(i))
# b_acc = main(i)
# avg_acc = avg_acc + float(b_acc)
# avg_acc /= n
# print('avg acc:'.avg_acc)
Dataset Links
Mail dataset
link :https://pan.baidu.com/s/1ncsf4SiqMc0bfGpgKvzOvw
Extraction code :spwd
边栏推荐
- The project bar on the left side of CodeBlocks disappears, workspace automatically saves the project, default workspace, open the last workspace, workspace (Graphic tutorial, solved)
- CVPR 2022 | 基于密度与深度分解的自增强非成对图像去雾
- NC | intestinal cells and lactic acid bacteria work together to prevent Candida infection
- JS基础--数据类型
- [encounter Django] - (II) database configuration
- 谷歌新论文-Minerva:用语言模型解决定量推理问题
- Detailed explanation of linear regression in machine learning
- CPI教程-异步接口创建及使用
- y48.第三章 Kubernetes从入门到精通 -- Pod的状态和探针(二一)
- 2022年6月编程语言排行,第一名居然是它?!
猜你喜欢

JS基础--数据类型

我国蜂窝物联网用户已达 15.9 亿,年内有望超越移动电话用户

Addition, deletion, modification and query of database

Mobile hard drive reads but does not display drive letter

Global filter (processing time format)

Rising stars in Plant Sciences (rsps2022) final Science Lecture (6.30 pm)

【MPC】②quadprog求解正定、半正定、负定二次规划

y48.第三章 Kubernetes从入门到精通 -- Pod的状态和探针(二一)

node版本管理器nvm安装及切换

Submission lottery - light application server essay solicitation activity (may) award announcement
随机推荐
Technology sharing | introduction to linkis parameters
JS基础--数据类型
Addition, deletion, modification and query of database
YoDA统一数据应用——融合计算在蚂蚁风险场景下的探索与实践
"Target detection" + "visual understanding" to realize the understanding and translation of the input image (with source code)
NeurIPS 2022 | 细胞图像分割竞赛正式启动!
Valgrind usage of memory leak locating tool
LeetCode 438. Find all letter ectopic words in the string__ sliding window
编译调试Net6源码
dotnet 控制台 使用 Microsoft.Maui.Graphics 配合 Skia 进行绘图入门
NC | intestinal cells and lactic acid bacteria work together to prevent Candida infection
Rising Stars in Plant Sciences (RSPS2022) Finalist科学演讲会(6.30晚9点)
Matplotlib数据可视化基础
Is it safe to open a stock account online in 2022? Is there any danger?
Give up high paying jobs in Shenzhen and go back home
我国蜂窝物联网用户已达 15.9 亿,年内有望超越移动电话用户
十年磨一剑:蚂蚁集团可观测性平台 AntMonitor 揭秘
华为设备配置大型网络WLAN基本业务
[matytype] insert MathType inter line and intra line formulas in CSDN blog
NC | 肠道细胞和乳酸菌共同作用来防止念珠菌感染