当前位置:网站首页>Spam filtering challenges
Spam filtering challenges
2022-07-01 11:05:00 【I'm not zzy1231a】
Spam filtering challenges
With the gradual development of network applications , E-mail has become an integral part of people's daily work and life . meanwhile , The problem of spam plagues many e-mail users , They not only bring reading burden to e-mail users , It also takes up limited mailbox space . Therefore, this study proposes a mail classification method based on multi-layer perceptron , Used to classify unknown messages , Detect and identify spam .
Goal question : Solve the two classification problem of spam filtering
The dataset includes 3017 mail , The training set includes 2082 Normal sealing ,935 Seal garbage , The data set is at the end of this article 
TF-IDF modular
TF-IDF It is a statistical method of word frequency , Suitable for solving classification problems , It is mainly used to evaluate the importance of a word to a document set or one of the documents in a corpus . If the frequency of a word in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification . In the process of transforming text data into feature vectors , The commonly used text feature representation is word bag method , That is, we don't consider the order of words , Each of the words that appear as a separate set of features , These unique words are called thesaurus , Every text can count a lot of columns of eigenvectors on a long vocabulary , If every text has words , Usually marked as “ Stop words ” Not included in the eigenvector . This article USES the SKlearn Machine learning library TfidfVectorizer Method , Compare with CountVectorizer,TfidfVectorizer In addition to considering the frequency of a word in the text , Also look at the number of texts that contain the word . It can reduce the impact of high-frequency meaningless words , Mining more meaningful features .
MLP Multilayer perceptron module
MLP It is the most common feedforward artificial neural network model , In addition to the I / O layer , It can have multiple hidden layers in the middle , The simplest MLP Only one hidden layer , That is, the three-tier structure . The layers of multi-layer perceptron are fully connected . The bottom layer of multi-layer perceptron is the input layer , In the middle is the hidden layer , Finally, the output layer . Suppose you input a n Dimension vector , There is n Neurons , The regression function is softmax. The neurons in the hidden layer are fully connected with the input layer , Suppose the input layer uses a vector X Express , The output of the hidden layer is f(W1X+b1),W1 Weight. ( Also called connection coefficient ),b1 It's bias , function f Can be common sigmoid Function or tanh function . The output of the output layer is softmax(W2X1+b2),X1 Represents the output of the hidden layer f(W1X+b1). therefore ,MLP All the parameters are the connection weights and offsets between the layers , The parameters of multilayer perceptron set in this project are as follows : The activation function is the default relu function ; Optimizer selection lbfgs(quasi-Newton Method optimizer ) Used to optimize weights ;alpha Set to 1e-5; The number of hidden layers is 2, Each floor has its own 50 Neurons ; The maximum number of iterations is 100.
Source code
import os
import re
from html import unescape
from re import T
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
def html_div(html):
text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
text = re.sub('<.*?>', '', text, flags=re.M | re.S)
text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
return unescape(text)
def main(cv=1):
path_spam = 'C:/cmail/trainspam/'
path_ham = 'C:/cmail/trainham/'
path_test = 'C:/cmail/test/'
dataset_x_train = []
dataset_y_train = []
dataset_file_train = []
for filename in list(os.listdir(path_spam)):
f = open(path_spam + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_train.append(content)
dataset_y_train.append(1)
dataset_file_train.append(filename)
for filename in list(os.listdir(path_ham)):
f = open(path_ham + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_train.append(content)
dataset_y_train.append(0)
dataset_file_train.append(filename)
dataset_x = np.array(dataset_x_train)
dataset_y = np.array(dataset_y_train)
dataset_file = np.array(dataset_file_train)
x_train_data, y_train_data, file_train_data = dataset_x, dataset_y, dataset_file
dataset_x_test = []
dataset_file_test = []
for filename in list(os.listdir(path_test)):
f = open(path_test + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_test.append(content)
dataset_file_test.append(filename)
dataset_x_test = np.array(dataset_x)
dataset_file_test = np.array(dataset_file)
x_test_data, file_test_data = dataset_x_test, dataset_file_test
vectorizer = TfidfVectorizer(min_df=2, max_df=0.6, ngram_range=(1, 2), stop_words='english',
strip_accents='unicode', norm='l2')
# max_df: It can be set in the range of [0.0 1.0] Of float, There is no limit to the scope of int, The default is 1.0. This parameter is used as a threshold , When constructing a corpus keyword set , If there's a word document frequence Greater than max_df, This word will not be used as a keyword . If this parameter is float, The percentage between the number of words and the number of corpus documents , If it is int, The number of times a word appears . If the parameter has been given vocabulary, This parameter is invalid .
# min_df: Be similar to max_df, The difference is that if a word is document frequence Less than min_df, Then the word will not be used as a key word .
# ngram_range: for example ngram_range(min,max), It means to be text Divide into min,min+1,min+2,.........max Two different phrases . such as ' I Love China ' in ngram_range(1,3) And then you get ' I ' ' Love ' ' China ' ' I Love ' ' Love China ' and ' I Love China ', If it is ngram_range (1,1) You can only get a single word ' I ' ' Love ' and ' China '.
# stop_words: Set stop words , Set to english Built in English stop words will be used
#
train_x_vec = vectorizer.fit_transform(x_train_data)
test_x_vec = vectorizer.transform(x_test_data)
acc = 0.0
# Split the test set to test the accuracy of the model
train_x, test_x, train_y, test_y = train_test_split(train_x_vec, y_train_data, test_size=0.2)
mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(50, 50), random_state=0, max_iter=100)
mlp.fit(train_x_vec, y_train_data)
prediction_mlp = mlp.predict(test_x_vec)
# mlp.fit(train_x, train_y)
# prediction_mlp_test = mlp.predict(test_x)
acc = format(accuracy_score(test_y, prediction_mlp))
print('acc: ', format(accuracy_score(test_y, prediction_mlp)))
# File is written to
for i in range(len(x_test_data)):
with open('C:/cmail//re.txt', 'a', encoding='ISO-8859-1') as f:
if int(prediction_mlp[i]) == 0:
text = 'ham ' + file_test_data[i]
else:
text = 'spam ' + file_test_data[i]
f.write(text + '\n')
return acc
if __name__ == '__main__':
# Ten fold cross validation The training set opens on April 1
# n = 1
# avg_acc = 0.0
# for i in range(1, n + 1):
# print('cv {}:'.format(i))
# b_acc = main(i)
# avg_acc = avg_acc + float(b_acc)
# avg_acc /= n
# print('avg acc:'.avg_acc)
Dataset Links
Mail dataset
link :https://pan.baidu.com/s/1ncsf4SiqMc0bfGpgKvzOvw
Extraction code :spwd
边栏推荐
- YoDA统一数据应用——融合计算在蚂蚁风险场景下的探索与实践
- Matplotlib数据可视化基础
- mysql如何把 一个数据库中的表数据 复制到 另一个数据库中(两个数据库不在同一个数据库链接下)
- What are the advantages and disadvantages of PHP
- Prism journal navigation button usability exploration record
- Combinaison Oracle et json
- JS foundation -- data type
- 2022年6月编程语言排行,第一名居然是它?!
- 12款大家都在用的產品管理平臺
- Leetcode 181 Employees exceeding the manager's income (June 29, 2022)
猜你喜欢

英特爾實驗室公布集成光子學研究新進展

Submission lottery - light application server essay solicitation activity (may) award announcement

12 plateformes de gestion de produits utilisées par tout le monde

CVPR 2022 | Virtual Correspondence: Humans as a Cue for Extreme-View Geometry

Database experiment report (I)

NC | intestinal cells and lactic acid bacteria work together to prevent Candida infection

华为设备配置大型网络WLAN基本业务

英特尔实验室公布集成光子学研究新进展

我国蜂窝物联网用户已达 15.9 亿,年内有望超越移动电话用户

The project bar on the left side of CodeBlocks disappears, workspace automatically saves the project, default workspace, open the last workspace, workspace (Graphic tutorial, solved)
随机推荐
Compliance management of fund managers
Addition, deletion, modification and query of database
Huawei HMS core joins hands with hypergraph to inject new momentum into 3D GIS
使用强大的DBPack处理分布式事务(PHP使用教程)
“目标检测”+“视觉理解”实现对输入图像的理解及翻译(附源代码)
Ask everyone in the group about the fact that the logminer scheme of flick Oracle CDC has been used to run stably in production
转义字符串
Handling distributed transactions with powerful dbpack (PHP tutorial)
Rising Stars in Plant Sciences (RSPS2022) Finalist科学演讲会(6.30晚9点)
中国探月工程独家藏品限量发售!
Valgrind usage of memory leak locating tool
Infinite innovation in cloud "vision" | the 2022 Alibaba cloud live summit was officially launched
想开个户,在网上开华泰证券的户安全吗?
[encounter Django] - (II) database configuration
CVPR 2022 | 基于密度与深度分解的自增强非成对图像去雾
内存泄漏定位工具之 valgrind 使用
关于Keil编译程序出现“File has been changed outside the editor,reload?”的解决方法
CPI教程-异步接口创建及使用
Project0: Games
软件项目管理 9.2.软件项目配置管理过程