当前位置:网站首页>Spam filtering challenges
Spam filtering challenges
2022-07-01 11:05:00 【I'm not zzy1231a】
Spam filtering challenges
With the gradual development of network applications , E-mail has become an integral part of people's daily work and life . meanwhile , The problem of spam plagues many e-mail users , They not only bring reading burden to e-mail users , It also takes up limited mailbox space . Therefore, this study proposes a mail classification method based on multi-layer perceptron , Used to classify unknown messages , Detect and identify spam .
Goal question : Solve the two classification problem of spam filtering
The dataset includes 3017 mail , The training set includes 2082 Normal sealing ,935 Seal garbage , The data set is at the end of this article 
TF-IDF modular
TF-IDF It is a statistical method of word frequency , Suitable for solving classification problems , It is mainly used to evaluate the importance of a word to a document set or one of the documents in a corpus . If the frequency of a word in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification . In the process of transforming text data into feature vectors , The commonly used text feature representation is word bag method , That is, we don't consider the order of words , Each of the words that appear as a separate set of features , These unique words are called thesaurus , Every text can count a lot of columns of eigenvectors on a long vocabulary , If every text has words , Usually marked as “ Stop words ” Not included in the eigenvector . This article USES the SKlearn Machine learning library TfidfVectorizer Method , Compare with CountVectorizer,TfidfVectorizer In addition to considering the frequency of a word in the text , Also look at the number of texts that contain the word . It can reduce the impact of high-frequency meaningless words , Mining more meaningful features .
MLP Multilayer perceptron module
MLP It is the most common feedforward artificial neural network model , In addition to the I / O layer , It can have multiple hidden layers in the middle , The simplest MLP Only one hidden layer , That is, the three-tier structure . The layers of multi-layer perceptron are fully connected . The bottom layer of multi-layer perceptron is the input layer , In the middle is the hidden layer , Finally, the output layer . Suppose you input a n Dimension vector , There is n Neurons , The regression function is softmax. The neurons in the hidden layer are fully connected with the input layer , Suppose the input layer uses a vector X Express , The output of the hidden layer is f(W1X+b1),W1 Weight. ( Also called connection coefficient ),b1 It's bias , function f Can be common sigmoid Function or tanh function . The output of the output layer is softmax(W2X1+b2),X1 Represents the output of the hidden layer f(W1X+b1). therefore ,MLP All the parameters are the connection weights and offsets between the layers , The parameters of multilayer perceptron set in this project are as follows : The activation function is the default relu function ; Optimizer selection lbfgs(quasi-Newton Method optimizer ) Used to optimize weights ;alpha Set to 1e-5; The number of hidden layers is 2, Each floor has its own 50 Neurons ; The maximum number of iterations is 100.
Source code
import os
import re
from html import unescape
from re import T
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
def html_div(html):
text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
text = re.sub('<.*?>', '', text, flags=re.M | re.S)
text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
return unescape(text)
def main(cv=1):
path_spam = 'C:/cmail/trainspam/'
path_ham = 'C:/cmail/trainham/'
path_test = 'C:/cmail/test/'
dataset_x_train = []
dataset_y_train = []
dataset_file_train = []
for filename in list(os.listdir(path_spam)):
f = open(path_spam + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_train.append(content)
dataset_y_train.append(1)
dataset_file_train.append(filename)
for filename in list(os.listdir(path_ham)):
f = open(path_ham + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_train.append(content)
dataset_y_train.append(0)
dataset_file_train.append(filename)
dataset_x = np.array(dataset_x_train)
dataset_y = np.array(dataset_y_train)
dataset_file = np.array(dataset_file_train)
x_train_data, y_train_data, file_train_data = dataset_x, dataset_y, dataset_file
dataset_x_test = []
dataset_file_test = []
for filename in list(os.listdir(path_test)):
f = open(path_test + filename, encoding='ISO-8859-1')
cont = f.read()
content = html_div(cont)
dataset_x_test.append(content)
dataset_file_test.append(filename)
dataset_x_test = np.array(dataset_x)
dataset_file_test = np.array(dataset_file)
x_test_data, file_test_data = dataset_x_test, dataset_file_test
vectorizer = TfidfVectorizer(min_df=2, max_df=0.6, ngram_range=(1, 2), stop_words='english',
strip_accents='unicode', norm='l2')
# max_df: It can be set in the range of [0.0 1.0] Of float, There is no limit to the scope of int, The default is 1.0. This parameter is used as a threshold , When constructing a corpus keyword set , If there's a word document frequence Greater than max_df, This word will not be used as a keyword . If this parameter is float, The percentage between the number of words and the number of corpus documents , If it is int, The number of times a word appears . If the parameter has been given vocabulary, This parameter is invalid .
# min_df: Be similar to max_df, The difference is that if a word is document frequence Less than min_df, Then the word will not be used as a key word .
# ngram_range: for example ngram_range(min,max), It means to be text Divide into min,min+1,min+2,.........max Two different phrases . such as ' I Love China ' in ngram_range(1,3) And then you get ' I ' ' Love ' ' China ' ' I Love ' ' Love China ' and ' I Love China ', If it is ngram_range (1,1) You can only get a single word ' I ' ' Love ' and ' China '.
# stop_words: Set stop words , Set to english Built in English stop words will be used
#
train_x_vec = vectorizer.fit_transform(x_train_data)
test_x_vec = vectorizer.transform(x_test_data)
acc = 0.0
# Split the test set to test the accuracy of the model
train_x, test_x, train_y, test_y = train_test_split(train_x_vec, y_train_data, test_size=0.2)
mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(50, 50), random_state=0, max_iter=100)
mlp.fit(train_x_vec, y_train_data)
prediction_mlp = mlp.predict(test_x_vec)
# mlp.fit(train_x, train_y)
# prediction_mlp_test = mlp.predict(test_x)
acc = format(accuracy_score(test_y, prediction_mlp))
print('acc: ', format(accuracy_score(test_y, prediction_mlp)))
# File is written to
for i in range(len(x_test_data)):
with open('C:/cmail//re.txt', 'a', encoding='ISO-8859-1') as f:
if int(prediction_mlp[i]) == 0:
text = 'ham ' + file_test_data[i]
else:
text = 'spam ' + file_test_data[i]
f.write(text + '\n')
return acc
if __name__ == '__main__':
# Ten fold cross validation The training set opens on April 1
# n = 1
# avg_acc = 0.0
# for i in range(1, n + 1):
# print('cv {}:'.format(i))
# b_acc = main(i)
# avg_acc = avg_acc + float(b_acc)
# avg_acc /= n
# print('avg acc:'.avg_acc)
Dataset Links
Mail dataset
link :https://pan.baidu.com/s/1ncsf4SiqMc0bfGpgKvzOvw
Extraction code :spwd
边栏推荐
- 我国蜂窝物联网用户已达 15.9 亿,年内有望超越移动电话用户
- Win平台下influxDB导出、导入
- Huawei equipment is configured with large network WLAN basic services
- Project0:小游戏
- Suggest collecting | what to do when encountering slow SQL on opengauss?
- flutter Uint8List格式的图片和File格式图片的互相转换
- NC | 肠道细胞和乳酸菌共同作用来防止念珠菌感染
- CPI教程-异步接口创建及使用
- 获取键代码
- How does MySQL copy table data from one database to another (two databases are not linked to the same database)
猜你喜欢

Uncover the secrets of new products! Yadi Guanneng 3 multi product matrix to meet the travel needs of global users

The first anniversary of the data security law, which four major changes are coming?

Google's new paper Minerva: solving quantitative reasoning problems with language models

Rising stars in Plant Sciences (rsps2022) final Science Lecture (6.30 pm)

网站源码整站下载 网站模板源代码下载
![[paper reading] trajectory guided control prediction for end to end autonomous driving: a simple yet strong Ba](/img/fa/f2d24ee3dbbbe6332c84a82109338e.png)
[paper reading] trajectory guided control prediction for end to end autonomous driving: a simple yet strong Ba

What are the advantages and disadvantages of PHP

新品大揭秘!雅迪冠能 3 多元产品矩阵,满足全球用户出行需求

Value 1000 graduation project campus information publishing platform website source code

Matplotlib数据可视化基础
随机推荐
"Target detection" + "visual understanding" to realize the understanding and translation of the input image (with source code)
Ask everyone in the group about the fact that the logminer scheme of flick Oracle CDC has been used to run stably in production
Huawei Equipment configure les services de base du réseau WLAN à grande échelle
MIT最新论文《对可解释特征的需求:动机和分类》:在机器学习模型的组成元素中建立可解释性
价值1000毕业设计校园信息发布平台网站源码
【MPC】②quadprog求解正定、半正定、负定二次规划
基金国际化的发展概况
Project0:小游戏
Huawei HMS core joins hands with hypergraph to inject new momentum into 3D GIS
Intel Labs announces new progress in integrated photonics research
Technology sharing | introduction to linkis parameters
Rising Stars in Plant Sciences (RSPS2022) Finalist科学演讲会(6.30晚9点)
Graduation season · advanced technology er
Database experiment report (I)
kubernetes之ingress探索实践
毕业季·进击的技术er
Win平台下influxDB导出、导入
[.NET6]使用ML.NET+ONNX预训练模型整活B站经典《华强买瓜》
LeetCode.每日一题 剑指 Offer II 091. 粉刷房子 (DP问题)
2022年现在在网上开通股票账户安全吗?会不会有什么危险?