当前位置:网站首页>Naive Bayes -- Document Classification
Naive Bayes -- Document Classification
2022-07-27 03:09:00 【weixin_ nine hundred and sixty-one million eight hundred and se】
- W W W Is the characteristic value of a given document ( frequency statistic , The forecast document provides ), C C C For documents , be
P ( C ∣ W ) = P ( W ∣ C ) P ( C ) P ( W ) P(C|W)=\frac{P(W|C)P(C)}{P(W)} P(C∣W)=P(W)P(W∣C)P(C)
- P ( C ) P(C) P(C) The probability of each document category ( Number of words in a document category / Total number of document words )
- P ( W ∣ C ) P(W|C) P(W∣C) Characteristics under a given category ( The words in the predicted document ) Probability
- features W W W Characteristic words F 1 , F 2 , F 3 , . . . F1,F2,F3,... F1,F2,F3,...
P ( C ∣ F 1 , F 2 , . . . ) = P ( F 1 , F 2 , . . . ∣ C ) P ( C ) P ( F 1 , F 2 , . . . ) P(C|F1,F2,...)=\frac{P(F1,F2,...|C)P(C)}{P(F1,F2,...)} P(C∣F1,F2,...)=P(F1,F2,...)P(F1,F2,...∣C)P(C)
- computing method P ( F 1 ∣ C ) = N i / N P(F1|C)=N_i/N P(F1∣C)=Ni/N( In the training document to calculate )
- N i N_i Ni by F 1 F1 F1 Words in C C C The number of times a category appears in all documents
- N N N Is the category C C C The number of times all words appear and
- Laplace smoothing coefficient
If the word frequency list There are many occurrences of 0, It is likely that the calculation results are zero
P ( F 1 ∣ C ) = N i + α N + α ∗ m P(F1|C)=\frac{N_i+\alpha}{N+\alpha*m} P(F1∣C)=N+α∗mNi+α
Laplace smoothing coefficient α \alpha α It's usually 1, m It is the number of feature words in the training document
- Examples of text classification
- Load data set
news = fetch_20newsgroups(subset='all', data_home='data') #subset: 'train' perhaps 'test','all', Optional , Select the dataset to load ,fetch_* The file is large , So you need to download ,data_home Is the download path
print(news.target)
print(news.target_names)
result :
[10 3 17 ... 3 1 7]
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
b. Divide the training set and the test set , feature extraction
# Data segmentation
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=1)
# Feature extraction of data set
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)
x_test = tf.transform(x_test)
c. Naive Bayes prediction
# The prediction of naive Bayesian algorithm ,alpha Is the Laplace smoothing coefficient , Numerator and denominator plus a coefficient , Add to the denominator alpha* Number of characteristic words
mlt = MultinomialNB(alpha=1.0)
mlt.fit(x_train, y_train)
y_predict = mlt.predict(x_test)
print(" The predicted article category is :", y_predict)
print(" Accuracy rate is :", mlt.score(x_test, y_test))
print(" Accuracy rate and recall rate of each category :", classification_report(y_test, y_predict, target_names=news.target_names))
result :
The predicted article category is : [16 19 18 ... 13 7 14]
Accuracy rate is : 0.8518675721561969
Accuracy rate and recall rate of each category :
precision recall f1-score support
alt.atheism 0.91 0.77 0.83 199
comp.graphics 0.83 0.79 0.81 242
comp.os.ms-windows.misc 0.89 0.83 0.86 263
comp.sys.ibm.pc.hardware 0.80 0.83 0.81 262
comp.sys.mac.hardware 0.90 0.88 0.89 234
comp.windows.x 0.92 0.85 0.88 230
misc.forsale 0.96 0.67 0.79 257
rec.autos 0.90 0.87 0.88 265
rec.motorcycles 0.90 0.95 0.92 251
rec.sport.baseball 0.89 0.96 0.93 226
rec.sport.hockey 0.95 0.98 0.96 262
sci.crypt 0.76 0.97 0.85 257
sci.electronics 0.84 0.80 0.82 229
sci.med 0.97 0.86 0.91 249
sci.space 0.92 0.96 0.94 256
soc.religion.christian 0.55 0.98 0.70 243
talk.politics.guns 0.76 0.96 0.85 234
talk.politics.mideast 0.93 0.99 0.96 224
talk.politics.misc 0.98 0.56 0.72 197
talk.religion.misc 0.97 0.26 0.41 132
accuracy 0.85 4712
macro avg 0.88 0.84 0.84 4712
weighted avg 0.87 0.85 0.85 4712
d. Calculation AUC
# hold 0-19 A total of 20 A classification , Turn into 0 and 1
y_test1 = np.where(y_test == 5, 1, 0)
y_predict1 = np.where(y_predict == 5, 1, 0)
# roc_auc_score Of y_test It can only be classified into two categories , How to calculate for multiple classifications AUC
print("AUC indicators :", roc_auc_score(y_test1, y_predict1))
边栏推荐
- What did kubedmin do?
- 对象创建的流程分析
- Okaleido tiger is about to log in to binance NFT in the second round, which has aroused heated discussion in the community
- 软件测试相关试题知识点
- 196. 删除重复的电子邮箱
- [paper]PointLaneNet论文浅析
- How big is the bandwidth of the Tiktok server for hundreds of millions of people to brush at the same time?
- HCIP第十三天笔记
- 身家破亿!86版「红孩儿」拒绝出道成学霸,已是中科院博士,名下52家公司
- 浅浅梳理一下双轴快排(DualPivotQuickSort)
猜你喜欢

Manually build ABP framework from 0 -abp official complete solution and manually build simplified solution practice

CAS部署使用以及登录成功跳转地址

"Software testing" packaging resume directly improves the pass rate from these points

iNFTnews | GGAC联合中国航天ASES 独家出品《中国2065典藏版》

次轮Okaleido Tiger即将登录Binance NFT,引发社区热议

Kubernetes Dashboard 部署应用以及访问

商城小程序项目完整源码(微信小程序)

Goatgui invites you to attend a machine learning seminar

次轮Okaleido Tiger即将登录Binance NFT,引发社区热议

CS224W fall 课程 ---- 1.1 why Graphs ?
随机推荐
2649: segment calculation
Cuteone: a onedrive multi network disk mounting program / with member / synchronization and other functions
阿里云解决方案架构师张平:云原生数字化安全生产的体系建设
Social wechat applet of fanzhihu forum community
Rust web (I) -- self built TCP server
How to use devaxpress WPF to create the first MVVM application in winui?
Kubeadmin到底做了什么?
Thread.Sleep(0)的作用
Integrated water conservancy video monitoring station telemetry terminal video image water level water quality water quantity flow velocity monitoring
Redis四大特殊数据类型的学习和理解
素因子分解--C(gcc)--PTA
Static keyword
[Ryu] common problems and solutions in installing Ryu
商城小程序项目完整源码(微信小程序)
Complete source code of mall applet project (wechat applet)
go实现导出excel表格
[SQL简单题] LeetCode 627. 变更性别
全网最全的软件测试基础知识整理(新手入门必学)
Make ppt timeline
基于GoLang实现API短信网关