当前位置:网站首页>Naive Bayes -- Document Classification
Naive Bayes -- Document Classification
2022-07-27 03:09:00 【weixin_ nine hundred and sixty-one million eight hundred and se】
- W W W Is the characteristic value of a given document ( frequency statistic , The forecast document provides ), C C C For documents , be
P ( C ∣ W ) = P ( W ∣ C ) P ( C ) P ( W ) P(C|W)=\frac{P(W|C)P(C)}{P(W)} P(C∣W)=P(W)P(W∣C)P(C)
- P ( C ) P(C) P(C) The probability of each document category ( Number of words in a document category / Total number of document words )
- P ( W ∣ C ) P(W|C) P(W∣C) Characteristics under a given category ( The words in the predicted document ) Probability
- features W W W Characteristic words F 1 , F 2 , F 3 , . . . F1,F2,F3,... F1,F2,F3,...
P ( C ∣ F 1 , F 2 , . . . ) = P ( F 1 , F 2 , . . . ∣ C ) P ( C ) P ( F 1 , F 2 , . . . ) P(C|F1,F2,...)=\frac{P(F1,F2,...|C)P(C)}{P(F1,F2,...)} P(C∣F1,F2,...)=P(F1,F2,...)P(F1,F2,...∣C)P(C)
- computing method P ( F 1 ∣ C ) = N i / N P(F1|C)=N_i/N P(F1∣C)=Ni/N( In the training document to calculate )
- N i N_i Ni by F 1 F1 F1 Words in C C C The number of times a category appears in all documents
- N N N Is the category C C C The number of times all words appear and
- Laplace smoothing coefficient
If the word frequency list There are many occurrences of 0, It is likely that the calculation results are zero
P ( F 1 ∣ C ) = N i + α N + α ∗ m P(F1|C)=\frac{N_i+\alpha}{N+\alpha*m} P(F1∣C)=N+α∗mNi+α
Laplace smoothing coefficient α \alpha α It's usually 1, m It is the number of feature words in the training document
- Examples of text classification
- Load data set
news = fetch_20newsgroups(subset='all', data_home='data') #subset: 'train' perhaps 'test','all', Optional , Select the dataset to load ,fetch_* The file is large , So you need to download ,data_home Is the download path
print(news.target)
print(news.target_names)
result :
[10 3 17 ... 3 1 7]
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
b. Divide the training set and the test set , feature extraction
# Data segmentation
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=1)
# Feature extraction of data set
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)
x_test = tf.transform(x_test)
c. Naive Bayes prediction
# The prediction of naive Bayesian algorithm ,alpha Is the Laplace smoothing coefficient , Numerator and denominator plus a coefficient , Add to the denominator alpha* Number of characteristic words
mlt = MultinomialNB(alpha=1.0)
mlt.fit(x_train, y_train)
y_predict = mlt.predict(x_test)
print(" The predicted article category is :", y_predict)
print(" Accuracy rate is :", mlt.score(x_test, y_test))
print(" Accuracy rate and recall rate of each category :", classification_report(y_test, y_predict, target_names=news.target_names))
result :
The predicted article category is : [16 19 18 ... 13 7 14]
Accuracy rate is : 0.8518675721561969
Accuracy rate and recall rate of each category :
precision recall f1-score support
alt.atheism 0.91 0.77 0.83 199
comp.graphics 0.83 0.79 0.81 242
comp.os.ms-windows.misc 0.89 0.83 0.86 263
comp.sys.ibm.pc.hardware 0.80 0.83 0.81 262
comp.sys.mac.hardware 0.90 0.88 0.89 234
comp.windows.x 0.92 0.85 0.88 230
misc.forsale 0.96 0.67 0.79 257
rec.autos 0.90 0.87 0.88 265
rec.motorcycles 0.90 0.95 0.92 251
rec.sport.baseball 0.89 0.96 0.93 226
rec.sport.hockey 0.95 0.98 0.96 262
sci.crypt 0.76 0.97 0.85 257
sci.electronics 0.84 0.80 0.82 229
sci.med 0.97 0.86 0.91 249
sci.space 0.92 0.96 0.94 256
soc.religion.christian 0.55 0.98 0.70 243
talk.politics.guns 0.76 0.96 0.85 234
talk.politics.mideast 0.93 0.99 0.96 224
talk.politics.misc 0.98 0.56 0.72 197
talk.religion.misc 0.97 0.26 0.41 132
accuracy 0.85 4712
macro avg 0.88 0.84 0.84 4712
weighted avg 0.87 0.85 0.85 4712
d. Calculation AUC
# hold 0-19 A total of 20 A classification , Turn into 0 and 1
y_test1 = np.where(y_test == 5, 1, 0)
y_predict1 = np.where(y_predict == 5, 1, 0)
# roc_auc_score Of y_test It can only be classified into two categories , How to calculate for multiple classifications AUC
print("AUC indicators :", roc_auc_score(y_test1, y_predict1))
边栏推荐
- Integrated water conservancy video monitoring station telemetry terminal video image water level water quality water quantity flow velocity monitoring
- 万字长文,带你搞懂 Kubernetes 网络模型
- 手动从0搭建ABP框架-ABP官方完整解决方案和手动搭建简化解决方案实践
- Alibaba cloud technology expert Yang Zeqiang: Construction of observability on elastic computing cloud
- Is the low commission account opening of Galaxy Securities Fund reliable, reliable and safe
- Worth more than 100 million! The 86 version of "red boy" refuses to be a Daocheng Xueba. He is already a doctor of the Chinese Academy of Sciences and has 52 companies under his name
- 朴素贝叶斯——文档分类
- [栈和队列简单题] LeetCode 232. 用栈实现队列,225. 用队列实现栈
- 196. 删除重复的电子邮箱
- 浅浅梳理一下双轴快排(DualPivotQuickSort)
猜你喜欢
随机推荐
Zhang Ping, Alibaba cloud Solution Architect: system construction of cloud native digital safety production
HCIP第十三天笔记
Comprehensive summary of shell analysis log file commands
用最原始的方法纯手工实现常见的 20 个数组方法
Use the most primitive method to manually implement the common 20 array methods
Non global function of lua function
Debezium series: pull historical data based on debezium offset to ensure that data is not lost
idea中常用的快捷键
"Software testing" packaging resume directly improves the pass rate from these points
小玩一个并行多线程MCU—MC3172
Okaleido tiger is about to log in to binance NFT in the second round, which has aroused heated discussion in the community
八皇后编程实现
iNFTnews | GGAC联合中国航天ASES 独家出品《中国2065典藏版》
175. 组合两个表(非常简单)
window对象的常见事件
Play a parallel multithreaded mcu-mc3172
Inftnews | "traffic + experience" white lining e Digital Fashion Festival leads the new changes of digital fashion
五、MFC视图窗口和文档
次轮Okaleido Tiger即将登录Binance NFT,引发社区热议
Okaleido tiger is about to log in to binance NFT in the second round, which has aroused heated discussion in the community





![[paper]PointLaneNet论文浅析](/img/f6/8001be4f90fe15100e0295de02491f.png)


