当前位置:网站首页>Naive Bayes -- Document Classification
Naive Bayes -- Document Classification
2022-07-27 03:09:00 【weixin_ nine hundred and sixty-one million eight hundred and se】
- W W W Is the characteristic value of a given document ( frequency statistic , The forecast document provides ), C C C For documents , be
P ( C ∣ W ) = P ( W ∣ C ) P ( C ) P ( W ) P(C|W)=\frac{P(W|C)P(C)}{P(W)} P(C∣W)=P(W)P(W∣C)P(C)
- P ( C ) P(C) P(C) The probability of each document category ( Number of words in a document category / Total number of document words )
- P ( W ∣ C ) P(W|C) P(W∣C) Characteristics under a given category ( The words in the predicted document ) Probability
- features W W W Characteristic words F 1 , F 2 , F 3 , . . . F1,F2,F3,... F1,F2,F3,...
P ( C ∣ F 1 , F 2 , . . . ) = P ( F 1 , F 2 , . . . ∣ C ) P ( C ) P ( F 1 , F 2 , . . . ) P(C|F1,F2,...)=\frac{P(F1,F2,...|C)P(C)}{P(F1,F2,...)} P(C∣F1,F2,...)=P(F1,F2,...)P(F1,F2,...∣C)P(C)
- computing method P ( F 1 ∣ C ) = N i / N P(F1|C)=N_i/N P(F1∣C)=Ni/N( In the training document to calculate )
- N i N_i Ni by F 1 F1 F1 Words in C C C The number of times a category appears in all documents
- N N N Is the category C C C The number of times all words appear and
- Laplace smoothing coefficient
If the word frequency list There are many occurrences of 0, It is likely that the calculation results are zero
P ( F 1 ∣ C ) = N i + α N + α ∗ m P(F1|C)=\frac{N_i+\alpha}{N+\alpha*m} P(F1∣C)=N+α∗mNi+α
Laplace smoothing coefficient α \alpha α It's usually 1, m It is the number of feature words in the training document
- Examples of text classification
- Load data set
news = fetch_20newsgroups(subset='all', data_home='data') #subset: 'train' perhaps 'test','all', Optional , Select the dataset to load ,fetch_* The file is large , So you need to download ,data_home Is the download path
print(news.target)
print(news.target_names)
result :
[10 3 17 ... 3 1 7]
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
b. Divide the training set and the test set , feature extraction
# Data segmentation
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=1)
# Feature extraction of data set
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)
x_test = tf.transform(x_test)
c. Naive Bayes prediction
# The prediction of naive Bayesian algorithm ,alpha Is the Laplace smoothing coefficient , Numerator and denominator plus a coefficient , Add to the denominator alpha* Number of characteristic words
mlt = MultinomialNB(alpha=1.0)
mlt.fit(x_train, y_train)
y_predict = mlt.predict(x_test)
print(" The predicted article category is :", y_predict)
print(" Accuracy rate is :", mlt.score(x_test, y_test))
print(" Accuracy rate and recall rate of each category :", classification_report(y_test, y_predict, target_names=news.target_names))
result :
The predicted article category is : [16 19 18 ... 13 7 14]
Accuracy rate is : 0.8518675721561969
Accuracy rate and recall rate of each category :
precision recall f1-score support
alt.atheism 0.91 0.77 0.83 199
comp.graphics 0.83 0.79 0.81 242
comp.os.ms-windows.misc 0.89 0.83 0.86 263
comp.sys.ibm.pc.hardware 0.80 0.83 0.81 262
comp.sys.mac.hardware 0.90 0.88 0.89 234
comp.windows.x 0.92 0.85 0.88 230
misc.forsale 0.96 0.67 0.79 257
rec.autos 0.90 0.87 0.88 265
rec.motorcycles 0.90 0.95 0.92 251
rec.sport.baseball 0.89 0.96 0.93 226
rec.sport.hockey 0.95 0.98 0.96 262
sci.crypt 0.76 0.97 0.85 257
sci.electronics 0.84 0.80 0.82 229
sci.med 0.97 0.86 0.91 249
sci.space 0.92 0.96 0.94 256
soc.religion.christian 0.55 0.98 0.70 243
talk.politics.guns 0.76 0.96 0.85 234
talk.politics.mideast 0.93 0.99 0.96 224
talk.politics.misc 0.98 0.56 0.72 197
talk.religion.misc 0.97 0.26 0.41 132
accuracy 0.85 4712
macro avg 0.88 0.84 0.84 4712
weighted avg 0.87 0.85 0.85 4712
d. Calculation AUC
# hold 0-19 A total of 20 A classification , Turn into 0 and 1
y_test1 = np.where(y_test == 5, 1, 0)
y_predict1 = np.where(y_predict == 5, 1, 0)
# roc_auc_score Of y_test It can only be classified into two categories , How to calculate for multiple classifications AUC
print("AUC indicators :", roc_auc_score(y_test1, y_predict1))
边栏推荐
- How big is the bandwidth of the Tiktok server for hundreds of millions of people to brush at the same time?
- Go to export excel form
- 关于url编解码应该选用的函数
- Favicon网页收藏图标在线制作PHP网站源码/ICO图片在线生成/支持多种图片格式转换
- "Software testing" packaging resume directly improves the pass rate from these points
- 商城小程序项目完整源码(微信小程序)
- 196. 删除重复的电子邮箱
- Inftnews | ggac and China Aerospace ases exclusively produce "China 2065 Collection Edition"
- Cloud development sleeping alarm clock wechat applet source code
- Role of thread.sleep (0)
猜你喜欢
随机推荐
小玩一个并行多线程MCU—MC3172
Marqueeview realizes sliding display effect
一体式水利视频监控站 遥测终端视频图像水位水质水量流速监测
Use the most primitive method to manually implement the common 20 array methods
Arduino UNO +74HC164流水灯示例
[dynamic planning medium] leetcode 198. looting 740. delete and get points
typora详细教程
Non global function of lua function
Favicon web page collection icon online production PHP website source code /ico image online generation / support multiple image format conversion
Cuteone: a onedrive multi network disk mounting program / with member / synchronization and other functions
Concept of data asset management
2649: 段位计算
机器学习【Matplotlib】
CS224W fall 1.2 Applications of Graph ML
Plato Farm通过LaaS协议Elephant Swap,为社区用户带来全新体验
Cloud development sleeping alarm clock wechat applet source code
How to use devaxpress WPF to create the first MVVM application in winui?
Manually build ABP framework from 0 -abp official complete solution and manually build simplified solution practice
Thread.Sleep(0)的作用
Redis四大特殊数据类型的学习和理解









