当前位置：网站首页>Naive Bayes -- Document Classification

Naive Bayes -- Document Classification

2022-07-27 03:09:00 【weixin_ nine hundred and sixty-one million eight hundred and se】

$W$ Is the characteristic value of a given document ( frequency statistic , The forecast document provides ), $C$ For documents , be

$P(C|W)=\frac{P(W|C)P(C)}{P(W)}$

$P (C)$ The probability of each document category ( Number of words in a document category ／ Total number of document words )
$P (W ∣ C)$ Characteristics under a given category （ The words in the predicted document ） Probability
features $W$ Characteristic words $F 1, F 2, F 3, ...$

$P(C|F1,F2,...)=\frac{P(F1,F2,...|C)P(C)}{P(F1,F2,...)}$

computing method $P(F1|C)=N_i/N$ （ In the training document to calculate ）
$N_i$ by $F 1$ Words in $C$ The number of times a category appears in all documents
$N$ Is the category $C$ The number of times all words appear and
Laplace smoothing coefficient

If the word frequency list There are many occurrences of 0, It is likely that the calculation results are zero
$P(F1|C)=\frac{N_i+\alpha}{N+\alpha*m}$
Laplace smoothing coefficient $\alpha$ It's usually 1, m It is the number of feature words in the training document

Examples of text classification
1. Load data set

news = fetch_20newsgroups(subset='all', data_home='data') #subset: 'train' perhaps 'test','all', Optional , Select the dataset to load ,fetch_* The file is large , So you need to download ,data_home Is the download path 
print(news.target)
print(news.target_names)

 result ：
[10  3 17 ...  3  1  7]
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

b. Divide the training set and the test set , feature extraction

#  Data segmentation 
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=1)

#  Feature extraction of data set 
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train) 
x_test = tf.transform(x_test)

c. Naive Bayes prediction

#  The prediction of naive Bayesian algorithm ,alpha Is the Laplace smoothing coefficient , Numerator and denominator plus a coefficient , Add to the denominator alpha* Number of characteristic words 
mlt = MultinomialNB(alpha=1.0)
mlt.fit(x_train, y_train)

y_predict = mlt.predict(x_test)
print(" The predicted article category is ：", y_predict)
print(" Accuracy rate is ：", mlt.score(x_test, y_test))
print(" Accuracy rate and recall rate of each category ：", classification_report(y_test, y_predict, target_names=news.target_names))


 result ：
 The predicted article category is ： [16 19 18 ... 13  7 14]
 Accuracy rate is ： 0.8518675721561969
 Accuracy rate and recall rate of each category ：
                            precision    recall  f1-score   support

             alt.atheism       0.91      0.77      0.83       199
           comp.graphics       0.83      0.79      0.81       242
 comp.os.ms-windows.misc       0.89      0.83      0.86       263
comp.sys.ibm.pc.hardware       0.80      0.83      0.81       262
   comp.sys.mac.hardware       0.90      0.88      0.89       234
          comp.windows.x       0.92      0.85      0.88       230
            misc.forsale       0.96      0.67      0.79       257
               rec.autos       0.90      0.87      0.88       265
         rec.motorcycles       0.90      0.95      0.92       251
      rec.sport.baseball       0.89      0.96      0.93       226
        rec.sport.hockey       0.95      0.98      0.96       262
               sci.crypt       0.76      0.97      0.85       257
         sci.electronics       0.84      0.80      0.82       229
                 sci.med       0.97      0.86      0.91       249
               sci.space       0.92      0.96      0.94       256
  soc.religion.christian       0.55      0.98      0.70       243
      talk.politics.guns       0.76      0.96      0.85       234
   talk.politics.mideast       0.93      0.99      0.96       224
      talk.politics.misc       0.98      0.56      0.72       197
      talk.religion.misc       0.97      0.26      0.41       132

                accuracy                           0.85      4712
               macro avg       0.88      0.84      0.84      4712
            weighted avg       0.87      0.85      0.85      4712

d. Calculation AUC

#  hold 0-19 A total of 20 A classification , Turn into 0 and 1
y_test1 = np.where(y_test == 5, 1, 0)
y_predict1 = np.where(y_predict == 5, 1, 0)
# roc_auc_score Of y_test It can only be classified into two categories , How to calculate for multiple classifications AUC
print("AUC indicators ：", roc_auc_score(y_test1, y_predict1))

原网站

版权声明
本文为[weixin_ nine hundred and sixty-one million eight hundred and se]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207270028040946.html

当前位置：网站首页>Naive Bayes -- Document Classification

Naive Bayes -- Document Classification

边栏推荐

猜你喜欢

随机推荐