当前位置：网站首页>Fundamentals of machine learning Bayesian analysis-14

Fundamentals of machine learning Bayesian analysis-14

2022-07-28 12:52:00 【gemoumou】

Bayesian analysis

Insert picture description here

Bayes -iris

#  Import algorithm package and data set 
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB

#  Load data 
iris = datasets.load_iris()
x_train,x_test,y_train,y_test = train_test_split(iris.data, iris.target)

mul_nb = GaussianNB()
mul_nb.fit(x_train,y_train)

Insert picture description here

print(classification_report(mul_nb.predict(x_test),y_test))

Insert picture description here

Bayes - News classification

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

news = fetch_20newsgroups(subset='all')
print(news.target_names)
print(len(news.data))
print(len(news.target))

Insert picture description here

print(len(news.target_names))

Insert picture description here

news.data[0]

Insert picture description here

print(news.target[0])
print(news.target_names[news.target[0]])

Insert picture description here

x_train,x_test,y_train,y_test = train_test_split(news.data,news.target)
# train = fetch_20newsgroups(subset='train')
# x_train = train.data
# y_train = train.target
# test = fetch_20newsgroups(subset='test')
# x_test = test.data
# y_test = test.target

Insert picture description here

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

# 
print(cv.get_feature_names())
print(cv_fit.toarray())

print(cv_fit.toarray().sum(axis=0))

Insert picture description here

from sklearn import model_selection 
from sklearn.naive_bayes import MultinomialNB

cv = CountVectorizer()
cv_data = cv.fit_transform(x_train)
mul_nb = MultinomialNB()

scores = model_selection.cross_val_score(mul_nb, cv_data, y_train, cv=3, scoring='accuracy')  
print("Accuracy: %0.3f" % (scores.mean()))

Insert picture description here
TfidfVectorizer An advanced calculation method is used , be called Term Frequency Inverse Document
Frequency (TF-IDF). This is a statistical method to measure the importance of a word in text or corpus . Intuitively , This method compares the frequency of words in the whole corpus , Look for words with high frequency in the current document . This is a way to standardize the results , It can avoid the situation that some words appear too frequently and have little effect on the characterization of an example ( I guess for example a and and It appears more frequently in English , But they have little effect on the representation of a text ).

from sklearn.feature_extraction.text import TfidfVectorizer
#  Text document list 
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
#  Create transform functions 
vectorizer = TfidfVectorizer()
#  Entry and vocabulary creation 
vectorizer.fit(text)
#  summary 
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
#  Coding documents 
vector = vectorizer.transform([text[0]])
#  Summary coding document 
print(vector.shape)
print(vector.toarray())

Insert picture description here

#  Create transform functions 
vectorizer = TfidfVectorizer()
#  Entry and vocabulary creation 
tfidf_train = vectorizer.fit_transform(x_train)

scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy') 
print("Accuracy: %0.3f" % (scores.mean()))

Insert picture description here

def get_stop_words():
    result = set()
    for line in open('stopwords_en.txt', 'r').readlines():
        result.add(line.strip())
    return result

#  Load stop words 
stop_words = get_stop_words()
#  Create transform functions 
vectorizer = TfidfVectorizer(stop_words=stop_words)


mul_nb = MultinomialNB(alpha=0.01)

#  Entry and vocabulary creation 
tfidf_train = vectorizer.fit_transform(x_train)

scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy') 
print("Accuracy: %0.3f" % (scores.mean()))

Insert picture description here

#  Sharding data sets 
tfidf_data = vectorizer.fit_transform(news.data)
x_train,x_test,y_train,y_test = train_test_split(tfidf_data,news.target)

mul_nb.fit(x_train,y_train)
print(mul_nb.score(x_train, y_train))

print(mul_nb.score(x_test, y_test))

Insert picture description here
The word bag model (Bag of Words)

Bayesian spell checker

Spell checker principle
Of all the correctly spelled words , We want to find the right word c, Make the w The conditional probability of is the largest . solve ：
P(c|w) -> P(w|c) P / P(w)
such as ：appla Is the condition w,apple and apply Is the right word c, about apple and apply Come on P(w) It's all the same , So we ignore it in the above formula , It's written in :
P(w|c) P

P, This correctly spelled word appears in the article c Probability , in other words , In English articles , c What is the probability of occurrence .
Suppose we can think that the greater the probability of words appearing in the article , The greater the probability of correct spelling , You can replace this amount with the number of word occurrences . It's like , In English the Probability P(‘the’) It's relatively high , And it appears P(‘zxzxzxzyy’) The probability is close 0( Suppose the latter is also a word ).
P(w|c), When the user wants to type c In the case of w Probability . This represents the probability that users will c Strike wrong w.

import re

#  Read the content 
text = open('big.txt').read()

#  Turn lowercase , Only keep a-z character 
text = re.findall('[a-z]+', text.lower())

dic_words = {
    }
for t in text:
    dic_words[t] = dic_words.get(t,0) + 1

Insert picture description here
Edit distance :
The edit distance between two words is defined as the number of insertions used ( Insert a single letter in the word ), Delete ( Delete a single letter ), In exchange for ( Exchange two adjacent letters ), Replace ( Change one letter into another ) From one word to another .

#  alphabet 
alphabet = 'abcdefghijklmnopqrstuvwxyz'

# Return all and words  word  The edit distance is  1  Set 
def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion
               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion

Insert picture description here

# Return all and words  word  The edit distance is  2  Set 
# In these edits, the distance is less than 2 In the middle of the word ,  Use only the right words as candidates 
def edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

e1 = edits1('something')
e2 = edits2('something')
len(e1) + len(e2)

Insert picture description here
And something The edit distance is 1 perhaps 2 The word of has reached 114,818 individual
Optimize : Use only the right words as candidates , After the optimization edits2 Only return 3 Word : ‘smoothing’, ‘something’ and ‘soothing’

P(w|c) solve ： Normally, the probability of spelling one vowel into another is greater than that of consonant ( Because people often put hello become involved hallo such ); The probability of misspelling the first letter of a word is relatively small , wait . But for the sake of simplicity , Chose a simple method : The edit distance is 1 The correct word distance is 2 High priority , And the editing distance is 0 The priority of the correct word is 1 The height of . Generally put hello become involved hallo Is more likely than hello become involved halo There is a great possibility that .

def known(words):
    w = set()
    for word in words:
        if word in dic_words:
            w.add(word)
    return w

#  First calculate the editing distance , Then find the most matching word according to the editing distance 
def correct(word):
    #  Get candidate words 
    # If known(set) Non empty , candidates  I'm going to pick this collection ,  Instead of counting the following 
    candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or word
    #  There are no similar words in the dictionary 
    if word == candidates:
        return word
    #  Return the word with the highest frequency 
    max_num = 0
    for c in candcidates:
        if dic_words[c] >= max_num:
            max_num = dic_words[c]
            candidate = c
    return candidate