当前位置:网站首页>Fundamentals of machine learning Bayesian analysis-14
Fundamentals of machine learning Bayesian analysis-14
2022-07-28 12:52:00 【gemoumou】
Bayesian analysis



















Bayes -iris
# Import algorithm package and data set
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
# Load data
iris = datasets.load_iris()
x_train,x_test,y_train,y_test = train_test_split(iris.data, iris.target)
mul_nb = GaussianNB()
mul_nb.fit(x_train,y_train)

print(classification_report(mul_nb.predict(x_test),y_test))


Bayes - News classification
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
news = fetch_20newsgroups(subset='all')
print(news.target_names)
print(len(news.data))
print(len(news.target))

print(len(news.target_names))

news.data[0]

print(news.target[0])
print(news.target_names[news.target[0]])

x_train,x_test,y_train,y_test = train_test_split(news.data,news.target)
# train = fetch_20newsgroups(subset='train')
# x_train = train.data
# y_train = train.target
# test = fetch_20newsgroups(subset='test')
# x_test = test.data
# y_test = test.target

from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
#
print(cv.get_feature_names())
print(cv_fit.toarray())
print(cv_fit.toarray().sum(axis=0))

from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
cv = CountVectorizer()
cv_data = cv.fit_transform(x_train)
mul_nb = MultinomialNB()
scores = model_selection.cross_val_score(mul_nb, cv_data, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

TfidfVectorizer An advanced calculation method is used , be called Term Frequency Inverse Document
Frequency (TF-IDF). This is a statistical method to measure the importance of a word in text or corpus . Intuitively , This method compares the frequency of words in the whole corpus , Look for words with high frequency in the current document . This is a way to standardize the results , It can avoid the situation that some words appear too frequently and have little effect on the characterization of an example ( I guess for example a and and It appears more frequently in English , But they have little effect on the representation of a text ).
from sklearn.feature_extraction.text import TfidfVectorizer
# Text document list
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# Create transform functions
vectorizer = TfidfVectorizer()
# Entry and vocabulary creation
vectorizer.fit(text)
# summary
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# Coding documents
vector = vectorizer.transform([text[0]])
# Summary coding document
print(vector.shape)
print(vector.toarray())

# Create transform functions
vectorizer = TfidfVectorizer()
# Entry and vocabulary creation
tfidf_train = vectorizer.fit_transform(x_train)
scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

def get_stop_words():
result = set()
for line in open('stopwords_en.txt', 'r').readlines():
result.add(line.strip())
return result
# Load stop words
stop_words = get_stop_words()
# Create transform functions
vectorizer = TfidfVectorizer(stop_words=stop_words)
mul_nb = MultinomialNB(alpha=0.01)
# Entry and vocabulary creation
tfidf_train = vectorizer.fit_transform(x_train)
scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

# Sharding data sets
tfidf_data = vectorizer.fit_transform(news.data)
x_train,x_test,y_train,y_test = train_test_split(tfidf_data,news.target)
mul_nb.fit(x_train,y_train)
print(mul_nb.score(x_train, y_train))
print(mul_nb.score(x_test, y_test))

The word bag model (Bag of Words)






Bayesian spell checker
Spell checker principle
Of all the correctly spelled words , We want to find the right word c, Make the w The conditional probability of is the largest . solve :
P(c|w) -> P(w|c) P / P(w)
such as :appla Is the condition w,apple and apply Is the right word c, about apple and apply Come on P(w) It's all the same , So we ignore it in the above formula , It's written in :
P(w|c) P
P, This correctly spelled word appears in the article c Probability , in other words , In English articles , c What is the probability of occurrence .
Suppose we can think that the greater the probability of words appearing in the article , The greater the probability of correct spelling , You can replace this amount with the number of word occurrences . It's like , In English the Probability P(‘the’) It's relatively high , And it appears P(‘zxzxzxzyy’) The probability is close 0( Suppose the latter is also a word ).
P(w|c), When the user wants to type c In the case of w Probability . This represents the probability that users will c Strike wrong w.
import re
# Read the content
text = open('big.txt').read()
# Turn lowercase , Only keep a-z character
text = re.findall('[a-z]+', text.lower())
dic_words = {
}
for t in text:
dic_words[t] = dic_words.get(t,0) + 1

Edit distance :
The edit distance between two words is defined as the number of insertions used ( Insert a single letter in the word ), Delete ( Delete a single letter ), In exchange for ( Exchange two adjacent letters ), Replace ( Change one letter into another ) From one word to another .
# alphabet
alphabet = 'abcdefghijklmnopqrstuvwxyz'
# Return all and words word The edit distance is 1 Set
def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion

# Return all and words word The edit distance is 2 Set
# In these edits, the distance is less than 2 In the middle of the word , Use only the right words as candidates
def edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
e1 = edits1('something')
e2 = edits2('something')
len(e1) + len(e2)

And something The edit distance is 1 perhaps 2 The word of has reached 114,818 individual
Optimize : Use only the right words as candidates , After the optimization edits2 Only return 3 Word : ‘smoothing’, ‘something’ and ‘soothing’
P(w|c) solve : Normally, the probability of spelling one vowel into another is greater than that of consonant ( Because people often put hello become involved hallo such ); The probability of misspelling the first letter of a word is relatively small , wait . But for the sake of simplicity , Chose a simple method : The edit distance is 1 The correct word distance is 2 High priority , And the editing distance is 0 The priority of the correct word is 1 The height of . Generally put hello become involved hallo Is more likely than hello become involved halo There is a great possibility that .
def known(words):
w = set()
for word in words:
if word in dic_words:
w.add(word)
return w
# First calculate the editing distance , Then find the most matching word according to the editing distance
def correct(word):
# Get candidate words
# If known(set) Non empty , candidates I'm going to pick this collection , Instead of counting the following
candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or word
# There are no similar words in the dictionary
if word == candidates:
return word
# Return the word with the highest frequency
max_num = 0
for c in candcidates:
if dic_words[c] >= max_num:
max_num = dic_words[c]
candidate = c
return candidate

边栏推荐
- New Oriental's single quarter revenue was 524million US dollars, a year-on-year decrease of 56.8%, and 925 learning centers were reduced
- C# static的用法详解
- Initialization examples of several modes of mma8452q
- Hc-05 Bluetooth module debugging slave mode and master mode experience
- 输入字符串,内有数字和非字符数组,例如A123x456将其中连续的数字作为一个整数,依次存放到一个数组中,如123放到a[0],456放到a[1],并输出a这些数
- Hongjiu fruit passed the hearing: five month operating profit of 900million Ali and China agricultural reclamation are shareholders
- 试用copilot过程中问题解决
- [nuxt 3] (XII) project directory structure 3
- BA autoboot plug-in of uniapp application boot
- Library automatic reservation script
猜你喜欢

Distributed session solution

Scala transformation, filtering, grouping, sorting

Vs code is not in its original position after being updated

Marketing play is changeable, and understanding the rules is the key!

leetcode 1518. 换酒问题

CCF201912-2 回收站选址

New Oriental's single quarter revenue was 524million US dollars, a year-on-year decrease of 56.8%, and 925 learning centers were reduced

Library automatic reservation script

Multiple items on a computer share a public-private key pair to pull the Gerrit server code

Connected Block & food chain - (summary of parallel search set)
随机推荐
Machine learning practice - integrated learning-23
LeetCode84 柱状图中最大的矩形
非标自动化设备企业如何借助ERP系统,做好产品质量管理?
Interface control telerik UI for WPF - how to use radspreadsheet to record or comment
Did kafaka lose the message
机器学习实战-神经网络-21
[half understood] zero value copy
牛客网二叉树题解
VS1003调试例程
大模型哪家强?OpenBMB发布BMList给你答案!
Review the IO stream again, and have an in-depth understanding of serialization and deserialization
Vs code is not in its original position after being updated
MSP430 开发中遇到的坑(待续)
stm32 回环结构接收串口数据并处理
Markdown concise grammar manual
上位机和三菱FN2x通信实例
Uncover why devaxpress WinForms, an interface control, discards the popular maskbox property
Analysis of new retail e-commerce o2o model
JSP自定义标签之自定义分页标签02
Leetcode 42. rainwater connection