当前位置:网站首页>Fundamentals of machine learning Bayesian analysis-14
Fundamentals of machine learning Bayesian analysis-14
2022-07-28 12:52:00 【gemoumou】
Bayesian analysis



















Bayes -iris
# Import algorithm package and data set
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
# Load data
iris = datasets.load_iris()
x_train,x_test,y_train,y_test = train_test_split(iris.data, iris.target)
mul_nb = GaussianNB()
mul_nb.fit(x_train,y_train)

print(classification_report(mul_nb.predict(x_test),y_test))


Bayes - News classification
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
news = fetch_20newsgroups(subset='all')
print(news.target_names)
print(len(news.data))
print(len(news.target))

print(len(news.target_names))

news.data[0]

print(news.target[0])
print(news.target_names[news.target[0]])

x_train,x_test,y_train,y_test = train_test_split(news.data,news.target)
# train = fetch_20newsgroups(subset='train')
# x_train = train.data
# y_train = train.target
# test = fetch_20newsgroups(subset='test')
# x_test = test.data
# y_test = test.target

from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
#
print(cv.get_feature_names())
print(cv_fit.toarray())
print(cv_fit.toarray().sum(axis=0))

from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
cv = CountVectorizer()
cv_data = cv.fit_transform(x_train)
mul_nb = MultinomialNB()
scores = model_selection.cross_val_score(mul_nb, cv_data, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

TfidfVectorizer An advanced calculation method is used , be called Term Frequency Inverse Document
Frequency (TF-IDF). This is a statistical method to measure the importance of a word in text or corpus . Intuitively , This method compares the frequency of words in the whole corpus , Look for words with high frequency in the current document . This is a way to standardize the results , It can avoid the situation that some words appear too frequently and have little effect on the characterization of an example ( I guess for example a and and It appears more frequently in English , But they have little effect on the representation of a text ).
from sklearn.feature_extraction.text import TfidfVectorizer
# Text document list
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# Create transform functions
vectorizer = TfidfVectorizer()
# Entry and vocabulary creation
vectorizer.fit(text)
# summary
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# Coding documents
vector = vectorizer.transform([text[0]])
# Summary coding document
print(vector.shape)
print(vector.toarray())

# Create transform functions
vectorizer = TfidfVectorizer()
# Entry and vocabulary creation
tfidf_train = vectorizer.fit_transform(x_train)
scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

def get_stop_words():
result = set()
for line in open('stopwords_en.txt', 'r').readlines():
result.add(line.strip())
return result
# Load stop words
stop_words = get_stop_words()
# Create transform functions
vectorizer = TfidfVectorizer(stop_words=stop_words)
mul_nb = MultinomialNB(alpha=0.01)
# Entry and vocabulary creation
tfidf_train = vectorizer.fit_transform(x_train)
scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

# Sharding data sets
tfidf_data = vectorizer.fit_transform(news.data)
x_train,x_test,y_train,y_test = train_test_split(tfidf_data,news.target)
mul_nb.fit(x_train,y_train)
print(mul_nb.score(x_train, y_train))
print(mul_nb.score(x_test, y_test))

The word bag model (Bag of Words)






Bayesian spell checker
Spell checker principle
Of all the correctly spelled words , We want to find the right word c, Make the w The conditional probability of is the largest . solve :
P(c|w) -> P(w|c) P / P(w)
such as :appla Is the condition w,apple and apply Is the right word c, about apple and apply Come on P(w) It's all the same , So we ignore it in the above formula , It's written in :
P(w|c) P
P, This correctly spelled word appears in the article c Probability , in other words , In English articles , c What is the probability of occurrence .
Suppose we can think that the greater the probability of words appearing in the article , The greater the probability of correct spelling , You can replace this amount with the number of word occurrences . It's like , In English the Probability P(‘the’) It's relatively high , And it appears P(‘zxzxzxzyy’) The probability is close 0( Suppose the latter is also a word ).
P(w|c), When the user wants to type c In the case of w Probability . This represents the probability that users will c Strike wrong w.
import re
# Read the content
text = open('big.txt').read()
# Turn lowercase , Only keep a-z character
text = re.findall('[a-z]+', text.lower())
dic_words = {
}
for t in text:
dic_words[t] = dic_words.get(t,0) + 1

Edit distance :
The edit distance between two words is defined as the number of insertions used ( Insert a single letter in the word ), Delete ( Delete a single letter ), In exchange for ( Exchange two adjacent letters ), Replace ( Change one letter into another ) From one word to another .
# alphabet
alphabet = 'abcdefghijklmnopqrstuvwxyz'
# Return all and words word The edit distance is 1 Set
def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion

# Return all and words word The edit distance is 2 Set
# In these edits, the distance is less than 2 In the middle of the word , Use only the right words as candidates
def edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
e1 = edits1('something')
e2 = edits2('something')
len(e1) + len(e2)

And something The edit distance is 1 perhaps 2 The word of has reached 114,818 individual
Optimize : Use only the right words as candidates , After the optimization edits2 Only return 3 Word : ‘smoothing’, ‘something’ and ‘soothing’
P(w|c) solve : Normally, the probability of spelling one vowel into another is greater than that of consonant ( Because people often put hello become involved hallo such ); The probability of misspelling the first letter of a word is relatively small , wait . But for the sake of simplicity , Chose a simple method : The edit distance is 1 The correct word distance is 2 High priority , And the editing distance is 0 The priority of the correct word is 1 The height of . Generally put hello become involved hallo Is more likely than hello become involved halo There is a great possibility that .
def known(words):
w = set()
for word in words:
if word in dic_words:
w.add(word)
return w
# First calculate the editing distance , Then find the most matching word according to the editing distance
def correct(word):
# Get candidate words
# If known(set) Non empty , candidates I'm going to pick this collection , Instead of counting the following
candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or word
# There are no similar words in the dictionary
if word == candidates:
return word
# Return the word with the highest frequency
max_num = 0
for c in candcidates:
if dic_words[c] >= max_num:
max_num = dic_words[c]
candidate = c
return candidate

边栏推荐
- 公司在什么情况下可以开除员工
- LeetCode 移除元素&移动零
- Multi Chain and multi currency wallet system development cross chain technology
- How to realize more multimedia functions through the ffmpeg library and NaPi mechanism integrated in openharmony system?
- Did kafaka lose the message
- 机器学习基础-主成分分析PCA-16
- leetcode 376. Wiggle Subsequence
- Is it overtime to be on duty? Take up legal weapons to protect your legitimate rights and interests. It's time to rectify the working environment
- New Oriental's single quarter revenue was 524million US dollars, a year-on-year decrease of 56.8%, and 925 learning centers were reduced
- Custom paging tag 02 of JSP custom tag
猜你喜欢

The usage and Simulation Implementation of vector in STL

LeetCode94. 二叉树的中序遍历

Which big model is better? Openbmb releases bmlist to give you the answer!

Machine learning practice - logistic regression-19

sqli-labs(less-8)

大模型哪家强?OpenBMB发布BMList给你答案!
![[half understood] zero value copy](/img/5b/18082c1ea93f2e3bbf4920d73163fd.png)
[half understood] zero value copy

VS code更新后不在原来位置

Marketing play is changeable, and understanding the rules is the key!

机器学习实战-决策树-22
随机推荐
[base] what is the optimization of optimization performance?
连通块&&食物链——(并查集小结)
MSP430 开发中遇到的坑(待续)
STM32 loopback structure receives and processes serial port data
C for循环内定义int i变量出现的重定义问题
Ccf201912-2 recycling station site selection
[nuxt 3] (XII) project directory structure 3
Using dependent packages to directly implement paging and SQL statements
Block reversal (summer vacation daily question 7)
VS1003 debugging routine
公司在什么情况下可以开除员工
机器学习基础-贝叶斯分析-14
MMA8452Q几种模式的初始化实例
LeetCode 42.接雨水
苏黎世联邦理工学院 | 具有可变形注意Transformer 的基于参考的图像超分辨率(ECCV2022))
力扣315计算右侧小于当前元素的个数
Leetcode:704 binary search
合并表格行---三层for循环遍历数据
New progress in the implementation of the industry | the openatom openharmony sub forum of the 2022 open atom global open source summit was successfully held
Review the IO stream again, and have an in-depth understanding of serialization and deserialization