当前位置:网站首页>Fundamentals of machine learning Bayesian analysis-14
Fundamentals of machine learning Bayesian analysis-14
2022-07-28 12:52:00 【gemoumou】
Bayesian analysis



















Bayes -iris
# Import algorithm package and data set
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
# Load data
iris = datasets.load_iris()
x_train,x_test,y_train,y_test = train_test_split(iris.data, iris.target)
mul_nb = GaussianNB()
mul_nb.fit(x_train,y_train)

print(classification_report(mul_nb.predict(x_test),y_test))


Bayes - News classification
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
news = fetch_20newsgroups(subset='all')
print(news.target_names)
print(len(news.data))
print(len(news.target))

print(len(news.target_names))

news.data[0]

print(news.target[0])
print(news.target_names[news.target[0]])

x_train,x_test,y_train,y_test = train_test_split(news.data,news.target)
# train = fetch_20newsgroups(subset='train')
# x_train = train.data
# y_train = train.target
# test = fetch_20newsgroups(subset='test')
# x_test = test.data
# y_test = test.target

from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
#
print(cv.get_feature_names())
print(cv_fit.toarray())
print(cv_fit.toarray().sum(axis=0))

from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
cv = CountVectorizer()
cv_data = cv.fit_transform(x_train)
mul_nb = MultinomialNB()
scores = model_selection.cross_val_score(mul_nb, cv_data, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

TfidfVectorizer An advanced calculation method is used , be called Term Frequency Inverse Document
Frequency (TF-IDF). This is a statistical method to measure the importance of a word in text or corpus . Intuitively , This method compares the frequency of words in the whole corpus , Look for words with high frequency in the current document . This is a way to standardize the results , It can avoid the situation that some words appear too frequently and have little effect on the characterization of an example ( I guess for example a and and It appears more frequently in English , But they have little effect on the representation of a text ).
from sklearn.feature_extraction.text import TfidfVectorizer
# Text document list
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# Create transform functions
vectorizer = TfidfVectorizer()
# Entry and vocabulary creation
vectorizer.fit(text)
# summary
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# Coding documents
vector = vectorizer.transform([text[0]])
# Summary coding document
print(vector.shape)
print(vector.toarray())

# Create transform functions
vectorizer = TfidfVectorizer()
# Entry and vocabulary creation
tfidf_train = vectorizer.fit_transform(x_train)
scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

def get_stop_words():
result = set()
for line in open('stopwords_en.txt', 'r').readlines():
result.add(line.strip())
return result
# Load stop words
stop_words = get_stop_words()
# Create transform functions
vectorizer = TfidfVectorizer(stop_words=stop_words)
mul_nb = MultinomialNB(alpha=0.01)
# Entry and vocabulary creation
tfidf_train = vectorizer.fit_transform(x_train)
scores = model_selection.cross_val_score(mul_nb, tfidf_train, y_train, cv=3, scoring='accuracy')
print("Accuracy: %0.3f" % (scores.mean()))

# Sharding data sets
tfidf_data = vectorizer.fit_transform(news.data)
x_train,x_test,y_train,y_test = train_test_split(tfidf_data,news.target)
mul_nb.fit(x_train,y_train)
print(mul_nb.score(x_train, y_train))
print(mul_nb.score(x_test, y_test))

The word bag model (Bag of Words)






Bayesian spell checker
Spell checker principle
Of all the correctly spelled words , We want to find the right word c, Make the w The conditional probability of is the largest . solve :
P(c|w) -> P(w|c) P / P(w)
such as :appla Is the condition w,apple and apply Is the right word c, about apple and apply Come on P(w) It's all the same , So we ignore it in the above formula , It's written in :
P(w|c) P
P, This correctly spelled word appears in the article c Probability , in other words , In English articles , c What is the probability of occurrence .
Suppose we can think that the greater the probability of words appearing in the article , The greater the probability of correct spelling , You can replace this amount with the number of word occurrences . It's like , In English the Probability P(‘the’) It's relatively high , And it appears P(‘zxzxzxzyy’) The probability is close 0( Suppose the latter is also a word ).
P(w|c), When the user wants to type c In the case of w Probability . This represents the probability that users will c Strike wrong w.
import re
# Read the content
text = open('big.txt').read()
# Turn lowercase , Only keep a-z character
text = re.findall('[a-z]+', text.lower())
dic_words = {
}
for t in text:
dic_words[t] = dic_words.get(t,0) + 1

Edit distance :
The edit distance between two words is defined as the number of insertions used ( Insert a single letter in the word ), Delete ( Delete a single letter ), In exchange for ( Exchange two adjacent letters ), Replace ( Change one letter into another ) From one word to another .
# alphabet
alphabet = 'abcdefghijklmnopqrstuvwxyz'
# Return all and words word The edit distance is 1 Set
def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion

# Return all and words word The edit distance is 2 Set
# In these edits, the distance is less than 2 In the middle of the word , Use only the right words as candidates
def edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
e1 = edits1('something')
e2 = edits2('something')
len(e1) + len(e2)

And something The edit distance is 1 perhaps 2 The word of has reached 114,818 individual
Optimize : Use only the right words as candidates , After the optimization edits2 Only return 3 Word : ‘smoothing’, ‘something’ and ‘soothing’
P(w|c) solve : Normally, the probability of spelling one vowel into another is greater than that of consonant ( Because people often put hello become involved hallo such ); The probability of misspelling the first letter of a word is relatively small , wait . But for the sake of simplicity , Chose a simple method : The edit distance is 1 The correct word distance is 2 High priority , And the editing distance is 0 The priority of the correct word is 1 The height of . Generally put hello become involved hallo Is more likely than hello become involved halo There is a great possibility that .
def known(words):
w = set()
for word in words:
if word in dic_words:
w.add(word)
return w
# First calculate the editing distance , Then find the most matching word according to the editing distance
def correct(word):
# Get candidate words
# If known(set) Non empty , candidates I'm going to pick this collection , Instead of counting the following
candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or word
# There are no similar words in the dictionary
if word == candidates:
return word
# Return the word with the highest frequency
max_num = 0
for c in candcidates:
if dic_words[c] >= max_num:
max_num = dic_words[c]
candidate = c
return candidate

边栏推荐
- 大模型哪家强?OpenBMB发布BMList给你答案!
- Machine learning practice - logistic regression-19
- Use json.stringify() to format data
- Solution to the binary tree problem of niuke.com
- Sliding Window
- [nuxt 3] (XII) project directory structure 3
- Deployment之滚动更新策略。
- 公司在什么情况下可以开除员工
- 1331. Array sequence number conversion: simple simulation question
- 机器学习基础-贝叶斯分析-14
猜你喜欢

快速读入

Machine learning practice - neural network-21

MMA8452Q几种模式的初始化实例

Ccf201912-2 recycling station site selection

Is it overtime to be on duty? Take up legal weapons to protect your legitimate rights and interests. It's time to rectify the working environment

苏黎世联邦理工学院 | 具有可变形注意Transformer 的基于参考的图像超分辨率(ECCV2022))

软件架构师必需要了解的 saas 架构设计?

Machine learning practice - logistic regression-19

Linear classifier (ccf20200901)

DART 三维辐射传输模型申请及下载
随机推荐
LeetCode206 反转链表
上位机和三菱FN2x通信实例
归并排序
The usage and Simulation Implementation of vector in STL
How to realize more multimedia functions through the ffmpeg library and NaPi mechanism integrated in openharmony system?
VS1003 debugging routine
Redis implements distributed locks
Four authentic postures after suffering and trauma, Zizek
leetcode 376. Wiggle Subsequence
LeetCode 移除元素&移动零
C语言项目中使用json
How can non-standard automation equipment enterprises do well in product quality management with the help of ERP system?
Application and download of dart 3D radiative transfer model
Deployment之滚动更新策略。
AI制药的数据之困,分子建模能解吗?
Multiple items on a computer share a public-private key pair to pull the Gerrit server code
Using JSON in C language projects
Holes in [apue] files
Monotonic stack
洪九果品通过聆讯:5个月经营利润9亿 阿里与中国农垦是股东