当前位置:网站首页>[machine learning] naive Bayesian classification of text -- Classification of people's names and countries
[machine learning] naive Bayesian classification of text -- Classification of people's names and countries
2022-07-28 22:03:00 【Du Hengzhi】
Naive Bayes
- Based on a priori probability 、 Conditional probability gets joint probability , Then we get the posterior probability ;
- Satisfy all characteristic independence assumptions , And equally important ;
- It's a generative model ;
- It can be used for multi classification problems of small data sets .
Code
import re
from sklearn.feature_extraction.text import CountVectorizer # The text means , Until n-gram
from sklearn.model_selection import train_test_split # Divide the data set
from sklearn.naive_bayes import MultinomialNB # Introduce Bayesian formula
class LanguageDetector():
def __init__(self, classifier=MultinomialNB()):
self.classifier = classifier
# ngram_range=(1,2) N-GRAM by 1,2
self.vectorizer = CountVectorizer(ngram_range=(1,2), max_features=1000, preprocessor=self._remove_noise)
# Denoise
def _remove_noise(self, document):
noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
clean_text = re.sub(noise_pattern, "", document)
return clean_text
# feature extraction
def features(self, X):
return self.vectorizer.transform(X)
# Training
def fit(self, X, y):
self.vectorizer.fit(X)
self.classifier.fit(self.features(X), y)
# forecast
def predict(self, x):
return self.classifier.predict(self.features([x]))
# Accuracy rate
def score(self, X, y):
return self.classifier.score(self.features(X), y)
if __name__ == "__main__":
in_f = open('data.csv')
lines = in_f.readlines()
in_f.close()
dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]
x, y = zip(*dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
language_detector = LanguageDetector()
language_detector.fit(x_train, y_train)
print(language_detector.predict('This is an English sentence')) # Test statement
print(language_detector.score(x_test, y_test))
Running results
['en']
0.9779444199382443
Data display of this article

边栏推荐
- 管理区解耦架构见过吗?能帮客户搞定大难题的
- 中国科学家首次用DNA构造卷积人工神经网络,可完成32类分子模式识别任务,或用于生物标志物信号分析和诊断
- System Analyst
- 基于属性词补全的武器装备属性抽取研究
- 90. 子集 II
- 基于对象的实时空间音频渲染丨Dev for Dev 专栏
- Is it necessary to calibrate the fluke dtx-1800 test accuracy?
- Pytoch learning record (III): random gradient descent, neural network and full connection
- 标准C语言学习总结10
- 搞事摸鱼一天有一天
猜你喜欢

Information fusion method and application of expert opinion and trust in large group emergency decision-making based on complex network

Introduction to wechat applet development, develop your own applet

Miscellaneous records of powersploit, evaluation, weevery and other tools in Kali

株洲市九方中学开展防溺水、消防安全教育培训活动

Oracle, SQL Foundation

如何高效、精准地进行图片搜索?看看轻量化视觉预训练模型

KubeEdge发布云原生边缘计算威胁模型及安全防护技术白皮书

详解visual studio 2015在局域网中远程调试程序

msfvenom制作主控与被控端

Research on the recognition method of move function information of scientific paper abstract based on paragraph Bert CRF
随机推荐
39. Combined sum
Research on the recognition method of move function information of scientific paper abstract based on paragraph Bert CRF
Zhuzhou Jiufang middle school carried out drowning prevention and fire safety education and training activities
Chinese patent keyword extraction based on LSTM and logistic regression
网格数据生成函数meshgrid
什么是质因数,质因数(素因数或质因子)在数论里是指能整除给定正整数的质数
拥抱开源指南
ESP8266-Arduino编程实例-SPIFFS及数据上传(Arduino IDE和PlatformIO IDE)
[极客大挑战 2019]Secret File&文件包含常用伪协议以及姿势
Technology selection rust post analysis
管理区解耦架构见过吗?能帮客户搞定大难题的
Oracle triggers
详解visual studio 2015在局域网中远程调试程序
How is nanoid faster and more secure than UUID implemented? (glory Collection Edition)
display 各值的区别
Professional Committee of agricultural water and soil engineering of China Association of Agricultural Engineering - 12th session - Notes
Skiasharp's WPF self drawn drag ball (case version)
小程序 canvas 生成海报
世界肝炎日 | 基层也能享受三甲资源,智慧医疗系统如何解决“看病难”?
Two global variables__ Dirname and__ Further introduction to common functions of filename and FS modules