当前位置:网站首页>【机器学习】朴素贝叶斯对文本分类--对人名国别分类
【机器学习】朴素贝叶斯对文本分类--对人名国别分类
2022-07-28 20:18:00 【Du恒之】
朴素贝叶斯
- 基于先验概率、条件概率得到联合概率,进而得到后验概率;
- 满足所有特征独立性假设,而且同等重要;
- 是一种生成式模型;
- 可用于小数据集的多分类问题。
代码
import re
from sklearn.feature_extraction.text import CountVectorizer # 文本表示,一直处理到n-gram
from sklearn.model_selection import train_test_split # 划分数据集
from sklearn.naive_bayes import MultinomialNB # 引入贝叶斯公式
class LanguageDetector():
def __init__(self, classifier=MultinomialNB()):
self.classifier = classifier
# ngram_range=(1,2) N-GRAM为1,2
self.vectorizer = CountVectorizer(ngram_range=(1,2), max_features=1000, preprocessor=self._remove_noise)
# 去噪
def _remove_noise(self, document):
noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
clean_text = re.sub(noise_pattern, "", document)
return clean_text
# 特征提取
def features(self, X):
return self.vectorizer.transform(X)
# 训练
def fit(self, X, y):
self.vectorizer.fit(X)
self.classifier.fit(self.features(X), y)
# 预测
def predict(self, x):
return self.classifier.predict(self.features([x]))
# 准确率
def score(self, X, y):
return self.classifier.score(self.features(X), y)
if __name__ == "__main__":
in_f = open('data.csv')
lines = in_f.readlines()
in_f.close()
dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]
x, y = zip(*dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
language_detector = LanguageDetector()
language_detector.fit(x_train, y_train)
print(language_detector.predict('This is an English sentence')) # 测试语句
print(language_detector.score(x_test, y_test))
运行结果
['en']
0.9779444199382443
本文的数据展示

边栏推荐
- 90. 子集 II
- 基于多模态融合的非遗图片分类研究
- MSI Bao'an factory is on fire! Official response: no one was injured, and the production line will not be affected!
- Kubevera plug-in addons download address
- Technology selection rust post analysis
- 凡尔赛天花板:“毕业两年月薪才35K,真是没出息啊~~”
- How is nanoid faster and more secure than UUID implemented? (glory Collection Edition)
- System Analyst
- 比UUID更快更安全NanoID到底是怎么实现的?(荣耀典藏版)
- kingbase中指定用户默认查找schema,或曰用户无法使用public schema下函数问题
猜你喜欢

RHCSA第一天

Is it necessary to calibrate the fluke dtx-1800 test accuracy?

Pytoch learning record (III): random gradient descent, neural network and full connection

Matlab from introduction to mastery Chapter 1 Introduction to matlab

kubevela插件addons下载地址

Soft test --- database (3) data operation

Divide and conquer, upload large files in pieces

Miscellaneous records of powersploit, evaluation, weevery and other tools in Kali

Research on the recognition method of move function information of scientific paper abstract based on paragraph Bert CRF

Edited by vimtutor
随机推荐
World Hepatitis Day | grassroots can also enjoy the three a resources. How can the smart medical system solve the "difficulty of seeing a doctor"?
基于BRNN的政务APP评论端到端方面级情感分析方法
比UUID更快更安全NanoID到底是怎么实现的?(荣耀典藏版)
msfvenom制作主控与被控端
kubevela插件addons下载地址
For the next generation chromebook, MediaTek launched new chipsets mt8192 and mt8195
Wechat applet development company, do you know how to choose?
I have been in the industry for 4 years and changed jobs twice. I have understood the field of software testing~
fluke dtx-1800测试精度有必要进行原厂校准吗?
中国农业工程学会农业水土工程专业委员会-第十二届-笔记
Two global variables__ Dirname and__ Further introduction to common functions of filename and FS modules
Huawei releases the first electric drive system driveone: charging for 10 minutes, endurance of 200km
Introduction to wechat applet development, develop your own applet
Esp8266 Arduino programming example - timer and interrupt
管理区解耦架构见过吗?能帮客户搞定大难题的
Mysql的B+树高度计算
[brother hero July training] day 28: dynamic planning
基于属性词补全的武器装备属性抽取研究
How to design workflow engine gracefully (glory Collection Edition)
Which brand is the best and most cost-effective open headset