当前位置:网站首页>【机器学习】朴素贝叶斯对文本分类--对人名国别分类
【机器学习】朴素贝叶斯对文本分类--对人名国别分类
2022-07-28 20:18:00 【Du恒之】
朴素贝叶斯
- 基于先验概率、条件概率得到联合概率,进而得到后验概率;
- 满足所有特征独立性假设,而且同等重要;
- 是一种生成式模型;
- 可用于小数据集的多分类问题。
代码
import re
from sklearn.feature_extraction.text import CountVectorizer # 文本表示,一直处理到n-gram
from sklearn.model_selection import train_test_split # 划分数据集
from sklearn.naive_bayes import MultinomialNB # 引入贝叶斯公式
class LanguageDetector():
def __init__(self, classifier=MultinomialNB()):
self.classifier = classifier
# ngram_range=(1,2) N-GRAM为1,2
self.vectorizer = CountVectorizer(ngram_range=(1,2), max_features=1000, preprocessor=self._remove_noise)
# 去噪
def _remove_noise(self, document):
noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
clean_text = re.sub(noise_pattern, "", document)
return clean_text
# 特征提取
def features(self, X):
return self.vectorizer.transform(X)
# 训练
def fit(self, X, y):
self.vectorizer.fit(X)
self.classifier.fit(self.features(X), y)
# 预测
def predict(self, x):
return self.classifier.predict(self.features([x]))
# 准确率
def score(self, X, y):
return self.classifier.score(self.features(X), y)
if __name__ == "__main__":
in_f = open('data.csv')
lines = in_f.readlines()
in_f.close()
dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]
x, y = zip(*dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
language_detector = LanguageDetector()
language_detector.fit(x_train, y_train)
print(language_detector.predict('This is an English sentence')) # 测试语句
print(language_detector.score(x_test, y_test))
运行结果
['en']
0.9779444199382443
本文的数据展示

边栏推荐
- How does MySQL archive data?
- 开放式耳机哪个品牌好、性价比最高的开放式耳机排名
- Wechat applet development company, do you know how to choose?
- 使用Mock技术帮助提升测试效率的小tips,你知道几个?
- 详解visual studio 2015在局域网中远程调试程序
- openEuler Embedded SIG | 分布式软总线
- An end-to-end aspect level emotion analysis method for government app reviews based on brnn
- KubeEdge发布云原生边缘计算威胁模型及安全防护技术白皮书
- JVM 内存布局详解(荣耀典藏版)
- 第 7 篇:绘制旋转立方体
猜你喜欢
![Leetcode 142. circular linked list II [knowledge points: speed pointer, hash table]](/img/74/321a4a0fab0b0dbae53b2ea1faf814.png)
Leetcode 142. circular linked list II [knowledge points: speed pointer, hash table]

怎样巧用断言+异常处理类,使代码更简洁!(荣耀典藏版)

微信小程序开发公司你懂得选择吗?

Research on the recognition method of move function information of scientific paper abstract based on paragraph Bert CRF

I have been in the industry for 4 years and changed jobs twice. I have understood the field of software testing~

Lt7911d type-c/dp to Mipi scheme is mature and can provide technical support

基于多模态融合的非遗图片分类研究

磷脂偶联抗体/蛋白试剂盒的存储与步骤

开放式耳机哪个品牌好、性价比最高的开放式耳机排名

Storage and steps of phospholipid coupled antibody / protein Kit
随机推荐
What technology is needed for applet development
[极客大挑战 2019]Secret File&文件包含常用伪协议以及姿势
39. Combined sum
Esp8266 Arduino programming example - deep sleep and wake up
融合LSTM与逻辑回归的中文专利关键词抽取
Assign a string pointer to an array [easy to understand]
Byte side: can TCP and UDP use the same port?
从 Web3到Web2.5,是倒退还是另辟蹊径?
Official document of kubevela 1.4.x
Zhuzhou Jiufang middle school carried out drowning prevention and fire safety education and training activities
HCIA comprehensive experiment (take Huawei ENSP as an example)
MATLAB从入门到精通 第1章 MATLAB入门
什么是质因数,质因数(素因数或质因子)在数论里是指能整除给定正整数的质数
Esp8266 Arduino programming example - timer and interrupt
中国科学家首次用DNA构造卷积人工神经网络,可完成32类分子模式识别任务,或用于生物标志物信号分析和诊断
For the first time, Chinese scientists used DNA to construct convolutional artificial neural network, which can complete 32 types of molecular pattern recognition tasks, or be used for biomarker signa
Esp8266 Arduino programming example - SPIFs and data upload (Arduino IDE and platformio IDE)
How does MySQL archive data?
Knowledge description framework of foreign patent documents based on knowledge elements
With the help of domestic chip manufacturers, the shipment of white brand TWS headphones has reached 600million in 2020