当前位置:网站首页>利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
2022-07-02 06:08:00 【JackHCC】
自然语言处理中文分词
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】
项目地址:https://github.com/JackHCC/Chinese-Tokenization
方法概述
- 传统算法:使用N-gram,HMM,最大熵,CRF等实现中文分词
- 神经⽹络⽅法:CNN、Bi-LSTM、Transformer等
- 预训练语⾔模型⽅法:Bert等
数据集概述
- PKU 与 MSR 是 SIGHAN 于 2005 年组织的中⽂分词⽐赛 所⽤的数据集,也是学术界测试分词⼯具的标准数据集。
实验过程
实验结果
PKU数据集
| 模型 | 准确率 | 召回率 | F1分数 |
|---|---|---|---|
| Uni-Gram | 0.8550 | 0.9342 | 0.8928 |
| Uni-Gram+规则 | 0.9111 | 0.9496 | 0.9300 |
| HMM | 0.7936 | 0.8090 | 0.8012 |
| CRF | 0.9409 | 0.9396 | 0.9400 |
| Bi-LSTM | 0.9248 | 0.9236 | 0.9240 |
| Bi-LSTM+CRF | 0.9366 | 0.9354 | 0.9358 |
| BERT | 0.9712 | 0.9635 | 0.9673 |
| BERT-CRF | 0.9705 | 0.9619 | 0.9662 |
| jieba | 0.8559 | 0.7896 | 0.8214 |
| pkuseg | 0.9512 | 0.9224 | 0.9366 |
| THULAC | 0.9287 | 0.9295 | 0.9291 |
MSR数据集
| 模型 | 准确率 | 召回率 | F1分数 |
|---|---|---|---|
| Uni-Gram | 0.9119 | 0.9633 | 0.9369 |
| Uni-Gram+规则 | 0.9129 | 0.9634 | 0.9375 |
| HMM | 0.7786 | 0.8189 | 0.7983 |
| CRF | 0.9675 | 0.9676 | 0.9675 |
| Bi-LSTM | 0.9624 | 0.9625 | 0.9624 |
| Bi-LSTM+CRF | 0.9631 | 0.9632 | 0.9632 |
| BERT | 0.9841 | 0.9817 | 0.9829 |
| BERT-CRF | 0.9805 | 0.9787 | 0.9796 |
| jieba | 0.8204 | 0.8145 | 0.8174 |
| pkuseg | 0.8701 | 0.8894 | 0.8796 |
| THULAC | 0.8428 | 0.8880 | 0.8648 |
边栏推荐
- 锐捷EBGP 配置案例
- 492.构造矩形
- Cookie plugin and localforce offline storage plugin
- Common websites for Postgraduates in data mining
- Generics and generic constraints of typescript
- Error creating bean with name 'instanceoperatorclientimpl' defined in URL when Nacos starts
- Deep learning classification network -- vggnet
- BGP 路由优选规则和通告原则
- Stc8h8k series assembly and C51 actual combat - keys allow key counting (using falling edge interrupt control)
- 在uni-app中引入uView
猜你喜欢

Zabbix Server trapper 命令注入漏洞 (CVE-2017-2824)

Sumo tutorial Hello World

Little bear sect manual query and ADC in-depth study

Mathematical statistics and machine learning

网络相关知识(硬件工程师)

ESP8266与STC8H8K单片机联动——天气时钟

Ros2 --- lifecycle node summary

51 single chip microcomputer - ADC explanation (a/d conversion, d/a conversion)

Eco express micro engine system has supported one click deployment to cloud hosting

Shenji Bailian 3.54-dichotomy of dyeing judgment
随机推荐
Mock simulate the background return data with mockjs
穀歌出海創業加速器報名倒計時 3 天,創業人闖關指南提前收藏!
ZABBIX server trap command injection vulnerability (cve-2017-2824)
From design delivery to development, easy and efficient!
memcached安装
LeetCode 27. 移除元素
加密压缩文件解密技巧
Compte à rebours de 3 jours pour l'inscription à l'accélérateur de démarrage Google Sea, Guide de démarrage collecté à l'avance!
Google play academy team PK competition, official start!
Error creating bean with name 'instanceoperatorclientimpl' defined in URL when Nacos starts
Contest3147 - game 38 of 2021 Freshmen's personal training match_ F: Polyhedral dice
Use some common functions of hbuilderx
492. Construction rectangle
Ti millimeter wave radar learning (I)
Stc8h8k Series Assembly and c51 Real combat - NIXIE TUBE displays ADC, Key Series port reply Key number and ADC value
Invalid operation: Load into table ‘sources_orderdata‘ failed. Check ‘stl_load_errors‘ system table
Shenji Bailian 3.52-prim
Community theory | kotlin flow's principle and design philosophy
Shenji Bailian 3.54-dichotomy of dyeing judgment
Comment utiliser mitmproxy