当前位置:网站首页>利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
2022-07-02 06:08:00 【JackHCC】
自然语言处理中文分词
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】
项目地址:https://github.com/JackHCC/Chinese-Tokenization
方法概述
- 传统算法:使用N-gram,HMM,最大熵,CRF等实现中文分词
- 神经⽹络⽅法:CNN、Bi-LSTM、Transformer等
- 预训练语⾔模型⽅法:Bert等
数据集概述
- PKU 与 MSR 是 SIGHAN 于 2005 年组织的中⽂分词⽐赛 所⽤的数据集,也是学术界测试分词⼯具的标准数据集。
实验过程
实验结果
PKU数据集
| 模型 | 准确率 | 召回率 | F1分数 |
|---|---|---|---|
| Uni-Gram | 0.8550 | 0.9342 | 0.8928 |
| Uni-Gram+规则 | 0.9111 | 0.9496 | 0.9300 |
| HMM | 0.7936 | 0.8090 | 0.8012 |
| CRF | 0.9409 | 0.9396 | 0.9400 |
| Bi-LSTM | 0.9248 | 0.9236 | 0.9240 |
| Bi-LSTM+CRF | 0.9366 | 0.9354 | 0.9358 |
| BERT | 0.9712 | 0.9635 | 0.9673 |
| BERT-CRF | 0.9705 | 0.9619 | 0.9662 |
| jieba | 0.8559 | 0.7896 | 0.8214 |
| pkuseg | 0.9512 | 0.9224 | 0.9366 |
| THULAC | 0.9287 | 0.9295 | 0.9291 |
MSR数据集
| 模型 | 准确率 | 召回率 | F1分数 |
|---|---|---|---|
| Uni-Gram | 0.9119 | 0.9633 | 0.9369 |
| Uni-Gram+规则 | 0.9129 | 0.9634 | 0.9375 |
| HMM | 0.7786 | 0.8189 | 0.7983 |
| CRF | 0.9675 | 0.9676 | 0.9675 |
| Bi-LSTM | 0.9624 | 0.9625 | 0.9624 |
| Bi-LSTM+CRF | 0.9631 | 0.9632 | 0.9632 |
| BERT | 0.9841 | 0.9817 | 0.9829 |
| BERT-CRF | 0.9805 | 0.9787 | 0.9796 |
| jieba | 0.8204 | 0.8145 | 0.8174 |
| pkuseg | 0.8701 | 0.8894 | 0.8796 |
| THULAC | 0.8428 | 0.8880 | 0.8648 |
边栏推荐
- Web页面用户分步操作引导插件driver.js
- 加密压缩文件解密技巧
- Redis key value database [advanced]
- sudo提权
- Spark overview
- Happy Lantern Festival | Qiming cloud invites you to guess lantern riddles
- I/o multiplexing & event driven yyds dry inventory
- 经典文献阅读之--SuMa++
- Reading classic literature -- Suma++
- 脑与认知神经科学Matlab Psytoolbox认知科学实验设计——实验设计四
猜你喜欢

Monitoring uplink of VRRP

Contest3147 - game 38 of 2021 Freshmen's personal training match_ A: chicken

借力 Google Cloud 基础设施和着陆区,构建企业级云原生卓越运营能力

经典文献阅读之--SuMa++
![Redis key value database [primary]](/img/47/10461d12720a9dd801f80ed1d3ad23.jpg)
Redis key value database [primary]

MySQL transaction and isolation level

Shenji Bailian 3.52-prim

Google Go to sea entrepreneurship accelerator registration countdown 3 days, entrepreneurs pass through the guide in advance collection!

Google Play Academy 组队 PK 赛,正式开赛!

Deep learning classification network -- alexnet
随机推荐
Eco express micro engine system has supported one click deployment to cloud hosting
Stc8h8k series assembly and C51 actual combat - digital display ADC, key serial port reply key number and ADC value
token过期自动续费方案和实现
Spark overview
复杂 json数据 js前台解析 详细步骤《案例:一》
Ti millimeter wave radar learning (I)
脑与认知神经科学Matlab Psytoolbox认知科学实验设计——实验设计四
I/o impressions from readers | prize collection winners list
AttributeError: ‘str‘ object has no attribute ‘decode‘
Detailed notes of ES6
Memcached installation
ZABBIX server trap command injection vulnerability (cve-2017-2824)
Contest3147 - game 38 of 2021 Freshmen's personal training match_ E: Listen to songs and know music
Ros2 --- lifecycle node summary
memcached安装
Redis Key-Value数据库【初级】
Contest3147 - game 38 of 2021 Freshmen's personal training match_ G: Flower bed
External interrupts cannot be accessed. Just delete the code and restore it Record this unexpected bug
从设计交付到开发,轻松畅快高效率!
Eco express micro engine system has supported one click deployment to cloud hosting