当前位置:网站首页>利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
2022-07-02 06:08:00 【JackHCC】
自然语言处理中文分词
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】
项目地址:https://github.com/JackHCC/Chinese-Tokenization
方法概述
- 传统算法:使用N-gram,HMM,最大熵,CRF等实现中文分词
- 神经⽹络⽅法:CNN、Bi-LSTM、Transformer等
- 预训练语⾔模型⽅法:Bert等
数据集概述
- PKU 与 MSR 是 SIGHAN 于 2005 年组织的中⽂分词⽐赛 所⽤的数据集,也是学术界测试分词⼯具的标准数据集。
实验过程
实验结果
PKU数据集
模型 | 准确率 | 召回率 | F1分数 |
---|---|---|---|
Uni-Gram | 0.8550 | 0.9342 | 0.8928 |
Uni-Gram+规则 | 0.9111 | 0.9496 | 0.9300 |
HMM | 0.7936 | 0.8090 | 0.8012 |
CRF | 0.9409 | 0.9396 | 0.9400 |
Bi-LSTM | 0.9248 | 0.9236 | 0.9240 |
Bi-LSTM+CRF | 0.9366 | 0.9354 | 0.9358 |
BERT | 0.9712 | 0.9635 | 0.9673 |
BERT-CRF | 0.9705 | 0.9619 | 0.9662 |
jieba | 0.8559 | 0.7896 | 0.8214 |
pkuseg | 0.9512 | 0.9224 | 0.9366 |
THULAC | 0.9287 | 0.9295 | 0.9291 |
MSR数据集
模型 | 准确率 | 召回率 | F1分数 |
---|---|---|---|
Uni-Gram | 0.9119 | 0.9633 | 0.9369 |
Uni-Gram+规则 | 0.9129 | 0.9634 | 0.9375 |
HMM | 0.7786 | 0.8189 | 0.7983 |
CRF | 0.9675 | 0.9676 | 0.9675 |
Bi-LSTM | 0.9624 | 0.9625 | 0.9624 |
Bi-LSTM+CRF | 0.9631 | 0.9632 | 0.9632 |
BERT | 0.9841 | 0.9817 | 0.9829 |
BERT-CRF | 0.9805 | 0.9787 | 0.9796 |
jieba | 0.8204 | 0.8145 | 0.8174 |
pkuseg | 0.8701 | 0.8894 | 0.8796 |
THULAC | 0.8428 | 0.8880 | 0.8648 |
边栏推荐
- Contest3147 - game 38 of 2021 Freshmen's personal training match_ 1: Maximum palindromes
- Web components series (VIII) -- custom component style settings
- Ti millimeter wave radar learning (I)
- 51单片机——ADC讲解(A/D转换、D/A转换)
- WLAN相关知识点总结
- 数据回放伴侣Rviz+plotjuggler
- 【C语言】简单实现扫雷游戏
- Shenji Bailian 3.52-prim
- 492. Construction rectangle
- The official zero foundation introduction jetpack compose Chinese course is coming!
猜你喜欢
谷歌出海创业加速器报名倒计时 3 天,创业人闯关指南提前收藏!
Monitoring uplink of VRRP
Happy Lantern Festival | Qiming cloud invites you to guess lantern riddles
Contest3147 - game 38 of 2021 Freshmen's personal training match_ E: Listen to songs and know music
ROS create workspace
Google play academy team PK competition, official start!
51 single chip microcomputer - ADC explanation (a/d conversion, d/a conversion)
From design delivery to development, easy and efficient!
【C语言】简单实现扫雷游戏
亚马逊aws数据湖工作之坑1
随机推荐
神机百炼3.52-Prim
Web页面用户分步操作引导插件driver.js
51单片机——ADC讲解(A/D转换、D/A转换)
线性dp(拆分篇)
Replace Django database with MySQL (attributeerror: 'STR' object has no attribute 'decode')
Common websites for Postgraduates in data mining
Use some common functions of hbuilderx
Stc8h8k series assembly and C51 actual combat - digital display ADC, key serial port reply key number and ADC value
Redis key value database [primary]
Classic literature reading -- deformable Detr
Bgp Routing preference Rules and notice Principles
Flutter hybrid development: develop a simple quick start framework | developers say · dtalk
借力 Google Cloud 基础设施和着陆区,构建企业级云原生卓越运营能力
STC8H8K系列匯編和C51實戰——數碼管顯示ADC、按鍵串口回複按鍵號與ADC數值
The real definition of open source software
The official zero foundation introduction jetpack compose Chinese course is coming!
Format check JS
BGP 路由優選規則和通告原則
Ti millimeter wave radar learning (I)
Flutter 混合开发: 开发一个简单的快速启动框架 | 开发者说·DTalk