当前位置:网站首页>The Chinese word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)
The Chinese word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)
2022-07-02 06:14:00 【JackHCC】
Natural language processing Chinese word segmentation
Using traditional methods (N-gram,HMM etc. )、 Neural network method (CNN,LSTM etc. ) And pre training methods (Bert etc. ) Chinese word segmentation task implementation 【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】
Project address :https://github.com/JackHCC/Chinese-Tokenization
Methods an overview
- Traditional algorithms : Use N-gram,HMM, Maximum entropy ,CRF Wait to realize Chinese word segmentation
- nerve ⽹ Collateral ⽅ Law :CNN、Bi-LSTM、Transformer etc.
- Pre training language ⾔ Model ⽅ Law :Bert etc.
Data set Overview
- PKU And MSR yes SIGHAN On 2005 Organized in ⽂ participle ⽐ " the ⽤ Data set of , It is also an academic test of word segmentation ⼯ Standard data set with .
Experimental process
traditional method :
Neural network method
Pre training model method
experimental result
PKU Data sets
Model | Accuracy rate | Recall rate | F1 fraction |
---|---|---|---|
Uni-Gram | 0.8550 | 0.9342 | 0.8928 |
Uni-Gram+ The rules | 0.9111 | 0.9496 | 0.9300 |
HMM | 0.7936 | 0.8090 | 0.8012 |
CRF | 0.9409 | 0.9396 | 0.9400 |
Bi-LSTM | 0.9248 | 0.9236 | 0.9240 |
Bi-LSTM+CRF | 0.9366 | 0.9354 | 0.9358 |
BERT | 0.9712 | 0.9635 | 0.9673 |
BERT-CRF | 0.9705 | 0.9619 | 0.9662 |
jieba | 0.8559 | 0.7896 | 0.8214 |
pkuseg | 0.9512 | 0.9224 | 0.9366 |
THULAC | 0.9287 | 0.9295 | 0.9291 |
MSR Data sets
Model | Accuracy rate | Recall rate | F1 fraction |
---|---|---|---|
Uni-Gram | 0.9119 | 0.9633 | 0.9369 |
Uni-Gram+ The rules | 0.9129 | 0.9634 | 0.9375 |
HMM | 0.7786 | 0.8189 | 0.7983 |
CRF | 0.9675 | 0.9676 | 0.9675 |
Bi-LSTM | 0.9624 | 0.9625 | 0.9624 |
Bi-LSTM+CRF | 0.9631 | 0.9632 | 0.9632 |
BERT | 0.9841 | 0.9817 | 0.9829 |
BERT-CRF | 0.9805 | 0.9787 | 0.9796 |
jieba | 0.8204 | 0.8145 | 0.8174 |
pkuseg | 0.8701 | 0.8894 | 0.8796 |
THULAC | 0.8428 | 0.8880 | 0.8648 |
边栏推荐
- Talking about MySQL database
- Generic classes and parameterized classes of SystemVerilog
- Go 学习笔记整合
- Contest3147 - game 38 of 2021 Freshmen's personal training match_ E: Listen to songs and know music
- LeetCode 78. 子集
- Deep learning classification network -- Network in network
- LeetCode 40. Combined sum II
- Web components series (VIII) -- custom component style settings
- Spark overview
- Happy Lantern Festival | Qiming cloud invites you to guess lantern riddles
猜你喜欢
Unity shader learning notes (3) URP rendering pipeline shaded PBR shader template (ASE optimized version)
BGP报文详细解释
Sumo tutorial Hello World
Comment utiliser mitmproxy
从设计交付到开发,轻松畅快高效率!
ZABBIX server trap command injection vulnerability (cve-2017-2824)
Detailed explanation of BGP message
Contest3145 - the 37th game of 2021 freshman individual training match_ H: Eat fish
Web components series (VIII) -- custom component style settings
Flutter 混合开发: 开发一个简单的快速启动框架 | 开发者说·DTalk
随机推荐
Format check JS
Servlet web XML configuration details (3.0)
LeetCode 283. Move zero
492. Construction rectangle
深入学习JVM底层(五):类加载机制
BGP 路由优选规则和通告原则
Contest3147 - game 38 of 2021 Freshmen's personal training match_ G: Flower bed
ROS2----LifecycleNode生命周期节点总结
LeetCode 77. combination
Ros2 --- lifecycle node summary
LeetCode 90. Subset II
Scheme and implementation of automatic renewal of token expiration
浏览器原理思维导图
穀歌出海創業加速器報名倒計時 3 天,創業人闖關指南提前收藏!
经典文献阅读之--SuMa++
Redis Key-Value数据库 【高级】
如何使用MITMPROXy
Problems encountered in uni app development (continuous update)
Mock simulate the background return data with mockjs
Contest3147 - game 38 of 2021 Freshmen's personal training match_ 1: Maximum palindromes