当前位置:网站首页>NLP word segmentation
NLP word segmentation
2022-07-29 06:46:00 【yc_ ZZ】
participle
What is participle
The participle is ⾃ But the language ⾔ understand NLP An important step . A participle is a sentence ⼦、 The paragraph 、⽂ This chapter ⻓⽂ Ben , Decompose into data structures in terms of words
,⽅ For subsequent processing and analysis ⼯ do .
The function of participle
1、 Turn complex problems into mathematical problems
The reason why machine learning seems to solve many complex problems , Because it turns these problems into mathematical problems .⽽ NLP The same idea ,⽂ Ben is ⼀ some 「⾮ Structured data
」, We need to convert these data into 「 Structured data
」, Structured data can be transformed into mathematical problems
,⽽ Word segmentation is the first step of transformation .
2、 Word is a more appropriate granularity
Words are the most important means of expressing complete meaning ⼩ Company . The granularity of words is too ⼩,⽆ Express the full meaning of ,⽐ Such as ”⿏“ It can be ”⽼⿏“, It can also be ”⿏ mark “.⽽ sentence ⼦ The granularity of is too ⼤, Carry more information , It's hard to answer ⽤.⽐ Such as ” Tradition ⽅ The law requires word segmentation ,⼀ An important reason is tradition ⽅ The modeling of long-distance dependence can ⼒ Weak .”
Differences between Chinese and English word segmentation
1. There are different ways of word segmentation , Chinese is more difficult
Britain ⽂ There are natural spaces as separators , But in ⽂ No, . So how to segment is ⼀ A difficulty , Plus medium ⽂⾥⼀ The situation of polysemy ⾮ Often , It's easy to have ambiguity .
2. There are many forms of English words
Britain ⽂ There are abundant deformations and transformations in words . In response to these complex transformations , Britain ⽂NLP phase ⽐ in ⽂ There is ⼀ Some unique processing steps , We call it morphological reduction (Lemmatization) And words ⼲ extract (Stemming); in ⽂ You don't need to .
Part of speech reduction :does,done,doing,did It needs to be restored to do.
word ⼲ extract :cities,children,teeth These words , Need to be converted to city,child,tooth These basic forms .
3. Chinese word segmentation needs to consider granularity
for example 「 China's science and Technology ⼤ learn 」 There are many ways :
China's science and Technology ⼤ learn
China \ Science and technology \ ⼤ learn
China \ science \ technology \ ⼤ learn The more granularity ⼤, The more accurate the meaning is , But it can also lead to recalls ⽐ Less
. So in ⽂ Different scenarios and requirements are needed to choose different granularity . This is in England ⽂ None of them .
4、 participle ⼯ have
边栏推荐
- TCP socket communication experiment
- Using STP spanning tree protocol to solve the problem of two-layer loop in network
- Merkletree builds QT implementation UI
- Understanding of access, hybrid and trunk modes
- LVM逻辑卷组的管理
- 8、 Network security
- PhantomReference 虚引用代码演示
- What is DNS amplification attack
- Online multiplayer chat room based on UDP communication
- Hog+svm for pedestrian detection
猜你喜欢
day15_ generic paradigm
day14_ Unit test & Date common class & String common class
day06_ Classes and objects
比较单片机3种时钟电路方案
Condition 条件对象源码浅读
day04_数组
JMM 内存模型概念
Hongke share | let you have a comprehensive understanding of "can bus error" (III) -- can node status and error counter
OpenResty的核心与cosocket
8、 Network security
随机推荐
Is it OK to directly compare the size of two numbers in FPGA?
Joint use skills of joiner.on and stream().Map
FIR filter design (1) -- using the FDATool toolbox of MATLAB to design FIR filter parameters
9、 Networking technology
day17_ Under collection
Annotation
5G控制面协议之N2接口
三、广域通信网
ss命令详解
Using STP spanning tree protocol to solve the problem of two-layer loop in network
Multithreaded server programming
5G服务化接口和参考点
案例补充、ATM
What if the 80443 port of the website server has been maliciously attacked?
Use of for statement in Verilog
day12_多线程
AbstractQueuedSynchronizer(AQS)之 ReentrantLock 源码浅读
网络安全学习(一)
什么是DNS放大攻击
循环神经网络RNN