当前位置:网站首页>Natural language processing Jieba
Natural language processing Jieba
2022-07-24 07:53:00 【Shadow follows】
brief introduction
jieba Is currently performing relatively well Python Chinese word segmentation component .
It mainly supports four word segmentation modes :
- Accurate model
- All model
- Search engine model
- paddle Pattern
Support traditional participle 、 Support for custom dictionaries 、MIT License agreement
install
pip install jieba
import jieba
import jieba.posseg as pseg
Precise pattern segmentation
Try to cut the sentence as precisely as possible , Suitable for text analysis ,cut_all Default False
content = " Every month, the Secretary of the industry and Information Technology Bureau has to explain himself to the subordinate departments 24 Installation of technical devices such as oral exchange machine "
print(jieba.cut(content, cut_all=False))
# To return to the list directly , Use jieba.lcut that will do
print(jieba.lcut(content, cut_all=False))
Running results
<generator object Tokenizer.cut at 0x000001D2FD21D970>
[' Information Technology Department ', ' Female executive ', ' monthly ', ' after ', ' subordinate ', ' department ', ' all ', ' want ', ' In person ', ' Account for ', '24', ' mouth ', ' Switch ', ' etc. ', ' technical ', ' device ', ' Of ', ' install ', ' Work ']
Full mode participle
Scan the sentences for all the words that can be made into words , Very fast , But it can't eliminate ambiguity
content = " Every month, the Secretary of the industry and Information Technology Bureau has to explain himself to the subordinate departments 24 Installation of technical devices such as oral exchange machine "
print(jieba.cut(content, cut_all=True))
# To return to the list directly , Use jieba.lcut that will do
print(jieba.lcut(content, cut_all=True))
Running results
<generator object Tokenizer.cut at 0x000001D2FD21D970>
[' Information Technology Department ', ' virgin ', ' Female executive ', ' An officer ', ' monthly ', ' menstruation ', ' after ', ' subordinate ', ' department ', ' all ', ' want ', ' In person ', ' Oral sex ', ' Account for ', '24', ' Oral sex ', ' In exchange for ', ' Switch ', ' Replacement ', ' etc. ', ' technology ', ' technical ', ' Sexual organ ', ' device ', ' Of ', ' install ', ' Installer ', ' Tooling ', ' Work ']
Search engine pattern participle
On the basis of exact patterns , Again shred long words , Increase recall rate , Suitable for search engine segmentation
content = " Every month, the Secretary of the industry and Information Technology Bureau has to explain himself to the subordinate departments 24 Installation of technical devices such as oral exchange machine "
print(jieba.lcut_for_search(content))
content = " Trouble is Bodhi , I won't mention "
print(jieba.lcut(content))
Running results
[' Information Technology Department ', ' An officer ', ' Female executive ', ' monthly ', ' after ', ' subordinate ', ' department ', ' all ', ' want ', ' In person ', ' Account for ', '24', ' mouth ', ' In exchange for ', ' Replacement ', ' Switch ', ' etc. ', ' technology ', ' technical ', ' device ', ' Of ', ' install ', ' Work ']
[' Worry ', ' namely ', ' yes ', ' Bodhi ', ',', ' I ', ' For the time being ', ' No ', ' carry ']
Custom dictionary
Custom content writing .userdict.txt in
Cloud computing 5 n
Li Xiaofu 2 nr
easy_install 3 eng
To use 300
Han Yu appreciates 3 nz
Bayi Shuanglu 3 nz
print(jieba.lcut(" Bayi Shuanglu changed its name to Bayi Nanchang basketball team !"))
jieba.load_userdict("./userdict.txt")
print(jieba.lcut(" Bayi Shuanglu changed its name to Bayi Nanchang basketball team !"))
Running results
[' 8、 ... and ', ' A pair of ', ' deer ', ' renamed ', ' by ', ' Bayi ', ' nanchang ', ' The basketball team ', '!']
[' Bayi Shuanglu ', ' renamed ', ' by ', ' Bayi ', ' nanchang ', ' The basketball team ', '!']
Part of speech tagging
r: Personal pronouns
v: Verb
n: Noun
vn: Gerund
print(pseg.lcut(' I love tian 'anmen square in Beijing '))
Running results
[pair(' I ', 'r'), pair(' Love ', 'v'), pair(' Beijing ', 'ns'), pair(' The tiananmen square ', 'ns')]
边栏推荐
- C language advanced part II Pointer
- Continuous learning, lifelong learning, episodic memory, memory module paper summary -- gradient episodic memory promotes continuous learning
- Hcip day 8 notes
- Workspace creation
- Simple Gateway - intranet server safely obtains external network data
- NFT是什么?一篇文章搞懂NFT的概念
- Jetson AgX Orin source change
- Reptile learning - Overview
- Intelligent robots and intelligent systems (Professor Zhengzheng of Dalian University of Technology) -- 3. Industrial robots
- Advanced part of C language IV. detailed explanation of user-defined types
猜你喜欢

2021-06-03pip error valueerror: unable to find resource t64.exe in package pip_ vendor.distlib
![[cloud native] MySQL index analysis and query optimization](/img/ca/79783721637641cb8225bc26a8c4a9.png)
[cloud native] MySQL index analysis and query optimization

About the solution of thinking that you download torch as a GPU version, but the result is really a CPU version

QT | string generation QR code function

App performance test case

Vertex buffer and shader (the cherno + leranopongl) notes

Decision tree - ID3, C4.5, cart

学习笔记总结篇(一)

Using bidirectional linked list to realize stack (c)

GBK code in idea is converted to UTF-8 format ctrl+c+v one second solution perfect solution for single code file escape
随机推荐
Selenium basic knowledge automatically login Baidu Post Bar
CentOS 7 install mysql5.6.37
About how to set colored fonts on the terminal
学习笔记总结篇(一)
[hiflow] Tencent cloud hiflow scene connector realizes intelligent campus information management
Requests crawler multi page crawling to KFC restaurant location
Hegong sky team vision training Day2 - traditional vision, opencv basic operation
Amber tutorial A17 learning - concept
Super simple countdown code writing
*Yolo5 learning * data experiment based on yolo5 face combined with attention model CBAM
Solve the problem that Anaconda navigator cannot be opened
Multiple optimization methods print prime numbers between 100 and 200
Qt|字符串生成二维码功能
FlinkSQL-UDF自定义数据源
Sense dimension design responsive layout
Hcip day 8 notes
Anaconda cannot shut down the method of forced shutdown
Selenium basic knowledge multi window processing
Selenium basic knowledge debugging method
About the solution of thinking that you download torch as a GPU version, but the result is really a CPU version