当前位置:网站首页>Stutter participle_ Principle of word breaker
Stutter participle_ Principle of word breaker
2022-06-28 09:24:00 【Java architects must see】
install jieba library :pip3 install jieba
# Stuttering participle
# -*- coding:utf-8 -*-
import sys
import os
import jiebasent = ' Tianshan intelligence is a business intelligence enterprise BI、 Data analysis 、 Technical community in the field of data mining and big data technology www.hellobi.com . Content from the initial business intelligence BI The field has also been extended to data analysis 、 Data mining is related to big data In the field of technology , Include R、Python、SPSS、Hadoop、Spark、Hive、Kylin etc. , Become a vertical community focused on the data field . Tianshan intelligence is committed to building an ecosystem based on the data field , Link everything through the community Data related resources : For example, the data itself 、 people 、 Data solution providers and enterprises , Work together with everyone to promote big data 、 business intelligence BI Popularization and development in China .'
print (sent)Stuttering word segmentation module has three word segmentation modes :
1. All model : Scan all the words that can be made into words in a sentence , Very fast , But it doesn't solve the ambiguity . This full mode , According to the dictionary , Match and divide all the words that appear , So there will be repetition , obviously , This is not what we need .
2. Accurate model : Try to cut the sentence as precisely as possible , Suitable for text analysis ( similar LTP Word segmentation ), And this precise model is closer to what we want .
3. Search engine model : Segmentation of long words based on precise patterns , Increase recall rate , Suitable for search engine segmentation . This search engine model is also good , More detailed .
# All model
wordlist = jieba.cut(sent,cut_all = True)
print('|'.join(wordlist))# Exact segmentation
wordlist = jieba.cut(sent)
print('|'.join(wordlist)) # Search engine model
wordlist = jieba.cut_for_search(sent)
print('|'.join(wordlist))Find new problems -- Add user-defined dictionary : Looking back at the results of the exact model , Find some new words or professional words , for example : Tianshan intelligence 、 big data , These should no longer be cut apart , So based on the default dictionary , We can load custom dictionaries . Enter my jieba Module directory -> See a dict The dictionary of , open -> Found to have 1. word 2. Numbers ( For word frequency , The higher the height, the easier it is to match ) 3. The part of speech . For convenience , We define and add a dictionary named userdict.txt
# Add user-defined dictionary
# Use the user dictionary
jieba.load_userdict('D:\\Anaconda3\\Lib\\site-packages\\jieba\\userdict.txt')
wordlist = jieba.cut(sent)
print('|'.join(wordlist)) Reference material :
https://zhuanlan.zhihu.com/p/29747350?utm_source=qq&utm_medium=social&utm_oi=780081763178258432
That's the end of today's article , Thank you for reading ,Java Architects must see I wish you a promotion and a raise , Good luck every year .
边栏推荐
- I want to register my stock account online. How do I do it? Is online account opening safe?
- 图解MySQL的binlog、redo log和undo log
- APICloud携手三六零天御,助力企业守好App安全“第一关”
- 1182:合影效果
- Valentine's Day - VBS learning (sentences, love words)
- SQL 優化經曆:從 30248秒到 0.001秒的經曆
- Which securities company is better and safer to choose when opening an account for the inter-bank certificate of deposit fund with mobile phone
- Machine virtuelle 14 installer win7 (tutoriel)
- 虛擬機14安裝win7(圖教程)
- 2020-10-27
猜你喜欢
Understanding the IO model

Deployment of MySQL database in Linux Environment

Apiccloud, together with 360 Tianyu, helps enterprises keep the "first pass" of APP security

Test cases for learning the basic content of software testing (II)

Postman interface test

English translation plug-in installation of idea

Common test method used by testers --- orthogonal method

PMP考试重点总结八——监控过程组(2)

Prototype chain JS

虚拟机14安装win7(图教程)
随机推荐
SQL 優化經曆:從 30248秒到 0.001秒的經曆
PMP考试重点总结六——图表整理
什么是在线开户?现在网上开户安全么?
两道面试小Demo
The digital human industry is about to break out. What is the market pattern?
Stock suspension
为什么SELECT * 会导致查询效率低?
Importerror: no module named image [duplicate] - importerror: no module named image [duplicate]
Apiccloud, together with 360 Tianyu, helps enterprises keep the "first pass" of APP security
Data mining modeling practice
股票 停牌
How to implement two factor authentication MFA based on RADIUS protocol?
Use of Jasper soft studio report tool and solution of thorny problems
异常
Data visualization makes correlation analysis easier to use
Is it safe to open an account for mobile phone stock speculation?
How do I open an account on my mobile phone? Is it safe to open an account online now?
Why does select * lead to low query efficiency?
Linux下安装redis 、Windows下安装redis(超详细图文教程)
Rich text - Test Case