当前位置:网站首页>[Notes] Stuttering word segmentation to draw word cloud map
[Notes] Stuttering word segmentation to draw word cloud map
2022-07-30 01:50:00 【Sprite.Nym】
一、结巴分词的三种模式
(1)精确模式:Cut the words that are most likely to make up the words,No redundant words.
(2)全模式:Cut all possible words into words,There are redundant words.
(3)搜索引擎模式:在精确模式的基础上,对长词再次切分,适合用于搜索引擎分词.
二、正则提取数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# 导数据
douyin = pd.read_csv('data/douyin.csv')
# 正则提取,并达到MySQL中group_concat的效果
temp = douyin['signature'].str.extractall(r'[^一-龥]*([一-龥]+)[^一-龥]*').copy()
temp = temp.reset_index().groupby('level_0').agg({
0:list})[0].apply(lambda x: ','.join(x))
# Tokenize the extracted string,并剔除停用词
result0 = ','.join([','.join(jieba.lcut(statement)) for statement in temp]).split(',')
result1 = [x for x in result0 if x not in stop_words]
# Create word frequenciesSeries对象
important_words = pd.Series(result1).value_counts()[1:200]
important_words
三、绘制
# Import reference images required for drawing
bgimg = np.array(Image.open('data/bgimg.png'))
# Use the color of the reference image as the color of the word cloud
genclr = wordcloud.ImageColorGenerator(bgimg)
# Create a Ciyun map object
wc = wordcloud.WordCloud(
# 指定字体路径
font_path='data/FZZJ-LongYTJW.TTF',
# 指定背景颜色
background_color='white',
# Specifies the maximum number of words
max_words=200,
# Specify the maximum and minimum font size
max_font_size=300,
min_font_size=5,
# 指定随机种子
random_state=4,
# Specifies which graph the word cloud graph outline refers to
mask=bgimg,
# Specifies the word cloud color
color_func=genclr)
# 渲染文字
wc.generate_from_frequencies(important_words)
# 使用plt展示词云图
plt.figure(figsize=(24,24))
plt.imshow(wc)
plt.axis('off')

边栏推荐
- 新型LaaS协议Elephant Swap给ePLATO提供可持续溢价空间
- 网络原理 基础知识
- SwiftUI SQLite数据库存储使用教程大合集(2022年版)
- js中原型链的理解,原型链解决的是什么问题?
- 气路旋转连接器怎么用
- LeetCode 2352. 相等行列对
- 畅玩西安全攻略
- [Flutter] Flutter preloading of mixed development solves the problem of slow page loading for the first time
- [QNX Hypervisor 2.2用户手册]9.12 预留
- OSPF shamlink 解决后门链路问题
猜你喜欢

泰克Tektronix示波器软件TDS210|TDS220|TDS224上位机软件NS-Scope

Recommendation system: collection of user "behavioral data" [use Kafka and Cassandra to process data] [if it overlaps with business data, it also needs to be collected independently]

JS develops 3D modeling software

vscode 工作区配置插件 配置不同工作环境

exness:美国GDP萎缩,日元反弹受捧

App测试需要测什么

It is really strong to apply the @Transactional transaction annotation to such perfection!

【Vmware NSX-V基本架构及组件安装】

泰克Tektronix示波器软件TDS420|TDS430|TDS460上位机软件NS-Scope

1.2Recyclerview实现Item点击事件
随机推荐
RAII技术学习
OSPF shamlink 解决后门链路问题
LeetCode 2352. Equal Row Column Pairs
MySQL高级篇(高阳)建表sql语句大全
Understanding the prototype chain in js, what problem does the prototype chain solve?
TCP/IP 常见问题
App测试需要测什么
LeetCode 2342. 数位和相等数对的最大和
Unity便携式 VR 的实现
推荐系统:特征工程、常用特征
视觉系统设计实例halcon-winform-11.菜单折叠与展示
可惜了!规模这么大的上市公司说散就散了
AI落地难?云原生助力企业快速应用机器学习 MLOps
my creative day
ufw set firewall rules
经济衰退时期的对比:如今更像历史上的哪段时期?
二叉搜索树
go jwt使用
关于 SAP Fiori 应用的离线使用
畅玩西安全攻略