当前位置:网站首页>【NLP】生成词云
【NLP】生成词云
2022-07-28 20:18:00 【Du恒之】
from imageio import imread
import warnings
warnings.filterwarnings("ignore")
import jieba #分词包
import numpy #numpy计算包
import codecs #codecs提供的open方法来指定打开的文件的语言编码,它会在读取的时候自动转换为内部unicode
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#词云包
#词云大小
matplotlib.rcParams['figure.figsize'] = (15.0, 15.0)
from wordcloud import WordCloud,ImageColorGenerator
def createSuperWordCloud(text_path,image_path):
#"./data/entertainment_news.csv"
df = pd.read_csv(text_path, encoding='utf-8')
# 去掉空行
df = df.dropna()
#df.head()
#将数据变成List
content=df.content.values.tolist()
segment=[]
for line in content:
try:
#列表
segs=jieba.lcut(line)
for seg in segs:
#判断是否为空或者是不是换行词
if len(seg)>1 and seg!='\r\n':
segment.append(seg)
except:
print(line)
continue
words_df=pd.DataFrame({
'segment':segment})
stopwords=pd.read_csv("data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
#先抽取在停用词里面的分词词组,然后再将它去掉
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]
# 这一块是个难点,词频统计
words_stat = words_df.groupby('segment').agg(计数=pd.NamedAgg(column='segment', aggfunc='size')).reset_index().sort_values(
by='计数', ascending=False)
#读取图片生成背景
bimg=imread(image_path)
# 生成词云
wordcloud=WordCloud(background_color="white",mask=bimg,font_path='data/simhei.ttf',max_font_size=200)
#生成词频
word_frequence = {
x[0]:x[1] for x in words_stat.head(1000).values}
wordcloud=wordcloud.fit_words(word_frequence)
# 重新上色
bimgColors=ImageColorGenerator(bimg)
# 去掉off
plt.axis("off")
#重新填写背景
plt.imshow(wordcloud.recolor(color_func=bimgColors))
# 调用
createSuperWordCloud("./data/entertainment_news.csv",'image/entertainment.jpeg')
结果

边栏推荐
- Record some small requirements in the form of cases
- For the 1000 yuan 5g mobile phone market, MediaTek Tianji 700 released
- 软考 --- 数据库(3)数据操作
- Edited by vimtutor
- 第 7 篇:绘制旋转立方体
- Assign a string pointer to an array [easy to understand]
- The University was abandoned for three years, the senior taught himself for seven months, and found a 12K job
- Leetcode 142. circular linked list II [knowledge points: speed pointer, hash table]
- 基于BRNN的政务APP评论端到端方面级情感分析方法
- How is nanoid faster and more secure than UUID implemented? (glory Collection Edition)
猜你喜欢

Open earphone which air conduction earphone with good sound quality and recognized sound quality is recommended

Implementation of sequence table

熊市下 DeFi 的未来趋势

kubevela插件addons下载地址

kali里的powersploit、evasion、weevely等工具的杂项记录

开放式耳机哪个品牌好、性价比最高的开放式耳机排名

msfvenom制作主控与被控端

比UUID更快更安全NanoID到底是怎么实现的?(荣耀典藏版)

What technology is needed for applet development

Log slimming operation: how to optimize from 5g to 1g! (glory Collection Edition)
随机推荐
Apifox:满足你对 Api 的所有幻想
kubevela插件addons下载地址
Layout the 6G track in advance! Ziguang zhanrui released the white paper "6G unbounded AI"
Priced at 1.15 billion yuan, 1206 pieces of equipment were injected into the joint venture! Sk Hynix grabs the mainland wafer foundry market!
如何高效、精准地进行图片搜索?看看轻量化视觉预训练模型
Research on intangible cultural heritage image classification based on multimodal fusion
[brother hero July training] day 28: dynamic planning
Cross domain transfer learning of professional skill word extraction in Chinese recruitment documents
For the next generation chromebook, MediaTek launched new chipsets mt8192 and mt8195
分而治之,大型文件分片上传
Pyqt5 rapid development and actual combat 5.4 web page interaction
How many tips do you know about using mock technology to help improve test efficiency?
Have you ever seen this kind of dynamic programming -- the stock problem of state machine dynamic programming (Part 2)
Official document of kubevela 1.4.x
Detailed explanation of JVM memory layout (glory collection version)
基于复杂网络的大群体应急决策专家意见与信任信息融合方法及应用
基于属性词补全的武器装备属性抽取研究
Leetcode 142. circular linked list II [knowledge points: speed pointer, hash table]
节省70%的显存,训练速度提高2倍!浙大&阿里提出在线卷积重新参数化OREPA,代码已开源!(CVPR 2022 )
株洲市九方中学开展防溺水、消防安全教育培训活动