当前位置:网站首页>[NLP] generate word cloud
[NLP] generate word cloud
2022-07-28 22:04:00 【Du Hengzhi】
from imageio import imread
import warnings
warnings.filterwarnings("ignore")
import jieba # Word segmentation packages
import numpy #numpy Calculation package
import codecs #codecs Provided open Method to specify the language encoding of the open file , It will automatically convert to internal when it is read unicode
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud# Ci Yun Bao
# Word cloud size
matplotlib.rcParams['figure.figsize'] = (15.0, 15.0)
from wordcloud import WordCloud,ImageColorGenerator
def createSuperWordCloud(text_path,image_path):
#"./data/entertainment_news.csv"
df = pd.read_csv(text_path, encoding='utf-8')
# Remove the blank lines
df = df.dropna()
#df.head()
# Turn data into List
content=df.content.values.tolist()
segment=[]
for line in content:
try:
# list
segs=jieba.lcut(line)
for seg in segs:
# Judge whether it is empty or whether it is a newline word
if len(seg)>1 and seg!='\r\n':
segment.append(seg)
except:
print(line)
continue
words_df=pd.DataFrame({
'segment':segment})
stopwords=pd.read_csv("data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3 No reference at all
# First extract the participle phrase in the stop word , Then remove it
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]
# This is a difficult part , Word frequency statistics
words_stat = words_df.groupby('segment').agg( Count =pd.NamedAgg(column='segment', aggfunc='size')).reset_index().sort_values(
by=' Count ', ascending=False)
# Read the picture and generate the background
bimg=imread(image_path)
# Generate the word cloud
wordcloud=WordCloud(background_color="white",mask=bimg,font_path='data/simhei.ttf',max_font_size=200)
# Generate word frequency
word_frequence = {
x[0]:x[1] for x in words_stat.head(1000).values}
wordcloud=wordcloud.fit_words(word_frequence)
# Recolor
bimgColors=ImageColorGenerator(bimg)
# Get rid of off
plt.axis("off")
# Fill in the background again
plt.imshow(wordcloud.recolor(color_func=bimgColors))
# call
createSuperWordCloud("./data/entertainment_news.csv",'image/entertainment.jpeg')
result

边栏推荐
- 标准C语言学习总结10
- II. Explanation of the sequence and deserialization mechanism of redistemplate
- ESP8266-Arduino编程实例-深度休眠与唤醒
- 腾讯云数据库负责人林晓斌借一亿元炒股?知情人士:金额不实
- Apple M1 processor details: performance and energy efficiency have doubled, and Intel Core i9 is no match!
- 拥抱开源指南
- 纳米金偶联抗体/蛋白试剂盒(20nm,1mg/100μg/500 μg偶联量)的制备
- 90. 子集 II
- Apifox: satisfy all your fantasies about API
- Introduction to C language [detailed]
猜你喜欢

Chinese patent keyword extraction based on LSTM and logistic regression

基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究

Soft test --- database (3) data operation

怎样巧用断言+异常处理类,使代码更简洁!(荣耀典藏版)

Detailed explanation of JVM memory layout (glory collection version)

kubevela插件addons下载地址

HCIA comprehensive experiment (take Huawei ENSP as an example)

字节一面:TCP 和 UDP 可以使用同一个端口吗?

KubeVela 1.4.x 官方文档

Construction of Chinese traditional embroidery classification model based on xception TD
随机推荐
【机器学习】朴素贝叶斯对文本分类--对人名国别分类
Official document of kubevela 1.4.x
怎样巧用断言+异常处理类,使代码更简洁!(荣耀典藏版)
II. Explanation of the sequence and deserialization mechanism of redistemplate
kali里的powersploit、evasion、weevely等工具的杂项记录
Standard C language learning summary 10
中文招聘文档中专业技能词抽取的跨域迁移学习
Detailed explanation of JVM memory layout (glory collection version)
HCIA comprehensive experiment (take Huawei ENSP as an example)
get和post的区别
display 各值的区别
Edited by vimtutor
Byte side: can TCP and UDP use the same port?
Pytoch learning record (III): random gradient descent, neural network and full connection
Layout the 6G track in advance! Ziguang zhanrui released the white paper "6G unbounded AI"
纳米金偶联抗体/蛋白试剂盒(20nm,1mg/100μg/500 μg偶联量)的制备
How does MySQL archive data?
分而治之,大型文件分片上传
融合LSTM与逻辑回归的中文专利关键词抽取
Meeting notice of OA project (Query & whether to attend the meeting & feedback details)