当前位置:网站首页>【NLP】生成词云
【NLP】生成词云
2022-07-28 20:18:00 【Du恒之】
from imageio import imread
import warnings
warnings.filterwarnings("ignore")
import jieba #分词包
import numpy #numpy计算包
import codecs #codecs提供的open方法来指定打开的文件的语言编码,它会在读取的时候自动转换为内部unicode
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#词云包
#词云大小
matplotlib.rcParams['figure.figsize'] = (15.0, 15.0)
from wordcloud import WordCloud,ImageColorGenerator
def createSuperWordCloud(text_path,image_path):
#"./data/entertainment_news.csv"
df = pd.read_csv(text_path, encoding='utf-8')
# 去掉空行
df = df.dropna()
#df.head()
#将数据变成List
content=df.content.values.tolist()
segment=[]
for line in content:
try:
#列表
segs=jieba.lcut(line)
for seg in segs:
#判断是否为空或者是不是换行词
if len(seg)>1 and seg!='\r\n':
segment.append(seg)
except:
print(line)
continue
words_df=pd.DataFrame({
'segment':segment})
stopwords=pd.read_csv("data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
#先抽取在停用词里面的分词词组,然后再将它去掉
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]
# 这一块是个难点,词频统计
words_stat = words_df.groupby('segment').agg(计数=pd.NamedAgg(column='segment', aggfunc='size')).reset_index().sort_values(
by='计数', ascending=False)
#读取图片生成背景
bimg=imread(image_path)
# 生成词云
wordcloud=WordCloud(background_color="white",mask=bimg,font_path='data/simhei.ttf',max_font_size=200)
#生成词频
word_frequence = {
x[0]:x[1] for x in words_stat.head(1000).values}
wordcloud=wordcloud.fit_words(word_frequence)
# 重新上色
bimgColors=ImageColorGenerator(bimg)
# 去掉off
plt.axis("off")
#重新填写背景
plt.imshow(wordcloud.recolor(color_func=bimgColors))
# 调用
createSuperWordCloud("./data/entertainment_news.csv",'image/entertainment.jpeg')
结果

边栏推荐
- Kubeedge releases white paper on cloud native edge computing threat model and security protection technology
- 中文招聘文档中专业技能词抽取的跨域迁移学习
- 微信小程序开发入门,自己开发小程序
- ESP8266-Arduino编程实例-定时器与中断
- 40. 组合总和 II
- 这种动态规划你见过吗——状态机动态规划之股票问题(下)
- How to search images efficiently and accurately? Look at the lightweight visual pre training model
- With the help of domestic chip manufacturers, the shipment of white brand TWS headphones has reached 600million in 2020
- LT7911D Type-C/DP转mipi 方案成熟可提供技术支持
- 中国科学家首次用DNA构造卷积人工神经网络,可完成32类分子模式识别任务,或用于生物标志物信号分析和诊断
猜你喜欢

小程序开发需要什么技术

How to design workflow engine gracefully (glory Collection Edition)

openEuler Embedded SIG | 分布式软总线

RHCSA第一天

Cross domain transfer learning of professional skill word extraction in Chinese recruitment documents

微信小程序开发入门,自己开发小程序

MATLAB从入门到精通 第1章 MATLAB入门

中国农业工程学会农业水土工程专业委员会-第十二届-笔记

熊市下 DeFi 的未来趋势

Official document of kubevela 1.4.x
随机推荐
The University was abandoned for three years, the senior taught himself for seven months, and found a 12K job
微信小程序开发入门,自己开发小程序
It is said that Microsoft has obtained the supply license for Xianghua! Will Huawei usher in the full lifting of the ban?
Apifox:满足你对 Api 的所有幻想
Nano gold coupled antibody / protein Kit (20nm, 1mg/100 μ g/500 μ G coupling amount) preparation
Esp8266 Arduino programming example - SPIFs and data upload (Arduino IDE and platformio IDE)
Pytorch learning record (4): over fitting, convolution neural network CNN
中国农业工程学会农业水土工程专业委员会-第十二届-笔记
Pyqt5 rapid development and actual combat 5.4 web page interaction
Msfvenom makes master and controlled terminals
How many tips do you know about using mock technology to help improve test efficiency?
kali里的powersploit、evasion、weevely等工具的杂项记录
Cross domain transfer learning of professional skill word extraction in Chinese recruitment documents
Hold high the two flags of 5g and AI: Ziguang zhanrui Market Summit is popular in Shencheng
比UUID更快更安全NanoID到底是怎么实现的?(荣耀典藏版)
Two global variables__ Dirname and__ Further introduction to common functions of filename and FS modules
熊市下 DeFi 的未来趋势
第 8 篇:创建摄像机类
Have you ever seen this kind of dynamic programming -- the stock problem of state machine dynamic programming (Part 2)
Vimtutor编辑