当前位置:网站首页>利用jieba库进行词频统计
利用jieba库进行词频统计
2022-06-12 05:26:00 【算法与编程之美】
0 引言
在读一篇文章和读一本经典名著时,我们常常想统计出来每个词汇出现的次数及该词汇的出现频率,其实我们可以利用Python中的第三方库jieba库来实现。
1 问题
通过对一篇文章和一本书中的词频统计,我们可以知道什么事物或是谁在该文章或该著作作者用了更多的文笔来提到和描述它,
2 方法

- encoding=’ANSI’:将打开的文本格式设为ANSI形式
- read(size):方法从文件当前位置起读取size个字节,若无参数size,则表示读取至文件结束为止,它范围为字符串对象。
- items = list(counts.items):将counts中的元素存入items表格中。
- key = lambda x:x[1]:等价于 def func(x):
return x[1] - reverse = True:列表反转排序,不写reverse = True 就是列表升序排列,括号里面加上reverse =True 就是降序排列!
- {0:<10}{1:>5}: <表示左对齐,>表示右对齐,数字表示宽度,<10表示左对齐,并占10个位置,>5表示右对齐,占5个位置。
3实验结果与讨论
通过实验、实践等证明提出的方法是有效的,是能够解决开头提出的问题。
代码清单 1
| import jieba txt = open("三国演义.txt", "r", encoding='ANSI').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(15): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count)) |
4 结语
使用jieba库对一段文本进行词频的统计是一件非常有意思的事,我们只需要使用这第三方库,就可以在不阅读文本的情况下,得到该文本的高频率词汇。但jieba库的作用远远不止于此,它更多的作用等着我们去挖掘。
边栏推荐
- ESP8266 Arduino OLED
- Performance & interface test tool - JMeter
- Index fund summary
- Ecosystem type distribution data, land use data, vegetation type distribution and nature reserve distribution data
- Classes and objects, methods and encapsulation
- WiFi smartconfig implementation
- 2022 "college entrance examination memory" has been packaged, please check!
- [getting to the bottom] five minutes to understand the combination evaluation model - fuzzy borde (taking the C question of the 2021 college students' numerical simulation national competition as an e
- Applet pull-down load refresh onreachbottom
- Summary of problems in rv1109/rv1126 product development
猜你喜欢

How to quickly reference uview UL in uniapp, and introduce and use uviewui in uni app

国企为什么要上市
![[GIS tutorial] land use transfer matrix](/img/89/c5b55262e39405547c46538355e278.jpg)
[GIS tutorial] land use transfer matrix

National land use data of 30m precision secondary classification

It costs less than 30 yuan, but we still don't build it quickly - check the small knowledge of software application

Normalized vegetation index (NDVI) data, NPP data, GPP data, evapotranspiration data, vegetation type data, ecosystem type distribution data

Development of video preview for main interface of pupanvr-ui

Ecosystem type distribution data, land use data, vegetation type distribution and nature reserve distribution data

BI 如何让SaaS产品具有 “安全感”和“敏锐感”(上)

Serial port oscilloscope_ port_ Setup of plotter secondary development environment (including QT setup)
随机推荐
Accumulated temperature spatial distribution data, temperature distribution data, sunshine data, rainfall distribution, solar radiation data, surface runoff data, land use data, NPP data, NDVI data
Google reinforcement learning framework seed RL environment deployment
个体工商户是不是法人企业
14- II. Cutting rope II
Automated testing - Po mode / log /allure/ continuous integration
When the build When gradle does not load the dependencies, and you need to add a download path in libraries, the path in gradle is not a direct downloadable path
Chrome is amazingly fast, fixing 40 vulnerabilities in less than 30 days
Detailed tutorial on the use of yolov5 and training your own dataset with yolov5
[cjson] precautions for root node
Multi thread learning III. classification of threads
Computer network connected but unable to access the Internet
59 - I. maximum value of sliding window
MySQL5.7.21 Build For ARM
国企为什么要上市
Stm32f4 ll library multi-channel ADC
CODIS long link test
It costs less than 30 yuan, but we still don't build it quickly - check the small knowledge of software application
Microsoft announces that it will discontinue support for older versions of visual studio
Pupanvr- an open source embedded NVR system (1)
67. convert string to integer