当前位置:网站首页>txt文件英语单词词频统计
txt文件英语单词词频统计
2022-08-05 05:25:00 【回首思】
目录
一、需求分析
把txt文件里的英语单词按照出现次数排序并生成csv文件,如果次数相同按照单词的md5值来排序
二、相关库列表
- pandas
- re
- collections
- hashlib
三、代码在此
- 打开文件
txt_file = open(file_path, 'r')
- 读取文件内容
txt_data = txt_file.read()
- 字母全小写
txt_lower = txt_data.lower()
- 正则表达式去特殊符号
# 正则表达式去除特殊符号 punc = '~`!#$%^&*()_+-=|\';":/.,?><~·!@#¥%……&*()——+-=“:’;、。,?》《{}\n' # 调用正则表达 txt_query = re.sub(r"[%s]+" % punc, "", txt_lower)
- 调用统计库对词频统计
# 使用空格来对字符串进行裁切 txt_list = txt_query.split(' ') # 调用统计库来对词频进行统计 word = Counter(txt_list)
- 把每个单词的信息写入到一个列表
# 创建一个列表来接收DataFrame原型 pa_list = [] # 把键、值和根据键名生成的md5写入原型里 for key, value in word.items(): pa_list.append([key, value, hashlib.md5(key.encode('utf-8')).hexdigest()])
- 调用pandas完成排序和文件导出
# 生成DataFrame pd_data = pd.DataFrame(pa_list) # 根据值和md5进行排序 dataexclex = pd_data.sort_values([1, 2]) # 把文件导出 dataexclex.to_csv(f'./{new_file_name}.csv')
- 完整代码
import pandas as pd import re from collections import Counter import hashlib def Word_frequency_statistics(file_path,new_file_name): # 打开文件 txt_file = open(file_path, 'r') # 读取文件内容 txt_data = txt_file.read() # 字母小写化 txt_lower = txt_data.lower() # 正则表达式去除特殊符号 punc = '~`!#$%^&*()_+-=|\';":/.,?><~·!@#¥%……&*()——+-=“:’;、。,?》《{}\n' # 调用正则表达 txt_query = re.sub(r"[%s]+" % punc, "", txt_lower) # 使用空格来对字符串进行裁切 txt_list = txt_query.split(' ') # 调用统计库来对词频进行统计 word = Counter(txt_list) # 创建一个列表来接收DataFrame原型 pa_list = [] # 把键、值和根据键名生成的md5写入原型里 for key, value in word.items(): pa_list.append([key, value, hashlib.md5(key.encode('utf-8')).hexdigest()]) # 生成DataFrame pd_data = pd.DataFrame(pa_list) # 根据值和md5进行排序 dataexclex = pd_data.sort_values([1, 2]) # 把文件导出 dataexclex.to_csv(f'./{new_file_name}.csv') # 要词频统计的文件路径 file_path = '' # 导出时的文件名称 new_file_name = '' # 调用函数统计词频 Word_frequency_statistics(file_path,new_file_name)
四、一些问题
我只试过wav文件,mp3文件一个也可以,前提是文件夹里的文件全是要参与的文件,安装库的时候遇到问题简易回退一个大版本0.9.0=>0.8.0实际体验没多大区别。
边栏推荐
- 初识网页与浏览器
- From "dual card dual standby" to "dual communication", vivo took the lead in promoting the implementation of the DSDA architecture
- el-autocomplete use
- 从“双卡双待“到”双通“,vivo率先推动DSDA架构落地
- The use of three parameters of ref, out, and Params in Unity3D
- js 使用雪花id生成随机id
- Successful indie developers deal with failure & imposters
- ALC experiment
- 设置文本向两边居中展示
- numpy.random使用文档
猜你喜欢
scikit-image图像处理笔记
DevOps process demo (practical record)
Mina断线重连
The use of three parameters of ref, out, and Params in Unity3D
Take you in-depth understanding of cookies
input detailed file upload
Chengyun Technology was invited to attend the 2022 Alibaba Cloud Partner Conference and won the "Gathering Strength and Going Far" Award
系统基础-学习笔记(一些命令记录)
LeetCode practice and self-comprehension record (1)
[问题已处理]-jenkins流水线checkout超时
随机推荐
transport layer protocol
VLAN is introduced with the experiment
Advantages of overseas servers
人人AI(吴恩达系列)
[ingress]-ingress exposes services using tcp port
[issue resolved] - jenkins pipeline checkout timeout
D39_ coordinate transformation
Collision, character controller, Cloth components (cloth), joints in the Unity physics engine
D46_Force applied to rigid body
Media query, rem mobile terminal adaptation
config.js相关配置汇总
Does flink cdc currently support Gauss database sources?
Collection of error records (write down when you encounter them)
跨域的十种解决方案详解(总结)
Nacos集群搭建
滚动条问题,未解决
Mina的长连接和短连接
媒体查询、rem移动端适配
Mina disconnects and reconnects
【考研结束第一天,过于空虚,想对自己进行总结一下】