当前位置:网站首页>爬虫文本数据清洗
爬虫文本数据清洗
2022-07-31 01:25:00 【浪里摸鱼】
def filter_chars(text):
"""过滤无用字符 :param text: 文本 """
# 找出文本中所有非中,英和数字的字符
add_chars = set(re.findall(r'[^\u4e00-\u9fa5a-zA-Z0-9]', text))
extra_chars = set(r"""!!¥$%*()()-——【】::“”";;'‘’,。?,.?、""")
add_chars = add_chars.difference(extra_chars)
# tab 是/t
# 替换特殊字符组合
text = re.sub('{IMG:.?.?.?}', '', text)
text = re.sub(r'<!--IMG_\d+-->', '', text)
text = re.sub('(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]', '', text) # 过滤网址
text = re.sub('<a[^>]*>', '', text).replace("</a>", "") # 过滤a标签
text = text.replace("</P>", "")
text = text.replace("nbsp;", "")
text = re.sub('<P[^>]*>', '', text, flags=re.IGNORECASE).replace("</p>", "") # 过滤P标签
text = re.sub('<strong[^>]*>', ',', text).replace("</strong>", "") # 过滤strong标签
text = re.sub('<br>', ',', text) # 过滤br标签
text = re.sub('www.[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]', '', text).replace("()", "") # 过滤www开头的网址
text = re.sub(r'\s', '', text) # 过滤不可见字符
text = re.sub('Ⅴ', 'V', text)
# 清洗
for c in add_chars:
text = text.replace(c, '')
return text
边栏推荐
猜你喜欢

Parameter introduction and selection points of wireless module

Basic Parameters of RF Devices 1

Installation problem corresponding to tensorflow and GPU version

ShardingSphere's unsharded table configuration combat (6)

Teach you how to configure Jenkins automated email notifications

ShardingSphere之未分片表配置实战(六)

Kyushu cloud as cloud computing standardization excellent member unit

使用PageHelper实现分页查询(详细)

typescript9-常用基础类型

九州云入选“可信云最新评估体系及2022年通过评估企业名单”
随机推荐
小黑leetcode之旅:117. 填充每个节点的下一个右侧节点指针 II
软件测试工作3年了,谈谈我是如何从刚入门进阶到自动化测试的?
Can deep learning solve the parameters of a specific function?
4G通信模块CAT1和CAT4的区别
typescript10-commonly used basic types
验证 XML 文档
【Mysql】——索引的深度理解
ShardingSphere之垂直分库分表实战(五)
【genius_platform软件平台开发】第七十四讲:window环境下的静态库和动态库的一些使用方法(VC环境)
斩获BAT、TMD技术专家Offer,我都经历了什么?
数字图像隐写术之卡方分布
"Real" emotions dictionary based on the text sentiment analysis and LDA theme analysis
Jiuzhou Cloud was selected into the "Trusted Cloud's Latest Evaluation System and the List of Enterprises Passing the Evaluation in 2022"
TiCDC 架构和数据同步链路解析
Meta元宇宙部门第二季度亏损28亿 仍要继续押注?元宇宙发展尚未看到出路
查看zabbix-release-5.0-1.el8.noarch.rpm包内容
The sword refers to offer17---print the n digits from 1 to the largest
调度中心xxl-Job
kotlin中函数作为参数和函数作为返回值实例练习
typescript15- (specify both parameter and return value types)