当前位置:网站首页>Crawler text data cleaning
Crawler text data cleaning
2022-07-31 01:38:00 【In the sea fishing】
def filter_chars(text):
"""过滤无用字符 :param text: 文本 """
# Find all non-Chinese in the text,English and numeric characters
add_chars = set(re.findall(r'[^\u4e00-\u9fa5a-zA-Z0-9]', text))
extra_chars = set(r"""!!¥$%*()()-——【】::“”";;'‘’,.?,.?、""")
add_chars = add_chars.difference(extra_chars)
# tab 是/t
# Replace special character combinations
text = re.sub('{IMG:.?.?.?}', '', text)
text = re.sub(r'<!--IMG_\d+-->', '', text)
text = re.sub('(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]', '', text) # Filter URLs
text = re.sub('<a[^>]*>', '', text).replace("</a>", "") # 过滤a标签
text = text.replace("</P>", "")
text = text.replace("nbsp;", "")
text = re.sub('<P[^>]*>', '', text, flags=re.IGNORECASE).replace("</p>", "") # 过滤P标签
text = re.sub('<strong[^>]*>', ',', text).replace("</strong>", "") # 过滤strong标签
text = re.sub('<br>', ',', text) # 过滤br标签
text = re.sub('www.[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]', '', text).replace("()", "") # 过滤www开头的网址
text = re.sub(r'\s', '', text) # 过滤不可见字符
text = re.sub('Ⅴ', 'V', text)
# 清洗
for c in add_chars:
text = text.replace(c, '')
return text
边栏推荐
- 16、注册中心-consul
- Word/Excel 固定表格大小,填写内容时,表格不随单元格内容变化
- "Real" emotions dictionary based on the text sentiment analysis and LDA theme analysis
- PDF 拆分/合并
- Word 表格跨页,仍然显示标题
- Centos 7.9 install PostgreSQL14.4 steps
- 进程间通信学习笔记
- Kyushu cloud as cloud computing standardization excellent member unit
- 黄东旭:TiDB的优势是什么?
- 无线模块的参数介绍和选型要点
猜你喜欢
I have been working in software testing for 3 years, how did I go from just getting started to automated testing?
孩子的编程启蒙好伙伴,自己动手打造小世界,长毛象教育AI百变编程积木套件上手
Analyze the capabilities and scenarios of the cloud native message flow system Apache Pulsar
Dispatch Center xxl-Job
Kyushu cloud as cloud computing standardization excellent member unit
Centos 7.9 install PostgreSQL14.4 steps
黄东旭:TiDB的优势是什么?
case语句的综合结果,你究竟会了吗?【Verilog高级教程】
Shell变量与赋值、变量运算、特殊变量
MySql installation and configuration super detailed tutorial and simple method of building database and table
随机推荐
android的webview缓存相关知识收集
数字图像隐写术之JPEG 隐写分析
解析云原生消息流系统 Apache Pulsar 能力及场景
系统设计.短链系统设计
倍增、DFS序
Set the browser scrollbar style
Centos 7.9安装PostgreSQL14.4步骤
Analyze the capabilities and scenarios of the cloud native message flow system Apache Pulsar
Mysql: Invalid default value for TIMESTAMP
System design. Short chain system design
进程间通信学习笔记
打印任务排序 js od华为
MySQL的安装教程(嗷嗷详细,包教包会~)
MySql的初识感悟,以及sql语句中的DDL和DML和DQL的基本语法
无线模块的参数介绍和选型要点
Distributed. Distributed lock
蛮力法/邻接表 广度优先 有向带权图 无向带权图
关于Redis相关内容的基础学习
TiDB之rawkv升级之路v5.0.4--&gt;v6.1.0
MySQL (6)