A framework for cleaning Chinese dialog data


本项目为一个清洗对话数据的多线程框架,针对知乎、微博、贴吧等。 目前还比较简陋,欢迎提bug和优化,比如句内重复短语降重函数的正则或者后缀算法。 代码还在继续完善中,注释以及一些函数出处引用等待完善。


--scripts: 存放运行脚本
  ---run.sh: 使用我挑选的几个规则来运行run_dist.py  
--src: 清洗框架功能主目录  
  ---inputters: 存放dataloader 和 存取数据工具函数
  ---rules: 存放各级别的规则函数
  ---single_filter.py: run_dist.py所调用的单个线程的主程序,加载处理单个数据,并保存过滤后的数据以及脏   
---tool_data: 存放黑名单词典,每行一个词  
---run_dist.py: 主运行文件,加载dataloader,加载黑名单,简历线程池 
---utils: 数据统计,结果检测


bash ./scripts/run.sh 2>&1 | tee -a cleaning.log



1 黑名单过滤,包括特殊字符和脏话
2 emoji表情
3 邮箱、电话号等隐私过滤, 人名 替换为NAME1、NAME2。。。
4 URL过滤
5 unicode 相关修复
6 去重:包括重复词缩减、过滤掉上下文相同的句子、重复的对话
7 meena以及dialogpt中使用的广告、通用回复筛除


NOTE THAT: 1, 改动某规则的时候注意是否影响到其他规则, 规则清洗顺序有要求 2, 黑名单如人名、特殊话题等可根据需要配置放置到 ./tool_data/下,文件命名可自行配置请参阅。/run_dist.py中dataloader。黑名单可到github上搜寻,如 https://github.com/fighting41love/funNLP 3, 将在每个函数上方给定测试样例,下方给定期待样例 4, 目前run.sh中使用的参数为本人正在使用的功能


参数 描述
n_p 多进程数
batch_size 单个进程最大处理session数
tool_dir 工具数据所在目录(如黑名单)
out_dir 清洗后的文件输出目录
raw_dir 待处理文件所在mull
dirty_dir 存储清洗出来的脏数据,如为空则不存
:--------------- :-------------------
split_multi_repost 将微博转发数据按"//@aaa XXXX //@bbb XXX"撕开成多句
no_utter_dup 如果 context == response 则去掉该对话
re_name 人名用 , ...替换
no_ad 去除可能是广告的对话(同样的回复对应多个context)借鉴论文
de_generic_dialog 去通用回复 借鉴论文
no_short_response 去掉对话尾部所有过短回复
:--------------- :-------------------
bert_clean 使用BertTokenizer 中函数清理句子
cleantext_clean 使用clean-text 清理 (电话号、邮箱、unicode错误等)
:--------------- :-------------------
no_short 去除过短的句子
no_long 去除过长的句子
de_reply_tag 去除微博中 "回复 @XXX:"
de_hashtag 去除句中 "# XXX#"
de_emotion 去除句中 ": XXX:"
de_mention 去除句子中 "@Cindy", "@Bob:", "@Amy:" 等
no_mention 去除包含 @XXX 的句子
de_repost 去除句中 "//XXX"
de_duplicated 句中短语降重 (待用后缀算法优化)
de_emoji 去除emoji (代补全)
no_special_topic 过滤包含特定名单词的对话对话
no_str_blacklist 过滤包含黑名单词的对话
no_toupiao 判断是否是微博投票
no_specific_utter 删除一些特定句子
contain_zh 删掉不包含中文的句子
de_single_repost_mention 去掉 "@XXX:"
de_weibo_url 去除 http:\t.c
de_url 去除 url
de_angle 去除 其中XX为非中文
de_alpha_num 去除长串无意义的数字字母组合
de_specific 去除句中固定pattern
:--------------- :-------------------
de_showall 去除某些特定文件中的 "...显示全部"
de_brackets 去除某些特定文件中的 "[XXX]"
:--------------- :-------------------
no_word_blacklist 过滤分此后的黑名单词的对话
no_alpha_noise 过滤掉含有不成 英文单词的 字母组合 的句子
check_confuse_word 保存包含混淆名单词的对话进行recall
yda_dedupl 如果一个词语在句子中出现的比例 超过一个阈值则放弃该句子
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

DeepMind 9.7k Jan 02, 2023
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
Paddle2.x version AI-Writer

Paddle2.x 版本AI-Writer 用魔改 GPT 生成网文。Tuned GPT for novel generation.

yujun 74 Jan 04, 2023
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Generate vector graphics from a textual caption

VectorAscent: Generate vector graphics from a textual description Example "a painting of an evergreen tree" python text_to_painting.py --prompt "a pai

Ajay Jain 97 Dec 15, 2022
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

Multilabel time series classification with LSTM Tensorflow implementation of model discussed in the following paper: Learning to Diagnose with LSTM Re

Aaqib 552 Nov 28, 2022
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+

rJAM splitscreen message reader for MysticBBS A46+

Robbert Langezaal 4 Nov 22, 2022
Seonghwan Kim 24 Sep 11, 2022
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
DAGAN - Dual Attention GANs for Semantic Image Synthesis

Contents Semantic Image Synthesis with DAGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evalu

Hao Tang 104 Oct 08, 2022
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
Easy to start. Use deep nerual network to predict the sentiment of movie review.

Easy to start. Use deep nerual network to predict the sentiment of movie review. Various methods, word2vec, tf-idf and df to generate text vectors. Various models including lstm and cov1d. Achieve f1

1 Nov 19, 2021
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022