The Toxicity Dataset
by Surge AI
Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's why we're creating the world's largest dataset of social media toxicity — so you can skip the slog and get to work.
We hope you find this dataset useful, whether you want to flag hateful speech, develop content moderation tools, or build classifiers to detect toxic messages.
Need a larger dataset of toxicity to train your ML models, or toxicity in other languages (Spanish, French, German, Japanese, Portuguese, and 17+ more)? We work with top AI and Safety companies around the world. Reach out to [email protected]!
Dataset
This repo contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Click on toxicity_en.csv to see a spreadsheet of 1000 English examples. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.
Columns
text
: the text of the commentis_toxic
: whether or not the comment is toxic
Future
We'll be adding more languages and annotations (e.g., augmenting each comment with a severity ranking, adding categories, etc) over time.
If you're also interested in a dataset of profanity, check out our obscenity list.
Follow us on Twitter at @HelloSurgeAI.