[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

Overview

MMChat

This repo contains the code and data for the LREC2022 paper MMChat: Multi-Modal Chat Dataset on Social Media.

Dataset

MMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat. Please read our paper for more details. The images in the dataset are hosted on Weibo's static image server. You can refer to the scripts provided in data_processing/weibo_image_crawler to download these images.

Two sample dialogues form MMChat are given below (translated from Chinese): A sample dialogue from MMChat

MMChat is released in different versions:

Rule Filtered Raw MMChat

This version of MMChat contains raw dialogues filtered by our rules. The following table shows some basic statistics:

Item Description Count
Sessions 4.257 M
Sessions with more than 4 utterances 2.304 M
Utterances 18.590 M
Images 4.874 M
Avg. utterance per session 4.367
Avg. image per session 1.670
Avg. character per utterance 14.104

We devide above dialogues into 9 splits to facilitate the download:

  1. Split0 Google Drive, Baidu Netdisk
  2. Split1 Google Drive, Baidu Netdisk
  3. Split2 Google Drive, Baidu Netdisk
  4. Split3 Google Drive, Baidu Netdisk
  5. Split4 Google Drive, Baidu Netdisk
  6. Split5 Google Drive, Baidu Netdisk
  7. Split6 Google Drive, Baidu Netdisk
  8. Split7 Google Drive, Baidu Netdisk
  9. Split8 Google Drive, Baidu Netdisk

LCCC Filtered MMChat

This version of MMChat contains the dialogues that are filtered based on the LCCC (Large-scale Cleaned Chinese Conversation) dataset. Specifically, some dialogues in MMChat are also contained in LCCC. We regard these dialogues as cleaner dialogues since sophisticated schemes are designed in LCCC to filter out noises. This version of MMChat is obtained using the script data_processing/LCCC_filter.py The following table shows some basic statistics:

Item Description Count
Sessions 492.6 K
Sessions with more than 4 utterances 208.8 K
Utterances 1.986 M
Images 1.066 M
Avg. utterance per session 4.031
Avg. image per session 2.514
Avg. character per utterance 11.336

We devide above dialogues into 9 splits to facilitate the download:

  1. Split0 Google Drive, Baidu Netdisk
  2. Split1 Google Drive, Baidu Netdisk
  3. Split2 Google Drive, Baidu Netdisk
  4. Split3 Google Drive, Baidu Netdisk
  5. Split4 Google Drive, Baidu Netdisk
  6. Split5 Google Drive, Baidu Netdisk
  7. Split6 Google Drive, Baidu Netdisk
  8. Split7 Google Drive, Baidu Netdisk
  9. Split8 Google Drive, Baidu Netdisk

MMChat

The MMChat dataset reported in our paper are given here. The Weibo content corresponding to these dialogues are all "分享图片", (i.e., "Share Images" in English). The following table shows some basic statistics:

Item Description Count
Sessions 120.84 K
Sessions with more than 4 utterances 17.32 K
Utterances 314.13 K
Images 198.82 K
Avg. utterance per session 2.599
Avg. image per session 2.791
Avg. character per utterance 8.521

The above dialogues can be downloaded from either Google Drive or Baidu Netdisk.

MMChat-hf

We perform human annotation on the sampled dialogues to determine whether the given images are related to the corresponding dialogues. The following table only shows the statistics for dialogues that are annotated as image-related.

Item Description Count
Sessions 19.90 K
Sessions with more than 4 utterances 8.91 K
Utterances 81.06 K
Images 52.66K
Avg. utterance per session 4.07
Avg. image per session 2.70
Avg. character per utterance 11.93

We annotated about 100K dialogues. All the annotated dialogues can be downloaded from either Google Drive or Baidu Netdisk.

Code

We are also releasing all the codes used for our experiments. You can use the script run_training.sh in each folder to launch the distributed training.

For models that require image features, you can extract the image features using the scripts in data_processing/extract_image_features

The model shown in our paper can be found in dialog_image: Model

Reference

Please cite our paper if you find our work useful ;)

@inproceedings{zheng2022MMChat,
  author    = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},
  title     = {MMChat: Multi-Modal Chat Dataset on Social Media},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  year      = {2022},
  publisher = {European Language Resources Association},
}
@inproceedings{wang2020chinese,
  title     = {A Large-Scale Chinese Short-Text Conversation Dataset},
  author    = {Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
  booktitle = {NLPCC},
  year      = {2020},
  url       = {https://arxiv.org/abs/2008.03946}
}
Owner
Silver
Dialogue System, Natural Language Processing
Silver
A compendium of useful, interesting, inspirational usage of pandas functions, each example will be an ipynb file

Pandas_by_examples A compendium of useful/interesting/inspirational usage of pandas functions, each example will be an ipynb file What is this reposit

Guangyuan(Frank) Li 32 Nov 20, 2022
PyoMyo - Python Opensource Myo library

PyoMyo Python module for the Thalmic Labs Myo armband. Cross platform and multithreaded and works without the Myo SDK. pip install pyomyo Documentati

PerlinWarp 81 Jan 08, 2023
这是一个yolo3-tf2的源码,可以用于训练自己的模型。

YOLOV3:You Only Look Once目标检测模型在Tensorflow2当中的实现 目录 性能情况 Performance 所需环境 Environment 文件下载 Download 训练步骤 How2train 预测步骤 How2predict 评估步骤 How2eval 参考资料

Bubbliiiing 68 Dec 21, 2022
From the basics to slightly more interesting applications of Tensorflow

TensorFlow Tutorials You can find python source code under the python directory, and associated notebooks under notebooks. Source code Description 1 b

Parag K Mital 5.6k Jan 09, 2023
Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (NeurIPS 2020)

MTTS-CAN: Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement Paper Xin Liu, Josh Fromm, Shwetak Patel, Daniel M

Xin Liu 106 Dec 30, 2022
A toolkit for controlling Euro Truck Simulator 2 with python to develop self-driving algorithms.

europilot Overview Europilot is an open source project that leverages the popular Euro Truck Simulator(ETS2) to develop self-driving algorithms. A con

1.4k Jan 04, 2023
Scripts for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation and a convolutional neural network (CNN) for image classification

About subwAI subwAI - a project for training an AI to play the endless runner Subway Surfers using a supervised machine learning approach by imitation

82 Jan 01, 2023
CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY

M-BERT-Study CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY Motivation Multilingual BERT (M-BERT) has shown surprising cross lingual a

CogComp 1 Feb 28, 2022
We provided a matlab implementation for an evolutionary multitasking AUC optimization framework (EMTAUC).

EMTAUC We provided a matlab implementation for an evolutionary multitasking AUC optimization framework (EMTAUC). In this code, SBGA is considered a ba

7 Nov 24, 2022
Distance correlation and related E-statistics in Python

dcor dcor: distance correlation and related E-statistics in Python. E-statistics are functions of distances between statistical observations in metric

Carlos Ramos Carreño 108 Dec 27, 2022
Hierarchical User Intent Graph Network for Multimedia Recommendation

Hierarchical User Intent Graph Network for Multimedia Recommendation This is our Pytorch implementation for the paper: Hierarchical User Intent Graph

6 Jan 05, 2023
A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

Biomedical Computer Vision @ Uniandes 52 Dec 19, 2022
Real-time ground filtering algorithm of cloud points acquired using Terrestrial Laser Scanner (TLS)

This repository contains tools to simulate the ground filtering process of a registered point cloud. The repository contains two filtering methods. The first method uses a normal vector, and fit to p

5 Aug 25, 2022
Image Segmentation Animation using Quadtree concepts.

QuadTree Image Segmentation Animation using QuadTree concepts. Usage usage: quad.py [-h] [-fps FPS] [-i ITERATIONS] [-ws WRITESTART] [-b] [-img] [-s S

Alex Eidt 29 Dec 25, 2022
Official PyTorch implementation of "Improving Face Recognition with Large AgeGaps by Learning to Distinguish Children" (BMVC 2021)

Inter-Prototype (BMVC 2021): Official Project Webpage This repository provides the official PyTorch implementation of the following paper: Improving F

Jungsoo Lee 16 Jun 30, 2022
A library for Deep Learning Implementations and utils

deeply A Deep Learning library Table of Contents Features Quick Start Usage License Features Python 2.7+ and Python 3.4+ compatible. Quick Start $ pip

Achilles Rasquinha 1 Dec 12, 2022
Prototype for Baby Action Detection and Classification

Baby Action Detection Table of Contents About Install Run Predictions Demo About An attempt to harness the power of Deep Learning to come up with a so

Shreyas K 30 Dec 16, 2022
Code for paper 'Hand-Object Contact Consistency Reasoning for Human Grasps Generation' at ICCV 2021

GraspTTA Hand-Object Contact Consistency Reasoning for Human Grasps Generation (ICCV 2021). Project Page with Videos Demo Quick Results Visualization

Hanwen Jiang 47 Dec 09, 2022
Python implementation of ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images, AAAI2022.

ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images Binh M. Le & Simon S. Woo, "ADD:

2 Oct 24, 2022
Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [CVPR 2021]

Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [BCNet, CVPR 2021] This is the official pytorch implementation of BCNet built on

Lei Ke 434 Dec 01, 2022