Contains links to publicly available datasets for modeling health outcomes using speech and language.

Overview

speech-nlp-datasets

Contains links to publicly available datasets for modeling various health outcomes using speech and language.

Speech-based Corpora

TalkBank Project

  • [Corpus] CHILDES Database
    Contains speech of children with different conditions (e.g. Autism, Down's syndrome, hearing impairment) and across different languages (e.g. English, Dutch, Greek, Mandarin).
    MacWhinney, B. (2014). The CHILDES project: Tools for analyzing talk, Volume II: The database. Psychology Press.

  • [Corpus] DementiaBank (from TalkBank)
    Contains recordings of individuals with dementia across different languages. Includes around 400 subjects, most notable in size and containing control subjects is:

    • English Pitt: Longitudinal neuropsychological assessments of 319 subjects (dementia + control) performing Cookie Theft, Word Fluency, Story Recall, and Sentence Construction task. (Becker et al., 1994)
  • [Corpus] Clinical TalkBank
    In addition to DementiaBank, TalkBank contains:

    • RHDBank individuals with Right-Hemisphere Disorder
    • TBIBank individuals with Traumatic Brain Injury
    • AphasiaBank a communication disorder affecting ability to speak, write, and understand language due to some trauma to language parts of the brain.
    • FluencyBank contains individuals with language disfluencies due to being a second language learner, or due to stuttering.

Text-based Corpora

  • [Corpus] Reddit Self-reported Depression Diagnosis (RSDD) dataset
    Contains Reddit posts for ~9,000 users with a claim to depression and ~107,000 control users. (Yates et al., (2017))

  • [Corpus] MIMIC III (Medical Information Mart for Intensive Care)
    Contains medical details and outcomes of 40,000+ patients (e.g. demographics, vital signs, laboratory tests, medications) as well as 2M+ free-text written medical notes from medical personnel (e.g. physicians, nurses, etc.). (Johnson et al., (2016)).

  • i2b2/UTHealth NLP Task (contact authors for corpus?)
    Contains emergency medical records for 296 patients at Partners HealthCare and medical discharge and correspondance notes between medical personnel. Kumar et al., (2014) describes how the data was processed, and Stubbs et al. (2014) describes the 2014 task of identifying risk factors for heart disease over time.

  • Nun Study (contact authors for corpus?)
    Diaries of 93 nuns to used to evaluate cognitive impairment (Alzheimer's disease) in later life. Also contains neuropsychology tests and autopsy information. Study was authored by (Snowdon et al.,(1996))

Owner
Tuka Alhanai
Building technology to improve quality of life.
Tuka Alhanai
Write Python in Urdu - اردو میں کوڈ لکھیں

UrduPython Write simple Python in Urdu. How to Use Write Urdu code in سامپل۔پے The mappings are as following: "۔": ".", "،":

Saad A. Bazaz 26 Nov 27, 2022
The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

Karibash 3 Aug 19, 2022
Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Wav2Vec2 STT Python Beta Software Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 mode

David Zurow 22 Dec 29, 2022
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 03, 2023
Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Bort Companion code for the paper "Optimal Subarchitecture Extraction for BERT." Bort is an optimal subset of architectural parameters for the BERT ar

Alexa 461 Nov 21, 2022
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
Material for GW4SHM workshop, 16/03/2022.

GW4SHM Workshop Wednesday, 16th March 2022 (13:00 – 15:15 GMT): Presented by: Dr. Rhodri Nelson, Imperial College London Project website: https://www.

Devito Codes 1 Mar 16, 2022
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

Fitrie Ratnasari 1 Feb 14, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 02, 2023
Long text token classification using LongFormer

Long text token classification using LongFormer

abhishek thakur 161 Aug 07, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 06, 2023
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

10 Dec 19, 2022
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 06, 2023
Search msDS-AllowedToActOnBehalfOfOtherIdentity

前言 现在进行RBCD的攻击手段主要是搜索mS-DS-CreatorSID,如果机器的创建者是我们可控的话,那就可以修改对应机器的msDS-AllowedToActOnBehalfOfOtherIdentity,利用工具SharpAllowedToAct-Modify 那我们索性也试试搜索所有计算机

Jumbo 26 Dec 05, 2022
Chatbot for the Chatango messaging platform

BroiestBot The baddest bot in the game right now. Uses the ch.py framework for joining Chantango rooms and responding to user messages. Commands If a

Todd Birchard 3 Jan 17, 2022