The aim of this task is to predict someone's English proficiency based on a text input.

Last update: Dec 13, 2021

Overview

English_proficiency_prediction_NLP

The aim of this task is to predict someone's English proficiency based on a text input.

Using the The NICT JLE Corpus available here : https://alaginrc.nict.go.jp/nict_jle/index_E.html

The source of the corpus data is the transcripts of the audio-recorded speech samples of 1,281 participants (1.2 million words, 300 hours in total) of English oral proficiency interview test. Each participant got a SST (Standard Speaking Test) score between 1 (low proficiency) and 9 (high proficiency) based on this test.

The goal is to build a machine learning algorithm for predicting the SST score of each participant based on their transcript.

Steps:

1 - Pre-process the dataset: extract the participant transcript (all tags). Inside participant transcript, you can remove all other tags and extract only English words.

2 - Process the dataset: extract features with the Bag of Word (BoW) technique

3 - Train a classifier to predict the SST score

4 - Compute the accuracy of your system (the number of participant classified correctly) and plot the confusion matrix.

5 - Try to improve your system (for example you can try to use GloVe instead of BoW).

The aim of this task is to predict someone's English proficiency based on a text input.

Related tags

Overview

English_proficiency_prediction_NLP

Owner

Speech to text streamlit app

Azure Text-to-speech service for Home Assistant

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

DeLighT: Very Deep and Light-Weight Transformers

DensePhrases provides answers to your natural language questions from the entire Wikipedia in real-time

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

VoiceFixer VoiceFixer is a framework for general speech restoration.

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

Baseline code for Korean open domain question answering(ODQA)

基于pytorch+bert的中文事件抽取

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

A CSRankings-like index for speech researchers

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.