Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Last update: Dec 30, 2022

Related tags

Text Data & NLP PLBART

Overview

PLBART

Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021.

Note. A detailed documentation is coming soon.

Pre-training data

PLBART is pre-trained on Java and Python functions and natural language descriptions collected from Github and StackOverflow.

Evaluation tasks

We evaluated PLBART on five tasks.

Code summarization [REF]
Code generation [REF]
Code translation [REF]
Clone detection [REF]
Vulnerability REF [REF]

Notes

We will publish the pretrained PLBART checkpoint soon.
We list all the files in this repository here.

Acknowledgement

PLBART uses Fairseq, codeXglue, and TransCoder and thanks the authors of these works for their contribution.

Citation

@inproceedings{ahmad2020summarization,
    author = {Ahmad, Wasi Uddin and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
    booktitle = {Proceedings of the 2021 Conference of the North {A}merican Chapter of the Association for Computational Linguistics},
    title = {Unified Pre-training for Program Understanding and Generation},
    year = {2021}
}

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Related tags

Overview

PLBART

Pre-training data

Evaluation tasks

Notes

Acknowledgement

Citation

Owner

Wasi Ahmad

Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

A unified tokenization tool for Images, Chinese and English.

translate using your voice

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Pipeline for training LSA models using Scikit-Learn.

A full spaCy pipeline and models for scientific/biomedical documents.

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Script and models for clustering LAION-400m CLIP embeddings.

Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

scikit-learn wrappers for Python fastText.

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

MASS: Masked Sequence to Sequence Pre-training for Language Generation

TruthfulQA: Measuring How Models Imitate Human Falsehoods

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

Dust model dichotomous performance analysis

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Behavioral Testing of Clinical NLP Models