Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Related tags

Deep LearningAVATAR
Overview

AVATAR

  • Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
  • AVATAR stands for jAVA-pyThon progrAm tRanslation.
  • AVATAR is a corpus of 8,475 programming problems and their solutions written in Java and Python.
  • Supervised fine-tuning and evaluation in terms of Computational Accuracy, see details here.

Table of Contents

Dataset

We have collected the programming problems and their solutions from competitive programming sites, online platforms, and open source repositories. We list the sources below.

  • CodeForces
  • AtCoder
  • CodeJam
  • GeeksforGeeks
  • LeetCode
  • ProjectEuler

Data collected can be downloaded by following:

cd data
bash download.sh

To prepare the data, we perform the following steps.

  • Removing docstrings, comments, etc.
  • Use baseline models' tokenizer to perform tokenization.
  • Filter data based on length threshold (~512).
  • Perform de-duplication. (remove examples that are duplicates)

To perform the preparation, run:

cd data
bash prepare.sh

Models

We studied 8 models for program translation.

Models trained from scratch

Pre-trained models

Training & Evaluation

To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.

# Seq2Seq+Attn.
cd seq2seq
bash rnn.sh GPU_ID LANG1 LANG2

# Transformer
cd seq2seq
bash transformer.sh GPU_ID LANG1 LANG2

# CodeGPT
cd codegpt
bash run.sh GPU_ID LANG1 LANG2 CodeGPT

# CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID LANG1 LANG2

# CodeBERT
cd codebert
bash run.sh GPU_ID LANG1 LANG2

# GraphCoderBERT
cd graphcodebert
bash run.sh GPU_ID LANG1 LANG2

# PLBART
cd plbart
# fine-tuning either for Java->Python or Python-Java
bash run.sh GPU_ID LANG1 LANG2
# multilingual fine-tuning
bash multilingual.sh GPU_ID

# Naive Copy
cd naivecopy
bash run.sh
  • Here, LANG1 LANG2=Java Python or LANG1 LANG2=Python Java.
  • Download pre-trained PLBART, GraphCodeBERT, and Transcoder model files by running download.sh script.
  • We trained the models on GeForce RTX 2080 ti GPUs (11019MiB).

Benchmarks

We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), Exact Match (EM). We report the model performances below.

Training Models Java to Python Python to Java
CA BLEU SM DM CB EM CA BLEU SM DM CB EM
None Naive Copy - 23.4 - - - 0.0 - 26.9 - - - 0.0
TransCoder 76.9 36.8 31.0 17.1 29.1 0.1 100 49.4 37.6 18.5 31.9 0.0
TC-DOBF 77.7 43.4 29.7 33.9 34.8 0.0 100 46.1 36.0 12.6 28.8 0.0
From Scratch Seq2Seq+Attn. 66.5 56.3 39.1 18.4 37.9 1.0 71.8 62.7 46.6 28.5 43.0 0.8
Transformer 61.5 38.9 34.2 16.5 29.1 0.0 67.4 45.6 45.7 26.4 37.4 0.1
Pre-trained CodeGPT 47.3 38.2 32.5 11.5 26.1 1.1 71.2 44.0 38.8 26.7 33.8 0.1
CodeGPT-adapted 48.1 38.2 32.5 12.1 26.2 1.2 68.6 42.4 37.2 27.2 33.1 0.5
CodeBERT 62.3 59.3 37.7 16.2 36.7 0.5 74.7 55.3 38.4 22.5 36.1 0.6
GraphCodeBERT 65.7 59.7 38.9 16.4 37.1 0.7 57.2 60.6 48.4 20.6 40.1 0.4
PLBARTmono 76.4 67.1 42.6 19.3 43.3 2.4 34.4 69.1 57.1 34.0 51.4 1.2
PLBARTmulti 70.4 67.1 42.0 17.6 42.4 2.4 30.8 69.4 56.6 34.5 51.8 1.0

License

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.

Citation

@article{ahmad-etal-2021-avatar,
  title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2108.11590},
  year={2021}
}
Owner
Wasi Ahmad
I am a Ph.D. student in CS at UCLA.
Wasi Ahmad
DL course co-developed by YSDA, HSE and Skoltech

Deep learning course This repo supplements Deep Learning course taught at YSDA and HSE @fall'21. For previous iteration visit the spring21 branch. Lec

Yandex School of Data Analysis 1.3k Dec 30, 2022
CrossNorm and SelfNorm for Generalization under Distribution Shifts (ICCV 2021)

CrossNorm (CN) and SelfNorm (SN) (Accepted at ICCV 2021) This is the official PyTorch implementation of our CNSN paper, in which we propose CrossNorm

100 Dec 28, 2022
Learning to Map Large-scale Sparse Graphs on Memristive Crossbar

Release of AutoGMap:Learning to Map Large-scale Sparse Graphs on Memristive Crossbar For reproduction of our searched model, the Ubuntu OS is recommen

2 Aug 23, 2022
A pre-trained model with multi-exit transformer architecture.

ElasticBERT This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
Log4j JNDI inj. vuln scanner

Log-4-JAM - Log 4 Just Another Mess Log4j JNDI inj. vuln scanner Requirements pip3 install requests_toolbelt Usage # make sure target list has http/ht

Ashish Kunwar 66 Nov 09, 2022
GAN Image Generator and Characterwise Image Recognizer with python

MODEL SUMMARY 모델의 구조는 크게 6단계로 나뉩니다. STEP 0: Input Image Predict 할 이미지를 모델에 입력합니다. STEP 1: Make Black and White Image STEP 1 은 입력받은 이미지의 글자를 흑색으로, 배경을

Juwan HAN 1 Feb 09, 2022
AITUS - An atomatic notr maker for CYTUS

AITUS an automatic note maker for CYTUS. 利用AI根据指定乐曲生成CYTUS游戏谱面。 效果展示:https://www

GradiusTwinbee 6 Feb 24, 2022
Python implementation of Lightning-rod Agent, the Stack4Things board-side probe

Iotronic Lightning-rod Agent Python implementation of Lightning-rod Agent, the Stack4Things board-side probe. Free software: Apache 2.0 license Websit

2 May 19, 2022
Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method (NeurIPS 2021)

Skyformer This repository is the official implementation of Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr"om Method (NeurIPS 2021).

Qi Zeng 46 Sep 20, 2022
Lucid Sonic Dreams syncs GAN-generated visuals to music.

Lucid Sonic Dreams Lucid Sonic Dreams syncs GAN-generated visuals to music. By default, it uses NVLabs StyleGAN2, with pre-trained models lifted from

731 Jan 02, 2023
Tilted Empirical Risk Minimization (ICLR '21)

Tilted Empirical Risk Minimization This repository contains the implementation for the paper Tilted Empirical Risk Minimization ICLR 2021 Empirical ri

Tian Li 40 Nov 28, 2022
CPF: Learning a Contact Potential Field to Model the Hand-object Interaction

Contact Potential Field This repo contains model, demo, and test codes of our paper: CPF: Learning a Contact Potential Field to Model the Hand-object

Lixin YANG 99 Dec 26, 2022
An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available actions

Agar.io_Q-Learning_AI An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available act

1 Jun 09, 2022
Bottom-up Human Pose Estimation

Introduction This is the official code of Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation. This paper has been accepted to CVPR2

108 Dec 01, 2022
PyTorch evaluation code for Delving Deep into the Generalization of Vision Transformers under Distribution Shifts.

Out-of-distribution Generalization Investigation on Vision Transformers This repository contains PyTorch evaluation code for Delving Deep into the Gen

Chongzhi Zhang 72 Dec 13, 2022
Baseline inference Algorithm for the STOIC2021 challenge.

STOIC2021 Baseline Algorithm This codebase contains an example submission for the STOIC2021 COVID-19 AI Challenge. As a baseline algorithm, it impleme

Luuk Boulogne 10 Aug 08, 2022
MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space

Update (20 Jan 2020): MODALS on text data is avialable MODALS MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space Table of Conte

38 Dec 15, 2022
This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling

deSpeckNet-TF-GEE This repository contains the re-implementation of our paper deSpeckNet: Generalizing Deep Learning Based SAR Image Despeckling publi

Adugna Mullissa 16 Sep 07, 2022
Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?

Pseudo-random numbers with pseudoscience rng is so complicated! Why cant we have a horoscopic, vibe-y way of calculating a random number? Why cant rng

Andrew Blance 1 Dec 27, 2021
Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.

Music Source Separation with Channel-wise Subband Phase Aware ResUnet (CWS-PResUNet) Introduction This repo contains the pretrained Music Source Separ

Lau 100 Dec 25, 2022