Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Last update: Nov 15, 2022

Overview

Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

Download and resplit data, see data_utils for details;
Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@misc{chirkova2020empirical,
      title={Empirical Study of Transformers for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      year={2020},
      eprint={2010.07987},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@inproceedings{chirkova2020simple,
      title={A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      booktitle={North American Chapter of the Association for Computational Linguistics}
      year={2021}, 
}

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Related tags

Overview

Transformers for variable misuse, function naming and code completion tasks

Repository structure

Run

Attribution

Citation

Owner

Bayesian Methods Research Group

《Single Image Reflection Removal Beyond Linearity》(CVPR 2019)

Dynamic View Synthesis from Dynamic Monocular Video

Behind the Curtain: Learning Occluded Shapes for 3D Object Detection

Captcha-tensorflow - Image Captcha Solving Using TensorFlow and CNN Model. Accuracy 90%+

Video Corpus Moment Retrieval with Contrastive Learning (SIGIR 2021)

Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

Official implementation of "DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation"

[ICCV'21] NEAT: Neural Attention Fields for End-to-End Autonomous Driving

NeurIPS'21 Tractable Density Estimation on Learned Manifolds with Conformal Embedding Flows

Cross-platform CLI tool to generate your Github profile's stats and summary.

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Sleep staging from ECG, assisted with EEG

A PyTorch Implementation of SphereFace.

PyTorch implementation of MulMON

Code for Massive-scale Decoding for Text Generation using Lattices

[PNAS2021] The neural architecture of language: Integrative modeling converges on predictive processing

This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Super-BPD: Super Boundary-to-Pixel Direction for Fast Image Segmentation (CVPR 2020)