Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Overview

🌳 Fingerprinting Fine-tuned Language Models in the wild

This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Clone the repo

git clone https://github.com/LCS2-IIITD/ACL-FFLM.git
pip3 install -r requirements.txt 

Dataset

The dataset includes both organic and synthetic text.

  • Synthetic -

    Collected from posts of r/SubSimulatorGPT2. Each user on the subreddit is a GPT2 small (345 MB) bot that is fine-tuned on 500k posts and comments from a particular subreddit (e.g., r/askmen, r/askreddit,r/askwomen). The bots generate posts on r/SubSimulatorGPT2, starting off with the main post followed by comments (and replies) from other bots. The bots also interact with each other by using the synthetic text in the preceding comment/reply as their prompt. In total, the sub-reddit contains 401,214 comments posted between June 2019 and January 2020 by 108 fine-tuned GPT2 LMs (or class).

  • Organic -

    Collected from comments of 108 subreddits the GPT2 bots have been fine-tuned upon. We randomly collected about 2000 comments between the dates of June 2019 - Jan 2020.

The complete dataset is available here. Download the dataset as follows -

  1. Download the 2 folders organic and synthetic, containing the comments from individual classes.
  2. Store them in the data folder in the following format.
data
├── organic
├── synthetic
└── authors.csv

Note -
For the below TL;DR run you also need to download dataset.json and dataset.pkl files which contain pre-processed data.
Organize them in the dataset/synthetic folder as follows -

dataset
├── organic
├── synthetic
  ├── splits (Folder already present)
    ├── 6 (Folder already present)
      └── 108_800_100_200_dataset.json (File already present)
  ├── dataset.json (to be added via drive link)
  └── dataset.pkl (to be added via drive link)
└── authors.csv (File already present)

108_800_100_200_dataset.json is a custom dataset which contains the comment ID's, the labels and their separation into train, test and validation splits.
Upon running the models, the comments for each split are fetched from the dataset.json using the comment ID's in the 108_800_100_200_dataset.json file .

Running the code

TL;DR

You can skip the pre-processing and the Create Splits if you want to run the code on some custom datasets available in the dataset/synthetic....splits folder. Make sure to follow the instructions mentioned in the Note of the Dataset section for the setting up the dataset folders.

Pre-process the dataset

First, we pre-process the complete dataset using the data present in the folder create-splits. Select the type of data (organic/synthetic) you want to pre-process using the parameter synthetic in the file. By deafult the parameter is set for synthetic data i.e True. This would create a pre-processed dataset.json and dataset.pkl files in the dataset/[organic OR synthetic] folder.

Create Train, Test and Validation Splits

We create splits of train, test and validation data. The parameters such as min length of sentences (default 6), lowercase sentences, size of train (max and default 800/class), validation (max and default 100/class) and test (max and default 200/class),number of classes (max and default 108) can be set internally in the create_splits.py in the splits folder under the commented PARAMETERS Section.

cd create-splits.py
python3 create_splits.py

This creates a folder in the folder dataset/synthetic/splits/[min_len_of_sentence/min_nf_tokens = 6]/. The train, validation and test datasets are all stored in the same file with the filename [#CLASSES]_[#TRAIN_SET_SIZE]_[#VAL_SET_SIZE]_[#TEST_SET_SIZE]_dataset.json like 108_800_100_200_dataset.json.

Running the model

Now fix the same parameters in the seq_classification.py file. To train and test the best model (Fine-tuned GPT2/ RoBERTa) -

cd models/generate-embed/ft/
python3 seq_classification.py 

A results folder will be generated which will contain the results of each epoch.

Note -
For the other models - pretrained and writeprints, first generate the embeddings using the files in the folders models/generate-embed/[pre-trained or writeprints]. The generated embeddings are stored in the results/generate-embed folder. Then, use the script in the models/classifiers/[pre-trained or writeprints] to train sklearn classifiers on generated embeddings. The final results will be in the results/classifiers/[pre-trained or writeprints] folder.

👪 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For any detailed clarifications/issues, please email to nirav17072[at]iiitd[dot]ac[dot]in .

⚖️ License

MIT

Owner
LCS2-IIITDelhi
Laboratory for Computation Social Systems (LCS2) is a research group led by Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar at IIIT-D
LCS2-IIITDelhi
SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

Yixuan Su 345 Jan 03, 2023
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介 分享了一个搜狐文本匹配的baseline,主要是通过条件LayerNorm来增加模型的多样性,以实现同一模型处理不同类型的数据、形成不同输出的目的。 线下验证集F1约0.74,线上测试集F1约0.73。

苏剑林(Jianlin Su) 45 Sep 06, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Local cross-platform machine translation GUI, based on CTranslate2

DesktopTranslator Local cross-platform machine translation GUI, based on CTranslate2 Download Windows Installer You can either download a ready-made W

Yasmin Moslem 29 Jan 05, 2023
NLP applications using deep learning.

NLP-Natural-Language-Processing NLP applications using deep learning like text generation etc. 1- Poetry Generation: Using a collection of Irish Poem

KASHISH 1 Jan 27, 2022
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 07, 2023
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
Text editor on python tkinter to convert english text to other languages with the help of ployglot.

Transliterator Text Editor This is a simple transliteration program which is used to convert english word to phonetically matching word in another lan

Merin Rose Tom 1 Jan 16, 2022
LeBenchmark: a reproducible framework for assessing SSL from speech

LeBenchmark: a reproducible framework for assessing SSL from speech

11 Nov 30, 2022
Dust model dichotomous performance analysis

Dust-model-dichotomous-performance-analysis Using a collated dataset of 90,000 dust point source observations from 9 drylands studies from around the

1 Dec 17, 2021
Conditional probing: measuring usable information beyond a baseline

Conditional probing: measuring usable information beyond a baseline

John Hewitt 20 Dec 15, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

Facebook Research 5.1k Dec 26, 2022
Creating a chess engine using GPT-3

GPT3Chess Creating a chess engine using GPT-3 Code for my article : https://towardsdatascience.com/gpt-3-play-chess-d123a96096a9 My game (white) vs GP

19 Dec 17, 2022