NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

Overview

NLG evaluation via Statistical Measures of Similarity: BaryScore, DepthScore, InfoLM

Automatic Evaluation Metric described in the papers BaryScore (EMNLP 2021) , DepthScore (Submitted), InfoLM (AAAI 2022).

Authors:

Goal :

This repository deals with automatic evaluation of NLG and addresses the special case of reference based evaluation. The goal is to build a metric m: where is the space of sentences. An example is given below:

Metric examples: similar sentences should have a high score, dissimilar should have a low score according to m.

Overview

We start by giving an overview of the proposed metrics.

DepthScore (Submitted)

DepthScore is a single layer metric based on pretrained contextualized representations. Similar to BertScore, it embeds both the candidate (C: It is freezing this morning) and the reference (R: The weather is cold today) using a single layer of Bert to obtain discrete probability measures and . Then, a similarity score is computed using the pseudo metric introduced here.

Depth Score

This statistical measure has been tested on Data2text and Summarization.

BaryScore (EMNLP 2021)

BaryScore is a multi-layers metric based on pretrained contextualized representations. Similar to MoverScore, it aggregates the layers of Bert before computing a similarity score. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; BaryScore aggregates the different outputs through the Wasserstein space topology. MoverScore (right) leverages the information available in other layers by aggregating the layers using a power mean and then use a Wasserstein distance ().

BaryScore (left) vs MoverScore (right)

This statistical measure has been tested on Data2text, Summarization, Image captioning and NMT.

InfoLM (AAAI 2022)

InfoLM is a metric based on a pretrained language model ( PLM) (). Given an input sentence S mask at position i (), the PLM outputs a discret probability distribution () over the vocabulary (). The second key ingredient of InfoLM is a measure of information () that computes a measure of similarity between the aggregated distributions. Formally, InfoLM involes 3 steps:

  • 1. Compute individual distributions using for the candidate C and the reference R.
  • 2. Aggregate individual distributions using a weighted sum.
  • 3. Compute similarity using .
InfoLM

InfoLM is flexible as it can adapte to different criteria using different measures of information. This metric has been tested on Data2text and Summarization.

References

If you find this repo useful, please cite our papers:

@article{infolm_aaai2022,
  title={InfoLM: A New Metric to Evaluate Summarization \& Data2Text Generation},
  author={Colombo, Pierre and Clavel, Chloe and Piantanida, Pablo},
  journal={arXiv preprint arXiv:2112.01589},
  year={2021}
}
@inproceedings{colombo-etal-2021-automatic,
    title = "Automatic Text Evaluation through the Lens of {W}asserstein Barycenters",
    author = "Colombo, Pierre  and Staerman, Guillaume  and Clavel, Chlo{\'e}  and Piantanida, Pablo",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    year = "2021",
    pages = "10450--10466"
}
@article{depth_score,
  title={A pseudo-metric between probability distributions based on depth-trimmed regions},
  author={Staerman, Guillaume and Mozharovskyi, Pavlo and Colombo, Pierre and Cl{\'e}men{\c{c}}on, St{\'e}phan and d'Alch{\'e}-Buc, Florence},
  journal={arXiv preprint arXiv:2103.12711},
  year={2021}
}

Usage

Python Function

Running our metrics can be computationally intensive (because it relies on pretrained models). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can use light pretrained representations such as TinyBERT, DistilBERT.

We provide example inputs under <metric_name>.py. For example for BaryScore

metric_call = BaryScoreMetric()

ref = [
        'I like my cakes very much',
        'I hate these cakes!']
hypothesis = ['I like my cakes very much',
                  'I like my cakes very much']

metric_call.prepare_idfs(ref, hypothesis)
final_preds = metric_call.evaluate_batch(ref, hypothesis)
print(final_preds)

Command Line Interface (CLI)

We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

export metric=infolm
export measure_to_use=fisher_rao
CUDA_VISIBLE_DEVICES=0 python score_cli.py --ref="samples/refs.txt" --cand="samples/hyps.txt" --metric_name=${metric} --measure_to_use=${measure_to_use}

See more options by python score_cli.py -h.

Practical Tips

  • Unlike BERT, RoBERTa uses GPT2-style tokenizer which creates addition " " tokens when there are multiple spaces appearing together. It is recommended to remove addition spaces by sent = re.sub(r' +', ' ', sent) or sent = re.sub(r'\s+', ' ', sent).
  • Using inverse document frequency (idf) on the reference sentences to weigh word importance may correlate better with human judgment. However, when the set of reference sentences become too small, the idf score would become inaccurate/invalid. To use idf, please set --idf when using the CLI tool.
  • When you are low on GPU memory, consider setting batch_size to a low number.

Practical Limitation

  • Because pretrained representations have learned positional embeddings with max length 512, our scores are undefined between sentences longer than 510 (512 after adding [CLS] and [SEP] tokens) . The sentences longer than this will be truncated. Please consider using larger models which can support much longer inputs.

Acknowledgements

Our research was granted access to the HPC resources of IDRIS under the allocation 2021-AP010611665 as well as under the project 2021-101838 made by GENCI.

Owner
Pierre Colombo
Pierre Colombo
Código de um painel de auto atendimento feito em Python.

Painel de Auto-Atendimento O intuito desse projeto era fazer em Python um programa que simulasse um painel de auto atendimento, no maior estilo Mac Do

Calebe Alves Evangelista 2 Nov 09, 2022
Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

Philipp Erler 329 Jan 06, 2023
WSDM2022 "A Simple but Effective Bidirectional Extraction Framework for Relational Triple Extraction"

BiRTE WSDM2022 "A Simple but Effective Bidirectional Extraction Framework for Relational Triple Extraction" Requirements The main requirements are: py

9 Dec 27, 2022
This is the official code of our paper "Diversity-based Trajectory and Goal Selection with Hindsight Experience Relay" (PRICAI 2021)

Diversity-based Trajectory and Goal Selection with Hindsight Experience Replay This is the official implementation of our paper "Diversity-based Traje

Tianhong Dai 6 Jul 18, 2022
Udacity's CS101: Intro to Computer Science - Building a Search Engine

Udacity's CS101: Intro to Computer Science - Building a Search Engine All soluti

Phillip 0 Feb 26, 2022
DI-HPC is an acceleration operator component for general algorithm modules in reinforcement learning algorithms

DI-HPC: Decision Intelligence - High Performance Computation DI-HPC is an acceleration operator component for general algorithm modules in reinforceme

OpenDILab 185 Dec 29, 2022
DLL: Direct Lidar Localization

DLL: Direct Lidar Localization Summary This package presents DLL, a direct map-based localization technique using 3D LIDAR for its application to aeri

Service Robotics Lab 127 Dec 16, 2022
QAHOI: Query-Based Anchors for Human-Object Interaction Detection (paper)

QAHOI QAHOI: Query-Based Anchors for Human-Object Interaction Detection (paper) Requirements PyTorch = 1.5.1 torchvision = 0.6.1 pip install -r requ

38 Dec 29, 2022
The spiritual successor to knockknock for PyTorch Lightning, get notified when your training ends

Who's there? The spiritual successor to knockknock for PyTorch Lightning, to get a notification when your training is complete or when it crashes duri

twsl 70 Oct 06, 2022
Source code of all the projects of Udacity Self-Driving Car Engineer Nanodegree.

self-driving-car In this repository I will share the source code of all the projects of Udacity Self-Driving Car Engineer Nanodegree. Hope this might

Andrea Palazzi 2.4k Dec 29, 2022
[ICCV'21] Official implementation for the paper Social NCE: Contrastive Learning of Socially-aware Motion Representations

CrowdNav with Social-NCE This is an official implementation for the paper Social NCE: Contrastive Learning of Socially-aware Motion Representations by

VITA lab at EPFL 125 Dec 23, 2022
PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

DRO: Deep Recurrent Optimizer for Structure-from-Motion This is the official PyTorch implementation code for DRO-sfm. For technical details, please re

Alibaba Cloud 56 Dec 12, 2022
Intrusion Detection System using ensemble learning (machine learning)

IDS-ML implementation of an intrusion detection system using ensemble machine learning methods Data set This project is carried out using the UNSW-15

4 Nov 25, 2022
Official implementation for "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation" (CVPR 2022)

QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation (CVPR2022) https://arxiv.org/abs/2203.08483 Unpaired image-to-image (I2I

Xueqi Hu 50 Dec 16, 2022
Safe Local Motion Planning with Self-Supervised Freespace Forecasting, CVPR 2021

Safe Local Motion Planning with Self-Supervised Freespace Forecasting By Peiyun Hu, Aaron Huang, John Dolan, David Held, and Deva Ramanan Citing us Yo

Peiyun Hu 90 Dec 01, 2022
AI-based, context-driven network device ranking

Batea A batea is a large shallow pan of wood or iron traditionally used by gold prospectors for washing sand and gravel to recover gold nuggets. Batea

Secureworks Taegis VDR 269 Nov 26, 2022
SAT Project - The first project I had done at General Assembly, performed EDA, data cleaning and created data visualizations

Project 1: Standardized Test Analysis by Adam Klesc Overview This project covers: Basic statistics and probability Many Python programming concepts Pr

Adam Muhammad Klesc 1 Jan 03, 2022
Keras-1D-NN-Classifier

Keras-1D-NN-Classifier This code is based on the reference codes linked below. reference 1, reference 2 This code is for 1-D array data classification

Jae-Hoon Shim 6 May 18, 2021
Hidden-Fold Networks (HFN): Random Recurrent Residuals Using Sparse Supermasks

Hidden-Fold Networks (HFN): Random Recurrent Residuals Using Sparse Supermasks by Ángel López García-Arias, Masanori Hashimoto, Masato Motomura, and J

Ángel López García-Arias 4 May 19, 2022
A Moonraker plug-in for real-time compensation of frame thermal expansion

Frame Expansion Compensation A Moonraker plug-in for real-time compensation of frame thermal expansion. Installation Credit to protoloft, from whom I

58 Jan 02, 2023