For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Overview

Cross-Encoder-with-Bi-Encoder

For Tok-k passages that have passed through the Bi-Encoder Retrival, ReRank is performed using CrossEncoder.

Data

Data used by "Open-Domain Question Answering Competition" hosted by Aistages, and copyrights can be used under CC-BY-2.0.

+- data
|   +- train_dataset
    |   +- train
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- validataion
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- test_dataset
    |   +- validation
        |   +- dataset.arrow
        |   +- dataset_info.json
        |   +- indices.arrow
        |   +- state.json
    |   +- dataset_dict.json
|   +- wikipedia_documents.json
  • Wikipedia data can be uploaded to the folder location above and used.
!git clone https://github.com/jjonhwa/Cross-Encoder-with-Bi-Encoder.git # git clone
% cd ./Cross-Encoder-with-Bi-Encoder/_data                              # change directory (input your own path)

!gdown --id 1O-kxt4DupOibNhkwmg3luTLt07faRgvO # wiki data upload        # download wikipedia data

Setup

Dependencies

  • datasets==1.5.0
  • transformers==4.5.0
  • tqdm==4.41.1
  • pandas==1.1.4
  • CUDA==11.0

Install Requirements

bash install_requirements.sh

Hardware

  • GPU : Tesla V100 (32GB)

Checkpoint

  • You can check the code in the Colab environment using Demo.
  • It does not work in Colab Basic.

What can we do to improve the performance of Retriever?

1. Explore the data set production process.

  • Sparse Embedding may be better in tasks for viewing Passage and creating a question (if there is an annotation bias), such as SQuAD.
  • In most other data, documents can be extracted with higher accuracy if Dense Passage Retreat is used.

2. Sparse Embedding & Dense Embedding

  • Most of the content was knowledge obtained by referring to Paper, and based on this, it led to improvement in Retriever performance.
  • Prior to the application of DPR, in the case of 'KLUE MRC database' in which datasets were configured in the same manner as SQuAD, it would be better to utilize techniques such as Sparse embedding technique BM25 compared to DPR.
  • Actually, until ReRank Strategy was applied, the highest performance was achieved with elastic search based on BM25.
  • When only biencoder was used, Retrieval accuracy was far below elastic search in the 'KLUE MRC competition'
  • Retrieval Accuracy in our Data
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR Bi-Encoder - 0.775 0.85

3. ReRank Strategy with CrossEncoder (In-Batch_Negative Samples)

  • Our purpose is to bring high performance from KLUE MRC competition to End-to-End from Retrieval to Reader. From this, the ReRank strategy using Cross Encoder was used.
  • In addition, when implementing Cross Encoder, the key point is to extract a negative sample within Batch and use it to calculate loss.
  • After extracting the Retrival Passage of the Top-500 using the Bi-Encoder, only a small number of Passages are finally extracted by returning to the Cross Encoder.
  • Retrieval Accuracy in our Data
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR without CrossEncoder - 0.775 0.85
DPR with CrossEncoder 0.825 0.95 -

4. Ensemble

  • In this process, the contents of CrossEncoder were mainly written, and the contents of Ensemble were omitted.
  • An experiment was conducted assuming that performance improvement would be achieved from different types of Retrival combinations by conducting Ensemble using Sparse Embedding and Dense Embedding.
  • Top-100 was selected using Elastic Search and Top-100 was selected using DPR and Cross Encoder, and the final output score was calculated by combining them 1 to 1 and normalizing them.
  • When the final Reader model was tested, when Top-5 was input, the performance was the best, so the experiment was conducted after limiting the number of passages to be returned to five.
  • Actually, the performance has improved significantly, and the retrival accuracy is as follows.
Top-5 Top-50 Top-100
Elastic Search 0.852 0.945 0.962
DPR with CrossEncoder 0.825 0.95 -
Ensemble 0.9082 - -

Train CrossEncoder & BiEncoder

  • Learn crossencoder and biencoder and store them.
  • Modify only the data path to match your data. (find "your_dataset_path")
python train.py --encoder 'cross' --output_directory './save_directory/'

or

python train.py --encoder 'bi' --output_directory './save_directory/'

Run ReRank

  • It precedes creating an encoder using crossencoder and biencoder. (Before Run ReRank, you have to run 'train.py' to make)
  • Modify only the data path to match your data. (find "your_dataset_path")
python rerank.py --input_directory './save_directory/'

Run Retriever Demo

  • Top 500 Passages are Retrieved from about 60000 data using Biencoder, and Top 5 is finally retrieved using CrossEncoder.
  • Passage Embedding about wiki data, Cross Encoder and Bi-Encoder can be downloaded and utilized
  • Open In Colab
A basic layout of atm working of my local database

Software for working Banking service 😄 This project was developed for Banking service. mysql server is required To have mysql server on your system u

satya 1 Oct 21, 2021
A collection of simple tools that proved to be needed for hadling large periodic calculations with the VASP software package.

VESTA-tools A collection of simple tools that proved to be needed for handling large periodic calculations with the VASP software package. distTotCalc

Ilia Kichev 2 Dec 14, 2021
navigation_commander is a ROS package to command the robot to navigate autonomously to each table for food delivery inside a hotel.

navigation_commander navigation_commander is a ROS package to command the robot to navigate autonomously to each table for food delivery inside a hote

ALEENA LENTIN 9 Nov 08, 2021
Implements a polyglot REPL which supports multiple languages and shared meta-object protocol scope between REPLs.

MetaCall Polyglot REPL Description This repository implements a Polyglot REPL which shares the state of the meta-object protocol between the REPLs. Us

MetaCall 10 Dec 28, 2022
Tutor plugin for integration of Open edX with a Richie course catalog

Richie plugin for Tutor This is a plugin to integrate Richie, the learning portal CMS, with Open edX. The integration takes the form of a Tutor plugin

Overhang.IO 2 Sep 08, 2022
This is a fork of the BakeTool with some improvements that I did to have better workflow.

blender-bake-tool This is a fork of the BakeTool with some improvements that I did to have better workflow. 99.99% of work was done by BakeTool team.

Acvarium 3 Oct 04, 2022
A parallel branch-and-bound engine for Python.

pybnb A parallel branch-and-bound engine for Python. This software is copyright (c) by Gabriel A. Hackebeil (gabe.hacke

Gabriel Hackebeil 52 Nov 12, 2022
Localization and multifractal properties of the long-range Kitaev chain in the presence of an Aubry-André-Harper modulation

This repository contains the code for the paper Localization and multifractal properties of the long-range Kitaev chain in the presence of an Aubry-André-Harper modulation.

Joana Fraxanet 2 Apr 17, 2022
Module-based cryptographic tool

Cryptosploit A decryption/decoding/cracking tool using various modules. To use it, you need to have basic knowledge of cryptography. Table of Contents

/SNESE_AR\ 33 Nov 27, 2022
GDSC UIET KUK 📍 , welcomes you all to this amazing event where you will be introduced to the world of coding 💻 .

GDSC UIET KUK 📍 , welcomes you all to this amazing event where you will be introduced to the world of coding 💻 .

Google Developer Student Club UIET KUK 9 Mar 24, 2022
This Open-Source project is great for sensor capture and storage solutions.

Phase 1 This project helps developers in the creation of extended realities that communicate with Arduino and require the security of blockchain stora

Wolfberry, LLC 10 Dec 28, 2022
Modelling the 30 salamander problem from `Pure Mathematics` by Martin Liebeck

Salamanders on an island The Problem From A Concise Introduction to Pure Mathematics By Martin Liebeck Critic Ivor Smallbrain is watching the horror m

Faisal Jina 1 Jul 10, 2022
Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

find_te_ins find_te_ins is designed to find Transposon Element (TE) insertions using long reads (nanopore), by alignment directly. (minimap2) Install

Ming Wang 1 Feb 09, 2022
HogwartsRegister - A Hogwarts Register With Python

A Hogwarts Register Installation download code git clone https://github.com/haor

0 Feb 12, 2022
Med to csv - A simple way to parse MedAssociate output file in tidy data

MedAssociates to CSV file A simple way to parse MedAssociate output file in tidy

Jean-Emmanuel Longueville 5 Sep 09, 2022
Doom o’clock is a website/project that features a countdown of “when will the earth end” and a greenhouse gas effect emission prediction that’s predicted

Doom o’clock is a website/project that features a countdown of “when will the earth end” and a greenhouse gas effect emission prediction that’s predicted

shironeko(Hazel) 4 Jan 01, 2022
EFB Docker image with efb-telegram-master and efb-wechat-slave

efb-wechat-docker EFB Docker image with efb-telegram-master and efb-wechat-slave Features Container run by non-root user. Support add environment vari

Haukeng 1 Nov 10, 2022
Labspy06 With Python

Labspy06 Profil Nama : Nafal mumtaz fuadi Nim : 312110457 Kelas : T1.21.A.2 Latihan 1 Ubahlah kode dibawah ini menjadi fungsi menggunakan lambda impor

Mas Nafal 1 Dec 12, 2021
🍕 A small app with capabilities ordering food and listing them with pub/sub pattern

food-ordering A small app with capabilities ordering food and listing them. Prerequisites Docker Run Tests docker-compose run --rm web ./manage.py tes

Muhammet Mücahit 1 Jan 14, 2022