Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

Related tags

Deep LearningBiDR
Overview

BiDR

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval.

Requirements

torch==1.7
transformers==4.6
faiss-gpu==1.6.4.post2

Data Download and Preprocess

bash download_data.sh
python preprocess.py

These commands will download and preprocess the MSMARCO Passage and Doc dataset, then the resutls will be saved to ./data.
We take the Passage dataset as the example to show the running workflow.

Conventional Workflow

Representation Learning

Train the encoder with random negative (or set --hardneg_json to provied bm25/hard negatives) :

mkdir log
dataset=passage
savename=dense_global_model
python train.py --model_name_or_path roberta-base \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method random \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq False \
|tee ./log/${savename}.log

Unsupervised Quantization

Generate dense embeddings of queries and docs:

data_type=passage
savename=dense_global_model
epoch=20
python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

Product Quantization based on Faiss and recall performance:

data_type=passage
savename=dense_global_model
epoch=20
python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--ckpt_path /model/${savename}/${epoch}/ \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100

Progressively Optimized Bi-Granular Document Representation

Sparse Representation Learning

Instead of running unsupervised quantization for the well-learned dense embeddings, the sparse embeddings are generated from contrastive learning, which optimizes the global discrimination and helps to enable high-quality answers to be covered in candidate search.

Train

We find that using Faiss OPQ to initialize the PQ module has a significant gain for MSMARCO dataset. But for the largest dataset: Ads dataset, initialization with Faiss OPQ is redundant and has no promotion.

dataset=passage
savename=sparse_global_model
python train.py --model_name_or_path ./model/dense_global_model/20 \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method random \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq True \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index \
--partition 96 --centroids 256 --quantloss_weight 0.0 \
|tee ./log/${savename}.log

where the ./model/dense_global_model/20 and ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index is generated by conventional workflow.

Test

data_type=passage
savename=sparse_global_model
epoch=20

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100 \
--ckpt_path ./model/${savename}/${epoch}/ \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index

Dense Representation Learning

The dense embeddings are optimized based on the candidate distribution generated by sparse embeddings. We propose a novel sampling strategy called locality-centric sampling to enhance local discrimination: construct a bipartite proximity graph and conduct random walk or snow sample on it.

Train

Encode the quries in train set and generate the candidates for all train queries:

data_type=passage
savename=sparse_global_model
epoch=20

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} \
--mode train

python ./test/lightweight_ann.py \
--output_dir ./data/${data_type}/evaluate/${savename}_${epoch} \
--subvector_num 96 \
--index opq \
--topk 1000 \
--data_type ${data_type} \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100 \
--ckpt_path ./model/${savename}/${epoch}/ \
--init_index_path ./data/${data_type}/evaluate/dense_global_model_20/OPQ96,PQ96x8.index \
--mode train \
--save_hardneg_to_json

This command will save the train_hardneg.json to output_dir. Then train the dense embeddings to distinguish the ground truth from the negative in candidate:

dataset=passage
savename=dense_local_model
python train.py --model_name_or_path roberta-base \
--max_query_length 24 --max_doc_length 128 \
--data_dir ./data/${dataset}/preprocess \
--learning_rate 1e-4 --optimizer_str adamw \
--per_device_train_batch_size 128 \
--per_query_neg_num 1 \
--generate_batch_method {random_walk or snow_sample} \
--loss_method multi_ce  \
--savename ${savename} --save_model_path ./model \
--world_size 8 --gpu_rank 0_1_2_3_4_5_6_7  --master_port 13256 \
--num_train_epochs 30  \
--use_pq False \
--hardneg_json ./data/${data_type}/evaluate/sparse_global_model_20/train_hardneg.json \
--mink 0  --maxk 200 \
|tee ./log/${savename}.log

Test

data_type=passage
savename=dense_local_model
epoch=10

python ./inference.py \
--data_type ${data_type} \
--preprocess_dir ./data/${data_type}/preprocess/ \
--ckpt_path ./model/${savename}/${epoch}/ \
--max_doc_length 256 --max_query_length 32 \
--eval_batch_size 512 \
--ckpt_path ./model/${savename}/${epoch}/ \
--output_dir  evaluate/${savename}_${epoch} 

python ./test/post_verification.py \
--data_type ${data_type} \
--output_dir  evaluate/${savename}_${epoch} \
--candidate_from_ann ./data/${data_type}/evaluate/sparse_global_model_20/dev.rank_1000_score_faiss_opq.tsv \
--MRR_cutoff 10 \
--Recall_cutoff 5 10 30 50 100

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

An efficient 3D semantic segmentation framework for Urban-scale point clouds like SensatUrban, Campus3D, etc.

Zou 33 Jan 03, 2023
SEAN: Image Synthesis with Semantic Region-Adaptive Normalization (CVPR 2020, Oral)

SEAN: Image Synthesis with Semantic Region-Adaptive Normalization (CVPR 2020 Oral) Figure: Face image editing controlled via style images and segmenta

Peihao Zhu 579 Dec 30, 2022
LightningFSL: Pytorch-Lightning implementations of Few-Shot Learning models.

LightningFSL: Few-Shot Learning with Pytorch-Lightning In this repo, a number of pytorch-lightning implementations of FSL algorithms are provided, inc

Xu Luo 76 Dec 11, 2022
Code Release for the paper "TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation"

TriBERT This repository contains the code for the NeurIPS 2021 paper titled "TriBERT: Full-body Human-centric Audio-visual Representation Learning for

UBC Computer Vision Group 8 Aug 31, 2022
Official PaddlePaddle implementation of Paint Transformer

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [Paddle Implementation] Update We have optimized the serial inference p

TianweiLin 284 Dec 31, 2022
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

WebDataset WebDataset is a PyTorch Dataset (IterableDataset) implementation providing efficient access to datasets stored in POSIX tar archives and us

1.1k Jan 08, 2023
Wenzhou-Kean University AI-LAB

AI-LAB This is Wenzhou-Kean University AI-LAB. Our research interests are in Computer Vision and Natural Language Processing. Computer Vision Please g

WKU AI-LAB 10 May 05, 2022
Adaptive, interpretable wavelets across domains (NeurIPS 2021)

Adaptive wavelets Wavelets which adapt given data (and optionally a pre-trained model). This yields models which are faster, more compressible, and mo

Yu Group 50 Dec 16, 2022
Super Resolution for images using deep learning.

Neural Enhance Example #1 — Old Station: view comparison in 24-bit HD, original photo CC-BY-SA @siv-athens. As seen on TV! What if you could increase

Alex J. Champandard 11.7k Dec 29, 2022
simple_pytorch_example project is a toy example of a python script that instantiates and trains a PyTorch neural network on the FashionMNIST dataset

simple_pytorch_example project is a toy example of a python script that instantiates and trains a PyTorch neural network on the FashionMNIST dataset

Ramón Casero 1 Jan 07, 2022
The repository offers the official implementation of our BMVC 2021 paper in PyTorch.

CrossMLP Cascaded Cross MLP-Mixer GANs for Cross-View Image Translation Bin Ren1, Hao Tang2, Nicu Sebe1. 1University of Trento, Italy, 2ETH, Switzerla

Bingoren 16 Jul 27, 2022
PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

Taesun Whang 47 Nov 22, 2022
This library provides an abstraction to perform Model Versioning using Weight & Biases.

Description This library provides an abstraction to perform Model Versioning using Weight & Biases. Features Version a new trained model Promote a mod

Hector Lopez Almazan 2 Jan 28, 2022
DIRL: Domain-Invariant Representation Learning

DIRL: Domain-Invariant Representation Learning Domain-Invariant Representation Learning (DIRL) is a novel algorithm that semantically aligns both the

Ajay Tanwani 30 Nov 07, 2022
JUSTICE: A Benchmark Dataset for Supreme Court’s Judgment Prediction

JUSTICE: A Benchmark Dataset for Supreme Court’s Judgment Prediction CSCI 544 Final Project done by: Mohammed Alsayed, Shaayan Syed, Mohammad Alali, S

Smit Patel 3 Dec 28, 2022
🔥3D-RecGAN in Tensorflow (ICCV Workshops 2017)

3D Object Reconstruction from a Single Depth View with Adversarial Learning Bo Yang, Hongkai Wen, Sen Wang, Ronald Clark, Andrew Markham, Niki Trigoni

Bo Yang 125 Nov 26, 2022
Robbing the FED: Directly Obtaining Private Data in Federated Learning with Modified Models

Robbing the FED: Directly Obtaining Private Data in Federated Learning with Modified Models This repo contains a barebones implementation for the atta

16 Dec 04, 2022
PyTorch implementations of algorithms for density estimation

pytorch-flows A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invert

Ilya Kostrikov 546 Dec 05, 2022
This python-based package offers a way of creating a parametric OpenMC plasma source from plasma parameters.

openmc-plasma-source This python-based package offers a way of creating a parametric OpenMC plasma source from plasma parameters. The OpenMC sources a

Fusion Energy 10 Oct 18, 2022
PyTorch implementation of popular datasets and models in remote sensing

PyTorch Remote Sensing (torchrs) (WIP) PyTorch implementation of popular datasets and models in remote sensing tasks (Change Detection, Image Super Re

isaac 222 Dec 28, 2022