当前位置：网站首页>[NLP] vector retrieval model landing: Bottleneck and solution!

[NLP] vector retrieval model landing: Bottleneck and solution!

2022-06-26 07:34:00 【Demeanor 78】

author | Maple Xiao Qi
Arrangement | NewBeeNLP

The huge memory consumption of dense vector retrieval has always been a bottleneck to limit its implementation . actually ,DPR Generated 768 There is a lot of redundant information in dimensional dense vectors , We can use some compression method to exchange a small loss of accuracy for a significant reduction in memory usage .

Today I share an article from EMNLP 2021 The paper of , Three simple and effective compression methods are discussed ：

Unsupervised PCA Dimension reduction
Supervised fine tuning and dimensionality reduction
Product quantization

The experimental results show that it is simple PCA Dimensionality reduction has high cost performance , Can be less than 3% Of top-100 Loss of accuracy 48 Times the compression ratio , Less than 4% Of top-100 Loss of accuracy 96 Times the compression ratio .

Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval

Introduction

In the past two years , With DPR The dense vector retrieval model represented by has been widely used in open domain question answering and other fields . although DPR It can provide more accurate search results , but DPR The memory consumption of the generated vector index is very large .

For example, when we index Wikipedia , Based on inverted index BM25 Occupy only 2.4GB Of memory , and DPR Generated 768 Dimensional dense vectors need to occupy 61GB Of memory , This is more than BM25 More than enough 24 times , And this extra 24 Times more memory on multiple data sets in exchange for only an average 2.5% The indicators of the project have been improved (Top-100 Accuracy).

It can be guessed that ,DPR Generated 768 The dimensional density vector may be too large , There may be a lot of redundancy , We can try to trade a small loss of accuracy for a significant reduction in memory usage . In response to this question , This paper explores three simple and effective dense vector compression methods , Including principal component analysis (Principal Component Analysis, PCA)、 Product quantization (Product Quantization, PQ) And supervised dimensionality reduction (Supervised Dimensionality Reduction).

Quantifying Redundancy

First , Let's make sure DPR Whether there is redundancy in dense vectors , The author gives two common indicators ：PCA Of 「 Explain the variance ratio 」(explained variance ratio), And with 「 Mutual information 」(mutual information).

PCA Using feature decomposition, a set of vectors that may have correlation are transformed into a coordinate system composed of linearly independent feature vectors , And keep the direction with large variance , Discard the direction with small variance , The explanatory variance ratio is a measure of PCA Indicators of dimensionality reduction effect ：

Where is the variance corresponding to the eigenvalue from the largest to the smallest , And respectively PCA Dimensions before and after dimensionality reduction . This ratio shows the proportion of the original dense vector variance that can be retained by retaining the previous eigenvector .

Another way to measure vector redundancy is to calculate the mutual information of sum , Mutual information can be obtained from the formula ：

402 Payment Required

We can use DPR Approximate calculation of optimization objectives ：

The upper bound of mutual information is , In order to and PCA contrast , We can standardize mutual information .

After varying degrees of PCA Dimension reduction , Explain the variation trend of variance ratio and standard mutual information with vector dimension, as shown in the figure below . We can find out ,200 Weizuo is a sweet spot ,「 take 768 The vector of dimension is reduced to 200 dimension , The reduced dimension vector can retain 90% And the variance of 99% Mutual information of , Further dimensionality reduction will lead to a rapid decline in the amount of information .」

Dense Vector Compression

Next , We try three simple ways to compress dense vectors ：

「Supervised Approach：」 We can simply add two linear layers on the top of the double tower encoder
Heli dimension reduction , During training, you can freeze the lower parameters , Fine tune only linear layers , At the same time, we can also add an orthogonalization loss encouragement and mutual orthogonality , This can make the point product similarity after dimensionality reduction and the point product similarity before dimensionality reduction scale It's consistent .
「Unsupervised Approach：」 We can mix and , Then a linear equation is fitted to this vector set PCA Transformation , In the reasoning stage , Use the fitted PCA Transformation pair DPR The dimension of the generated vector is reduced .
「Product Quantization：」 We can also use product quantization to further compress the vector size , Its basic principle is to decompose the vector of a dimension into sub vectors , Each sub vector adopts -means quantitative , And use bits to store . such as , A vector of one dimension takes up bits , By decomposing it into sub vectors of bits , The size of the vector is compressed to bits , That is, the original size , On average , The number of bits per dimension has decreased from to .

Experiment & Results

The author in NQ、TriviaQA、WQ、CuratedTREC、SQuAD I tested it. DPR Of top- Accuracy rate ( At least one correct proportion of the previous recall results ), See the original text for details of the experiment .

Dimensionality Reduction

Let's first compare the performance of supervised and unsupervised dimensionality reduction , among PCA-* and Linear-* They are unsupervised PCA Peacekeeping reduction is supervised and fine tuned ( Fine tune only linear layers ) Result , and DPR-* Indicates that the lower layer parameters are not frozen , The result of joint fine tuning with linear layer . You can find , When the vector dimension is large (、), Unsupervised PCA Better , When the vector dimension is small (), Supervised fine-tuning will perform better , However, at this time, the performance of the model also decreases significantly , Therefore, in general, there is no supervision PCA More practical .

Although theoretically Linear-* You can learn PCA-* Linear mapping of fitting , But it is not easy to make the parameters converge to a good solution . in addition , Freeze the lower parameters (Linear-*) Than not frozen (DPR-*) The results are better , This is also caused by inadequate training . Sum up , in the majority of cases , We just need to do simple linear PCA Transformation , You can get a very good compression ratio .

Product Quantization

Product quantization is a very effective compression method , Based on the above experimental results, the author further adds product quantization , The experimental results are shown in the following table , among PQ-2 Indicates that the number of bits occupied by each dimension after quantization is . As can be seen from the table below ,PQ-1 Compression is too radical , Although its compression ratio is PQ-2 Twice as many , But the index has fallen more than twice , It's very uneconomic .

Sum up , We think PCA Dimensionality reduction plus product quantization is the best compression method , If we limit the decline of the index to the average 4% within , We can use PCA-128+PQ2 Compress dense vectors 96 times , Reduce the memory footprint of Wikipedia's vector index from 61GB Down to 642MB, At the same time, the reasoning time is changed from 7570ms Down to 416ms.

Hybrid Search

A large number of studies have shown that combining sparse vector retrieval (BM25) And dense vector retrieval can improve performance , The easiest way to do a weighted sum of scores is to do a linear sum ：

Here we simply set , That is, dense retrieval and sparse retrieval .

Adding hybrid retrieval can further improve performance , The following figure shows the relationship between retrieval accuracy and index size of different compression methods , Each curve from left to right is PQ1、PQ2 and w/o PQ, The black dotted line in the figure is the Pareto boundary , The original 768 dimension DPR The vector does not fall on the Pareto boundary , It shows that there is room for improvement . say concretely ,「PCA-256+PQ2+Hybrid Search Your compression strategy will 61GB The index size of has been reduced to 3.7GB, Its Top-100 The accuracy is even better than the original DPR Better (+0.2%).」

Discussion

One of the bottlenecks restricting the implementation of dense vector retrieval model is the reasoning delay and memory consumption , This paper proves through experiments that simple principal component analysis plus product quantization , Supplemented by sparse vector retrieval , It can greatly reduce the memory occupation on the premise of ensuring accuracy , Speed up retrieval , Quite practical .

Communicate together

I want to learn and progress with you ！『NewBeeNLP』 At present, many communication groups in different directions have been established （ machine learning / Deep learning / natural language processing / Search recommendations / Figure network / Interview communication / etc. ）, Quota co., LTD. , Quickly add the wechat below to join the discussion and exchange ！（ Pay attention to it o want Notes Can pass ）