当前位置：网站首页>[tricks] whiteningbert: an easy unsupervised sentence embedding approach

[tricks] whiteningbert: an easy unsupervised sentence embedding approach

2022-07-02 07:22:00 【lwgkzl】

executive summary

This article mainly introduces three uses BERT do Sentence Embedding Small Trick, Respectively ：

You should use all token embedding Of average As a sentence, it means , Instead of just using [CLS] Representation of corresponding position .
stay BERT Multi level sentence vector superposition should be used in , Instead of just using the last layer .
When judging sentence similarity through cosine similarity , have access to Whitening Operation to unify sentence embedding Vector distribution of , So we can get better sentence expression .

Model

The first two points introduced in this paper do not involve models , Only the third point Whitening The operation can be briefly introduced .
starting point ： Cosine similarity as a measure of vector similarity is based on “ Orthonormal basis ” On the basis of , The basis vectors are different , The meaning of each value in the vector also changes . And then pass by BERT The coordinate system of the extracted sentence vector may not be based on the same “ Orthonormal basis ” The coordinate system of .

Solution ： Normalize each vector into the coordinate system of the same standard orthogonal basis . A guess is , Each sentence vector generated by the pre training language model should be relatively uniform at each position in the coordinate system , That is, show all kinds of homosexuality . Based on this guess , We can normalize all sentence vectors , Make it isotropic . A feasible solution is to reduce the distribution of sentence vectors to normal distribution , Because the normal distribution satisfies the isotropy （ Mathematical theorems ）.

practice ：

Content screenshot from Su Shen's blog ： link

Experiment and conclusion

You should use all token embedding Of average As a sentence, it means , Instead of just using [CLS] Representation of corresponding position .
superposition BERT Of 1,2,12 The vector effect of these three layers is the best .
Whiten Operation is effective for most pre training language models .

Code

def whitening_torch_final(embeddings):
    # For torch < 1.10
    mu = torch.mean(embeddings, dim=0, keepdim=True)
    cov = torch.mm((embeddings - mu).t(), (embeddings - mu))
    # For torch >= 1.10
    cov = torch.cov(embedding)
    
    u, s, vt = torch.svd(cov)
    W = torch.mm(u, torch.diag(1/torch.sqrt(s)))
    embeddings = torch.mm(embeddings - mu, W)
    return embeddings

after bert encoder The vector after that , Send in whitening_torch_final Function whitening The operation of .

Optimize

According to Su Shen's blog , Only keep SVD Extracted before N Eigenvalues can improve further effect . also , Because only the former N Features , So PCA The principle is similar , It is equivalent to doing a step of dimensionality reduction on the sentence vector .
Change the code to ：

def whitening_torch_final(embeddings, keep_dim=256):
    # For torch >= 1.10
    cov = torch.cov(embedding) # emb_dim * emb_dim
    
    u, s, vt = torch.svd(cov)
    # u : emb_dim * emb_dim, s: emb_dim
    W = torch.mm(u, torch.diag(1/torch.sqrt(s))) # W: emb_dim * emb_dim
    embeddings = torch.mm(embeddings - mu, W[:,:keep_dim]) #  truncation 
    return embeddings # bs * keep_dim