当前位置：网站首页>[basis of recommendation system] sampling and construction of positive and negative samples

[basis of recommendation system] sampling and construction of positive and negative samples

2022-07-07 23:58:00 【Evening scenery at the top of the mountain】

List of articles

One 、 review word2vec Negative sampling in
Two 、word2vec Implementation of negative sampling in
3、 ... and 、 Recommend recall related basis in the system
- 3.1 Three training methods in recall
Four 、 Recommend negative sampling in the system
- 4.1 Negative sample constructed 6 A common way
Reference

One 、 review word2vec Negative sampling in

word2vec Negative sampling in ：CBOW perhaps skip-gram The training of this kind of model , When thesaurus scale is large and computing resources are limited , This kind of multi classification model will be because of the output layer （softmax） The effect of normalization of probability on calculation efficiency , To avoid training turtle speed .
Negative sampling provides another angle ： Given the current word and context , The task is to maximize the co-occurrence probability of the two .
The multi classification problem is also simplified to ： in the light of (w, c) The dichotomous problem of （ Co-occurrence or No co-occurrence ）, Thus, the normalization complexity on the large word list is avoided .

Such as $\mid w, c)$ Express c and w The probability of co-occurrence $\mid w, c)=\sigma\left(v_{w} \cdot v_{c}^{\prime}\right)$

1.1 The sliding window

In order to get the high quality of each word embedding（ Of words in similar contexts vector Should be similar ）,word2vec Is sliding through a sliding window , Calculate at the same time $P\left(w_{t+j} \mid w_{t}\right)$ . Here is a chestnut ,window_size=2.
Insert picture description here

1.2 Objective function

（1） At the beginning, we will take the lump we just got $P\left(w_{t+j} \mid w_{t}\right)$ Multiply , And for every t, So there is 2 A tired ride ：
Insert picture description here
（2） Because generally we minimize the objective function , So we took log And negative average operation , Modified objective function ：

In order to find the innermost probability of the above loss function $P\left(w_{t+j} \mid w_{t} ; \theta\right)$ , Use... For each word 2 individual vector Express ：

When w When is the central word , Expressed as $v_w$
When w When it is a contextual word , Expressed as $u_w$

But why use two vector For each word —— More easily optimization.

1.3 Prediction function

So for a central word c And a context c Yes ： $\mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}$ Set any value $x_i$ Map to the probability distribution , That is to say ：
Insert picture description here
The dot product of molecules is used to represent o and c Degree of similarity between , The denominator is based on the whole vocabulary , The normalized probability distribution is given .

Two 、word2vec Implementation of negative sampling in

The following is based on negative sampling skip-gram Model , For each positive sample data that needs training , You need to generate the corresponding negative samples according to a certain negative sampling probability , There are two ways to do it ：

Generate corresponding negative samples when building data , In the training model, there is no need to build negative samples , But the disadvantage is that each iteration uses the same negative samples , Lack of diversity ;
The following is to construct the corresponding negative samples during training , utilize SGNSDataset Medium collate_fn Realization , Yes batch Negative sampling of internal samples .

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence
from tqdm.auto import tqdm
from utils import BOS_TOKEN, EOS_TOKEN, PAD_TOKEN
from utils import load_reuters, save_pretrained, get_loader, init_weights

class SGNSDataset(Dataset):
    def __init__(self, corpus, vocab, context_size=2, n_negatives=5, ns_dist=None):
        self.data = []
        self.bos = vocab[BOS_TOKEN]
        self.eos = vocab[EOS_TOKEN]
        self.pad = vocab[PAD_TOKEN]
        for sentence in tqdm(corpus, desc="Dataset Construction"):
            sentence = [self.bos] + sentence + [self.eos]
            for i in range(1, len(sentence)-1):
                #  Model input ：(w, context) ; Output is 0/1, Express context Whether it is a negative sample 
                w = sentence[i]
                left_context_index = max(0, i - context_size)
                right_context_index = min(len(sentence), i + context_size)
                context = sentence[left_context_index:i] + sentence[i+1:right_context_index+1]
                context += [self.pad] * (2 * context_size - len(context))
                self.data.append((w, context))

        #  Number of negative samples 
        self.n_negatives = n_negatives
        #  Negative sampling distribution ： If parameter ns_dist by None, Then use uniform Distribution 
        self.ns_dist = ns_dist if ns_dist is not None else torch.ones(len(vocab))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        return self.data[i]

    def collate_fn(self, examples):
        words = torch.tensor([ex[0] for ex in examples], dtype=torch.long)
        contexts = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
        batch_size, context_size = contexts.shape
        neg_contexts = []
        #  Yes batch Negative sampling shall be conducted for the samples in the 
        for i in range(batch_size):
            #  Ensure that the negative sample does not include the current sample context
            ns_dist = self.ns_dist.index_fill(0, contexts[i], .0)
            neg_contexts.append(torch.multinomial(ns_dist, self.n_negatives * context_size, replacement=True))
        neg_contexts = torch.stack(neg_contexts, dim=0)
        return words, contexts, neg_contexts

class SGNSModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SGNSModel, self).__init__()
        #  Word embedding 
        self.w_embeddings = nn.Embedding(vocab_size, embedding_dim)
        #  Context embedding 
        self.c_embeddings = nn.Embedding(vocab_size, embedding_dim)

    def forward_w(self, words):
        w_embeds = self.w_embeddings(words)
        return w_embeds

    def forward_c(self, contexts):
        c_embeds = self.c_embeddings(contexts)
        return c_embeds


def get_unigram_distribution(corpus, vocab_size):
    #  Count... From the attributive material unigram A probability distribution 
    token_counts = torch.tensor([0] * vocab_size)
    total_count = 0
    for sentence in corpus:
        total_count += len(sentence)
        for token in sentence:
            token_counts[token] += 1
    unigram_dist = torch.div(token_counts.float(), total_count)
    return unigram_dist

embedding_dim = 64
context_size = 2
hidden_dim = 128
batch_size = 1024
num_epoch = 10
n_negatives = 10

#  Read text data 
corpus, vocab = load_reuters()
#  Calculation unigram A probability distribution 
unigram_dist = get_unigram_distribution(corpus, len(vocab))
#  according to unigram Distribution calculation negative sampling distribution : p(w)**0.75
negative_sampling_dist = unigram_dist ** 0.75
negative_sampling_dist /= negative_sampling_dist.sum()
#  structure SGNS Training data set 
dataset = SGNSDataset(
    corpus,
    vocab,
    context_size=context_size,
    n_negatives=n_negatives,
    ns_dist=negative_sampling_dist
)
data_loader = get_loader(dataset, batch_size)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SGNSModel(len(vocab), embedding_dim)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

model.train()
for epoch in range(num_epoch):
    total_loss = 0
    for batch in tqdm(data_loader, desc=f"Training Epoch {
      epoch}"):
        words, contexts, neg_contexts = [x.to(device) for x in batch]
        optimizer.zero_grad()
        batch_size = words.shape[0]
        #  extract batch Inner word 、 Context and vector representation of negative samples 
        word_embeds = model.forward_w(words).unsqueeze(dim=2)
        context_embeds = model.forward_c(contexts)
        neg_context_embeds = model.forward_c(neg_contexts)
        #  Classification of positive samples （ logarithm ） likelihood 
        context_loss = F.logsigmoid(torch.bmm(context_embeds, word_embeds).squeeze(dim=2))
        context_loss = context_loss.mean(dim=1)
        #  Classification of negative samples （ logarithm ） likelihood 
        neg_context_loss = F.logsigmoid(torch.bmm(neg_context_embeds, word_embeds).squeeze(dim=2).neg())
        neg_context_loss = neg_context_loss.view(batch_size, -1, n_negatives).sum(dim=2)
        neg_context_loss = neg_context_loss.mean(dim=1)
        #  Loss ： Negative log likelihood 
        loss = -(context_loss + neg_context_loss).mean()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Loss: {
      total_loss:.2f}")

#  Merge word embedding matrix and context embedding matrix , As the final pre training word vector 
combined_embeds = model.w_embeddings.weight + model.c_embeddings.weight
save_pretrained(vocab, combined_embeds.data, "sgns.vec")

3、 ... and 、 Recommend recall related basis in the system

Recall model training and evaluation （ The corresponding loss function ）

Point-wise Sample construction ：BCE Loss
Pair-wise Sample construction ：BPR Hinge Loss
List-wise Sample construction ：softmax Loss
Vectorization recall ： Use annoy

3.1 Three training methods in recall

In the recall , There are three general training methods ：point-wise、pair-wise、list-wise. stay datawhale Of RecHub in , With the parameters mode To specify the training method , Each different training method also corresponds to different Loss.

Refer to the following figure for the corresponding three training methods （3 Kind of ）, among a Express user Of embedding,b+ Indicates the of the positive sample embedding,b- Indicates the number of negative samples embedding.
Insert picture description here

（1）Point wise (mode = 0)

thought ： Consider recall as a second category , Look at each positive sample independently 、.

For a recall model ：

Input binary <User, Item>,
Output $P (U s e r$ , Item $)$ , Express User Yes Item Of interest .
The training goal is : If the article is a positive sample , The output should be as close as possible to 1 , Negative samples are output as close to 0 . Adopted Loss The most common is BCELoss（Binary Cross Entropy Loss）.

（2）Pair wise (mode = 1)

thought ： Users should be more interested in positive samples than negative samples .

For a recall model ：

Input triples <User, ItemPositive, ItemNegative>,
Output interest score $P ($ User, ItemPositive $), P ($ User, ItemNegative $)$ , Indicates that the user is satisfied with positive sample items and negative samples Interest score of this item .
The training goal is ： The interest score of positive samples should be greater than that of negative samples .

torch-rechub Used in the framework Loss by BPRLoss（Bayes Personalized Ranking Loss).Loss Put here the formula of A formula , Details can be referred to 【 Bayesian personalized ordering (BPR) Algorithm summary 】（ The content in the link is slightly different from the formula below , But thought is one What kind of )

$s=\frac{1}{N} \sum^{N} i_{i=1}-\log ($ sigmoid $($ pos_score $-$ neg_score $))$

（3）List wise（mode = 2）

thought ： Same thoughts Pair wise, But the implementation is different .

For a recall model ：

Input $\boldsymbol{N}+2$ element Group $\langle$ User, ItemPositive, ItemNeg_1,…, ItemNeg_N $\rangle$ ;
Output user pairs 1 A positive sample and $\mathrm{N}$ Interest scores of negative samples .
The training goal is ： The interest scores of positive samples should be greater than those of all other negative samples .

torch rechub Used in the framework Loss by torch.nn.CrossEntropyLoss, That is, the output is Softmax Cross entropy is taken after processing .

PS: there List wise The way is easy and Ranking Medium List wise confusion , Although they have the same name , but ranking Of List wise The order relation between samples is considered . for example ranking Will consider MAP、NDCP And other indicators considering sequence as evaluation indicators , and Matching Medium List wise Without considering the order .

Four 、 Recommend negative sampling in the system

In model training , Positive examples are required （ Users like products ） And negative examples （ Products that users don't like ） Give the model , However, due to the difficulty of data collection in the actual recommendation scenario , It is generally difficult to obtain explicit feedback behavior from non users （ If the user is right item Score of ）, But users' implicit feedback （ Users consume or interact item） It is easier to get .

It is generally assumed that the products that users have interacted with are positive examples , And through sampling , Select a part of the product set that the user has not interacted with as a negative example .

Negative sampling （Negative Sampling）： The process of negative example selection based on certain strategies from the set of products that users do not interact with .

DSSM Among the recalled samples ：
- Positive samples are exposed to the user and clicked by the user item;
- Negative sample ： In fact, the common mistake is to use exposure directly and not be user Click on item, But it leads to SSB（sample selection bias） Sample selection bias problem —— because Recall online always from the full amount of candidates item Medium recall , Not from the exposed item Medium recall .

DSSM The practice in the original paper ： Only positive samples , Write it down as $D^{+}$ , For the user $u_{1}$ , Its positive sample is what it has clicked item, Negative samples are random from $D^{+}$ ( It doesn't contain $u_{1}$ Clicked item) Choose... At random 4 individual item As a negative sample .

4.1 Negative sample constructed 6 A common way

（1） Expose non clicked data

If only this , It can lead to BBS problem , It depends on the scene .

（2） Global random selection negative example

From the original global material library , Randomly select negative samples as recall or rough arrangement .

（3）Batch Negative cases were randomly selected within the

Training in the same batch in , In addition to the positive example item, Choose to construct a negative example , To a certain extent SSB problem .

（4） Negative examples are randomly selected from the exposure data

（5） be based on Popularity Negative cases were randomly selected

The more popular item, If you haven't been clicked by the user , It is more likely to be a real negative example for the user .

（6） be based on Hard Select negative example

As easy negative A supplement to ,hard negative It is a difficult negative sample , That is, the matching degree is moderate , Users may or may not like —— But in fact, users don't like it ！ You can refer to Airbnb Screening Hard Negative example attempt （hard Example brings to the model loss And more information ）.

Business logic selection （ With airbnb For example ）
- i Add rooms in the same city as positive samples as negative samples , Enhanced the regional similarity of positive and negative samples , It increases the difficulty of learning the model
- ii increase “ Rejected by the owner ” As a negative sample , Enhanced positive and negative samples in “ Match users' interests ” Similarity in , It increases the difficulty of learning the model
Mining model
- EBR Baidu Mobius The practice of is very similar , Are screened out by the recall model of the previous version " Not so similar " Of <user,doc> Yes , As an additional negative sample , Train the next version of the recall model .
- EBR The approach is ： Using the previous version of the model, the recall location is 101~500 Upper item As hard negative（ Negative samples are still in easy negative Mainly , The empirical value in the article is easy:hard=100:1）