当前位置：网站首页>Difference between bow and cbow

Difference between bow and cbow

2022-06-30 09:47:00 【A grain of sand in the vast sea of people】

1. Bag-of-words Model

Bag-of-words The model is also called “ The word bag ”, Bag-of-words Model is a common document representation method in the field of information retrieval . In Information Retrieval ,BOW The model assumes that for a document , Ignore its word order and grammar 、 Syntax and so on , Think of it as just a collection of words , Each word in the document appears independently , It doesn't depend on whether other words appear . in other words , Any word that appears anywhere in the document , Are independently selected without the influence of the semantics of the document . For example, there are two documents ：

1：Bob likes to play basketball, Jim likes too.

2：Bob also likes to play football games.

Based on these two text documents , Construct a dictionary ：

Dictionary =

{1:”Bob”, 2. “like”, 3. “to”, 4. “play”, 5. “basketball”, 6. “also”, 7. “football”, 8. “games”, 9. “Jim”, 10. “too”}.

This dictionary contains 10 A different word , Use the index number of the dictionary , You can use one for each of the above two documents 10 The dimension vector represents （ Use integer numbers 0~n（n As a positive integer ） Represents the number of times a word appears in a document ）：

1：[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]

2：[1, 1, 1, 1 ,0, 1, 1, 1, 0, 0]

Each element in the vector represents the number of times the relevant element in the dictionary appears in the document ( Below , Will be represented by a histogram of words ). however , As you can see during the construction of the document vector , We don't express the order in which words appear in the original sentence （ This is Ben Bag-of-words One of the drawbacks of the model , However, the flaws do not hide the good, and even it does not matter here ）.

2. Bag-of-words Model Application

Application in naturallanguageprocessing

Now imagine in a huge collection of documents D, There are M A document , And after all the words in the document are extracted , Together to form a containing N A dictionary of words , utilize Bag-of-words Model , Each document can be represented as a N Dimension vector , Computers are very good at dealing with numerical vectors . such , You can use the computer to complete the classification process of massive documents .

Applications in the field of vision

Now? Computer Vision Medium Bag of words It is also very popular to describe the characteristics of images . The general idea is like this , Suppose there is 5 Class image , There are... In each category 10 Images , In this way, each image is divided into patch（ It can be a rigid partition or something like SIFT Based on key point detection ）, such , Each image consists of many patch Express , every last patch Use an eigenvector to represent , Let's assume that Sift It means , An image may have hundreds of patch, every last patch The dimension of the eigenvector 128.
The next step is to build Bag of words Model , hypothesis Dictionary The dictionary's Size by 100, That is to say 100 Word . Then we can use K-means Algorithm For all the patch Clustering ,k=100, We know , etc. k-means When converging , We also got every one cluster The final center of mass , So this 100 A center of mass （ dimension 128） It's the dictionary reed 100 A word , The dictionary is built .
How to use the dictionary after it is built ？ That's true , Initialize one first 100 individual bin The initial value of 0 Histogram h. There are not many images in every picture patch Well ？ Let's calculate these again patch And the distance from each centroid , Look at each patch Which center of mass is closest to , So histogram h Corresponding to bin Just add 1, Then calculate all the... Of this image patches after , I get one bin=100 Histogram , Then normalize , Use this 100 A vector of dimensions to represent this image . After all the images are calculated , Then we can carry out classification, clustering, training and prediction .

3. CBOW（Continuous Bag-of-Word）

3.1 One-word context

Suppose we vocabulary size by V,hidden layer The number of neurons is N, Suppose we have only one contextual word , Then predict the target word according to the context word , Similar to a bigram model, As shown in the figure below ：
Insert picture description here

Please see here for details

level softmax (hierarchical softmax） understand _BGoodHabit The blog of -CSDN Blog _ level softmax

3.2 Multi-word context

The above describes the case where there is only one context word , When there are multiple context words , Just average the vectors of multiple context words as input , As shown in the figure below ：

Insert picture description here

among

$h=\frac{1}{C}W^T(x_1+x_2,...,x_C)$

Forward propagation and back propagation natural language processing ： Continuous word bag model of word vector （The Continuous Bag-of-Words Model,CBOW）_guangyacyb The blog of -CSDN Blog _ Continuous bag model

4. BOW and CBOW difference

https://arxiv.org/pdf/1301.3781.pdf

The reason why the author thinks it is also a word bag model is that it has nothing to do with the order of words . The average mapping of context sensitive words to the hidden layer is independent of the order of words in the context .

CBOW It is represented by continuous context distribution .

5. CBOW Code

CBOW The model consists of three layers ： Input layer , Projection layer , Output layer . And NNML comparison , Remove the hidden layer .

CBOW It predicts the central word according to the context , It's kind of like cloze . How much of the context is a super parameter , You can adjust it yourself .

When building a dataset , according to CBOW Characteristics , Generally, the context is used as input , Use the headword as a label .

import torch
from torch import nn, optim
from torch.autograd import Variable
import torch.nn.functional as F

CONTEXT_SIZE = 2
raw_text = "We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.".split(' ')

vocab = set(raw_text)
word_to_idx = {word: i for i, word in enumerate(vocab)}

data = []
for i in range(CONTEXT_SIZE, len(raw_text)-CONTEXT_SIZE):
    context = [raw_text[i-2], raw_text[i-1], raw_text[i+1], raw_text[i+2]]
    target = raw_text[i]
    data.append((context, target))

class CBOW(nn.Module):
    def __init__(self, n_word, n_dim, context_size):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(n_word, n_dim)
        self.linear1 = nn.Linear(2*context_size*n_dim, 128)
        self.linear2 = nn.Linear(128, n_word)

    def forward(self, x):
        x = self.embedding(x)
        x = x.view(1, -1)
        x = self.linear1(x)
        x = F.relu(x, inplace=True)
        x = self.linear2(x)
        x = F.log_softmax(x)
        return x
        
model = CBOW(len(word_to_idx), 100, CONTEXT_SIZE)
if torch.cuda.is_available():
    model = model.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)

for epoch in range(40):
    print('epoch {}'.format(epoch))
    print('*'*10)
    running_loss = 0
    for word in data:
        context, target = word
        context = Variable(torch.LongTensor([word_to_idx[i] for i in context]))
        target = Variable(torch.LongTensor([word_to_idx[target]]))
        if torch.cuda.is_available():
            context = context.cuda()
            target = target.cuda()

        out = model(context)
        loss = criterion(out, target)
        running_loss += loss.data

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print('loss: {:.6f}'.format(running_loss / len(data)))

Other

Distributed semantics (distributed representation, representing words by their context)

Distributed semantics means ： The meaning of a word is determined by its Words around To decide (a word's meaning is given by the words that frequently appear close-by).

Distributed It also means , One dense vector Each bit of can represent multiple features 、 A feature can also be represented by many bits .

We express each word as a dense vector, Make the surrounding words (context word) Of two similar words dense vector It's similar . In this way, we can measure the similarity 了 .

A good word representation Be able to grasp the meaning of words syntactic( syntax , Such as subject predicate object ) And semantic( The semantic meaning of words ) Information , for example , A good word means you can do it ：