当前位置:网站首页>Difference between bow and cbow
Difference between bow and cbow
2022-06-30 09:47:00 【A grain of sand in the vast sea of people】
1. Bag-of-words Model
Bag-of-words The model is also called “ The word bag ”, Bag-of-words Model is a common document representation method in the field of information retrieval . In Information Retrieval ,BOW The model assumes that for a document , Ignore its word order and grammar 、 Syntax and so on , Think of it as just a collection of words , Each word in the document appears independently , It doesn't depend on whether other words appear . in other words , Any word that appears anywhere in the document , Are independently selected without the influence of the semantics of the document . For example, there are two documents :
1:Bob likes to play basketball, Jim likes too.
2:Bob also likes to play football games.
Based on these two text documents , Construct a dictionary :
Dictionary =
{1:”Bob”, 2. “like”, 3. “to”, 4. “play”, 5. “basketball”, 6. “also”, 7. “football”, 8. “games”, 9. “Jim”, 10. “too”}.
This dictionary contains 10 A different word , Use the index number of the dictionary , You can use one for each of the above two documents 10 The dimension vector represents ( Use integer numbers 0~n(n As a positive integer ) Represents the number of times a word appears in a document ):
1:[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
2:[1, 1, 1, 1 ,0, 1, 1, 1, 0, 0]
Each element in the vector represents the number of times the relevant element in the dictionary appears in the document ( Below , Will be represented by a histogram of words ). however , As you can see during the construction of the document vector , We don't express the order in which words appear in the original sentence ( This is Ben Bag-of-words One of the drawbacks of the model , However, the flaws do not hide the good, and even it does not matter here ).
2. Bag-of-words Model Application
Application in naturallanguageprocessing
Now imagine in a huge collection of documents D, There are M A document , And after all the words in the document are extracted , Together to form a containing N A dictionary of words , utilize Bag-of-words Model , Each document can be represented as a N Dimension vector , Computers are very good at dealing with numerical vectors . such , You can use the computer to complete the classification process of massive documents .
Applications in the field of vision
Now? Computer Vision Medium Bag of words It is also very popular to describe the characteristics of images . The general idea is like this , Suppose there is 5 Class image , There are... In each category 10 Images , In this way, each image is divided into patch( It can be a rigid partition or something like SIFT Based on key point detection ), such , Each image consists of many patch Express , every last patch Use an eigenvector to represent , Let's assume that Sift It means , An image may have hundreds of patch, every last patch The dimension of the eigenvector 128.
The next step is to build Bag of words Model , hypothesis Dictionary The dictionary's Size by 100, That is to say 100 Word . Then we can use K-means Algorithm For all the patch Clustering ,k=100, We know , etc. k-means When converging , We also got every one cluster The final center of mass , So this 100 A center of mass ( dimension 128) It's the dictionary reed 100 A word , The dictionary is built .
How to use the dictionary after it is built ? That's true , Initialize one first 100 individual bin The initial value of 0 Histogram h. There are not many images in every picture patch Well ? Let's calculate these again patch And the distance from each centroid , Look at each patch Which center of mass is closest to , So histogram h Corresponding to bin Just add 1, Then calculate all the... Of this image patches after , I get one bin=100 Histogram , Then normalize , Use this 100 A vector of dimensions to represent this image . After all the images are calculated , Then we can carry out classification, clustering, training and prediction .
3. CBOW(Continuous Bag-of-Word)
3.1 One-word context
Suppose we vocabulary size by V,hidden layer The number of neurons is N, Suppose we have only one contextual word , Then predict the target word according to the context word , Similar to a bigram model, As shown in the figure below :
Please see here for details
level softmax (hierarchical softmax) understand _BGoodHabit The blog of -CSDN Blog _ level softmax
3.2 Multi-word context
The above describes the case where there is only one context word , When there are multiple context words , Just average the vectors of multiple context words as input , As shown in the figure below :
among
Forward propagation and back propagation natural language processing : Continuous word bag model of word vector (The Continuous Bag-of-Words Model,CBOW)_guangyacyb The blog of -CSDN Blog _ Continuous bag model
4. BOW and CBOW difference
https://arxiv.org/pdf/1301.3781.pdf
The reason why the author thinks it is also a word bag model is that it has nothing to do with the order of words . The average mapping of context sensitive words to the hidden layer is independent of the order of words in the context .
CBOW It is represented by continuous context distribution .
5. CBOW Code
CBOW The model consists of three layers : Input layer , Projection layer , Output layer . And NNML comparison , Remove the hidden layer .
CBOW It predicts the central word according to the context , It's kind of like cloze . How much of the context is a super parameter , You can adjust it yourself .
When building a dataset , according to CBOW Characteristics , Generally, the context is used as input , Use the headword as a label .
import torch
from torch import nn, optim
from torch.autograd import Variable
import torch.nn.functional as F
CONTEXT_SIZE = 2
raw_text = "We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.".split(' ')
vocab = set(raw_text)
word_to_idx = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text)-CONTEXT_SIZE):
context = [raw_text[i-2], raw_text[i-1], raw_text[i+1], raw_text[i+2]]
target = raw_text[i]
data.append((context, target))
class CBOW(nn.Module):
def __init__(self, n_word, n_dim, context_size):
super(CBOW, self).__init__()
self.embedding = nn.Embedding(n_word, n_dim)
self.linear1 = nn.Linear(2*context_size*n_dim, 128)
self.linear2 = nn.Linear(128, n_word)
def forward(self, x):
x = self.embedding(x)
x = x.view(1, -1)
x = self.linear1(x)
x = F.relu(x, inplace=True)
x = self.linear2(x)
x = F.log_softmax(x)
return x
model = CBOW(len(word_to_idx), 100, CONTEXT_SIZE)
if torch.cuda.is_available():
model = model.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
for epoch in range(40):
print('epoch {}'.format(epoch))
print('*'*10)
running_loss = 0
for word in data:
context, target = word
context = Variable(torch.LongTensor([word_to_idx[i] for i in context]))
target = Variable(torch.LongTensor([word_to_idx[target]]))
if torch.cuda.is_available():
context = context.cuda()
target = target.cuda()
out = model(context)
loss = criterion(out, target)
running_loss += loss.data
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('loss: {:.6f}'.format(running_loss / len(data)))
Other
Distributed semantics (distributed representation, representing words by their context)
Distributed semantics means : The meaning of a word is determined by its Words around To decide (a word's meaning is given by the words that frequently appear close-by).
Distributed It also means , One dense vector Each bit of can represent multiple features 、 A feature can also be represented by many bits .
We express each word as a dense vector, Make the surrounding words (context word) Of two similar words dense vector It's similar . In this way, we can measure the similarity 了 .
A good word representation Be able to grasp the meaning of words syntactic( syntax , Such as subject predicate object ) And semantic( The semantic meaning of words ) Information , for example , A good word means you can do it :
Reference material
A awesome Bag-of-words Introduction to models ~_love pattern recognition -CSDN Blog _bag-of-words
边栏推荐
- [ubuntu-mysql 8 installation and master-slave replication]
- Express get request
- Challenge transform() 2D
- Linear-gradient()
- MySQL index and data storage structure foundation
- Based on svelte3 X desktop UI component library svelte UI
- float
- Good partner for cloud skill improvement, senior brother cloud of Amazon officially opened today
- 银河麒麟server-V10配置镜像源
- Idea setting automatic package Guide
猜你喜欢
Express file upload
Numpy (time date and time increment)
MySQL internal component structure
【Ubuntu-redis安装】
[new book recommendation] cleaning data for effective data science
Framework program of browser self-service terminal based on IE kernel
Clickhouse installation (quick start)
MCU firmware packaging Script Software
桂林 稳健医疗收购桂林乳胶100%股权 填补乳胶产品线空白
prometheus 监控之 ntp_exporter
随机推荐
Idea shortcut key settings
Self service terminal development process
【新书推荐】Deno Web Development
What is the difference between ZigBee, Bluetooth and WiFi (copy and reprint)
MySQL explain
Torch learning summary
I once met a girl whom I most wanted to take care of all my life. Later... No later
Why won't gold depreciate???
Small program development journey
How to reduce the delay in live broadcast in the development of live broadcast source code with goods?
ReturnJson,让返回数据多一些自定义数据或类名
直播带货源码开发中,如何降低直播中的延迟?
Deep Learning with Pytorch- neural network
Train an image classifier demo in pytorch [learning notes]
MySQL-- Entity Framework Code First(EF Code First)
Redis docker master-slave mode and sentinel
Microsoft. Bcl. Async usage summary -- in Net framework 4.5 project Net framework version 4.5 and above can use async/await asynchronous feature in C 5
Notes on masking and padding in tensorflow keras
(zero) most complete JVM knowledge points
Distributed things