当前位置：网站首页>Week 6 Learning Representation: Word Embedding (symbolic →numeric)

Week 6 Learning Representation: Word Embedding (symbolic →numeric)

2022-07-26 05:09:00 【Jinzhou hungry bully】

One 、 Learning representation in machine learning and deep learning

1、RNN Knowledge review

2、 Comparison between traditional feature extraction and modern feature extraction

Two 、 Word embedding （Word embedding）

1、Word embedding Definition

Embedding It is a noun in the field of Mathematics , yes Refers to an object X Embedded in another object Y in , mapping f : X → Y , For example, rational numbers are embedded in real numbers .
Word embedding yes NLP One of them Language model （language modeling） and Feature learning technology （feature learning techniques） The general term of , These techniques will put words or phrases in the vocabulary （words or phrases） Map to a vector composed of real numbers .
Word embedding Is to automatically learn from data Input space to Distributed representation Mapping of space f.
The simplest one Word Embedding Method , Namely Bag based on words （BOW） Of One-Hot Express , There's another way ： Co-occurrence matrix (Cocurrence matrix).

This process is called word embedding（ Word embedding ）, namely Embed the high-dimensional word vector into a low dimensional space . Pictured ：

2、 Fever alone （One hot representation）

2.1 Definition

The only hot code is One-Hot code , Also known as a bit effective code , The method is to use N Bit status register Come on N Status Encoding , Each state has its own register bit , And at any time , Only one of them works . for instance , Suppose we have four samples （ That's ok ）, Each sample has three characteristics （ Column ）, Pictured ：

our feature_1 There are two possible values , For example, men / Woman , Men's use here 1 Express , made for females 2 Express .feature_2 and feature_3 Each has 4 Species value （ state ）.one-hot Coding is to ensure that every single feature in each sample has only 1 Bit in state 1, Everything else is 0. For the above states one-hot The coding is shown in the figure below ：

Consider three characteristics ：

["male", "female"]
["from Europe", "from US", "from Asia"]
["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]

After replacing it with a single heat code , Should be ：

feature1=[01,10]
feature2=[001,010,100]
feature3=[0001,0010,0100,1000]

1.2.2 Analysis of advantages and disadvantages

advantage ：

One is It solves the problem that the classifier is difficult to deal with discrete data The problem of ,
Second, to a certain extent, it also played The role of extended features .

shortcoming ：

There are some shortcomings in the representation of text features .
First , It's a word bag model , Don't consider the order between words （ The order information of words in text is also very important ）;
secondly , it Assume that words are independent of each other （ in the majority of cases , Words and words affect each other ）;
Last , It gets The characteristic is discrete and sparse Of

3、 ... and 、 Word2vec

1、 Word2vec Definition

word2vec The model is actually a simplified neural network .word2vec yes Use a layer by layer neural network ( namely CBOW) hold one-hot The form of sparse word vector mapping is called a n dimension (n Usually hundreds ) Dense vector process . In order to speed up model training , Among them tricks Include Hierarchical softmax,negative sampling,Huffman Tree etc. .

stay NLP in , The most fine-grained objects are words . If we want to tag part of speech , Use the general idea , We can have a series of sample data (x,y). among x It means words ,y Part of speech . And what we have to do , Is to find one x -> y The mapping relation of , Traditional methods include Bayes,SVM And so on . But our mathematical model , Generally, it is a numerical input . however NLP Words in , It is the abstract summary of human beings , It's symbolic （ Such as Chinese 、 english 、 Latin and so on ）, therefore They need to be converted into numerical form , Or say —— Embedded in a mathematical space , such Embed mode , It's called word embedding （word embedding), and Word2vec, Is word embedding （ word embedding) A kind of .

Input is One-Hot Vector,Hidden Layer There is no activation function , It's a linear element .Output Layer Dimension follows Input Layer It's the same dimension , It's using Softmax Return to . When the model is trained , We don't use this trained model for new tasks , What we really need is the parameters that this model learns from the training data , For example, the weight matrix of hidden layer . How does this model define the input and output of data ？ Generally divided into CBOW（Continuous Bag-of-Words） And Skip-Gram Two models .

CBOW Model The training input is the word vector corresponding to the context sensitive words of a feature word , And the output is the word vector of this particular word , CBOW Yes Small database More appropriate , and Skip-Gram In large corpora Better performance in .
Skip-Gram Models and CBOW The idea is the opposite , namely The input is a word vector for a particular word , The output is the context word vector corresponding to a specific word .

Word2Vec The model is actually divided into two parts , The first part is to establish the model , The second part is to obtain the embedded word vector through the model .Word2Vec The whole modeling process is actually related to the self encoder （auto-encoder） The idea is very similar , That is to build a neural network based on training data , When the model is trained , We don't use this trained model for new tasks , What we really need is the parameters that this model learns from the training data , For example, the weight matrix of hidden layer —— We will see these weights in Word2Vec In fact, it is what we are trying to learn “word vectors”.

The method mentioned above will actually be used in Unsupervised feature learning （unsupervised feature learning） See you in , The most common is Self encoder （auto-encoder）： adopt Encode and compress the input in the hidden layer , Then, the data is decoded at the output layer and restored to the initial state , After training , We will put the output layer “ Cut off ”, Keep only hidden layers .