当前位置：网站首页>Basic usage of word2vec and Bert

Basic usage of word2vec and Bert

2022-07-28 06:11:00 【Alan and fish】

1.word2vec How to use

word2vec Generating word vectors can be divided into three steps :
participle -> Training -> Call model

#  The data set is a novel I casually found 
import jieba
from gensim.models import word2vec

#  Data preprocessing 
def load_train_data(filename):
    #  Data preprocessing 
    sentences=[]
    with open(filename, 'r', encoding='utf-8') as reader:
        for line in reader:
            line = line.strip()
            if len(line) >= 10:
                sentences.append(line)
    return sentences
#  Use stammer participle 
def segment(sentences):
    words=[]
    for sentence in sentences:
        # word=pseg.cut(sentence) #  Part of speech with participle 
        word=jieba.cut(sentence) #  Just participles , Without part of speech , After dividing the words , Use one list Pack up 
        result=''
        for w in word:
            result+=' '+w
        words.append(result)
    #  Read every line of text , Write all the text 
    with open('F:\\python\\NLPBase\\data\\test.txt','a',encoding='utf-8') as fw:
        for result in words:
            fw.write(result)
            pass
        fw.close()
    return words
#  Generate word2vec Model , Generate word vectors 
def word2vect(filepath):
    sentences = word2vec.LineSentence(filepath)
    model = word2vec.Word2Vec(sentences, hs=1, min_count=1, window=3, vector_size=10)
    model.save('model')  #  Save the model 

#======================================
#  Load data set 
sentences=load_train_data('F:\\python\\NLPBase\\data\\dataset.txt')
#  participle 
words=segment(sentences)
# Training 
word2vect('F:\\python\\NLPBase\\data\\test.txt')
model = word2vec.Word2Vec.load('model')  #  Load model 
#  Find the nearest word set of a word vector 
for val in model.wv.similar_by_word(" south ", topn=10):
    print(val[0], val[1])
    pass

result :

2.bert Simple use

bert The use of can be simply divided into three steps :
load bert Word segmentation is -> load bert Model -> participle -> take token To vocabulary Indexes -> Training -> Generate word vectors
Be careful ： One of them bert Data loading , If you download a file from the Internet, there are usually three things ,json package ,bert Pre training model , Corpus table , Once downloaded, the names of these three files cannot be modified , Otherwise you'll make a mistake .

import torch
from pytorch_pretrained_bert import BertModel, BertTokenizer
#  Pay attention to this bert The configuration file of can be downloaded from the Internet , You can also directly load online , I load it online directly , If you download these configuration files locally , Then fill in the path directly 
#  load bert The participator of 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
#  load bert Model , There are... Under this path folder bert_config.json Configuration files and model.bin Model weight file 
bert = BertModel.from_pretrained('bert-base-uncased')
#  participle 
s = "I'm not sure, this can work, lol -.-"
tokens = tokenizer.tokenize(s)
# "i\\'\\m\\not\\sure\\,\\this\\can\\work\\,\\lo\\##l\\-\\.\\-"
#  take token To vocabulary Indexes 
ids = torch.tensor([tokenizer.convert_tokens_to_ids(tokens)])
#  Put it in bert Training in the model 
result = bert(ids, output_all_encoded_layers=True)
print(result)

result

3. summary

bert and word2vec In fact, they are all generated word vectors , The difference is ,word2vec The word vector in is fixed , He only puts words with similar meanings in similar positions .

But there is something wrong with this approach , The meaning is different in different articles . such as ：

For example, in these two sentences it The meaning of reference is different , therefore it The relationship with each word is also different .

So it introduces Transformer Medium self-attention Mechanism . Make the weight between each word and the context different , In this way, we can better express the semantic information between words , This is it. bert.