当前位置:网站首页>Basic usage of word2vec and Bert
Basic usage of word2vec and Bert
2022-07-28 06:11:00 【Alan and fish】
1.word2vec How to use
word2vec Generating word vectors can be divided into three steps :
participle -> Training -> Call model
# The data set is a novel I casually found
import jieba
from gensim.models import word2vec
# Data preprocessing
def load_train_data(filename):
# Data preprocessing
sentences=[]
with open(filename, 'r', encoding='utf-8') as reader:
for line in reader:
line = line.strip()
if len(line) >= 10:
sentences.append(line)
return sentences
# Use stammer participle
def segment(sentences):
words=[]
for sentence in sentences:
# word=pseg.cut(sentence) # Part of speech with participle
word=jieba.cut(sentence) # Just participles , Without part of speech , After dividing the words , Use one list Pack up
result=''
for w in word:
result+=' '+w
words.append(result)
# Read every line of text , Write all the text
with open('F:\\python\\NLPBase\\data\\test.txt','a',encoding='utf-8') as fw:
for result in words:
fw.write(result)
pass
fw.close()
return words
# Generate word2vec Model , Generate word vectors
def word2vect(filepath):
sentences = word2vec.LineSentence(filepath)
model = word2vec.Word2Vec(sentences, hs=1, min_count=1, window=3, vector_size=10)
model.save('model') # Save the model
#======================================
# Load data set
sentences=load_train_data('F:\\python\\NLPBase\\data\\dataset.txt')
# participle
words=segment(sentences)
# Training
word2vect('F:\\python\\NLPBase\\data\\test.txt')
model = word2vec.Word2Vec.load('model') # Load model
# Find the nearest word set of a word vector
for val in model.wv.similar_by_word(" south ", topn=10):
print(val[0], val[1])
pass
result :
2.bert Simple use
bert The use of can be simply divided into three steps :
load bert Word segmentation is -> load bert Model -> participle -> take token To vocabulary Indexes -> Training -> Generate word vectors
Be careful : One of them bert Data loading , If you download a file from the Internet, there are usually three things ,json package ,bert Pre training model , Corpus table , Once downloaded, the names of these three files cannot be modified , Otherwise you'll make a mistake .
import torch
from pytorch_pretrained_bert import BertModel, BertTokenizer
# Pay attention to this bert The configuration file of can be downloaded from the Internet , You can also directly load online , I load it online directly , If you download these configuration files locally , Then fill in the path directly
# load bert The participator of
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# load bert Model , There are... Under this path folder bert_config.json Configuration files and model.bin Model weight file
bert = BertModel.from_pretrained('bert-base-uncased')
# participle
s = "I'm not sure, this can work, lol -.-"
tokens = tokenizer.tokenize(s)
# "i\\'\\m\\not\\sure\\,\\this\\can\\work\\,\\lo\\##l\\-\\.\\-"
# take token To vocabulary Indexes
ids = torch.tensor([tokenizer.convert_tokens_to_ids(tokens)])
# Put it in bert Training in the model
result = bert(ids, output_all_encoded_layers=True)
print(result)
result 
3. summary
bert and word2vec In fact, they are all generated word vectors , The difference is ,word2vec The word vector in is fixed , He only puts words with similar meanings in similar positions .
But there is something wrong with this approach , The meaning is different in different articles . such as :
For example, in these two sentences it The meaning of reference is different , therefore it The relationship with each word is also different .
So it introduces Transformer Medium self-attention Mechanism . Make the weight between each word and the context different , In this way, we can better express the semantic information between words , This is it. bert.
边栏推荐
- Deep learning (self supervision: CPC V2) -- data efficient image recognition with contractual predictive coding
- Wechat applet development and production should pay attention to these key aspects
- Solution to the crash after setting up a cluster
- How to do wechat group purchase applet? How much does it usually cost?
- vscode uniapp
- What are the points for attention in the development and design of high-end atmospheric applets?
- 强化学习——多智能体强化学习
- 深度学习——Pay Attention to MLPs
- Nlp项目实战自定义模板框架
- uView上传组件upload上传auto-upload模式图片压缩
猜你喜欢

深度学习(增量学习)——ICCV2021:SS-IL: Separated Softmax for Incremental Learning

Byte Android post 4 rounds of interviews, received 50k*18 offers, and successfully broke the situation under the layoff

How much does it cost to make a small program mall? What are the general expenses?

强化学习——多智能体强化学习

Improved knowledge distillation for training fast lr_fr for fast low resolution face recognition model training

Distributed cluster architecture scenario optimization solution: distributed scheduling problem

《AdaFace: Quality Adaptive Margin for Face Recognition》用于人脸识别的图像质量自适应边缘损失

搭建集群之后崩溃的解决办法

Reinforcement learning - incomplete observation problem, MCTs

基于tensorflow搭建神经网络
随机推荐
Installing redis under Linux (centos7)
微信小程序制作模板套用时需要注意什么呢?
UNL class diagram
Ssh/scp breakpoint resume Rsync
用于快速低分辨率人脸识别模型训练的改进知识蒸馏《Improved Knowledge Distillation for Training Fast LR_FR》
flutter webivew input唤起相机相册
深度学习——MetaFormer Is Actually What You Need for Vision
【4】 Redis persistence (RDB and AOF)
《AdaFace: Quality Adaptive Margin for Face Recognition》用于人脸识别的图像质量自适应边缘损失
Which enterprises are suitable for small program production and small program development?
Which is more reliable for small program development?
Self attention learning notes
tf.keras搭建神经网络功能扩展
Reinforcement learning - Multi-Agent Reinforcement Learning
Matplotlib data visualization
神经网络学习
Various programming languages decimal | time | Base64 and other operations of the quick look-up table
Mysql5.6 (according to.Ibd,.Frm file) restore single table data
深度学习(增量学习)——ICCV2022:Contrastive Continual Learning
How much does it cost to make a small program mall? What are the general expenses?