当前位置:网站首页>Word2vec training Chinese word vector
Word2vec training Chinese word vector
2022-07-01 13:33:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Word vector as the basic structure of text —— The model of words . A good word vector can make words with similar semantics gather together in the word vector space , It's very important for subsequent text categorization , Text clustering and other operations provide convenience , Here is a brief introduction to the training of word vectors , It mainly records the learning model, the preservation of word vectors and the usage of some functions .
One 、 Sohu News
1. Chinese corpus preparation
This paper uses the Sogou news corpus of Sogou Laboratory , Data links http://www.sogou.com/labs/resource/cs.php
The downloaded file is called : news_sohusite_xml.full.tar.gz
2. Data preprocessing
2.1 Decompress the data and take out the contents
(1)cd To the original file directory , Execute the unzip command :
tar -zvxf news_sohusite_xml.full.tar.gz(2) Take out the content
Because each of the Sohu materials here Text content is stored in the alignment . So take out The content in , Execute the following command :
cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" > corpus.txtGet the file name corpus.txt The file of , Can pass vim open
vim corpus.txt2.2 Use jieba participle
Send word2vec Your document needs word segmentation , The participle can be jieba Participle implementation ,jieba Easy to install , I won't explain it here . The code of word segmentation is as follows :
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Tue Sep 11 18:46:22 2018 @author: lilong """ """ Word segmentation from the original text is saved to a new file """ import jieba import numpy as np filePath='corpus_1.txt' fileSegWordDonePath ='corpusSegDone_1.txt' # Print Chinese list def PrintListChinese(list): for i in range(len(list)): print (list[i]) # Read the contents of the file to the list fileTrainRead = [] with open(filePath,'r') as fileTrainRaw: for line in fileTrainRaw: # Read the file by line fileTrainRead.append(line) # jieba Save in the list after word segmentation fileTrainSeg=[] for i in range(len(fileTrainRead)): fileTrainSeg.append([' '.join(list(jieba.cut(fileTrainRead[i][9:-11],cut_all=False)))]) if i % 100 == 0: print(i) # Save the word segmentation results to a file with open(fileSegWordDonePath,'w',encoding='utf-8') as fW: for i in range(len(fileTrainSeg)): fW.write(fileTrainSeg[i][0]) fW.write('\n') """ gensim word2vec Get the word vector """ import warnings import logging import os.path import sys import multiprocessing import gensim from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence # Ignore the warning warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') if __name__ == '__main__': program = os.path.basename(sys.argv[0]) # Read the file name of the current file logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # inp For input corpus , outp1 For the output model , outp2 by vector Format model inp = 'corpusSegDone_1.txt' out_model = 'corpusSegDone_1.model' out_vector = 'corpusSegDone_1.vector' # Training skip-gram Model model = Word2Vec(LineSentence(inp), size=50, window=5, min_count=5, workers=multiprocessing.cpu_count()) # Save the model model.save(out_model) # Save the word vector model.wv.save_word2vec_format(out_vector, binary=False) The result of the participle is :
And will save 3 File : corpusSegDone_1.txt corpusSegDone_1.model corpusSegDone_1.vector
Because it takes a while to run here , So there is no validation test .
Two 、 Wikipedia
Because training takes some time , So here is just the idea .
1. Data preprocessing
The amount of Wikipedia data is not large enough , Baidu Encyclopedia has a comprehensive amount of data , Baidu Encyclopedia mainland related information on the content is relatively comprehensive , Wikipedia of Hong Kong, Macao, Taiwan and foreign related information is more detailed , Therefore, the two corpora will be put into training together during training , Form complementarities , In addition, we added 1.1 Industry data of 10000 companies
Model :gensim tool kit word2vec Model , Easy to install and use , Fast training corpus : Baidu Encyclopedia 500 Ten thousand words + Wikipedia 30 Ten thousand words +1.1 Ten thousand field data participle :jieba participle , Add industry words to the custom dictionary , Remove stop words Hardware : It depends on your computer hardware
2. participle
- Prepare a dictionary of stop words , The interference of stop words should be removed during training
- Word segmentation tools include Chinese Academy of Sciences word segmentation , Harbin Institute of technology LTP participle ,jieba participle , Word segmentation effect Chinese Academy of Sciences word segmentation effect is good , And it's directly used here jieba Carry out word segmentation , Easy to use , Word segmentation is fast .
- Custom dictionary : Because encyclopedia data has many exclusive terms , Many are longer , If you segment words directly , In most cases, it will be cut , This is not the result we want , such as : the Chinese People's Liberation Army , May be divided into : China the people People's Liberation Army ,jieba Although it has the function of finding new words , To ensure the accuracy of word segmentation ,jieba The author of suggests that we still use a custom dictionary .
- Custom dictionary extraction : Extracted from Baidu Encyclopedia 200 Wan's entry , As the custom dictionary contains English words, it will cause jieba Segment English words , Therefore, we need to use regular expressions to remove English data in entries , And remove some words , There are also some shorter words in the entries , Such as ” In Beijing, ”, This kind of words will cause problems in word segmentation , You also need to use regular removal , There are also simple and crude methods , Keep... Directly 3 Chinese entries with Chinese characters or more , After removal, we get 170 10000 Size Custom Dictionary .
- participle
Word segmentation code :
# Multithreaded word segmentation
# jieba.enable_parallel()
# Load custom dictionary
jieba.load_userdict("F:/baike_spider/dict/baike_word_chinese")
# Load stop words
def getStopwords():
stopwords = []
with open("stop_words.txt", "r", encoding='utf8') as f:
lines = f.readlines()
for line in lines:
stopwords.append(line.strip())
return stopwords
# participle
def segment():
file_nums = 0
count = 0
url = base_url + 'processed_data/demo/'
fileNames = os.listdir(url)
for file in fileNames: # Traverse each file
# Log information
logging.info('starting ' + str(file_nums) + 'file word Segmentation')
segment_file = open(url + file + '_segment', 'a', encoding='utf8')
# Each file is processed separately
with open(url + file, encoding='utf8') as f:
text = f.readlines()
for sentence in text:
sentence = list(jieba.cut(sentence))
sentence_segment = []
for word in sentence:
if word not in stopwords:
sentence_segment.append(word)
segment_file.write(" ".join(sentence_segment))
del text
f.close()
segment_file.close()
# Log information
logging.info('finished ' + str(file_nums) + 'file word Segmentation')
file_nums += 1- because python Multithreading can only be single core multithreading , If it is a multi-core machine, it cannot be used effectively cpu,jieba It's using python Written , therefore jieba Only parallel word segmentation is supported , Parallel word segmentation refers to multi process word segmentation , And does not support windows.
- stay linux Tried jieba Self contained parallel word segmentation , After starting parallel word segmentation ,jieba Multiple processes will be automatically started in the background , And parallel word segmentation needs to read the training corpus into memory at one time and transfer it into jieba.cut(file.read()) That's how it works , If it is similar to the line by line input in my code , It doesn't work to start multiple processes ,jieba The multi process principle is ,jieba The background will automatically allocate the corpus segmentation to the specified process for processing , Divide the words and then merge .
- 8 nucleus 16g Memory Linux virtual machine , Discovery on jieba Parallel word segmentation ,1g Corpus data , Soon memory burst
- Single process jieba participle , There is no need to load all corpus data at once , The corpus can be read line by line , It doesn't take up much memory , Stable operation . Therefore, the corpus data is divided into 8 Share , Open manually 8 The processes are divided into words , In this way, the memory occupation of each process is stable , Than jieba The built-in parallel word segmentation has good performance ,20g The data of , Turn on HMM Pattern , The participle probably took 10 Hours
3. word2vec Training
Use gensim Toolkit's word2vec Training , Easy to use and fast , Better than Google Of word2vec The effect is good , use tensorflow To run word2vec Model ,16g Our memory can't run at all gensim word2vec The training code is as follows , It's simple :
import logging import multiprocessing import os.path import sys import jieba from gensim.models import Word2Vec from gensim.models.word2vec import PathLineSentences if __name__ == '__main__': # Log information output program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments # if len(sys.argv) < 4: # print(globals()['__doc__'] % locals()) # sys.exit(1) # input_dir, outp1, outp2 = sys.argv[1:4] input_dir = 'segment' outp1 = 'baike.model' outp2 = 'word2vec_format' fileNames = os.listdir(input_dir) # Training models # Enter the corpus catalog :PathLineSentences(input_dir) # embedding size:256 Co occurrence window size :10 Remove the number of occurrences 5 The following words , Multithreading , iteration 10 Time model = Word2Vec(PathLineSentences(input_dir), size=256, window=10, min_count=5, workers=multiprocessing.cpu_count(), iter=10) model.save(outp1) model.wv.save_word2vec_format(outp2, binary=False) # Run the command : Enter the training file directory python word2vec_model.py data baike.model baike.vector - Because the corpus is too large , Cannot load into memory training at one time ,gensim Provides PathLineSentences(input_dir) This class , I will go to the specified directory and read the corpus data files in turn , use iterator Load training data into memory .
- You can see from the training log , The process is to read each file in turn , Generate total vocab The dictionary , Used for statistical count, Used to filter during training min_count Words smaller than the number we set ,vocab After the general dictionary is generated , Will read the corpus in turn model Training , Training is very fast .
3、 ... and 、word2vec Save and load word vectors
- With model.save() Method to save the word vector Save the word vector
import gensim model = gensim.models.Word2Vec(documents, size=300) model.train(documents, total_examples=len(documents), epochs=10) model.save("../input/Word2vec.w2v") Load word vector
import gensim word2vec = gensim.models.word2vec.Word2Vec.load("./input/Quora.w2v").wv - Word vectors stored as binary
Save the word vector
model.wv.save_Word2Vec_format(embedding_path,binary=True) #model.wv.save_Word2Vec_format(embedding_path,binary=False) Non binary Load word vector
import gensim word2vec = gensim.models.KeyedVectors.load_word2vec_format(embedding_path,binary=True) - Use numpy Save and load
The file that holds array data can be in binary format or text format , Files in binary format can be Numpy Dedicated binary types and unformatted types .
Use np.save() preservation npy file ,np.load() load npy file .
Model export and import :
The simplest import and Export
(1)word2vec.save You can export the file , This is not exported as .bin
# Model saving and loading model.save('/tmp/mymodel') new_model = gensim.models.Word2Vec.load('/tmp/mymodel') odel = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) # load .txt file # using gzipped/bz2 input works too, no need to unzip: model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True) # load .bin file word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25) word2vec.save('word2vec_wx') (2)gensim.models.Word2Vec.load Method import
model = gensim.models.Word2Vec.load('xxx/word2vec_wx') pd.Series(model.most_similar(u' WeChat ',topn = 360000)) (3)Numpy You can use numpy.load:
import numpy word_2x = numpy.load('xxx/word2vec_wx.wv.syn0.npy') (4) Other import methods , Import txt Format +bin Format :
from gensim.models.keyedvectors import KeyedVectors word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False) # C text format word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True) # C binary format Incremental training
# Incremental training
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)Not right C The generated model is retrained
Only for record learning .
Reference resources : https://www.cnblogs.com/Newsteinwell/p/6034747.htmlhttps://www.jianshu.com/p/87798bccee48https://blog.csdn.net/sinat_26917383/article/details/69803018
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/131448.html Link to the original text :https://javaforall.cn
边栏推荐
- MySQL 66 questions, 20000 words + 50 pictures in detail! Necessary for review
- Meta再放大招!VR新模型登CVPR Oral:像人一样「读」懂语音
- spark源码阅读总纲
- Jenkins+webhooks-多分支参数化构建-
- 焱融看 | 混合云时代下,如何制定多云策略
- In the next stage of digital transformation, digital twin manufacturer Youyi technology announced that it had completed a financing of more than 300 million yuan
- Who should I know when opening a stock account? Is it actually safe to open an account online?
- Detailed explanation of OSPF LSA of routing Foundation
- 1553B环境搭建
- 启动solr报错The stack size specified is too small,Specify at least 328k
猜你喜欢

Professor Li Zexiang, Hong Kong University of science and technology: I'm wrong. Why is engineering consciousness more important than the best university?

Cs5268 advantages replace ag9321mcq typec multi in one docking station scheme

洞态在某互联⽹⾦融科技企业的最佳落地实践

MySQL statistical bill information (Part 2): data import and query

流量管理技术

Nexus builds NPM dependent private database

进入前六!博云在中国云管理软件市场销量排行持续上升

Computer network interview knowledge points

Meta enlarge again! VR new model posted on CVPR oral: read and understand voice like a human

Google Earth engine (GEE) - Global Human Settlements grid data 1975-1990-2000-2014 (p2016)
随机推荐
焱融看 | 混合云时代下,如何制定多云策略
Yarn重启applications记录恢复
Report on the "14th five year plan" and scale prospect prediction of China's laser processing equipment manufacturing industry Ⓢ 2022 ~ 2028
Function test process in software testing
详细讲解面试的 IO多路复用,select,poll,epoll
Wave animation color five pointed star loader loading JS special effects
SVG钻石样式代码
PG basics -- Logical Structure Management (trigger)
Use of shutter SQLite
Idea of [developing killer]
Analysis report on the development pattern of China's smart emergency industry and the 14th five year plan Ⓠ 2022 ~ 2028
Investment analysis and prospect prediction report of global and Chinese p-nitrotoluene industry Ⓙ 2022 ~ 2027
Flow management technology
French Data Protection Agency: using Google Analytics or violating gdpr
关于佛萨奇2.0“Meta Force原力元宇宙系统开发逻辑方案(详情)
Camp division of common PLC programming software
Sharing with the best paper winner of CV Summit: how is a good paper refined?
Global and Chinese styrene acrylic lotion polymer development trend and prospect scale prediction report Ⓒ 2022 ~ 2028
Computer network interview knowledge points
9. Use of better scroll and ref