当前位置:网站首页>Word2vec training Chinese word vector
Word2vec training Chinese word vector
2022-07-01 13:33:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Word vector as the basic structure of text —— The model of words . A good word vector can make words with similar semantics gather together in the word vector space , It's very important for subsequent text categorization , Text clustering and other operations provide convenience , Here is a brief introduction to the training of word vectors , It mainly records the learning model, the preservation of word vectors and the usage of some functions .
One 、 Sohu News
1. Chinese corpus preparation
This paper uses the Sogou news corpus of Sogou Laboratory , Data links http://www.sogou.com/labs/resource/cs.php
The downloaded file is called : news_sohusite_xml.full.tar.gz
2. Data preprocessing
2.1 Decompress the data and take out the contents
(1)cd To the original file directory , Execute the unzip command :
tar -zvxf news_sohusite_xml.full.tar.gz(2) Take out the content
Because each of the Sohu materials here Text content is stored in the alignment . So take out The content in , Execute the following command :
cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" > corpus.txtGet the file name corpus.txt The file of , Can pass vim open
vim corpus.txt2.2 Use jieba participle
Send word2vec Your document needs word segmentation , The participle can be jieba Participle implementation ,jieba Easy to install , I won't explain it here . The code of word segmentation is as follows :
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Tue Sep 11 18:46:22 2018 @author: lilong """ """ Word segmentation from the original text is saved to a new file """ import jieba import numpy as np filePath='corpus_1.txt' fileSegWordDonePath ='corpusSegDone_1.txt' # Print Chinese list def PrintListChinese(list): for i in range(len(list)): print (list[i]) # Read the contents of the file to the list fileTrainRead = [] with open(filePath,'r') as fileTrainRaw: for line in fileTrainRaw: # Read the file by line fileTrainRead.append(line) # jieba Save in the list after word segmentation fileTrainSeg=[] for i in range(len(fileTrainRead)): fileTrainSeg.append([' '.join(list(jieba.cut(fileTrainRead[i][9:-11],cut_all=False)))]) if i % 100 == 0: print(i) # Save the word segmentation results to a file with open(fileSegWordDonePath,'w',encoding='utf-8') as fW: for i in range(len(fileTrainSeg)): fW.write(fileTrainSeg[i][0]) fW.write('\n') """ gensim word2vec Get the word vector """ import warnings import logging import os.path import sys import multiprocessing import gensim from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence # Ignore the warning warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') if __name__ == '__main__': program = os.path.basename(sys.argv[0]) # Read the file name of the current file logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # inp For input corpus , outp1 For the output model , outp2 by vector Format model inp = 'corpusSegDone_1.txt' out_model = 'corpusSegDone_1.model' out_vector = 'corpusSegDone_1.vector' # Training skip-gram Model model = Word2Vec(LineSentence(inp), size=50, window=5, min_count=5, workers=multiprocessing.cpu_count()) # Save the model model.save(out_model) # Save the word vector model.wv.save_word2vec_format(out_vector, binary=False) The result of the participle is :
And will save 3 File : corpusSegDone_1.txt corpusSegDone_1.model corpusSegDone_1.vector
Because it takes a while to run here , So there is no validation test .
Two 、 Wikipedia
Because training takes some time , So here is just the idea .
1. Data preprocessing
The amount of Wikipedia data is not large enough , Baidu Encyclopedia has a comprehensive amount of data , Baidu Encyclopedia mainland related information on the content is relatively comprehensive , Wikipedia of Hong Kong, Macao, Taiwan and foreign related information is more detailed , Therefore, the two corpora will be put into training together during training , Form complementarities , In addition, we added 1.1 Industry data of 10000 companies
Model :gensim tool kit word2vec Model , Easy to install and use , Fast training corpus : Baidu Encyclopedia 500 Ten thousand words + Wikipedia 30 Ten thousand words +1.1 Ten thousand field data participle :jieba participle , Add industry words to the custom dictionary , Remove stop words Hardware : It depends on your computer hardware
2. participle
- Prepare a dictionary of stop words , The interference of stop words should be removed during training
- Word segmentation tools include Chinese Academy of Sciences word segmentation , Harbin Institute of technology LTP participle ,jieba participle , Word segmentation effect Chinese Academy of Sciences word segmentation effect is good , And it's directly used here jieba Carry out word segmentation , Easy to use , Word segmentation is fast .
- Custom dictionary : Because encyclopedia data has many exclusive terms , Many are longer , If you segment words directly , In most cases, it will be cut , This is not the result we want , such as : the Chinese People's Liberation Army , May be divided into : China the people People's Liberation Army ,jieba Although it has the function of finding new words , To ensure the accuracy of word segmentation ,jieba The author of suggests that we still use a custom dictionary .
- Custom dictionary extraction : Extracted from Baidu Encyclopedia 200 Wan's entry , As the custom dictionary contains English words, it will cause jieba Segment English words , Therefore, we need to use regular expressions to remove English data in entries , And remove some words , There are also some shorter words in the entries , Such as ” In Beijing, ”, This kind of words will cause problems in word segmentation , You also need to use regular removal , There are also simple and crude methods , Keep... Directly 3 Chinese entries with Chinese characters or more , After removal, we get 170 10000 Size Custom Dictionary .
- participle
Word segmentation code :
# Multithreaded word segmentation
# jieba.enable_parallel()
# Load custom dictionary
jieba.load_userdict("F:/baike_spider/dict/baike_word_chinese")
# Load stop words
def getStopwords():
stopwords = []
with open("stop_words.txt", "r", encoding='utf8') as f:
lines = f.readlines()
for line in lines:
stopwords.append(line.strip())
return stopwords
# participle
def segment():
file_nums = 0
count = 0
url = base_url + 'processed_data/demo/'
fileNames = os.listdir(url)
for file in fileNames: # Traverse each file
# Log information
logging.info('starting ' + str(file_nums) + 'file word Segmentation')
segment_file = open(url + file + '_segment', 'a', encoding='utf8')
# Each file is processed separately
with open(url + file, encoding='utf8') as f:
text = f.readlines()
for sentence in text:
sentence = list(jieba.cut(sentence))
sentence_segment = []
for word in sentence:
if word not in stopwords:
sentence_segment.append(word)
segment_file.write(" ".join(sentence_segment))
del text
f.close()
segment_file.close()
# Log information
logging.info('finished ' + str(file_nums) + 'file word Segmentation')
file_nums += 1- because python Multithreading can only be single core multithreading , If it is a multi-core machine, it cannot be used effectively cpu,jieba It's using python Written , therefore jieba Only parallel word segmentation is supported , Parallel word segmentation refers to multi process word segmentation , And does not support windows.
- stay linux Tried jieba Self contained parallel word segmentation , After starting parallel word segmentation ,jieba Multiple processes will be automatically started in the background , And parallel word segmentation needs to read the training corpus into memory at one time and transfer it into jieba.cut(file.read()) That's how it works , If it is similar to the line by line input in my code , It doesn't work to start multiple processes ,jieba The multi process principle is ,jieba The background will automatically allocate the corpus segmentation to the specified process for processing , Divide the words and then merge .
- 8 nucleus 16g Memory Linux virtual machine , Discovery on jieba Parallel word segmentation ,1g Corpus data , Soon memory burst
- Single process jieba participle , There is no need to load all corpus data at once , The corpus can be read line by line , It doesn't take up much memory , Stable operation . Therefore, the corpus data is divided into 8 Share , Open manually 8 The processes are divided into words , In this way, the memory occupation of each process is stable , Than jieba The built-in parallel word segmentation has good performance ,20g The data of , Turn on HMM Pattern , The participle probably took 10 Hours
3. word2vec Training
Use gensim Toolkit's word2vec Training , Easy to use and fast , Better than Google Of word2vec The effect is good , use tensorflow To run word2vec Model ,16g Our memory can't run at all gensim word2vec The training code is as follows , It's simple :
import logging import multiprocessing import os.path import sys import jieba from gensim.models import Word2Vec from gensim.models.word2vec import PathLineSentences if __name__ == '__main__': # Log information output program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments # if len(sys.argv) < 4: # print(globals()['__doc__'] % locals()) # sys.exit(1) # input_dir, outp1, outp2 = sys.argv[1:4] input_dir = 'segment' outp1 = 'baike.model' outp2 = 'word2vec_format' fileNames = os.listdir(input_dir) # Training models # Enter the corpus catalog :PathLineSentences(input_dir) # embedding size:256 Co occurrence window size :10 Remove the number of occurrences 5 The following words , Multithreading , iteration 10 Time model = Word2Vec(PathLineSentences(input_dir), size=256, window=10, min_count=5, workers=multiprocessing.cpu_count(), iter=10) model.save(outp1) model.wv.save_word2vec_format(outp2, binary=False) # Run the command : Enter the training file directory python word2vec_model.py data baike.model baike.vector - Because the corpus is too large , Cannot load into memory training at one time ,gensim Provides PathLineSentences(input_dir) This class , I will go to the specified directory and read the corpus data files in turn , use iterator Load training data into memory .
- You can see from the training log , The process is to read each file in turn , Generate total vocab The dictionary , Used for statistical count, Used to filter during training min_count Words smaller than the number we set ,vocab After the general dictionary is generated , Will read the corpus in turn model Training , Training is very fast .
3、 ... and 、word2vec Save and load word vectors
- With model.save() Method to save the word vector Save the word vector
import gensim model = gensim.models.Word2Vec(documents, size=300) model.train(documents, total_examples=len(documents), epochs=10) model.save("../input/Word2vec.w2v") Load word vector
import gensim word2vec = gensim.models.word2vec.Word2Vec.load("./input/Quora.w2v").wv - Word vectors stored as binary
Save the word vector
model.wv.save_Word2Vec_format(embedding_path,binary=True) #model.wv.save_Word2Vec_format(embedding_path,binary=False) Non binary Load word vector
import gensim word2vec = gensim.models.KeyedVectors.load_word2vec_format(embedding_path,binary=True) - Use numpy Save and load
The file that holds array data can be in binary format or text format , Files in binary format can be Numpy Dedicated binary types and unformatted types .
Use np.save() preservation npy file ,np.load() load npy file .
Model export and import :
The simplest import and Export
(1)word2vec.save You can export the file , This is not exported as .bin
# Model saving and loading model.save('/tmp/mymodel') new_model = gensim.models.Word2Vec.load('/tmp/mymodel') odel = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) # load .txt file # using gzipped/bz2 input works too, no need to unzip: model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True) # load .bin file word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25) word2vec.save('word2vec_wx') (2)gensim.models.Word2Vec.load Method import
model = gensim.models.Word2Vec.load('xxx/word2vec_wx') pd.Series(model.most_similar(u' WeChat ',topn = 360000)) (3)Numpy You can use numpy.load:
import numpy word_2x = numpy.load('xxx/word2vec_wx.wv.syn0.npy') (4) Other import methods , Import txt Format +bin Format :
from gensim.models.keyedvectors import KeyedVectors word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False) # C text format word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True) # C binary format Incremental training
# Incremental training
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)Not right C The generated model is retrained
Only for record learning .
Reference resources : https://www.cnblogs.com/Newsteinwell/p/6034747.htmlhttps://www.jianshu.com/p/87798bccee48https://blog.csdn.net/sinat_26917383/article/details/69803018
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/131448.html Link to the original text :https://javaforall.cn
边栏推荐
- Spark source code reading outline
- Analysis report on the development pattern of China's smart emergency industry and the 14th five year plan Ⓠ 2022 ~ 2028
- Research Report on China's software outsourcing industry investment strategy and the 14th five year plan Ⓡ 2022 ~ 2028
- Report on the current situation and development trend of bidirectional polypropylene composite film industry in the world and China Ⓟ 2022 ~ 2028
- Yarn重启applications记录恢复
- leetcode 322. Coin Change 零钱兑换(中等)
- 洞态在某互联⽹⾦融科技企业的最佳落地实践
- Arthas use
- Terminal identification technology and management technology
- 研发效能度量框架解读
猜你喜欢

Spark source code (V) how does dagscheduler taskscheduler cooperate with submitting tasks, and what is the corresponding relationship between application, job, stage, taskset, and task?

波浪动画彩色五角星loader加载js特效
![[development of large e-commerce projects] performance pressure test - basic concept of pressure test & jmeter-38](/img/50/819b9c2f69534afc6dc391c9de5f05.png)
[development of large e-commerce projects] performance pressure test - basic concept of pressure test & jmeter-38

Cs5268 advantages replace ag9321mcq typec multi in one docking station scheme
![[machine learning] VAE variational self encoder learning notes](/img/38/3eb8d9078b2dcbe780430abb15edcb.png)
[machine learning] VAE variational self encoder learning notes

1553B环境搭建

Function test process in software testing

孔松(信通院)-数字化时代云安全能力建设及趋势

啟動solr報錯The stack size specified is too small,Specify at least 328k

Summary of interview questions (1) HTTPS man in the middle attack, the principle of concurrenthashmap, serialVersionUID constant, redis single thread,
随机推荐
La taille de la pile spécifiée est petite, spécifiée à la sortie 328k
Beidou communication module Beidou GPS module Beidou communication terminal DTU
5. Use of ly tab plug-in of header component
Social distance (cow infection)
The future of game guild in decentralized games
Leetcode第一题:两数之和(3种语言)
[machine learning] VAE variational self encoder learning notes
研发效能度量框架解读
流量管理技术
The stack size specified is too small, specify at least 328k
Colorful five pointed star SVG dynamic web page background JS special effect
彩色五角星SVG动态网页背景js特效
Computer network interview knowledge points
Huawei HMS core joins hands with hypergraph to inject new momentum into 3D GIS
minimum spanning tree
ZABBIX 6.0 source code installation and ha configuration
Analysis report on the development trend and prospect scale of silicon intermediary industry in the world and China Ⓩ 2022 ~ 2027
Jenkins+webhooks- multi branch parametric construction-
About fossage 2.0 "meta force meta universe system development logic scheme (details)
二传感器尺寸「建议收藏」