当前位置：网站首页>Word2vec training Chinese word vector

Word2vec training Chinese word vector

2022-07-01 13:33:00 【Full stack programmer webmaster】

Hello everyone , I meet you again , I'm your friend, Quan Jun .

Word vector as the basic structure of text —— The model of words . A good word vector can make words with similar semantics gather together in the word vector space , It's very important for subsequent text categorization , Text clustering and other operations provide convenience , Here is a brief introduction to the training of word vectors , It mainly records the learning model, the preservation of word vectors and the usage of some functions .

One 、 Sohu News

1. Chinese corpus preparation

This paper uses the Sogou news corpus of Sogou Laboratory , Data links http://www.sogou.com/labs/resource/cs.php

The downloaded file is called ： news_sohusite_xml.full.tar.gz

2. Data preprocessing

2.1 Decompress the data and take out the contents

（1）cd To the original file directory , Execute the unzip command ：

tar -zvxf news_sohusite_xml.full.tar.gz

（2） Take out the content

Because each of the Sohu materials here Text content is stored in the alignment . So take out The content in , Execute the following command ：

cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>"  > corpus.txt

Get the file name corpus.txt The file of , Can pass vim open

vim corpus.txt

2.2 Use jieba participle

Send word2vec Your document needs word segmentation , The participle can be jieba Participle implementation ,jieba Easy to install , I won't explain it here . The code of word segmentation is as follows ：

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Tue Sep 11 18:46:22 2018 @author: lilong """ """  Word segmentation from the original text is saved to a new file  """ import jieba import numpy as np filePath='corpus_1.txt' fileSegWordDonePath ='corpusSegDone_1.txt' #  Print Chinese list  def PrintListChinese(list): for i in range(len(list)): print (list[i]) #  Read the contents of the file to the list  fileTrainRead = [] with open(filePath,'r') as fileTrainRaw: for line in fileTrainRaw: #  Read the file by line  fileTrainRead.append(line) # jieba Save in the list after word segmentation  fileTrainSeg=[] for i in range(len(fileTrainRead)): fileTrainSeg.append([' '.join(list(jieba.cut(fileTrainRead[i][9:-11],cut_all=False)))]) if i % 100 == 0: print(i) #  Save the word segmentation results to a file  with open(fileSegWordDonePath,'w',encoding='utf-8') as fW: for i in range(len(fileTrainSeg)): fW.write(fileTrainSeg[i][0]) fW.write('\n') """ gensim word2vec Get the word vector  """ import warnings import logging import os.path import sys import multiprocessing import gensim from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence #  Ignore the warning  warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') if __name__ == '__main__': program = os.path.basename(sys.argv[0]) #  Read the file name of the current file  logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # inp For input corpus , outp1 For the output model , outp2 by vector Format model  inp = 'corpusSegDone_1.txt' out_model = 'corpusSegDone_1.model' out_vector = 'corpusSegDone_1.vector' #  Training skip-gram Model  model = Word2Vec(LineSentence(inp), size=50, window=5, min_count=5, workers=multiprocessing.cpu_count()) #  Save the model  model.save(out_model) #  Save the word vector  model.wv.save_word2vec_format(out_vector, binary=False)

The result of the participle is ：

And will save 3 File ： corpusSegDone_1.txt corpusSegDone_1.model corpusSegDone_1.vector

Because it takes a while to run here , So there is no validation test .

Two 、 Wikipedia

Because training takes some time , So here is just the idea .

1. Data preprocessing

The amount of Wikipedia data is not large enough , Baidu Encyclopedia has a comprehensive amount of data , Baidu Encyclopedia mainland related information on the content is relatively comprehensive , Wikipedia of Hong Kong, Macao, Taiwan and foreign related information is more detailed , Therefore, the two corpora will be put into training together during training , Form complementarities , In addition, we added 1.1 Industry data of 10000 companies

Model ：gensim tool kit word2vec Model , Easy to install and use , Fast training corpus ： Baidu Encyclopedia 500 Ten thousand words + Wikipedia 30 Ten thousand words +1.1 Ten thousand field data participle ：jieba participle , Add industry words to the custom dictionary , Remove stop words Hardware ： It depends on your computer hardware

2. participle

Prepare a dictionary of stop words , The interference of stop words should be removed during training
Word segmentation tools include Chinese Academy of Sciences word segmentation , Harbin Institute of technology LTP participle ,jieba participle , Word segmentation effect Chinese Academy of Sciences word segmentation effect is good , And it's directly used here jieba Carry out word segmentation , Easy to use , Word segmentation is fast .
Custom dictionary ： Because encyclopedia data has many exclusive terms , Many are longer , If you segment words directly , In most cases, it will be cut , This is not the result we want , such as : the Chinese People's Liberation Army , May be divided into ： China the people People's Liberation Army ,jieba Although it has the function of finding new words , To ensure the accuracy of word segmentation ,jieba The author of suggests that we still use a custom dictionary .
Custom dictionary extraction ： Extracted from Baidu Encyclopedia 200 Wan's entry , As the custom dictionary contains English words, it will cause jieba Segment English words , Therefore, we need to use regular expressions to remove English data in entries , And remove some words , There are also some shorter words in the entries , Such as ” In Beijing, ”, This kind of words will cause problems in word segmentation , You also need to use regular removal , There are also simple and crude methods , Keep... Directly 3 Chinese entries with Chinese characters or more , After removal, we get 170 10000 Size Custom Dictionary .
participle

Word segmentation code ：

#  Multithreaded word segmentation 
# jieba.enable_parallel()

# Load custom dictionary 
jieba.load_userdict("F:/baike_spider/dict/baike_word_chinese")

# Load stop words 
def getStopwords():
    stopwords = []
    with open("stop_words.txt", "r", encoding='utf8') as f:
        lines = f.readlines()
        for line in lines:
            stopwords.append(line.strip())
    return stopwords

# participle 
def segment():
    file_nums = 0
    count = 0
    url = base_url + 'processed_data/demo/'
    fileNames = os.listdir(url)
    for file in fileNames: #  Traverse each file 
        #  Log information 
        logging.info('starting ' + str(file_nums) + 'file word Segmentation')
        segment_file = open(url + file + '_segment', 'a', encoding='utf8')
        #  Each file is processed separately 
        with open(url + file, encoding='utf8') as f:
            text = f.readlines()
            for sentence in text:
                sentence = list(jieba.cut(sentence))
                sentence_segment = []
                for word in sentence:
                    if word not in stopwords:
                        sentence_segment.append(word)
                segment_file.write(" ".join(sentence_segment))
            del text
            f.close()
        segment_file.close()
        #  Log information 
        logging.info('finished ' + str(file_nums) + 'file word Segmentation')
        file_nums += 1

because python Multithreading can only be single core multithreading , If it is a multi-core machine, it cannot be used effectively cpu,jieba It's using python Written , therefore jieba Only parallel word segmentation is supported , Parallel word segmentation refers to multi process word segmentation , And does not support windows.
stay linux Tried jieba Self contained parallel word segmentation , After starting parallel word segmentation ,jieba Multiple processes will be automatically started in the background , And parallel word segmentation needs to read the training corpus into memory at one time and transfer it into jieba.cut(file.read()) That's how it works , If it is similar to the line by line input in my code , It doesn't work to start multiple processes ,jieba The multi process principle is ,jieba The background will automatically allocate the corpus segmentation to the specified process for processing , Divide the words and then merge .
8 nucleus 16g Memory Linux virtual machine , Discovery on jieba Parallel word segmentation ,1g Corpus data , Soon memory burst
Single process jieba participle , There is no need to load all corpus data at once , The corpus can be read line by line , It doesn't take up much memory , Stable operation . Therefore, the corpus data is divided into 8 Share , Open manually 8 The processes are divided into words , In this way, the memory occupation of each process is stable , Than jieba The built-in parallel word segmentation has good performance ,20g The data of , Turn on HMM Pattern , The participle probably took 10 Hours

3. word2vec Training

Use gensim Toolkit's word2vec Training , Easy to use and fast , Better than Google Of word2vec The effect is good , use tensorflow To run word2vec Model ,16g Our memory can't run at all gensim word2vec The training code is as follows , It's simple ：

import logging import multiprocessing import os.path import sys import jieba from gensim.models import Word2Vec from gensim.models.word2vec import PathLineSentences if __name__ == '__main__': #  Log information output  program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments # if len(sys.argv) < 4: # print(globals()['__doc__'] % locals()) # sys.exit(1) # input_dir, outp1, outp2 = sys.argv[1:4] input_dir = 'segment' outp1 = 'baike.model' outp2 = 'word2vec_format' fileNames = os.listdir(input_dir) #  Training models   #  Enter the corpus catalog :PathLineSentences(input_dir) # embedding size:256  Co occurrence window size :10  Remove the number of occurrences 5 The following words , Multithreading , iteration 10 Time  model = Word2Vec(PathLineSentences(input_dir), size=256, window=10, min_count=5, workers=multiprocessing.cpu_count(), iter=10) model.save(outp1) model.wv.save_word2vec_format(outp2, binary=False) #  Run the command : Enter the training file directory  python word2vec_model.py data baike.model baike.vector

Because the corpus is too large , Cannot load into memory training at one time ,gensim Provides PathLineSentences(input_dir) This class , I will go to the specified directory and read the corpus data files in turn , use iterator Load training data into memory .
You can see from the training log , The process is to read each file in turn , Generate total vocab The dictionary , Used for statistical count, Used to filter during training min_count Words smaller than the number we set ,vocab After the general dictionary is generated , Will read the corpus in turn model Training , Training is very fast .

3、 ... and 、word2vec Save and load word vectors

With model.save() Method to save the word vector Save the word vector

import gensim model = gensim.models.Word2Vec(documents, size=300) model.train(documents, total_examples=len(documents), epochs=10) model.save("../input/Word2vec.w2v")

Load word vector

import gensim word2vec = gensim.models.word2vec.Word2Vec.load("./input/Quora.w2v").wv

Word vectors stored as binary

Save the word vector

model.wv.save_Word2Vec_format(embedding_path,binary=True) #model.wv.save_Word2Vec_format(embedding_path,binary=False) Non binary

Load word vector

import gensim word2vec = gensim.models.KeyedVectors.load_word2vec_format(embedding_path,binary=True)

Use numpy Save and load

The file that holds array data can be in binary format or text format , Files in binary format can be Numpy Dedicated binary types and unformatted types .

Use np.save() preservation npy file ,np.load() load npy file .

Model export and import :

The simplest import and Export

（1）word2vec.save You can export the file , This is not exported as .bin

#  Model saving and loading  model.save('/tmp/mymodel') new_model = gensim.models.Word2Vec.load('/tmp/mymodel') odel = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) #  load  .txt file  # using gzipped/bz2 input works too, no need to unzip: model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True) #  load  .bin file  word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25) word2vec.save('word2vec_wx')

（2）gensim.models.Word2Vec.load Method import

model = gensim.models.Word2Vec.load('xxx/word2vec_wx') pd.Series(model.most_similar(u' WeChat ',topn = 360000))

（3）Numpy You can use numpy.load：

import numpy word_2x = numpy.load('xxx/word2vec_wx.wv.syn0.npy')

（4） Other import methods , Import txt Format +bin Format ：

from gensim.models.keyedvectors import KeyedVectors word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False) # C text format word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True) # C binary format

Incremental training

#  Incremental training 
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

Not right C The generated model is retrained

Only for record learning .

Reference resources ： https://www.cnblogs.com/Newsteinwell/p/6034747.html https://www.jianshu.com/p/87798bccee48 https://blog.csdn.net/sinat_26917383/article/details/69803018

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/131448.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207011313194035.html