当前位置：网站首页>Word2vec vector model of Wiki Chinese corpus based on deep learning

Word2vec vector model of Wiki Chinese corpus based on deep learning

2022-06-29 16:56:00 【biyezuopinvip】

Resource download address ：https://download.csdn.net/download/sheziqiong/85820613
Resource download address ：https://download.csdn.net/download/sheziqiong/85820613

This example mainly introduces the selection of wiki Chinese corpus , And use python complete Word2vec The practical process of model building , Excluding principle part , It aims to understand the basic methods and steps of naturallanguageprocessing step by step . The article mainly includes the preparation of the development environment 、 Data acquisition 、 Data preprocessing 、 Model building and model testing , Corresponding to the five steps of model building .

One 、 Development environment preparation

1.1 python Environmental Science

stay python Official website Download the corresponding python edition , I use Python2.7.13 Version of .

1.2 gensim modular

（1） Download module

Word2vec Need to use a third party gensim modular , gensim Module dependency numpy and scipy Two bags , Therefore, you need to download the corresponding versions in turn numpy、scipy、gensim. Download address ：http://www.lfd.uci.edu/~gohlke/pythonlibs/

（2） Install the module

When the download is complete , stay python Install under directory Scripts Execute... In directory cmd Command to install .

    pip install numpy*.whl
    pip install scipy*.whl
    pip install gensim.whl

（3） Verify that the module was successfully installed

Input python Order to enter python Command line , Input separately “import numpy; import scipy; import gensim; ” No report error , Successful installation ！

Two 、Wiki Data acquisition

2.1 Wiki Download Chinese data

To wiki Download Chinese corpus from the official website , When the download is complete, it will be named zhwiki-latest-pages-articles.xml.bz2 The file of , It's about the size of 1.3G, Inside is a XML file .
The download address is as follows ：https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

2.2 take XML Of Wiki Data to text Format

（1）python Realization

To write python Program will XML The file is converted to text Format , Using the gensim.corpora Medium WikiCorpus Function to process Wikipedia data .python The code implementation is as follows , The file is named 1_process.py.

（2） Run program files

Run the following in the code folder cmd Command line , You can get the file generated after conversion wiki.zh.txt.

    D:\PyRoot\iDemo\wiki_zh>python 1_process.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt

（3） Get operation results

   2017-04-18 09:24:28,901: INFO: running 1_process.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt
   2017-04-18 09:25:31,154: INFO: Saved 10000 articles.
   2017-04-18 09:26:21,582: INFO: Saved 20000 articles.
   2017-04-18 09:27:05,642: INFO: Saved 30000 articles.
   2017-04-18 09:27:48,917: INFO: Saved 40000 articles.
   2017-04-18 09:28:35,546: INFO: Saved 50000 articles.
   2017-04-18 09:29:21,102: INFO: Saved 60000 articles.
   2017-04-18 09:30:04,540: INFO: Saved 70000 articles.
   2017-04-18 09:30:48,022: INFO: Saved 80000 articles.
   2017-04-18 09:31:30,665: INFO: Saved 90000 articles.
   2017-04-18 09:32:17,599: INFO: Saved 100000 articles.
   2017-04-18 09:33:13,811: INFO: Saved 110000 articles.
   2017-04-18 09:34:06,316: INFO: Saved 120000 articles.
   2017-04-18 09:35:01,007: INFO: Saved 130000 articles.
   2017-04-18 09:35:52,628: INFO: Saved 140000 articles.
   2017-04-18 09:36:47,148: INFO: Saved 150000 articles.
   2017-04-18 09:37:41,137: INFO: Saved 160000 articles.
   2017-04-18 09:38:33,684: INFO: Saved 170000 articles.
   2017-04-18 09:39:37,957: INFO: Saved 180000 articles.
   2017-04-18 09:43:36,299: INFO: Saved 190000 articles.
   2017-04-18 09:45:21,509: INFO: Saved 200000 articles.
   2017-04-18 09:46:40,865: INFO: Saved 210000 articles.
   2017-04-18 09:47:55,453: INFO: Saved 220000 articles.
   2017-04-18 09:49:07,835: INFO: Saved 230000 articles.
   2017-04-18 09:50:27,562: INFO: Saved 240000 articles.
   2017-04-18 09:51:38,755: INFO: Saved 250000 articles.
   2017-04-18 09:52:50,240: INFO: Saved 260000 articles.
   2017-04-18 09:53:57,526: INFO: Saved 270000 articles.
   2017-04-18 09:55:01,720: INFO: Saved 280000 articles.
   2017-04-18 09:55:22,565: INFO: finished iterating over Wikipedia corpus of 28285 5 documents with 63427579 positions (total 2908316 articles, 75814559 positions before pruning articles shorter than 50 words)
   2017-04-18 09:55:22,568: INFO: Finished Saved 282855 articles.

From the results ,31 Minutes run complete 282855 An article , Get one 931M Of txt file .

3、 ... and 、Wiki Data preprocessing

3.1 Replace traditional Chinese with simplified Chinese

Wiki Chinese corpus contains many traditional characters , It needs to be converted into simplified characters before processing , It's used here OpenCC Tools to transform .

（1） install OpenCC

Download the corresponding version of... At the following link address OpenCC, The version I downloaded is opencc-1.0.1-win32.
https://bintray.com/package/files/byvoid/opencc/OpenCC
in addition , The data shows that there are also python Version of , Use pip install opencc-python Installation , Do not repeat without practice .

（2） Use OpenCC Carry out complex and simple transformation

Enter the opencc The catalog of （opencc-1.0.1-win32）, double-click opencc.exe file . Open in the current directory dos window （Shift+ Right mouse button -> Open command window here ）, Enter the following command line ：

 opencc -i wiki.zh.txt -o wiki.zh.simp.txt -c t2s.json

You will get the file wiki.zh.simp.txt, That is, it has been converted into simplified Chinese .

（3） Results check

After decompression txt Yes 900 many M, use notepad++ Unable to open , So using python Self contained IO To read .Python The code is as follows ：

   import codecs,sys
   f = codecs.open(‘wiki.zh.simp.txt‘,‘r‘,encoding="utf8")
   line = f.readline()
   print(line)

The screenshot of the traditional Chinese example is shown below ：

The simplified Chinese screenshot after conversion is as follows ：

3.2 Stuttering participle

In this example, stutter segmentation is used to simplify the font wiki Chinese corpus data set for word segmentation , Before executing the code, you need to install jieba modular . Since punctuation has been removed from this corpus , Therefore, no cleaning operation is required in the word segmentation program , Direct participle . If you collect data yourself, you need to remove punctuation marks and stop words .
Python The implementation code is as follows ：

After the code is executed, you get a 1.12G Document size wiki.zh.simp.seg.txt. The screenshot of the word segmentation result is as follows ：

Four 、Word2Vec model training

（1）word2vec Model implementation

The document with good words can be divided word2vec Word vector model training . The document is large , I am 4GWin7 Your computer reported a memory error , Replace it with 8G Content Mac Then you can complete the training , And it's very fast . Specifically Python The code implementation is as follows , The file is named 3_train_word2vec_model.py.

（2） Run results view

   2017-05-03 21:54:14,887: INFO: training on 822697865 raw words (765330910 effective words) took 1655.2s, 462390 effective words/s
   2017-05-03 21:54:14,888: INFO: saving Word2Vec object under /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.model, separately None
   2017-05-03 21:54:14,888: INFO: not storing attribute syn0norm
   2017-05-03 21:54:14,889: INFO: storing np array 'syn0' to /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.model.wv.syn0.npy
   2017-05-03 21:54:16,505: INFO: storing np array 'syn1neg' to /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.model.syn1neg.npy
   2017-05-03 21:54:18,123: INFO: not storing attribute cum_table
   2017-05-03 21:54:26,542: INFO: saved /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.model
   2017-05-03 21:54:26,543: INFO: storing 733434x400 projection weights into /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.vector

Extract the last few lines of code running information , After the code runs, you get the following four files , among wiki.zh.text.model It's a built model ,wiki.zh.text.vector It's a word vector .

5、 ... and 、 Model test

After model training , To test the results of the model .Python The code is as follows , The file named 4_model_match.py.

Run the file to get the result , You can view the Related words of a given word .

thus , Use python For Chinese wiki The word vector modeling of the corpus is over ,wiki.zh.text.vector Is the word vector corresponding to each word , On this basis, we can extract and classify the features of the composition .

Resource download address ：https://download.csdn.net/download/sheziqiong/85820613
Resource download address ：https://download.csdn.net/download/sheziqiong/85820613

原网站

版权声明
本文为[biyezuopinvip]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206291643422474.html