当前位置:网站首页>Word2vec vector model of Wiki Chinese corpus based on deep learning
Word2vec vector model of Wiki Chinese corpus based on deep learning
2022-06-29 16:56:00 【biyezuopinvip】
Resource download address :https://download.csdn.net/download/sheziqiong/85820613
Resource download address :https://download.csdn.net/download/sheziqiong/85820613
This example mainly introduces the selection of wiki Chinese corpus , And use python complete Word2vec The practical process of model building , Excluding principle part , It aims to understand the basic methods and steps of naturallanguageprocessing step by step . The article mainly includes the preparation of the development environment 、 Data acquisition 、 Data preprocessing 、 Model building and model testing , Corresponding to the five steps of model building .
One 、 Development environment preparation
1.1 python Environmental Science
stay python Official website Download the corresponding python edition , I use Python2.7.13 Version of .
1.2 gensim modular
(1) Download module
Word2vec Need to use a third party gensim modular , gensim Module dependency numpy and scipy Two bags , Therefore, you need to download the corresponding versions in turn numpy、scipy、gensim. Download address :http://www.lfd.uci.edu/~gohlke/pythonlibs/
(2) Install the module
When the download is complete , stay python Install under directory Scripts Execute... In directory cmd Command to install .
pip install numpy*.whl
pip install scipy*.whl
pip install gensim.whl
(3) Verify that the module was successfully installed
Input python Order to enter python Command line , Input separately “import numpy; import scipy; import gensim; ” No report error , Successful installation !
Two 、Wiki Data acquisition
2.1 Wiki Download Chinese data
To wiki Download Chinese corpus from the official website , When the download is complete, it will be named zhwiki-latest-pages-articles.xml.bz2 The file of , It's about the size of 1.3G, Inside is a XML file .
The download address is as follows :https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
2.2 take XML Of Wiki Data to text Format
(1)python Realization
To write python Program will XML The file is converted to text Format , Using the gensim.corpora Medium WikiCorpus Function to process Wikipedia data .python The code implementation is as follows , The file is named 1_process.py.

(2) Run program files
Run the following in the code folder cmd Command line , You can get the file generated after conversion wiki.zh.txt.
D:\PyRoot\iDemo\wiki_zh>python 1_process.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt
(3) Get operation results
2017-04-18 09:24:28,901: INFO: running 1_process.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt
2017-04-18 09:25:31,154: INFO: Saved 10000 articles.
2017-04-18 09:26:21,582: INFO: Saved 20000 articles.
2017-04-18 09:27:05,642: INFO: Saved 30000 articles.
2017-04-18 09:27:48,917: INFO: Saved 40000 articles.
2017-04-18 09:28:35,546: INFO: Saved 50000 articles.
2017-04-18 09:29:21,102: INFO: Saved 60000 articles.
2017-04-18 09:30:04,540: INFO: Saved 70000 articles.
2017-04-18 09:30:48,022: INFO: Saved 80000 articles.
2017-04-18 09:31:30,665: INFO: Saved 90000 articles.
2017-04-18 09:32:17,599: INFO: Saved 100000 articles.
2017-04-18 09:33:13,811: INFO: Saved 110000 articles.
2017-04-18 09:34:06,316: INFO: Saved 120000 articles.
2017-04-18 09:35:01,007: INFO: Saved 130000 articles.
2017-04-18 09:35:52,628: INFO: Saved 140000 articles.
2017-04-18 09:36:47,148: INFO: Saved 150000 articles.
2017-04-18 09:37:41,137: INFO: Saved 160000 articles.
2017-04-18 09:38:33,684: INFO: Saved 170000 articles.
2017-04-18 09:39:37,957: INFO: Saved 180000 articles.
2017-04-18 09:43:36,299: INFO: Saved 190000 articles.
2017-04-18 09:45:21,509: INFO: Saved 200000 articles.
2017-04-18 09:46:40,865: INFO: Saved 210000 articles.
2017-04-18 09:47:55,453: INFO: Saved 220000 articles.
2017-04-18 09:49:07,835: INFO: Saved 230000 articles.
2017-04-18 09:50:27,562: INFO: Saved 240000 articles.
2017-04-18 09:51:38,755: INFO: Saved 250000 articles.
2017-04-18 09:52:50,240: INFO: Saved 260000 articles.
2017-04-18 09:53:57,526: INFO: Saved 270000 articles.
2017-04-18 09:55:01,720: INFO: Saved 280000 articles.
2017-04-18 09:55:22,565: INFO: finished iterating over Wikipedia corpus of 28285 5 documents with 63427579 positions (total 2908316 articles, 75814559 positions before pruning articles shorter than 50 words)
2017-04-18 09:55:22,568: INFO: Finished Saved 282855 articles.
From the results ,31 Minutes run complete 282855 An article , Get one 931M Of txt file .
3、 ... and 、Wiki Data preprocessing
3.1 Replace traditional Chinese with simplified Chinese
Wiki Chinese corpus contains many traditional characters , It needs to be converted into simplified characters before processing , It's used here OpenCC Tools to transform .
(1) install OpenCC
Download the corresponding version of... At the following link address OpenCC, The version I downloaded is opencc-1.0.1-win32.
https://bintray.com/package/files/byvoid/opencc/OpenCC
in addition , The data shows that there are also python Version of , Use pip install opencc-python Installation , Do not repeat without practice .
(2) Use OpenCC Carry out complex and simple transformation
Enter the opencc The catalog of (opencc-1.0.1-win32), double-click opencc.exe file . Open in the current directory dos window (Shift+ Right mouse button -> Open command window here ), Enter the following command line :
opencc -i wiki.zh.txt -o wiki.zh.simp.txt -c t2s.json
You will get the file wiki.zh.simp.txt, That is, it has been converted into simplified Chinese .
(3) Results check
After decompression txt Yes 900 many M, use notepad++ Unable to open , So using python Self contained IO To read .Python The code is as follows :
import codecs,sys
f = codecs.open(‘wiki.zh.simp.txt‘,‘r‘,encoding="utf8")
line = f.readline()
print(line)
The screenshot of the traditional Chinese example is shown below :

The simplified Chinese screenshot after conversion is as follows :

3.2 Stuttering participle
In this example, stutter segmentation is used to simplify the font wiki Chinese corpus data set for word segmentation , Before executing the code, you need to install jieba modular . Since punctuation has been removed from this corpus , Therefore, no cleaning operation is required in the word segmentation program , Direct participle . If you collect data yourself, you need to remove punctuation marks and stop words .
Python The implementation code is as follows :

After the code is executed, you get a 1.12G Document size wiki.zh.simp.seg.txt. The screenshot of the word segmentation result is as follows :

Four 、Word2Vec model training
(1)word2vec Model implementation
The document with good words can be divided word2vec Word vector model training . The document is large , I am 4GWin7 Your computer reported a memory error , Replace it with 8G Content Mac Then you can complete the training , And it's very fast . Specifically Python The code implementation is as follows , The file is named 3_train_word2vec_model.py.

(2) Run results view
2017-05-03 21:54:14,887: INFO: training on 822697865 raw words (765330910 effective words) took 1655.2s, 462390 effective words/s
2017-05-03 21:54:14,888: INFO: saving Word2Vec object under /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.model, separately None
2017-05-03 21:54:14,888: INFO: not storing attribute syn0norm
2017-05-03 21:54:14,889: INFO: storing np array 'syn0' to /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.model.wv.syn0.npy
2017-05-03 21:54:16,505: INFO: storing np array 'syn1neg' to /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.model.syn1neg.npy
2017-05-03 21:54:18,123: INFO: not storing attribute cum_table
2017-05-03 21:54:26,542: INFO: saved /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.model
2017-05-03 21:54:26,543: INFO: storing 733434x400 projection weights into /Users/sy/Desktop/pyRoot/wiki_zh_vec/wiki.zh.text.vector
Extract the last few lines of code running information , After the code runs, you get the following four files , among wiki.zh.text.model It's a built model ,wiki.zh.text.vector It's a word vector .

5、 ... and 、 Model test
After model training , To test the results of the model .Python The code is as follows , The file named 4_model_match.py.

Run the file to get the result , You can view the Related words of a given word .

thus , Use python For Chinese wiki The word vector modeling of the corpus is over ,wiki.zh.text.vector Is the word vector corresponding to each word , On this basis, we can extract and classify the features of the composition .
Resource download address :https://download.csdn.net/download/sheziqiong/85820613
Resource download address :https://download.csdn.net/download/sheziqiong/85820613
边栏推荐
- C# Winfrom Chart图表控件 柱状图、折线图
- After reading the complete code
- Telnet+ftp to control and upgrade the equipment
- Which version of JVM is the fastest?
- In order to prevent being rectified after 00, a company requires employees not to sue the company
- 基于深度学习的Wiki中文语料词word2vec向量模型
- [day 28] given a string s, please judge whether it is a palindrome string | palindrome judgment
- 基于C语言开发实现的一个用户级线程库
- An error is reported in the Flink SQL rownumber. Who has met him? How to solve it?
- Metadata management Apache Atlas Compilation integration deployment and testing
猜你喜欢
![[untitled]](/img/e2/be57a7e22275af59183c50e0710837.png)
[untitled]

最高81.98%!超百所“双一流”高校本科深造率公布

【无标题】

或许再过两年,ASML将可以自由给中国供应EUV光刻机

InheritableThreadLocal 在线程池中进行父子线程间消息传递出现消息丢失的解析

能够1年涨薪2次的软件测试工程师,他到底强在哪里?

After studying this series of notes about software testing, it is a "bonus" to enter the factory

In depth analysis of Monai (I) data and transforms

【OpenGL】杂谈一、通过鼠标拖拽实现相机绕空间中的某点进行球面旋转查看

UWB precise positioning scheme, centimeter level high-precision technology application, intelligent pairing induction technology
随机推荐
XAMPP Apache安装时问题总结
Possible reasons for not triggering onreachbutton
美国芯片再遭重击,Intel或将被台积电击败而沦落至全球第三
如何配置 logback?30分钟让你彻底学会代码熬夜敲
自旋电子学笔记-张曙丰
关于onReachButton 不触发可能原因
Technology sharing | broadcast function design in integrated dispatching
基于汇编实现的流载体的LSB隐藏项目
基于深度学习的Wiki中文语料词word2vec向量模型
c# 国内外ORM 框架 dapper efcore sqlsugar freesql hisql sqlserver数据常规插入测试性能对比
Information | Zuckerberg was rated as the most careless CEO in the global IT industry; China Mobile R & D tethered UAV emergency communication high altitude base station
关于XAMPP无法启动mysql数据库
Accelerate the implementation of intelligent driving projects? You still lack a truth evaluation system
Picture and text show you how to thoroughly understand the atomicity of MySQL transaction undolog
Practice | extreme optimization of script errors - make script errors clear at a glance
Review of mathematical knowledge: curve integral of type I
New feature of C11 - Auto and decltype type type indicators
redolog和binlog
6.25atcoderabc257e - addition and multiplication 2
如何配置 logback?30分鐘讓你徹底學會代碼熬夜敲