当前位置:网站首页>Downloading wiki corpus and aligning with multilingual wikis
Downloading wiki corpus and aligning with multilingual wikis
2022-06-13 01:04:00 【kaims】
Use wikiextractor extract wikidumps corpus
Generating parallel sentence pairs based on Wikipedia corpus
Building a parallel corpus based on Wikipedia
wikidumps Corpus Download
wikidumps The website is en-wikidumps
wikidumps Processing of corpus
Handle wikidumps Corpus can be used wikiextractor Tools to extract . Implementation requires installation
pip install wikiextractor
Then there are two ways to use , One is to transfer the python Modules are used as scripts to run
python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2
Or enter the installed wikiextractor Directory operation WikiExtractor.py To deal with it wikidumps Expect
python WikiExtractor.py enwiki-latest-pages-articles.xml.bz2
Some common parameters
1.-b File capacity , for example :-b 100M When the output file reaches 100M when , Automatically add new files , Multiple files can be generated
2.-o The name of the output file , Can be preceded by a path , for example :-o AA_yue or -o /extract/AA_yue, The default output folder is text
The file format after processing is
<doc id="244" url="https://zh.wikipedia.org/wiki?curid=244" title=" Historians ">
Historians
Historians are also called historians 、 Historian 、 Historians , It refers to writing historical works as a profession or the establishment of history 、 Intellectuals who have made great efforts to develop and apply . Historians include compilers of historical records and researchers of historical materials . People studying history must rely on the records left by their predecessors . Historians will study the events of the past and the authenticity of their records , And record their research . A historian can study someone's experience , A city 、 The development of a place or country . According to their different subjects , History can be divided into different categories , for example :
Personal history
Personal history , It's a study of what happened to someone in the past .
Local history
Local history , It is a study of events that have occurred in a city or place .
...
</doc>
<doc id="256" url="https://zh.wikipedia.org/wiki?curid=256" title=" Open source ">
...
</doc>
wikidumps The corpus is based on title alignment
Online mode
Specific... Can be used online api, Such as wikipedia perhaps wikipediaapi, concrete wikipedia Use guide and Wikipedia API Use guide .
Offline mode
Offline mode requires us to download the aligned corpus first , Then use tools to deal with , Then write your own program to get the alignment information .
First of all, from the wikidumps Download the corpus required for alignment , Its naming format is
*-page.sql.gz
*-langlinks.sql.gz
among * Is a prefix , Generally, it contains language abbreviations and time information . And then use wikipedia-parallel-titles Tool to generate Title alignment information , The tool library has a build-corpus.sh Script , Run this script to get the title alignment file , The run command is ./build-corpus.sh en zhwiki-latest > titles.txt, This command is implemented according to zhwiki-latest Get and en Title alignment file titles.txt.
边栏推荐
- Bubble sort - alternate sort at both ends
- Leetcode-14- longest common prefix (simple)
- [JS component] simulation framework
- Common skills of quantitative investment - index part 2: detailed explanation of BOL (Bollinger line) index, its code implementation and drawing
- 什么是 dummy change?
- Leetcode-17- letter combination of phone number (medium)
- [imx6ull] video monitoring project (USB camera +ffmepeg)
- 数学知识整理:极值&最值,驻点,拉格朗日乘子
- 深度学习训练多少轮?迭代多少次?
- Et5.0 configuring Excel
猜你喜欢

HashSet underlying source code

Binary tree -- using hierarchical sequence and middle sequence to determine a tree

Introduction to ROS from introduction to mastery (zero) tutorial

Alexnet实现Caltech101数据集图像分类(pytorch实现)
![[JS component] custom paging](/img/a7/42082c72ad8f2af1a52e1ab1293790.jpg)
[JS component] custom paging

Canvas game lower level 100

Today's sleep quality record 74 points

深度学习训练多少轮?迭代多少次?

Minimum spanning tree problem

五篇经典好文,值得一看(2)
随机推荐
408 true question - division sequence
[JS component] simulation framework
[JS component] customize the right-click menu
Unitywebrequest asynchronous Download
Leetcode-78- subset (medium)
Stmarl: a spatio temporal multi agentreinforcement learning approach for cooperative traffic
深度学习模型剪枝
With a market value of more than trillion yuan and a sales volume of more than 100000 yuan for three consecutive months, will BYD become the strongest domestic brand?
What is the difference between pytorch and tensorflow?
Deadlock problem summary
Leetcode-16- sum of the nearest three numbers (medium)
Leetcode-19- delete the penultimate node of the linked list (medium)
Liu Hui and introduction to nine chapter arithmetic and island arithmetic
Pysmb usage
The grass is bearing seeds
Several categories of software testing are clear at a glance
Alexnet实现Caltech101数据集图像分类(pytorch实现)
Kotlin coroutine suspend function suspend keyword
Leetcode-11- container with the most water (medium)
leetcode 206. Reverse linked list