当前位置:网站首页>Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools
Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools
2022-07-03 23:47:00 【Necther】
Resources sort out text classification 、 Entity recognition & Part of speech tagging 、 Search for matches 、 Recommendation system 、 Deixis disambiguation 、 Encyclopedia data 、 Pre training word vector or Model 、 A large number of data sets such as Chinese cloze , Chinese dataset platform and NLP Tools etc. .
This article is organized from :https://github.com/InsaneLife/ChineseNLPCorpus
Text classification
News classification
Today's headlines in Chinese ( Short text ) Classification data set :https://github.com/fateleak/toutiao-text-classfication-dataset
Data scale : common 38 Ten thousand , Distributed in 15 In categories .
Acquisition time :2018 year 05 month .
With 0.7 0.15 0.15 Do segmentation .
Tsinghua news classification corpus :
According to Sina News RSS Subscribed Channels 2005~2011 Years of historical data filtering generation .
Data volume :74 Ten thousand news documents (2.19 GB)
Small data experiments can filter categories : sports , Finance and economics, , Real estate , Home Furnishing , education , Technology , fashion , Current affairs , game , entertainment
http://thuctc.thunlp.org/#%E8%8E%B7%E5%8F%96%E9%93%BE%E6%8E%A5
rnn and cnn experiment :https://github.com/gaussic/text-classification-cnn-rnn
University of science and technology news classification corpus :http://www.nlpir.org/?action-viewnews-itemid-145
emotional / Point of view / Comment on Tendentiousness analysis

Entity recognition & Part of speech tagging
Weibo entity recognition
https://github.com/hltcoe/golden-horse
boson data
contain 6 Types of entities
https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/boson
People's daily data set
The person's name 、 Place names 、 Organization name three entity types
1998:https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/renMinRiBao
2004:https://pan.baidu.com/s/1LDwQjoj7qc-HT9qwhJ3rcA password: 1fa3
MSRA Microsoft Research Asia data set
5 More than 10000 pieces of Chinese named entity recognition and annotation data ( Including location 、 Institutions 、 figure )
https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA
SIGHAN Bakeoff 2005: There are four data sets , Including traditional Chinese and simplified Chinese , The following is simplified Chinese word segmentation data .
MSR: http://sighan.cs.uchicago.edu/bakeoff2005/
PKU :http://sighan.cs.uchicago.edu/bakeoff2005/
Search for matches
OPPO Mobile search sorting
OPPO Mobile search sorting query-title Semantic matching dataset .
link :https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw Extraction code :7p3n
Web search results evaluation (SogouE)
User inquiry and related URL list
https://www.sogou.com/labs/resource/e.php
Recommendation system

Encyclopedia data
Wikipedia
Wikipedia will regularly package and publish the corpus :
Data processing blog
https://dumps.wikimedia.org/zhwiki/
Baidu Encyclopedia
You can only climb by yourself , Crawled link :https://pan.baidu.com/share/init?surl=i3wvfil Extraction code neqs .
Deixis disambiguation
CoNLL 2012 :http://conll.cemantix.org/2012/data.html
Preliminary training :( The word vector or Model )
BERT
Open source code :https://github.com/google-research/bert
Model download :BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
ELMO
Open source code :https://github.com/allenai/bilm-tf
Pre training model :https://allennlp.org/elmo
Tencent word vector
tencent AI The Chinese word vector data set published by the laboratory contains 800 A million Chinese vocabulary , Each of these words corresponds to a 200 Dimension vector .
Download address :https://ai.tencent.com/ailab/nlp/embedding.html
Hundreds of pre trained Chinese word vectors
https://github.com/Embedding/Chinese-Word-Vectors
Chinese cloze data set
https://github.com/ymcui/Chinese-RC-Dataset
Chinese ancient poetry database
The most complete data set of ancient Chinese poetry , Nearly 14000 ancient poets in Tang and Song Dynasties , near 5.5 Ten thousand Tang poems add 26 Wansong poetry . In the Song Dynasty 1564 A poet ,21050 First word .
https://github.com/chinese-poetry/chinese-poetry
Insurance industry corpus
https://github.com/Samurais/insuranceqa-corpus-zh
Chinese word splitting Dictionary
English can do char embedding, Chinese may as well try to open characters
https://github.com/kfcd/chaizi
Chinese dataset platform
Sogou lab
Sogou lab provides some high-quality Chinese text datasets , It's early , Mostly for 2012 Years ago
https://www.sogou.com/labs/resource/list_pingce.php
Zhongke nature language processing and information retrieval sharing platform
http://www.nlpir.org/?action-category-catid-28
Small data of Chinese corpus
Including Chinese Named Entity Recognition 、 Chinese relationship recognition 、 Some small amount of data such as Chinese reading comprehension .
https://github.com/crownpku/Small-Chinese-Corpus
Wikipedia dataset
NLP Tools
THULAC:https://github.com/thunlp/THULAC : Including Chinese participle 、 Part of speech tagging function .
HanLP:https://github.com/hankcs/HanLP
Harbin Institute of technology LTP: https://github.com/HIT-SCIR/ltp
NLPIR: https://github.com/NLPIR-team/NLPIR
jieba participle : https://github.com/yanyiwu/cppj
边栏推荐
- Selenium library 4.5.0 keyword explanation (II)
- C summary of knowledge point definitions, summary notes
- Gossip about redis source code 82
- Solve the problem that the kaggle account registration does not display the verification code
- Vscode regular match replace console log(.*)
- Docking Alipay process [pay in person, QR code Payment]
- What is the difference between NFT, SFT and dnft? How to build NFT platform applications?
- 炒股开户佣金优惠怎么才能获得,网上开户安全吗
- [network security] what is emergency response? What indicators should you pay attention to in emergency response?
- 2022 a special equipment related management (elevator) examination questions and a special equipment related management (elevator) examination contents
猜你喜欢

"Learning notes" recursive & recursive

Qtoolbutton available signal

Tencent interview: can you find the number of 1 in binary?

Loop compensation - explanation and calculation of first-order, second-order and op amp compensation

Smart fan system based on stm32f407

2022 Guangdong Provincial Safety Officer a certificate third batch (main person in charge) simulated examination and Guangdong Provincial Safety Officer a certificate third batch (main person in charg

Fluent learning (4) listview

Zipper table in data warehouse (compressed storage)

Report on prospects and future investment recommendations of China's assisted reproductive industry, 2022-2028 Edition

Double efficiency. Six easy-to-use pychar plug-ins are recommended
随机推荐
C summary of knowledge point definitions, summary notes
How to quickly build high availability of service discovery
Is the controller a single instance or multiple instances? How to ensure the safety of concurrency
在恒泰证券开户怎么样?安全吗?
What is the difference between NFT, SFT and dnft? How to build NFT platform applications?
Correlation analysis summary
What is the Valentine's Day gift given by the operator to the product?
Yyds dry goods inventory three JS source code interpretation - getobjectbyproperty method
P1629 postman delivering letter
Selenium library 4.5.0 keyword explanation (I)
Gossip about redis source code 77
Kubedl hostnetwork: accelerating the efficiency of distributed training communication
How about opening an account at Hengtai securities? Is it safe?
Bufferpool caching mechanism for executing SQL in MySQL
No qualifying bean of type ‘com. netflix. discovery. AbstractDiscoveryClientOptionalArgs<?>‘ available
Investment demand and income forecast report of China's building ceramics industry, 2022-2028
Scratch uses runner Py run or debug crawler
Interpretation of corolla sub low configuration, three cylinder power configuration, CVT fuel saving and smooth, safety configuration is in place
SQL data update
P1656 bombing Railway