当前位置:网站首页>Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools
Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools
2022-07-03 23:47:00 【Necther】
Resources sort out text classification 、 Entity recognition & Part of speech tagging 、 Search for matches 、 Recommendation system 、 Deixis disambiguation 、 Encyclopedia data 、 Pre training word vector or Model 、 A large number of data sets such as Chinese cloze , Chinese dataset platform and NLP Tools etc. .
This article is organized from :https://github.com/InsaneLife/ChineseNLPCorpus
Text classification
News classification
Today's headlines in Chinese ( Short text ) Classification data set :https://github.com/fateleak/toutiao-text-classfication-dataset
Data scale : common 38 Ten thousand , Distributed in 15 In categories .
Acquisition time :2018 year 05 month .
With 0.7 0.15 0.15 Do segmentation .
Tsinghua news classification corpus :
According to Sina News RSS Subscribed Channels 2005~2011 Years of historical data filtering generation .
Data volume :74 Ten thousand news documents (2.19 GB)
Small data experiments can filter categories : sports , Finance and economics, , Real estate , Home Furnishing , education , Technology , fashion , Current affairs , game , entertainment
http://thuctc.thunlp.org/#%E8%8E%B7%E5%8F%96%E9%93%BE%E6%8E%A5
rnn and cnn experiment :https://github.com/gaussic/text-classification-cnn-rnn
University of science and technology news classification corpus :http://www.nlpir.org/?action-viewnews-itemid-145
emotional / Point of view / Comment on Tendentiousness analysis

Entity recognition & Part of speech tagging
Weibo entity recognition
https://github.com/hltcoe/golden-horse
boson data
contain 6 Types of entities
https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/boson
People's daily data set
The person's name 、 Place names 、 Organization name three entity types
1998:https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/renMinRiBao
2004:https://pan.baidu.com/s/1LDwQjoj7qc-HT9qwhJ3rcA password: 1fa3
MSRA Microsoft Research Asia data set
5 More than 10000 pieces of Chinese named entity recognition and annotation data ( Including location 、 Institutions 、 figure )
https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA
SIGHAN Bakeoff 2005: There are four data sets , Including traditional Chinese and simplified Chinese , The following is simplified Chinese word segmentation data .
MSR: http://sighan.cs.uchicago.edu/bakeoff2005/
PKU :http://sighan.cs.uchicago.edu/bakeoff2005/
Search for matches
OPPO Mobile search sorting
OPPO Mobile search sorting query-title Semantic matching dataset .
link :https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw Extraction code :7p3n
Web search results evaluation (SogouE)
User inquiry and related URL list
https://www.sogou.com/labs/resource/e.php
Recommendation system

Encyclopedia data
Wikipedia
Wikipedia will regularly package and publish the corpus :
Data processing blog
https://dumps.wikimedia.org/zhwiki/
Baidu Encyclopedia
You can only climb by yourself , Crawled link :https://pan.baidu.com/share/init?surl=i3wvfil Extraction code neqs .
Deixis disambiguation
CoNLL 2012 :http://conll.cemantix.org/2012/data.html
Preliminary training :( The word vector or Model )
BERT
Open source code :https://github.com/google-research/bert
Model download :BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
ELMO
Open source code :https://github.com/allenai/bilm-tf
Pre training model :https://allennlp.org/elmo
Tencent word vector
tencent AI The Chinese word vector data set published by the laboratory contains 800 A million Chinese vocabulary , Each of these words corresponds to a 200 Dimension vector .
Download address :https://ai.tencent.com/ailab/nlp/embedding.html
Hundreds of pre trained Chinese word vectors
https://github.com/Embedding/Chinese-Word-Vectors
Chinese cloze data set
https://github.com/ymcui/Chinese-RC-Dataset
Chinese ancient poetry database
The most complete data set of ancient Chinese poetry , Nearly 14000 ancient poets in Tang and Song Dynasties , near 5.5 Ten thousand Tang poems add 26 Wansong poetry . In the Song Dynasty 1564 A poet ,21050 First word .
https://github.com/chinese-poetry/chinese-poetry
Insurance industry corpus
https://github.com/Samurais/insuranceqa-corpus-zh
Chinese word splitting Dictionary
English can do char embedding, Chinese may as well try to open characters
https://github.com/kfcd/chaizi
Chinese dataset platform
Sogou lab
Sogou lab provides some high-quality Chinese text datasets , It's early , Mostly for 2012 Years ago
https://www.sogou.com/labs/resource/list_pingce.php
Zhongke nature language processing and information retrieval sharing platform
http://www.nlpir.org/?action-category-catid-28
Small data of Chinese corpus
Including Chinese Named Entity Recognition 、 Chinese relationship recognition 、 Some small amount of data such as Chinese reading comprehension .
https://github.com/crownpku/Small-Chinese-Corpus
Wikipedia dataset
NLP Tools
THULAC:https://github.com/thunlp/THULAC : Including Chinese participle 、 Part of speech tagging function .
HanLP:https://github.com/hankcs/HanLP
Harbin Institute of technology LTP: https://github.com/HIT-SCIR/ltp
NLPIR: https://github.com/NLPIR-team/NLPIR
jieba participle : https://github.com/yanyiwu/cppj
边栏推荐
- Idea a method for starting multiple instances of a service
- 炒股开户佣金优惠怎么才能获得,网上开户安全吗
- No qualifying bean of type ‘com. netflix. discovery. AbstractDiscoveryClientOptionalArgs<?>‘ available
- A preliminary study on the middleware of script Downloader
- Op amp related - link
- 2022 examination of safety production management personnel of hazardous chemical production units and examination skills of safety production management personnel of hazardous chemical production unit
- 炒股開戶傭金優惠怎麼才能獲得,網上開戶安全嗎
- "Learning notes" recursive & recursive
- Yyds dry goods inventory three JS source code interpretation - getobjectbyproperty method
- Gossip about redis source code 74
猜你喜欢

Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?

Private project practice sharing populate joint query in mongoose makes the template unable to render - solve the error message: syntaxerror: unexpected token r in JSON at

How will the complete NFT platform work in 2022? How about its core functions and online time?

Report on prospects and future investment recommendations of China's assisted reproductive industry, 2022-2028 Edition

Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?

The difference between single power amplifier and dual power amplifier

It is the most difficult to teach AI to play iron fist frame by frame. Now arcade game lovers have something

Recursive least square adjustment

Selenium check box

Qtoolbutton available signal
随机推荐
Enter MySQL in docker container by command under Linux
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
D27:mode of sequence (maximum, translation)
炒股开户佣金优惠怎么才能获得,网上开户安全吗
Selenium library 4.5.0 keyword explanation (II)
Deep learning ----- using NN, CNN, RNN neural network to realize MNIST data set processing
Idea set class header comments
A preliminary study on the middleware of script Downloader
C # basic knowledge (1)
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
Gossip about redis source code 75
2022 t elevator repair registration examination and the latest analysis of T elevator repair
Common mode interference of EMC
Les sociétés de valeurs mobilières dont la Commission d'ouverture d'un compte d'actions est la plus faible ont ce que tout le monde recommande.
Kubedl hostnetwork: accelerating the efficiency of distributed training communication
Is the controller a single instance or multiple instances? How to ensure the safety of concurrency
想请教一下,十大劵商如何开户?在线开户是安全么?
How to prevent malicious crawling of information by one-to-one live broadcast source server
Selenium check box
C # basic knowledge (2)