当前位置:网站首页>Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools
Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools
2022-07-03 23:47:00 【Necther】
Resources sort out text classification 、 Entity recognition & Part of speech tagging 、 Search for matches 、 Recommendation system 、 Deixis disambiguation 、 Encyclopedia data 、 Pre training word vector or Model 、 A large number of data sets such as Chinese cloze , Chinese dataset platform and NLP Tools etc. .
This article is organized from :https://github.com/InsaneLife/ChineseNLPCorpus
Text classification
News classification
Today's headlines in Chinese ( Short text ) Classification data set :https://github.com/fateleak/toutiao-text-classfication-dataset
Data scale : common 38 Ten thousand , Distributed in 15 In categories .
Acquisition time :2018 year 05 month .
With 0.7 0.15 0.15 Do segmentation .
Tsinghua news classification corpus :
According to Sina News RSS Subscribed Channels 2005~2011 Years of historical data filtering generation .
Data volume :74 Ten thousand news documents (2.19 GB)
Small data experiments can filter categories : sports , Finance and economics, , Real estate , Home Furnishing , education , Technology , fashion , Current affairs , game , entertainment
http://thuctc.thunlp.org/#%E8%8E%B7%E5%8F%96%E9%93%BE%E6%8E%A5
rnn and cnn experiment :https://github.com/gaussic/text-classification-cnn-rnn
University of science and technology news classification corpus :http://www.nlpir.org/?action-viewnews-itemid-145
emotional / Point of view / Comment on Tendentiousness analysis
Entity recognition & Part of speech tagging
Weibo entity recognition
https://github.com/hltcoe/golden-horse
boson data
contain 6 Types of entities
https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/boson
People's daily data set
The person's name 、 Place names 、 Organization name three entity types
1998:https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/renMinRiBao
2004:https://pan.baidu.com/s/1LDwQjoj7qc-HT9qwhJ3rcA password: 1fa3
MSRA Microsoft Research Asia data set
5 More than 10000 pieces of Chinese named entity recognition and annotation data ( Including location 、 Institutions 、 figure )
https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA
SIGHAN Bakeoff 2005: There are four data sets , Including traditional Chinese and simplified Chinese , The following is simplified Chinese word segmentation data .
MSR: http://sighan.cs.uchicago.edu/bakeoff2005/
PKU :http://sighan.cs.uchicago.edu/bakeoff2005/
Search for matches
OPPO Mobile search sorting
OPPO Mobile search sorting query-title Semantic matching dataset .
link :https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw Extraction code :7p3n
Web search results evaluation (SogouE)
User inquiry and related URL list
https://www.sogou.com/labs/resource/e.php
Recommendation system
Encyclopedia data
Wikipedia
Wikipedia will regularly package and publish the corpus :
Data processing blog
https://dumps.wikimedia.org/zhwiki/
Baidu Encyclopedia
You can only climb by yourself , Crawled link :https://pan.baidu.com/share/init?surl=i3wvfil Extraction code neqs .
Deixis disambiguation
CoNLL 2012 :http://conll.cemantix.org/2012/data.html
Preliminary training :( The word vector or Model )
BERT
Open source code :https://github.com/google-research/bert
Model download :BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
ELMO
Open source code :https://github.com/allenai/bilm-tf
Pre training model :https://allennlp.org/elmo
Tencent word vector
tencent AI The Chinese word vector data set published by the laboratory contains 800 A million Chinese vocabulary , Each of these words corresponds to a 200 Dimension vector .
Download address :https://ai.tencent.com/ailab/nlp/embedding.html
Hundreds of pre trained Chinese word vectors
https://github.com/Embedding/Chinese-Word-Vectors
Chinese cloze data set
https://github.com/ymcui/Chinese-RC-Dataset
Chinese ancient poetry database
The most complete data set of ancient Chinese poetry , Nearly 14000 ancient poets in Tang and Song Dynasties , near 5.5 Ten thousand Tang poems add 26 Wansong poetry . In the Song Dynasty 1564 A poet ,21050 First word .
https://github.com/chinese-poetry/chinese-poetry
Insurance industry corpus
https://github.com/Samurais/insuranceqa-corpus-zh
Chinese word splitting Dictionary
English can do char embedding, Chinese may as well try to open characters
https://github.com/kfcd/chaizi
Chinese dataset platform
Sogou lab
Sogou lab provides some high-quality Chinese text datasets , It's early , Mostly for 2012 Years ago
https://www.sogou.com/labs/resource/list_pingce.php
Zhongke nature language processing and information retrieval sharing platform
http://www.nlpir.org/?action-category-catid-28
Small data of Chinese corpus
Including Chinese Named Entity Recognition 、 Chinese relationship recognition 、 Some small amount of data such as Chinese reading comprehension .
https://github.com/crownpku/Small-Chinese-Corpus
Wikipedia dataset
NLP Tools
THULAC:https://github.com/thunlp/THULAC : Including Chinese participle 、 Part of speech tagging function .
HanLP:https://github.com/hankcs/HanLP
Harbin Institute of technology LTP: https://github.com/HIT-SCIR/ltp
NLPIR: https://github.com/NLPIR-team/NLPIR
jieba participle : https://github.com/yanyiwu/cppj
边栏推荐
- D30:color tunnels (color tunnels, translation)
- P1629 postman delivering letter
- Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
- Report on the construction and development mode and investment mode of sponge cities in China 2022-2028
- Alibaba cloud container service differentiation SLO hybrid technology practice
- The upload experience version of uniapp wechat applet enters the blank page for the first time, and the page data can be seen only after it is refreshed again
- Idea a method for starting multiple instances of a service
- Smart fan system based on stm32f407
- Qtoolbutton available signal
- Qtoolbutton - menu and popup mode
猜你喜欢
Qtoolbutton available signal
Amway by head has this project management tool to improve productivity in a straight line
Apple released a supplementary update to MacOS Catalina 10.15.5, which mainly fixes security vulnerabilities
How to make icons easily
Qtoolbutton - menu and popup mode
How to write a good title of 10w+?
Alibaba cloud container service differentiation SLO hybrid technology practice
Idea set class header comments
Design of logic level conversion in high speed circuit
China standard gas market prospect investment and development feasibility study report 2022-2028
随机推荐
Is the controller a single instance or multiple instances? How to ensure the safety of concurrency
2022.02.14
ADB command to get XML
Idea integrates Microsoft TFs plug-in
"Learning notes" recursive & recursive
Fashion cloud interview questions series - JS high-frequency handwritten code questions
C # basic knowledge (2)
2022 free examination questions for hoisting machinery command and hoisting machinery command theory examination
I wrote a chat software with timeout connect function
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
Idea set class header comments
D28:maximum sum (maximum sum, translation)
P1339 [USACO09OCT]Heat Wave G
Recursive least square adjustment
Gossip about redis source code 73
IO flow principle and classification
Make small tip
2022 chemical automation control instrument examination content and chemical automation control instrument simulation examination
股票開戶傭金最低的券商有哪些大家推薦一下,手機上開戶安全嗎
QT creator source code learning note 05, how does the menu bar realize plug-in?