当前位置:网站首页>Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools
Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools
2022-07-03 23:47:00 【Necther】
Resources sort out text classification 、 Entity recognition & Part of speech tagging 、 Search for matches 、 Recommendation system 、 Deixis disambiguation 、 Encyclopedia data 、 Pre training word vector or Model 、 A large number of data sets such as Chinese cloze , Chinese dataset platform and NLP Tools etc. .
This article is organized from :https://github.com/InsaneLife/ChineseNLPCorpus
Text classification
News classification
Today's headlines in Chinese ( Short text ) Classification data set :https://github.com/fateleak/toutiao-text-classfication-dataset
Data scale : common 38 Ten thousand , Distributed in 15 In categories .
Acquisition time :2018 year 05 month .
With 0.7 0.15 0.15 Do segmentation .
Tsinghua news classification corpus :
According to Sina News RSS Subscribed Channels 2005~2011 Years of historical data filtering generation .
Data volume :74 Ten thousand news documents (2.19 GB)
Small data experiments can filter categories : sports , Finance and economics, , Real estate , Home Furnishing , education , Technology , fashion , Current affairs , game , entertainment
http://thuctc.thunlp.org/#%E8%8E%B7%E5%8F%96%E9%93%BE%E6%8E%A5
rnn and cnn experiment :https://github.com/gaussic/text-classification-cnn-rnn
University of science and technology news classification corpus :http://www.nlpir.org/?action-viewnews-itemid-145
emotional / Point of view / Comment on Tendentiousness analysis
Entity recognition & Part of speech tagging
Weibo entity recognition
https://github.com/hltcoe/golden-horse
boson data
contain 6 Types of entities
https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/boson
People's daily data set
The person's name 、 Place names 、 Organization name three entity types
1998:https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/renMinRiBao
2004:https://pan.baidu.com/s/1LDwQjoj7qc-HT9qwhJ3rcA password: 1fa3
MSRA Microsoft Research Asia data set
5 More than 10000 pieces of Chinese named entity recognition and annotation data ( Including location 、 Institutions 、 figure )
https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA
SIGHAN Bakeoff 2005: There are four data sets , Including traditional Chinese and simplified Chinese , The following is simplified Chinese word segmentation data .
MSR: http://sighan.cs.uchicago.edu/bakeoff2005/
PKU :http://sighan.cs.uchicago.edu/bakeoff2005/
Search for matches
OPPO Mobile search sorting
OPPO Mobile search sorting query-title Semantic matching dataset .
link :https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw Extraction code :7p3n
Web search results evaluation (SogouE)
User inquiry and related URL list
https://www.sogou.com/labs/resource/e.php
Recommendation system
Encyclopedia data
Wikipedia
Wikipedia will regularly package and publish the corpus :
Data processing blog
https://dumps.wikimedia.org/zhwiki/
Baidu Encyclopedia
You can only climb by yourself , Crawled link :https://pan.baidu.com/share/init?surl=i3wvfil Extraction code neqs .
Deixis disambiguation
CoNLL 2012 :http://conll.cemantix.org/2012/data.html
Preliminary training :( The word vector or Model )
BERT
Open source code :https://github.com/google-research/bert
Model download :BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
ELMO
Open source code :https://github.com/allenai/bilm-tf
Pre training model :https://allennlp.org/elmo
Tencent word vector
tencent AI The Chinese word vector data set published by the laboratory contains 800 A million Chinese vocabulary , Each of these words corresponds to a 200 Dimension vector .
Download address :https://ai.tencent.com/ailab/nlp/embedding.html
Hundreds of pre trained Chinese word vectors
https://github.com/Embedding/Chinese-Word-Vectors
Chinese cloze data set
https://github.com/ymcui/Chinese-RC-Dataset
Chinese ancient poetry database
The most complete data set of ancient Chinese poetry , Nearly 14000 ancient poets in Tang and Song Dynasties , near 5.5 Ten thousand Tang poems add 26 Wansong poetry . In the Song Dynasty 1564 A poet ,21050 First word .
https://github.com/chinese-poetry/chinese-poetry
Insurance industry corpus
https://github.com/Samurais/insuranceqa-corpus-zh
Chinese word splitting Dictionary
English can do char embedding, Chinese may as well try to open characters
https://github.com/kfcd/chaizi
Chinese dataset platform
Sogou lab
Sogou lab provides some high-quality Chinese text datasets , It's early , Mostly for 2012 Years ago
https://www.sogou.com/labs/resource/list_pingce.php
Zhongke nature language processing and information retrieval sharing platform
http://www.nlpir.org/?action-category-catid-28
Small data of Chinese corpus
Including Chinese Named Entity Recognition 、 Chinese relationship recognition 、 Some small amount of data such as Chinese reading comprehension .
https://github.com/crownpku/Small-Chinese-Corpus
Wikipedia dataset
NLP Tools
THULAC:https://github.com/thunlp/THULAC : Including Chinese participle 、 Part of speech tagging function .
HanLP:https://github.com/hankcs/HanLP
Harbin Institute of technology LTP: https://github.com/HIT-SCIR/ltp
NLPIR: https://github.com/NLPIR-team/NLPIR
jieba participle : https://github.com/yanyiwu/cppj
边栏推荐
- I would like to ask how the top ten securities firms open accounts? Is it safe to open an account online?
- Learning methods of zynq
- 2022 t elevator repair registration examination and the latest analysis of T elevator repair
- D28:maximum sum (maximum sum, translation)
- 2/14 (regular expression, sed streaming editor)
- 2.14 summary
- Powerful blog summary
- Pytorch learning notes 5: model creation
- ADB related commands
- 股票开户佣金最低的券商有哪些大家推荐一下,手机上开户安全吗
猜你喜欢
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
Amway by head has this project management tool to improve productivity in a straight line
Analysis of refrigeration and air conditioning equipment operation in 2022 and examination question bank of refrigeration and air conditioning equipment operation
Fluent learning (5) GridView
It is forbidden to splice SQL in code
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
P3371 [template] single source shortest path (weakened version)
Qtoolbutton available signal
随机推荐
A treasure open source software, cross platform terminal artifact tabby
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
[Happy Valentine's day] "I still like you very much, like sin ² a+cos ² A consistent "(white code in the attached table)
The first game of the new year, many bug awards submitted
Fluent learning (5) GridView
Selenium library 4.5.0 keyword explanation (II)
SQL data update
IO flow principle and classification
Advanced C language - pointer 2 - knowledge points sorting
Selenium check box
Alibaba cloud container service differentiation SLO hybrid technology practice
Double efficiency. Six easy-to-use pychar plug-ins are recommended
Idea set class header comments
Cgb2201 preparatory class evening self-study and lecture content
股票开户最低佣金炒股开户免费,网上开户安全吗
Enter MySQL in docker container by command under Linux
D23:multiple of 3 or 5 (multiple of 3 or 5, translation + solution)
Maxwell equation and Euler formula - link
The difference between single power amplifier and dual power amplifier
Yyds dry goods inventory three JS source code interpretation - getobjectbyproperty method