当前位置:网站首页>Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools

Collation of the most complete Chinese naturallanguageprocessing data sets, platforms and tools

2022-07-03 23:47:00 Necther

Resources sort out text classification 、 Entity recognition & Part of speech tagging 、 Search for matches 、 Recommendation system 、 Deixis disambiguation 、 Encyclopedia data 、 Pre training word vector or Model 、 A large number of data sets such as Chinese cloze , Chinese dataset platform and NLP Tools etc. .

This article is organized from :https://github.com/InsaneLife/ChineseNLPCorpus

Text classification

News classification

Today's headlines in Chinese ( Short text ) Classification data set  :https://github.com/fateleak/toutiao-text-classfication-dataset

Data scale : common 38 Ten thousand , Distributed in 15 In categories .

Acquisition time :2018 year 05 month .

With 0.7 0.15 0.15 Do segmentation .

Tsinghua news classification corpus

According to Sina News RSS Subscribed Channels 2005~2011 Years of historical data filtering generation .

Data volume :74 Ten thousand news documents (2.19 GB)

Small data experiments can filter categories : sports , Finance and economics, , Real estate , Home Furnishing , education , Technology , fashion , Current affairs , game , entertainment

http://thuctc.thunlp.org/#%E8%8E%B7%E5%8F%96%E9%93%BE%E6%8E%A5

rnn and cnn experiment :https://github.com/gaussic/text-classification-cnn-rnn

University of science and technology news classification corpus http://www.nlpir.org/?action-viewnews-itemid-145

emotional / Point of view / Comment on Tendentiousness analysis

Entity recognition & Part of speech tagging

Weibo entity recognition

https://github.com/hltcoe/golden-horse

boson data

contain 6 Types of entities

https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/boson

People's daily data set

The person's name 、 Place names 、 Organization name three entity types

1998:https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/renMinRiBao

2004:https://pan.baidu.com/s/1LDwQjoj7qc-HT9qwhJ3rcA password: 1fa3

MSRA Microsoft Research Asia data set

5 More than 10000 pieces of Chinese named entity recognition and annotation data ( Including location 、 Institutions 、 figure )

https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA

SIGHAN Bakeoff 2005: There are four data sets , Including traditional Chinese and simplified Chinese , The following is simplified Chinese word segmentation data .

MSR: http://sighan.cs.uchicago.edu/bakeoff2005/

PKU :http://sighan.cs.uchicago.edu/bakeoff2005/

Search for matches

OPPO Mobile search sorting

OPPO Mobile search sorting query-title Semantic matching dataset .

link :https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw  Extraction code :7p3n

Web search results evaluation (SogouE)

User inquiry and related URL list

https://www.sogou.com/labs/resource/e.php

Recommendation system

Encyclopedia data

Wikipedia

Wikipedia will regularly package and publish the corpus :

Data processing blog

https://dumps.wikimedia.org/zhwiki/

Baidu Encyclopedia

You can only climb by yourself , Crawled link :https://pan.baidu.com/share/init?surl=i3wvfil  Extraction code neqs .


Deixis disambiguation

CoNLL 2012 :http://conll.cemantix.org/2012/data.html


Preliminary training :( The word vector or Model )

BERT

Open source code :https://github.com/google-research/bert

Model download :BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

ELMO

Open source code :https://github.com/allenai/bilm-tf

Pre training model :https://allennlp.org/elmo

Tencent word vector

tencent AI The Chinese word vector data set published by the laboratory contains 800 A million Chinese vocabulary , Each of these words corresponds to a 200 Dimension vector .

Download address :https://ai.tencent.com/ailab/nlp/embedding.html

Hundreds of pre trained Chinese word vectors

https://github.com/Embedding/Chinese-Word-Vectors

Chinese cloze data set

https://github.com/ymcui/Chinese-RC-Dataset

Chinese ancient poetry database

The most complete data set of ancient Chinese poetry , Nearly 14000 ancient poets in Tang and Song Dynasties , near 5.5 Ten thousand Tang poems add 26 Wansong poetry . In the Song Dynasty 1564 A poet ,21050 First word .

https://github.com/chinese-poetry/chinese-poetry

Insurance industry corpus

https://github.com/Samurais/insuranceqa-corpus-zh

Chinese word splitting Dictionary

English can do char embedding, Chinese may as well try to open characters

https://github.com/kfcd/chaizi

Chinese dataset platform

Sogou lab

Sogou lab provides some high-quality Chinese text datasets , It's early , Mostly for 2012 Years ago

https://www.sogou.com/labs/resource/list_pingce.php

Zhongke nature language processing and information retrieval sharing platform

http://www.nlpir.org/?action-category-catid-28

Small data of Chinese corpus

Including Chinese Named Entity Recognition 、 Chinese relationship recognition 、 Some small amount of data such as Chinese reading comprehension .

https://github.com/crownpku/Small-Chinese-Corpus

Wikipedia dataset

https://dumps.wikimedia.org/

NLP Tools

THULAChttps://github.com/thunlp/THULAC : Including Chinese participle 、 Part of speech tagging function .

HanLPhttps://github.com/hankcs/HanLP

Harbin Institute of technology LTP: https://github.com/HIT-SCIR/ltp

NLPIR: https://github.com/NLPIR-team/NLPIR

jieba participle : https://github.com/yanyiwu/cppj

原网站

版权声明
本文为[Necther]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202142043171959.html