当前位置:网站首页>Free machine learning dataset website (6300+ dataset)

Free machine learning dataset website (6300+ dataset)

2022-06-26 13:45:00 The star light blog in 2021 cloud computing top3

Today, I'd like to share with you a free website for acquiring machine learning data sets :

Machine Learning Datasets | Papers With Code

Good news for students who have ideas but do not have data sets , The website is very simple , And all kinds of data sets that are generally available are provided in this book , We can make all kinds of images 、 A collection of data sets such as comments and point clouds . 

 

CIFAR-10

from Krizhevsky Et al . stay Learning multi-layer features from micro images

CIFAR -10 Data sets ( Canadian Institute of advanced studies ,10 Categories ) yes Tiny Images A subset of a dataset , from 60000 Zhang 32x32 Color image composition . These images are marked with 10 One of the four mutually exclusive categories : The plane 、 automobile ( But not a truck or pickup truck )、 bird 、 cat 、 deer 、 Dog 、 frog 、 Horse 、 Boats and trucks ( But not a pickup truck ). Each kind has 6000 Zhang image , Each kind has 5000 Training images and 1000 Test images .

The criteria for determining whether an image belongs to a certain category are as follows :

  • The class name should be in “ What's in this picture ?” Top of the list of possible answers to questions .
  • The image should be photo realistic . The labeler was instructed to refuse to draw a line .
  • The image should contain only one of the objects referred to in this class ​​ Highlight examples . As long as the reporter still knows the identity of the object , Objects may be partially obscured or seen from an unusual angle .
resources :CIFAR-10 and CIFAR-100 datasets

 

Urban landscape

from Cordts Et al . stay For semantic city scene understanding Cityscapes Data set

Cityscapes It is a large database focusing on the semantic understanding of urban street view . It is divided into 8 Categories ( Plane 、 human beings 、 vehicle 、 Architecture 、 object 、 natural 、 Sky and void ) Of 30 Two categories provide semantics 、 Instance and dense pixel annotation . The data set consists of approximately 5000 A finely labeled image and 20000 A rough labeled image . In a few months 、 During the day and in good weather , stay 50 Cities captured data . It was originally recorded as a video , Therefore, the frame is manually selected to have the following characteristics : A large number of dynamic objects 、 Changing scene layout and changing background .

resources : A survey of deep learning techniques applied to semantic segmentation

 

Pennsylvania tree vault

from Mitchell P. Marcus Et al . stay Build a large annotated English corpus :Penn Treebank

English Penn Treebank ( PTB ) corpus , Especially with the Wall Street Journal (WSJ) The corresponding part of the corpus , It is one of the most well-known and commonly used corpora for evaluating sequence label models . This task includes annotating each word with a part of speech tag . In the most common segmentation of this corpus , from 0 To 18 Part of the is used for training (38 219 A sentence ,912 344 A sign ), from 19 To 21 The section of is used to verify (5 527 A sentence ,131 768 A sign ), from 22 To 24 Used for testing (5 462 A sentence ,129 654 A sign ). Corpora are also commonly used in character level and word level language modeling .

resources :Seq2Biseq: A bi-directional output recurrent neural network for sequence modeling

 

IMDb Movie reviews

from Andrew L. Maas Et al . stay Learn word vectors for emotion analysis

IMDb Movie reviews The data set is a binary affective analysis data set , From the Internet Movie Database (IMDb) Of 50,000 Comments make up , Mark as positive or negative . The dataset contains an even number of positive and negative comments . Consider only highly polarized comments . Score for negative comments ≤4( Full marks 10), Positive comment scores ≥7( Full marks 10). Each film contains no more than comments 30 strip . The dataset contains other unlabeled data .

resources :Sentiment analysis | NLP-progress

 

Model network

Introduced by Wu et al . stay 3D ShapeNets in : The depth of the volume shape represents

ModelNet 40 data Set contains composite object point clouds . As the most widely used point cloud analysis benchmark ,ModelNet40 Because of its variety 、 Clear shape 、 Data sets are well structured and popular . The original ModelNet40 from 40 Categories ( Like a plane 、 automobile 、 plant , The lamp ), among 9,843 For training , rest 2,468 For testing . The corresponding point cloud data points are uniformly sampled from the mesh surface , Then it is further preprocessed by moving to the origin and scaling to a unit sphere .

resources : Geometric feedback network for point cloud classification

CARLA( Automobile learning action )

from Dosovitskiy Et al . stay CARLA: An open urban driving simulator

CARLA(CAR Learning to Act) Is an open urban driving simulator , As Unreal Engine 4 And an open source layer on the . Technically speaking , It works in a way similar to Unreal Engine 4 An open source layer on , Sensors are provided in the following form RGB camera ( Customizable location )、 Actual ground depth map 、 have 12 One for driving ( road 、 Lane markings 、 traffic sign 、 Sidewalk, etc ) The design of the semantic categories of the ground live semantic segmentation map 、 The bounding box of dynamic objects in the environment , And the measurement of the agent itself ( Vehicle position and direction ).

resources : Synthetic data for deep learning

 

The above is a brief introduction to several commonly used data sets , Please go to the website to get more data  .

原网站

版权声明
本文为[The star light blog in 2021 cloud computing top3]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/177/202206261252428278.html