当前位置：网站首页>Free machine learning dataset website (6300+ dataset)

Free machine learning dataset website (6300+ dataset)

2022-06-26 13:45:00 【The star light blog in 2021 cloud computing top3】

Today, I'd like to share with you a free website for acquiring machine learning data sets ：

Machine Learning Datasets | Papers With Code

Good news for students who have ideas but do not have data sets , The website is very simple , And all kinds of data sets that are generally available are provided in this book , We can make all kinds of images 、 A collection of data sets such as comments and point clouds .

CIFAR-10

from Krizhevsky Et al . stay Learning multi-layer features from micro images

CIFAR -10 Data sets （ Canadian Institute of advanced studies ,10 Categories ） yes Tiny Images A subset of a dataset , from 60000 Zhang 32x32 Color image composition . These images are marked with 10 One of the four mutually exclusive categories ： The plane 、 automobile （ But not a truck or pickup truck ）、 bird 、 cat 、 deer 、 Dog 、 frog 、 Horse 、 Boats and trucks （ But not a pickup truck ）. Each kind has 6000 Zhang image , Each kind has 5000 Training images and 1000 Test images .

The criteria for determining whether an image belongs to a certain category are as follows ：

The class name should be in “ What's in this picture ？” Top of the list of possible answers to questions .
The image should be photo realistic . The labeler was instructed to refuse to draw a line .
The image should contain only one of the objects referred to in this class Highlight examples . As long as the reporter still knows the identity of the object , Objects may be partially obscured or seen from an unusual angle .

resources ：CIFAR-10 and CIFAR-100 datasets

Urban landscape

from Cordts Et al . stay For semantic city scene understanding Cityscapes Data set

Cityscapes It is a large database focusing on the semantic understanding of urban street view . It is divided into 8 Categories （ Plane 、 human beings 、 vehicle 、 Architecture 、 object 、 natural 、 Sky and void ） Of 30 Two categories provide semantics 、 Instance and dense pixel annotation . The data set consists of approximately 5000 A finely labeled image and 20000 A rough labeled image . In a few months 、 During the day and in good weather , stay 50 Cities captured data . It was originally recorded as a video , Therefore, the frame is manually selected to have the following characteristics ： A large number of dynamic objects 、 Changing scene layout and changing background .

resources ： A survey of deep learning techniques applied to semantic segmentation

Pennsylvania tree vault

from Mitchell P. Marcus Et al . stay Build a large annotated English corpus ：Penn Treebank

English Penn Treebank ( PTB ) corpus , Especially with the Wall Street Journal (WSJ) The corresponding part of the corpus , It is one of the most well-known and commonly used corpora for evaluating sequence label models . This task includes annotating each word with a part of speech tag . In the most common segmentation of this corpus , from 0 To 18 Part of the is used for training （38 219 A sentence ,912 344 A sign ）, from 19 To 21 The section of is used to verify （5 527 A sentence ,131 768 A sign ）, from 22 To 24 Used for testing （5 462 A sentence ,129 654 A sign ）. Corpora are also commonly used in character level and word level language modeling .

resources ：Seq2Biseq： A bi-directional output recurrent neural network for sequence modeling

IMDb Movie reviews

from Andrew L. Maas Et al . stay Learn word vectors for emotion analysis

IMDb Movie reviews The data set is a binary affective analysis data set , From the Internet Movie Database (IMDb) Of 50,000 Comments make up , Mark as positive or negative . The dataset contains an even number of positive and negative comments . Consider only highly polarized comments . Score for negative comments ≤4（ Full marks 10）, Positive comment scores ≥7（ Full marks 10）. Each film contains no more than comments 30 strip . The dataset contains other unlabeled data .

resources ：Sentiment analysis | NLP-progress

Home page

Model network

Introduced by Wu et al . stay 3D ShapeNets in ： The depth of the volume shape represents

ModelNet 40 data Set contains composite object point clouds . As the most widely used point cloud analysis benchmark ,ModelNet40 Because of its variety 、 Clear shape 、 Data sets are well structured and popular . The original ModelNet40 from 40 Categories （ Like a plane 、 automobile 、 plant , The lamp ）, among 9,843 For training , rest 2,468 For testing . The corresponding point cloud data points are uniformly sampled from the mesh surface , Then it is further preprocessed by moving to the origin and scaling to a unit sphere .

resources ： Geometric feedback network for point cloud classification

CARLA（ Automobile learning action ）

from Dosovitskiy Et al . stay CARLA： An open urban driving simulator

CARLA（CAR Learning to Act） Is an open urban driving simulator , As Unreal Engine 4 And an open source layer on the . Technically speaking , It works in a way similar to Unreal Engine 4 An open source layer on , Sensors are provided in the following form RGB camera （ Customizable location ）、 Actual ground depth map 、 have 12 One for driving （ road 、 Lane markings 、 traffic sign 、 Sidewalk, etc ） The design of the semantic categories of the ground live semantic segmentation map 、 The bounding box of dynamic objects in the environment , And the measurement of the agent itself （ Vehicle position and direction ）.

resources ： Synthetic data for deep learning

The above is a brief introduction to several commonly used data sets , Please go to the website to get more data .

原网站

版权声明
本文为[The star light blog in 2021 cloud computing top3]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206261252428278.html