当前位置:网站首页>We made a medical version of the MNIST dataset, and found that the common automl algorithm is not so easy to use
We made a medical version of the MNIST dataset, and found that the common automl algorithm is not so easy to use
2020-11-08 13:02:00 【U4u5y4 assault rifle】
author | Devil 、 Zhang Qian
source | Almost Human
Shanghai Jiaotong University researchers create a new open medical image data set MedMNIST, And Design 「MedMNIST Categorical decathlon 」, To promote AutoML Algorithm in the field of medical image analysis research .
stay AI In the development of Technology , Data sets play an important role . However , There are many difficulties in the creation of medical data sets , Such as data acquisition 、 Data tagging, etc .
In the near future , Researchers at Shanghai Jiaotong University created a medical image dataset MedMNIST, common contain 10 Preprocessing open medical image datasets ( Its data comes from many different data sources , And after pretreatment ).
Project address :
https://medmnist.github.io/
Address of thesis :
https://arxiv.org/pdf/2010.14925v1.pdf
GitHub Address :
https://github.com/MedMNIST/MedMNIST
Dataset download address :
https://www.dropbox.com/sh/upxrsyb5v8jxbso/AADOV0_6pC9Tb3cIACro1uUPa?dl=0
and MNIST The dataset is the same ,MedMNIST Data sets In lightweight 28 × 28 Performing classification tasks on images , The tasks involved cover the main medical image modes and diverse data scales . According to the researchers' design ,MedMNIST Data sets have the following features :
educative nature : The multimodal data in this dataset comes from multiple open medical image datasets with knowledge sharing license , It can be used for educational purposes .
Standardization : The researchers preprocessed the data , Convert it to the same format , therefore Users do not need to have background knowledge to use .
diversity : Multimodal datasets cover multiple data scales ( from 100 To 100,000) And tasks ( Two classification / Many classification 、 Ordered regression and multi label ).
Lightweight : The image size is 28 × 28, It is convenient for rapid prototyping and testing, and multimodal machine learning and AutoML Algorithm .
suffer Medical Segmentation Decathlon( Medical split decathlon ) Inspired by the , The study also designed MedMNIST Classification Decathlon(MedMNIST Categorical decathlon ), As AutoML Benchmark in the field of medical image classification .
It's all about 10 Evaluation on data sets AutoML Performance of the algorithm , The algorithm is not adjusted manually . The researchers compared the performance of several baseline methods , Including early stop ResNet [6]、 Open source AutoML Tools (auto-sklearn [7] and AutoKeras [8]), And commercialization AutoML Tools (Google AutoML Vision). The researchers hope that MedMNIST Classification Decathlon Can promote AutoML Research in the field of medical image analysis .
Ten preprocessed datasets
MedMNIST Data set containing 10 Preprocessing data sets , Covering the main data modes ( Such as X Photo chip 、OCT、 ultrasonic 、CT)、 Diverse classification tasks ( Two classification / Many classification 、 Ordered regression and multi label ) And data scale . As shown in the table 1 Shown , The diversity of data set design leads to the diversity of task difficulty , And that's what AutoML What benchmarks need . The researchers preprocessed each data set , Divide it into training - verification - Test subsets .
surface 1:MedMNIST Data set Overview , Covers the name of the dataset 、 source 、 Data mode 、 Task and dataset segmentation .
The data sets of these modes cover X Photo chip 、OCT、 ultrasonic 、CT、 Pathological section 、 Dermoscopy, etc , It's about colorectal cancer 、 Retinal diseases 、 Breast disease 、 Liver tumor and many other medical fields .
new type AutoML Medical image benchmark
As mentioned earlier , The researchers were inspired by the medical split decathlon , Designed 「MedMNIST Categorical decathlon 」, Designed to create lightweight... For medical image analysis AutoML The benchmark . It's all about 10 Evaluation on data sets AutoML Performance of the algorithm , The algorithm is not adjusted manually . The researchers compared the performance of several baseline methods , See the table below 2:
From the table 2 It can be seen that ,Google AutoML Vision The overall performance is good , But it's not always the best , Sometimes even lose to ResNet-18 and ResNet-50.auto-sklearn It doesn't perform well on most datasets , This shows that the performance of the typical statistical machine learning algorithm on the medical image data set is poor .AutoKeras Good performance on large data sets , Relatively poor performance on small data sets . No algorithm can achieve good generalization performance on these ten datasets , It helps to explore AutoML The algorithm is in different data modes 、 Generalization effects on task and scale datasets .
Next , Let's look at different methods in the training set 、 Performance on verification set and test set . Here's the picture 2 Shown , The algorithm is easy to over fit on small data sets .
Google AutoML Vision It can better control the over fitting problem , and auto-sklearn There is a serious over fitting . It can be inferred from this that , For learning algorithms , Appropriate reductive bias It's very important . We can still do that MedMNIST Explore different regularization techniques on datasets , Such as data enhancement 、 Model integration 、 Optimization algorithm, etc .
How to find data sets ?
Besides the medical field , Data sets from other fields are sometimes difficult to access , This requires us to master some common data collection methods and common resources . lately ,Medium A blogger on introduced several commonly used data collection sources :
1. Awesome Data
This is a GitHub The repository , Contains multiple different categories of datasets .
link :
https://github.com/awesomedata/awesome-public-datasets
2. Data Is Plural
This is a dataset resource presented in spreadsheet form , from 2015 It's been updated regularly since , The latest issue is 2020 year 10 month 28 The resources of the day , So some of the resources are very new .
link :https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
3. Kaggle Datasets
Kaggle Datasets Provides preview and summary information about many datasets , Very suitable for retrieving data sets for specific topics .
link :
https://www.kaggle.com/datasets
4. Data.world
and Kaggle equally ,Data.world Provides a series of user contributed datasets , It also provides a platform for companies to store and organize their own data .
link :
https://data.world/
5. Google Dataset Search
Dataset search It's Google 2018 A new search function launched in . If you're looking for data from a particular topic or source , This tool is worth trying .
link :
https://datasetsearch.research.google.com/
6. OpenDaL
OpenDal It's also a dataset search tool , You can search in many ways , For example, according to the creation time or frame a certain area on the map .
link :
https://opendatalibrary.com/
7. Pandas Data Reader
Pandas Data Reader It can help you pull data from online resources , And then apply it to Python pandas DataFrame in . Most of this is financial data .
link :
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html
8. from API get data
utilize Python from API Data acquisition is also a common method used by data scientists , Please refer to the following tutorial for specific operation steps .
link :
https://towardsdatascience.com/how-to-get-data-from-apis-with-python-dfb83fdc5b5b
Reference link :https://towardsdatascience.com/the-top-10-best-places-to-find-datasets-8d3b4e31c442
????
Now? , stay 「 You know 」 We can also be found
Go to Zhihu home page and search 「PaperWeekly」
Click on 「 Focus on 」 Subscribe to our column
About PaperWeekly
PaperWeekly It's a recommendation 、 Reading 、 Discuss 、 An academic platform for reporting the achievements of the frontier papers on artificial intelligence . If you study or engage in AI field , Welcome to clicking on the official account 「 Communication group 」, The little assistant will take you into PaperWeekly In the communication group .
版权声明
本文为[U4u5y4 assault rifle]所创,转载请带上原文链接,感谢
边栏推荐
- 谷歌开源能翻译101种语言的AI模型,只比Facebook多一种
- 金融领域首个开源中文BERT预训练模型,熵简科技推出FinBERT 1.0
- 2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
- Service architecture and transformation optimization process of e-commerce trading platform in mogujie (including ppt)
- 【Python 1-6】Python教程之——数字
- rabbitmq(一)-基础入门
- Entry level! Teach you how to develop small programs without asking for help (with internet disk link)
- Flink: from introduction to Zhenxiang (3. Reading data from collection and file)
- PMP experience sharing
- python基础教程python opencv pytesseract 验证码识别的实现
猜你喜欢
吐血整理!阿里巴巴 Android 开发手册!(附网盘链接)
Google's AI model, which can translate 101 languages, is only one more than Facebook
阿里撕下电商标签
Major changes in Huawei's cloud: Cloud & AI rises to Huawei's fourth largest BG with full fire
Why is Schnorr Signature known as the biggest technology update after bitcoin segwit
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
Ali tear off the e-commerce label
笔试面试题目:求丢失的猪
随机推荐
【Python 1-6】Python教程之——数字
Istio traffic management -- progress gateway
Harbor项目高手问答及赠书活动
Get PMP certificate at 51CTO College
用科技赋能教育创新与重构 华为将教育信息化落到实处
Written interview topic: looking for the lost pig
分布式文档存储数据库之MongoDB基础入门
Top 5 Chinese cloud manufacturers in 2018: Alibaba cloud, Tencent cloud, AWS, telecom, Unicom
This time Kwai tiktok is faster than shaking.
It's worth seeing! EMR elastic low cost offline big data analysis best practice (with network disk link)
Powershell 使用.Net对象发送邮件
Where is the new target market? What is the anchored product? |Ten questions 2021 Chinese enterprise service
Shell uses. Net objects to send mail
Python basic syntax
Python Gadgets: code conversion
On the confirmation of original data assets
Top 5 Chinese cloud manufacturers in 2018: Alibaba cloud, Tencent cloud, AWS, telecom, Unicom
阿里云加速增长,进一步巩固领先优势
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
wanxin finance