当前位置:网站首页>We made a medical version of the MNIST dataset, and found that the common automl algorithm is not so easy to use
We made a medical version of the MNIST dataset, and found that the common automl algorithm is not so easy to use
2020-11-08 13:02:00 【U4u5y4 assault rifle】

author | Devil 、 Zhang Qian
source | Almost Human
Shanghai Jiaotong University researchers create a new open medical image data set MedMNIST, And Design 「MedMNIST Categorical decathlon 」, To promote AutoML Algorithm in the field of medical image analysis research .
stay AI In the development of Technology , Data sets play an important role . However , There are many difficulties in the creation of medical data sets , Such as data acquisition 、 Data tagging, etc .
In the near future , Researchers at Shanghai Jiaotong University created a medical image dataset MedMNIST, common contain 10 Preprocessing open medical image datasets ( Its data comes from many different data sources , And after pretreatment ).
Project address :
https://medmnist.github.io/
Address of thesis :
https://arxiv.org/pdf/2010.14925v1.pdf
GitHub Address :
https://github.com/MedMNIST/MedMNIST
Dataset download address :
https://www.dropbox.com/sh/upxrsyb5v8jxbso/AADOV0_6pC9Tb3cIACro1uUPa?dl=0
and MNIST The dataset is the same ,MedMNIST Data sets In lightweight 28 × 28 Performing classification tasks on images , The tasks involved cover the main medical image modes and diverse data scales . According to the researchers' design ,MedMNIST Data sets have the following features :
educative nature : The multimodal data in this dataset comes from multiple open medical image datasets with knowledge sharing license , It can be used for educational purposes .
Standardization : The researchers preprocessed the data , Convert it to the same format , therefore Users do not need to have background knowledge to use .
diversity : Multimodal datasets cover multiple data scales ( from 100 To 100,000) And tasks ( Two classification / Many classification 、 Ordered regression and multi label ).
Lightweight : The image size is 28 × 28, It is convenient for rapid prototyping and testing, and multimodal machine learning and AutoML Algorithm .
suffer Medical Segmentation Decathlon( Medical split decathlon ) Inspired by the , The study also designed MedMNIST Classification Decathlon(MedMNIST Categorical decathlon ), As AutoML Benchmark in the field of medical image classification .
It's all about 10 Evaluation on data sets AutoML Performance of the algorithm , The algorithm is not adjusted manually . The researchers compared the performance of several baseline methods , Including early stop ResNet [6]、 Open source AutoML Tools (auto-sklearn [7] and AutoKeras [8]), And commercialization AutoML Tools (Google AutoML Vision). The researchers hope that MedMNIST Classification Decathlon Can promote AutoML Research in the field of medical image analysis .


Ten preprocessed datasets
MedMNIST Data set containing 10 Preprocessing data sets , Covering the main data modes ( Such as X Photo chip 、OCT、 ultrasonic 、CT)、 Diverse classification tasks ( Two classification / Many classification 、 Ordered regression and multi label ) And data scale . As shown in the table 1 Shown , The diversity of data set design leads to the diversity of task difficulty , And that's what AutoML What benchmarks need . The researchers preprocessed each data set , Divide it into training - verification - Test subsets .

surface 1:MedMNIST Data set Overview , Covers the name of the dataset 、 source 、 Data mode 、 Task and dataset segmentation .
The data sets of these modes cover X Photo chip 、OCT、 ultrasonic 、CT、 Pathological section 、 Dermoscopy, etc , It's about colorectal cancer 、 Retinal diseases 、 Breast disease 、 Liver tumor and many other medical fields .

new type AutoML Medical image benchmark
As mentioned earlier , The researchers were inspired by the medical split decathlon , Designed 「MedMNIST Categorical decathlon 」, Designed to create lightweight... For medical image analysis AutoML The benchmark . It's all about 10 Evaluation on data sets AutoML Performance of the algorithm , The algorithm is not adjusted manually . The researchers compared the performance of several baseline methods , See the table below 2:

From the table 2 It can be seen that ,Google AutoML Vision The overall performance is good , But it's not always the best , Sometimes even lose to ResNet-18 and ResNet-50.auto-sklearn It doesn't perform well on most datasets , This shows that the performance of the typical statistical machine learning algorithm on the medical image data set is poor .AutoKeras Good performance on large data sets , Relatively poor performance on small data sets . No algorithm can achieve good generalization performance on these ten datasets , It helps to explore AutoML The algorithm is in different data modes 、 Generalization effects on task and scale datasets .
Next , Let's look at different methods in the training set 、 Performance on verification set and test set . Here's the picture 2 Shown , The algorithm is easy to over fit on small data sets .

Google AutoML Vision It can better control the over fitting problem , and auto-sklearn There is a serious over fitting . It can be inferred from this that , For learning algorithms , Appropriate reductive bias It's very important . We can still do that MedMNIST Explore different regularization techniques on datasets , Such as data enhancement 、 Model integration 、 Optimization algorithm, etc .

How to find data sets ?
Besides the medical field , Data sets from other fields are sometimes difficult to access , This requires us to master some common data collection methods and common resources . lately ,Medium A blogger on introduced several commonly used data collection sources :
1. Awesome Data
This is a GitHub The repository , Contains multiple different categories of datasets .
link :
https://github.com/awesomedata/awesome-public-datasets
2. Data Is Plural
This is a dataset resource presented in spreadsheet form , from 2015 It's been updated regularly since , The latest issue is 2020 year 10 month 28 The resources of the day , So some of the resources are very new .
link :https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
3. Kaggle Datasets
Kaggle Datasets Provides preview and summary information about many datasets , Very suitable for retrieving data sets for specific topics .
link :
https://www.kaggle.com/datasets
4. Data.world
and Kaggle equally ,Data.world Provides a series of user contributed datasets , It also provides a platform for companies to store and organize their own data .
link :
https://data.world/
5. Google Dataset Search
Dataset search It's Google 2018 A new search function launched in . If you're looking for data from a particular topic or source , This tool is worth trying .
link :
https://datasetsearch.research.google.com/
6. OpenDaL
OpenDal It's also a dataset search tool , You can search in many ways , For example, according to the creation time or frame a certain area on the map .
link :
https://opendatalibrary.com/
7. Pandas Data Reader
Pandas Data Reader It can help you pull data from online resources , And then apply it to Python pandas DataFrame in . Most of this is financial data .
link :
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html
8. from API get data
utilize Python from API Data acquisition is also a common method used by data scientists , Please refer to the following tutorial for specific operation steps .
link :
https://towardsdatascience.com/how-to-get-data-from-apis-with-python-dfb83fdc5b5b
Reference link :https://towardsdatascience.com/the-top-10-best-places-to-find-datasets-8d3b4e31c442
????
Now? , stay 「 You know 」 We can also be found
Go to Zhihu home page and search 「PaperWeekly」
Click on 「 Focus on 」 Subscribe to our column
About PaperWeekly
PaperWeekly It's a recommendation 、 Reading 、 Discuss 、 An academic platform for reporting the achievements of the frontier papers on artificial intelligence . If you study or engage in AI field , Welcome to clicking on the official account 「 Communication group 」, The little assistant will take you into PaperWeekly In the communication group .


版权声明
本文为[U4u5y4 assault rifle]所创,转载请带上原文链接,感谢
边栏推荐
- What is SVG?
- 阿里云视频云技术专家 LVS 演讲全文:《“云端一体”的智能媒体生产制作演进之路》
- wanxin finance
- Top 5 Chinese cloud manufacturers in 2018: Alibaba cloud, Tencent cloud, AWS, telecom, Unicom
- Adobe media encoder /Me 2021软件安装包(附安装教程)
- C language I blog assignment 03
- PMP experience sharing
- STM32CubeIDE下载安装-GPIO基本配置操作-Debug调试(基于CMSIS DAP Debug)
- 华为云重大变革:Cloud&AI 升至华为第四大 BG ,火力全开
- 啥是数据库范式
猜你喜欢

这次,快手终于比抖音'快'了!

Or talk No.19 | Facebook Dr. Tian Yuandong: black box optimization of hidden action set based on Monte Carlo tree search

“他,程序猿,35岁,被劝退”:不要只懂代码,会说话,胜过10倍默默努力

This paper analyzes the top ten Internet of things applications in 2020!

2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...

Adobe Lightroom /Lr 2021软件安装包(附安装教程)

Xamarin 从零开始部署 iOS 上的 Walterlv.CloudKeyboard 应用

啥是数据库范式

Rust: performance test criteria Library

Xamarin deploys IOS from scratch Walterlv.CloudKeyboard application
随机推荐
华为云重大变革:Cloud&AI 升至华为第四大 BG ,火力全开
What can your cloud server do? What is the purpose of cloud server?
Windows下快递投递柜、寄存柜的软件初探
wanxin finance
From a friend recently Ali, Tencent, meituan and other P7 Python development post interview questions
Win10 Terminal + WSL 2 安装配置指南,精致开发体验
Top 5 Chinese cloud manufacturers in 2018: Alibaba cloud, Tencent cloud, AWS, telecom, Unicom
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
后端程序员必备:分布式事务基础篇
[Python 1-6] Python tutorial 1 -- number
Python Gadgets: code conversion
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
C语言I博客作业03
适合c/c++新手学习的一些项目,别给我错过了!
Stm32uberide download and install - GPIO basic configuration operation - debug (based on CMSIS DAP debug)
On the software of express delivery cabinet and deposit cabinet under Windows
Google's AI model, which can translate 101 languages, is only one more than Facebook
Eight ways to optimize if else code
Introduction to mongodb foundation of distributed document storage database
值得一看!EMR弹性低成本离线大数据分析最佳实践(附网盘链接)