当前位置:网站首页>We made a medical version of the MNIST dataset, and found that the common automl algorithm is not so easy to use
We made a medical version of the MNIST dataset, and found that the common automl algorithm is not so easy to use
2020-11-08 13:02:00 【U4u5y4 assault rifle】
author | Devil 、 Zhang Qian
source | Almost Human
Shanghai Jiaotong University researchers create a new open medical image data set MedMNIST, And Design 「MedMNIST Categorical decathlon 」, To promote AutoML Algorithm in the field of medical image analysis research .
stay AI In the development of Technology , Data sets play an important role . However , There are many difficulties in the creation of medical data sets , Such as data acquisition 、 Data tagging, etc .
In the near future , Researchers at Shanghai Jiaotong University created a medical image dataset MedMNIST, common contain 10 Preprocessing open medical image datasets ( Its data comes from many different data sources , And after pretreatment ).
Project address :
https://medmnist.github.io/
Address of thesis :
https://arxiv.org/pdf/2010.14925v1.pdf
GitHub Address :
https://github.com/MedMNIST/MedMNIST
Dataset download address :
https://www.dropbox.com/sh/upxrsyb5v8jxbso/AADOV0_6pC9Tb3cIACro1uUPa?dl=0
and MNIST The dataset is the same ,MedMNIST Data sets In lightweight 28 × 28 Performing classification tasks on images , The tasks involved cover the main medical image modes and diverse data scales . According to the researchers' design ,MedMNIST Data sets have the following features :
educative nature : The multimodal data in this dataset comes from multiple open medical image datasets with knowledge sharing license , It can be used for educational purposes .
Standardization : The researchers preprocessed the data , Convert it to the same format , therefore Users do not need to have background knowledge to use .
diversity : Multimodal datasets cover multiple data scales ( from 100 To 100,000) And tasks ( Two classification / Many classification 、 Ordered regression and multi label ).
Lightweight : The image size is 28 × 28, It is convenient for rapid prototyping and testing, and multimodal machine learning and AutoML Algorithm .
suffer Medical Segmentation Decathlon( Medical split decathlon ) Inspired by the , The study also designed MedMNIST Classification Decathlon(MedMNIST Categorical decathlon ), As AutoML Benchmark in the field of medical image classification .
It's all about 10 Evaluation on data sets AutoML Performance of the algorithm , The algorithm is not adjusted manually . The researchers compared the performance of several baseline methods , Including early stop ResNet [6]、 Open source AutoML Tools (auto-sklearn [7] and AutoKeras [8]), And commercialization AutoML Tools (Google AutoML Vision). The researchers hope that MedMNIST Classification Decathlon Can promote AutoML Research in the field of medical image analysis .
Ten preprocessed datasets
MedMNIST Data set containing 10 Preprocessing data sets , Covering the main data modes ( Such as X Photo chip 、OCT、 ultrasonic 、CT)、 Diverse classification tasks ( Two classification / Many classification 、 Ordered regression and multi label ) And data scale . As shown in the table 1 Shown , The diversity of data set design leads to the diversity of task difficulty , And that's what AutoML What benchmarks need . The researchers preprocessed each data set , Divide it into training - verification - Test subsets .
surface 1:MedMNIST Data set Overview , Covers the name of the dataset 、 source 、 Data mode 、 Task and dataset segmentation .
The data sets of these modes cover X Photo chip 、OCT、 ultrasonic 、CT、 Pathological section 、 Dermoscopy, etc , It's about colorectal cancer 、 Retinal diseases 、 Breast disease 、 Liver tumor and many other medical fields .
new type AutoML Medical image benchmark
As mentioned earlier , The researchers were inspired by the medical split decathlon , Designed 「MedMNIST Categorical decathlon 」, Designed to create lightweight... For medical image analysis AutoML The benchmark . It's all about 10 Evaluation on data sets AutoML Performance of the algorithm , The algorithm is not adjusted manually . The researchers compared the performance of several baseline methods , See the table below 2:
From the table 2 It can be seen that ,Google AutoML Vision The overall performance is good , But it's not always the best , Sometimes even lose to ResNet-18 and ResNet-50.auto-sklearn It doesn't perform well on most datasets , This shows that the performance of the typical statistical machine learning algorithm on the medical image data set is poor .AutoKeras Good performance on large data sets , Relatively poor performance on small data sets . No algorithm can achieve good generalization performance on these ten datasets , It helps to explore AutoML The algorithm is in different data modes 、 Generalization effects on task and scale datasets .
Next , Let's look at different methods in the training set 、 Performance on verification set and test set . Here's the picture 2 Shown , The algorithm is easy to over fit on small data sets .
Google AutoML Vision It can better control the over fitting problem , and auto-sklearn There is a serious over fitting . It can be inferred from this that , For learning algorithms , Appropriate reductive bias It's very important . We can still do that MedMNIST Explore different regularization techniques on datasets , Such as data enhancement 、 Model integration 、 Optimization algorithm, etc .
How to find data sets ?
Besides the medical field , Data sets from other fields are sometimes difficult to access , This requires us to master some common data collection methods and common resources . lately ,Medium A blogger on introduced several commonly used data collection sources :
1. Awesome Data
This is a GitHub The repository , Contains multiple different categories of datasets .
link :
https://github.com/awesomedata/awesome-public-datasets
2. Data Is Plural
This is a dataset resource presented in spreadsheet form , from 2015 It's been updated regularly since , The latest issue is 2020 year 10 month 28 The resources of the day , So some of the resources are very new .
link :https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
3. Kaggle Datasets
Kaggle Datasets Provides preview and summary information about many datasets , Very suitable for retrieving data sets for specific topics .
link :
https://www.kaggle.com/datasets
4. Data.world
and Kaggle equally ,Data.world Provides a series of user contributed datasets , It also provides a platform for companies to store and organize their own data .
link :
https://data.world/
5. Google Dataset Search
Dataset search It's Google 2018 A new search function launched in . If you're looking for data from a particular topic or source , This tool is worth trying .
link :
https://datasetsearch.research.google.com/
6. OpenDaL
OpenDal It's also a dataset search tool , You can search in many ways , For example, according to the creation time or frame a certain area on the map .
link :
https://opendatalibrary.com/
7. Pandas Data Reader
Pandas Data Reader It can help you pull data from online resources , And then apply it to Python pandas DataFrame in . Most of this is financial data .
link :
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html
8. from API get data
utilize Python from API Data acquisition is also a common method used by data scientists , Please refer to the following tutorial for specific operation steps .
link :
https://towardsdatascience.com/how-to-get-data-from-apis-with-python-dfb83fdc5b5b
Reference link :https://towardsdatascience.com/the-top-10-best-places-to-find-datasets-8d3b4e31c442
????
Now? , stay 「 You know 」 We can also be found
Go to Zhihu home page and search 「PaperWeekly」
Click on 「 Focus on 」 Subscribe to our column
About PaperWeekly
PaperWeekly It's a recommendation 、 Reading 、 Discuss 、 An academic platform for reporting the achievements of the frontier papers on artificial intelligence . If you study or engage in AI field , Welcome to clicking on the official account 「 Communication group 」, The little assistant will take you into PaperWeekly In the communication group .
版权声明
本文为[U4u5y4 assault rifle]所创,转载请带上原文链接,感谢
边栏推荐
- What is SVG?
- 2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
- 【Python 1-6】Python教程之——数字
- Research on WLAN direct connection (peer-to-peer connection or P2P) and cross platform research of IOS
- Top 5 Chinese cloud manufacturers in 2018: Alibaba cloud, Tencent cloud, AWS, telecom, Unicom
- Eight ways to optimize if else code
- 2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
- Rust : 性能测试criterion库
- Ali! Visual computing developer's series of manuals (with internet disk link)
- Top 5 Chinese cloud manufacturers in 2018: Alibaba cloud, Tencent cloud, AWS, telecom, Unicom
猜你喜欢
Major changes in Huawei's cloud: Cloud & AI rises to Huawei's fourth largest BG with full fire
Top 5 Chinese cloud manufacturers in 2018: Alibaba cloud, Tencent cloud, AWS, telecom, Unicom
How to deploy pytorch lightning model to production
TiDB 性能竞赛 11.02-11.06
Bccoin tells you: what is the most reliable investment project at the end of the year!
Python basic syntax
用 Python 写出来的进度条,竟如此美妙~
Win10 terminal + WSL 2 installation and configuration guide, exquisite development experience
蘑菇街电商交易平台服务架构及改造优化历程(含PPT)
阿里教你深入浅出玩转物联网平台!(附网盘链接)
随机推荐
Entry level! Teach you how to develop small programs without asking for help (with internet disk link)
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
AQS解析
金融领域首个开源中文BERT预训练模型,熵简科技推出FinBERT 1.0
PMP心得分享
原创 | 数据资产确权浅议
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
Ubuntu20.04 access FTP server garbled problem + upload files
Win10 terminal + WSL 2 installation and configuration guide, exquisite development experience
漫画|讲解一下如何写简历&项目
What is the database paradigm
Analysis of istio access control
分布式文档存储数据库之MongoDB基础入门
2018中国云厂商TOP5:阿里云、腾讯云、AWS、电信、联通 ...
This paper analyzes the top ten Internet of things applications in 2020!
The progress bar written in Python is so wonderful~
【Python 1-6】Python教程之——数字
Harbor项目高手问答及赠书活动
Flink from introduction to Zhenxiang (10. Sink data output elasticsearch)
On monotonous stack