Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Overview

img2dataset

pypi Open In Colab Try it on gitpod

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Also supports saving captions for url+caption datasets.

Install

pip install img2dataset

Usage

First get some image url list. For example:

echo 'https://placekitten.com/200/305' >> myimglist.txt
echo 'https://placekitten.com/200/304' >> myimglist.txt
echo 'https://placekitten.com/200/303' >> myimglist.txt

Then, run the tool:

img2dataset --url_list=myimglist.txt --output_folder=output_folder --thread_count=64 --image_size=256

The tool will then automatically download the urls, resize them, and store them with that format:

  • output_folder
    • 0
      • 0.jpg
      • 1.jpg
      • 2.jpg

or as this format if choosing webdataset:

  • output_folder
    • 0.tar containing:
      • 0.jpg
      • 1.jpg
      • 2.jpg

with each number being the position in the list. The subfolders avoids having too many files in a single folder.

If captions are provided, they will be saved as 0.txt, 1.txt, ...

This can then easily be fed into machine learning training or any other use case.

If save_metadata option is turned on (that's the default), then .json files named 0.json, 1.json,... are saved with these keys:

  • url
  • caption
  • key
  • shard_id
  • status : whether the download succeeded
  • error_message
  • width
  • height
  • original_width
  • original_height
  • exif

Also a .parquet file will be saved with the same name as the subfolder/tar files containing these same metadata. It can be used to analyze the results efficiently.

Integration with Weights & Biases

Performance metrics are monitored through Weights & Biases.

W&B metrics

In addition, most frequent errors are logged for easier debugging.

W&B table

Other features are available:

  • logging of environment configuration (OS, python version, CPU count, Hostname, etc)
  • monitoring of hardware resources (GPU/CPU, RAM, Disk, Networking, etc)
  • custom graphs and reports
  • comparison of runs (convenient when optimizing parameters such as number of threads/cpus)

When running the script for the first time, you can decide to either associate your metrics to your account or log them anonymously.

You can also log in (or create an account) before by running wandb login.

API

This module exposes a single function download which takes the same arguments as the command line tool:

  • url_list A file with the list of url of images to download. It can be a folder of such files. (required)
  • image_size The size to resize image to (default 256)
  • output_folder The path to the output folder. If existing subfolder are present, the tool will continue to the next number. (default "images")
  • processes_count The number of processes used for downloading the pictures. This is important to be high for performance. (default 1)
  • thread_count The number of threads used for downloading the pictures. This is important to be high for performance. (default 256)
  • resize_mode The way to resize pictures, can be no, border or keep_ratio (default border)
    • no doesn't resize at all
    • border will make the image image_size x image_size and add a border
    • keep_ratio will keep the ratio and make the smallest side of the picture image_size
    • center_crop will keep the ratio and center crop the largest side so the picture is squared
  • resize_only_if_bigger resize pictures only if bigger that the image_size (default False)
  • output_format decides how to save pictures (default files)
    • files saves as a set of subfolder containing pictures
    • webdataset saves as tars containing pictures
  • input_format decides how to load the urls (default txt)
    • txt loads the urls as a text file of url, one per line
    • csv loads the urls and optional caption as a csv
    • tsv loads the urls and optional caption as a tsv
    • parquet loads the urls and optional caption as a parquet
  • url_col the name of the url column for parquet and csv (default url)
  • caption_col the name of the caption column for parquet and csv (default None)
  • number_sample_per_shard the number of sample that will be downloaded in one shard (default 10000)
  • save_metadata if true, saves one parquet file per folder/tar and json files with metadata (default True)
  • save_additional_columns list of additional columns to take from the csv/parquet files and save in metadata files (default None)
  • timeout maximum time (in seconds) to wait when trying to download an image (default 10)
  • wandb_project name of W&B project used (default img2dataset)

How to tweak the options

The default values should be good enough for small sized dataset. For larger ones, these tips may help you get the best performance:

  • set the processes_count as the number of cores your machine has
  • increase thread_count as long as your bandwidth and cpu are below the limits
  • I advise to set output_format to webdataset if your dataset has more than 1M elements, it will be easier to manipulate few tars rather than million of files
  • keeping metadata to True can be useful to check what items were already saved and avoid redownloading them

Road map

This tool works very well in the current state for up to 100M elements. Future goals include:

  • a benchmark for 1B pictures which may require
    • further optimization on the resizing part
    • better multi node support
    • integrated support for incremental support (only download new elements)

Architecture notes

This tool is designed to download pictures as fast as possible. This put a stress on various kind of resources. Some numbers assuming 1350 image/s:

  • Bandwidth: downloading a thousand average image per second requires about 130MB/s
  • CPU: resizing one image may take several milliseconds, several thousand per second can use up to 16 cores
  • DNS querying: million of urls mean million of domains, default OS setting usually are not enough. Setting up a local bind9 resolver may be required
  • Disk: if using resizing, up to 30MB/s write speed is necessary. If not using resizing, up to 130MB/s. Writing in few tar files make it possible to use rotational drives instead of a SSD.

With these information in mind, the design choice was done in this way:

  • the list of urls is split in N shards. N is usually chosen as the number of cores
  • N processes are started (using multiprocessing process pool)
    • each process starts M threads. M should be maximized in order to use as much network as possible while keeping cpu usage below 100%.
    • each of this thread download 1 image and returns it
    • the parent thread handle resizing (which means there is at most N resizing running at once, using up the cores but not more)
    • the parent thread saves to a tar file that is different from other process

This design make it possible to use the CPU resource efficiently by doing only 1 resize per core, reduce disk overhead by opening 1 file per core, while using the bandwidth resource as much as possible by using M thread per process.

Setting up a bind9 resolver

In order to keep the success rate high, it is necessary to use an efficient DNS resolver. I tried several options: systemd-resolved, dnsmaskq and bind9 and reached the conclusion that bind9 reaches the best performance for this use case. Here is how to set this up on ubuntu:

sudo apt install bind9
sudo vim /etc/bind/named.conf.options

Add this in options:
        recursive-clients 10000;
        resolver-query-timeout 30000;
        max-clients-per-query 10000;
        max-cache-size 2000m;

sudo systemctl restart bind9

sudo vim /etc/resolv.conf

Put this content:
nameserver 127.0.0.1

This will make it possible to keep an high success rate while doing thousands of dns queries. You may also want to setup bind9 logging in order to check that few dns errors happen.

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

python -m pytest -v tests -s

Benchmarks

10000 image benchmark

cd tests
bash benchmark.sh

18M image benchmark

Download crawling at home first part, then:

cd tests
bash large_bench.sh

It takes 3.7h to download 18M pictures

1350 images/s is the currently observed performance. 4.8M images per hour, 116M images per 24h.

36M image benchmark

downloading 2 parquet files of 18M items (result 936GB) took 7h24 average of 1345 image/s

190M benchmark

downloading 190M images from the crawling at home dataset took 41h (result 5TB) average of 1280 image/s

Comments
  • Downloader is not producing full set of expected outputs

    Downloader is not producing full set of expected outputs

    Heya, I was trying to download the LAION400M dataset and noticed that I am not getting the full set of data for some reason.

    Any tips on debugging further?

    TL;DR - I was expecting ~12M files to be downloaded, only seeing successes in *_stats.json files indicating ~2M files were actually downloaded

    For example - I recently tried to download this dataset in a distributed manner on EMR:

    https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/dataset/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

    I applied some light NSFW filtering on it to produce a new parquet

    # rest of the script is redacted, but there is some code before this to normalize the NSFW row to make filtering more convenient
    sampled_df = df[df["NSFW"] == "unlikely"]
    sampled_df.reset_index(inplace=True)
    

    Verified its row count is ~12M samples:

    import glob
    import json
    from pyarrow.parquet import ParquetDataset
    
    files = glob.glob("*.parquet")
    
    d = {}
    
    for file in files:
        d[file] = 0
        dataset = ParquetDataset(file)
        for piece in dataset.pieces:
            d[file] += piece.get_metadata().num_rows
    
    print(json.dumps(d, indent=2, sort_keys=True))
    
    {
      "part00000.parquet": 12026281
    }
    

    Ran the download, and scanned over the output s3 bucket:

    aws s3 cp\
    	s3://path/to/s3/download/ . \
    	--exclude "*" \
    	--include "*.json" \
    	--recursive
    

    Ran this script to get the total count of images downloaded:

    import json
    import glob
    
    files = glob.glob("/path/to/json/files/*.json")
    
    count = {}
    successes = {}
    
    for file in files:
        with open(file) as f:
            j = json.load(f)
            count[file] = j["count"]
            successes[file] = j["successes"]
    
    rate = 100 * sum(successes.values()) / sum(count.values())
    print(f"Success rate: {rate}. From {sum(successes.values())} / {sum(count.values())}")
    

    which gave me the following output:

    Success rate: 56.15816066896948. From 1508566 / 2686281
    

    The high error rate here is not of major concern, I was running at low worker node count for experimentation so we have a lot of dns issues (I'll use a knot resolver later)

    unknown url type: '21nicrmo2'                                                      1.0
    <urlopen error [errno 22] invalid argument>                                        1.0
    encoding with 'idna' codec failed (unicodeerror: label empty or too long)          1.0
    http/1.1 401.2 unauthorized\r\n                                                    4.0
    <urlopen error no host given>                                                      5.0
    <urlopen error unknown url type: "https>                                          11.0
    incomplete read                                                                   14.0
    <urlopen error [errno 101] network is unreachable>                                38.0
    <urlopen error [errno 104] connection reset by peer>                              75.0
    [errno 104] connection reset by peer                                              92.0
    opencv                                                                           354.0
    <urlopen error [errno 113] no route to host>                                     448.0
    remote end closed connection without response                                    472.0
    <urlopen error [errno 111] connection refused>                                  1144.0
    encoding issue                                                                  2341.0
    timed out                                                                       2850.0
    <urlopen error timed out>                                                       4394.0
    the read operation timed out                                                    4617.0
    image decoding error                                                            5563.0
    ssl                                                                             6174.0
    http error                                                                     62670.0
    <urlopen error [errno -2] name or service not known>                         1086446.0
    success                                                                      1508566.0
    

    I also noticed there were only 270 json files produced, but given that each shard should contain 10,000 images, I expected ~1,200 json files to be produced. Not sure where this discrepancy is coming from

    > ls
    00000_stats.json  00051_stats.json  01017_stats.json  01066_stats.json  01112_stats.json  01157_stats.json
    00001_stats.json  00052_stats.json  01018_stats.json  01067_stats.json  01113_stats.json  01159_stats.json
    ...
    > ls -l | wc -l 
    270
    
    opened by PranshuBansalDev 33
  • Increasing mem and no output files

    Increasing mem and no output files

    Currently using your tool to download laion dataset, thank you for your contribution. The program grows in memory until it uses all of my 32G of RAM and 64G of SWAP. No tar files are ever output. Am I doing something wrong?

    Using the following command (slightly modified from official command provided by laion) img2dataset --url_list laion400m-meta --input_format "parquet" \ --url_col "URL" --caption_col "TEXT" --output_format webdataset \ --output_folder webdataset --processes_count 1 --thread_count 12 --image_size 384 \ --save_additional_columns '["NSFW","similarity","LICENSE"]'

    opened by pbatk 23
  • feat: support tfrecord

    feat: support tfrecord

    Add support for tfrecords.

    The webdataset format is not very convenient on TPU's due to bad support of pytorch dataloaders in multiprocessing at the moment so tfrecords allow better usage of CPU's.

    opened by borisdayma 22
  • Download stall at the end

    Download stall at the end

    I'm trying to download the CC3M dataset on an AWS Sagemaker Notebook instance. I first do pip install img2dataset. Then I fired up a terminal and do

    img2dataset --url_list cc3m.tsv --input_format "tsv"\
             --url_col "url" --caption_col "caption" --output_format webdataset\
               --output_folder cc3m --processes_count 16 --thread_count 64 --resize_mode no\
                 --enable_wandb False
    

    Code runs and downloads but stalls towards the end. I tried terminating by restarting the instance (restart), as a result, some .tar files are having read error "Unexpected end of file" while using the tar files for training. I also tried to terminate it using Ctrl-C on a second run, which result in the same read error when using the tar files for training. The difference between two termination methods is the later seemed to do some cleanup which removed "_tmp" folder within the download folder.

    opened by xiankgx 13
  • Respect noai and noimageai directives when downloading image files

    Respect noai and noimageai directives when downloading image files

    Media owners can use the X-Robots-Tag header to communicate usage directives for the associated media, including instruction that the image not be used in any indexes (noindex) or included in datasets used for machine learning purposes (noai).

    This PR makes img2dataset respect such directives by not including associated media in the generated dataset. It also updates the useragent string, introducing a img2dataset user agent token so that requests made using the tool are identifiable by media hosts.

    Refs:

    • https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag
    • https://www.deviantart.com/team/journal/A-New-Directive-for-Opting-Out-of-AI-Datasets-934500371
    opened by raincoastchris 12
  • How to download SBUcaptions and Visual Genome (VG) dataset in webdataset format

    How to download SBUcaptions and Visual Genome (VG) dataset in webdataset format

    For Vision and Language pretraining cc3m, mscoco, SBUcaptions and VG are very relevant datasets. I haven't been able to download SBU captions and VG. Here are my questions.

    1. How to download SBU captions and VG's metadata?
    2. How to download these datasets on webdataset format?

    Could you also please provide me with a tutorial or just some hints to download it in webdataset format using img2dataset? Thank you in advance.

    opened by sanyalsunny111 8
  • clip-retrieval-getting-started.ipynb giving errors (Urgent)

    clip-retrieval-getting-started.ipynb giving errors (Urgent)

    image Hello there I am new to the world of deep learning.I am trying to run clip-retrieval-getting-started.ipynb but getting the error attached as snip....Please help its urgent

    opened by minakshimathpal 8
  • Decrease memory usage

    Decrease memory usage

    Currently the memory usage is about 1.5GB per core. That's way too much, it must be possible to decrease it. Figure out what's using all that ram (is it because the resize queue is full ? should there be some backpressure on the downloader ,etc) and solve it

    opened by rom1504 8
  • Interest in supporting video datasets?

    Interest in supporting video datasets?

    Hi. Thanks for the amazing repository. It really makes the workflow very easy. I was wondering if you are considering to add video datasets as well. Some are based on urls, while some others are derived from youtube or segments from youtube.

    opened by TheShadow29 7
  • Add checksum of image

    Add checksum of image

    I think it could be useful to add a checksum in the parquet files since we're downloading the images anyway and it's fast to compute. It would help us do a real deduplication, not only on urls but on actual image content.

    opened by borisdayma 7
  • add list of int, float feature in TFRecordSampleWriter

    add list of int, float feature in TFRecordSampleWriter

    We use list of int, list of float attribute in coyo-labeled-300m dataset. (It will be released soon) To create a dataset using img2dataset in tfrecord, we need to add above features.

    opened by justHungryMan 6
  • Figure out how to timeout

    Figure out how to timeout

    I implemented some new metrics and found that many urls timeout after 20s, which clearly slow down everything

    here is some examples: Downloaded (12, 'http://www.herteldenbirname.com/wp-content/uploads/2014/05/Italia-Independent-Flocked-Aviator-Sunglasses-150x150.jpg') in 10.019284009933472 Downloaded (124, 'http://image.rakuten.co.jp/sneak/cabinet/shoes-03/cr-ucrocs5-a.jpg?_ex=128x128') in 10.01184344291687 Downloaded (146, 'http://www.slicingupeyeballs.com/wp-content/uploads/2009/05/stoneroses452.jpg') in 10.006474256515503 Downloaded (122, 'https://media.mwcradio.com/mimesis/2013-03/01/2013-03-01T153415Z_1_CBRE920179600_RTROPTP_3_TECH-US-GERMANY-EREADER_JPG_475x310_q85.jpg') in 10.241626739501953 Downloaded (282, 'https://8d1aee3bcc.site.internapcdn.net/00/images/media/5/5cfb2eba8f1f6244c6f7e261b9320a90-1.jpg') in 10.431355476379395 Downloaded (298, 'https://my-furniture.com.au/media/catalog/product/cache/1/small_image/295x295/9df78eab33525d08d6e5fb8d27136e95/a/u/au0019-stool-01.jpg') in 10.005694150924683 Downloaded (300, 'http://images.tastespotting.com/thumbnails/889506.jpg') in 10.007027387619019 Downloaded (330, 'https://www.infoworld.pk/wp-content/uploads/2016/02/Cool-HD-Valentines-Day-Wallpapers-480x300.jpeg') in 10.004335880279541 Downloaded (361, 'http://pendantscarf.com/image/cache/data/necklace/JW0013-(2)-150x150.jpg') in 10.00539231300354 Downloaded (408, 'https://www.solidrop.net/photo-6/animorphia-coloring-books-for-adults-children-drawing-book-secret-garden-style-relieve-stress-graffiti-painting-book.jpg') in 10.004313945770264

    Let's try to implement request timeout

    I tried #153 , eventlet and #260 and none of them can timeout properly

    A good value for timeout is 2s

    opened by rom1504 16
  • Add asyncio implementation of downloader

    Add asyncio implementation of downloader

    #252 #256

    The impl of asyncio downloader. It can also run properly on windows (without 3rd place dns resolver) with avg 500~600Mbps (under 1gbps network).

    use command arg --downloader to choose type of downloader("normal", "async"):

    img2dataset --downloader async
    

    mscoco download test

    opened by KohakuBlueleaf 3
  • opencv-python => opencv-python-headless

    opencv-python => opencv-python-headless

    This PR replaces opencv-python with opencv-python-headless to remove the dependency on GUI-related libraries (see: https://github.com/opencv/opencv-python/issues/370#issuecomment-671202529). I tested this working on the python:3.9 Docker image.

    opened by shionhonda 2
Releases(1.40.0)
Owner
Romain Beaumont
Interested in machine learning (computer vision, natural language processing, deep learning), node.js (network, bots, web), and programming in general
Romain Beaumont
Fill holes in binary 2D & 3D images fast.

Fill holes in binary 2D & 3D images fast.

11 Dec 09, 2022
This app finds duplicate to near duplicate images by generating a hash value for each image stored with a specialized data structure called VP-Tree which makes searching an image on a dataset of 100Ks almost instantanious

Offline Reverse Image Search Overview This app finds duplicate to near duplicate images by generating a hash value for each image stored with a specia

53 Nov 15, 2022
An automated Comic Book downloader (cbr/cbz) for use with SABnzbd, NZBGet and torrents

Mylar Note that feature development has stopped as we have moved to Mylar3. EOL for this project is the end of 2020 and will no longer be supported. T

979 Dec 13, 2022
Bot by image recognition simulating (random) human clicks

bbbot22 bot por reconhecimento de imagem simulando cliques humanos (aleatórios) inb4: sim, esse é basicamente o mesmo bot de 2021 porque a Globo não t

Yuri 2 Apr 05, 2022
Python-fu-cartoonify - GIMP plug-in to turn a photo into a cartoon.

python-fu-cartoonify GIMP plug-in to turn a photo into a cartoon. Preview Installation Copy python-fu-cartoonify.py into the plug-in folder listed und

Pascal Reitermann 6 Aug 05, 2022
🖼️ Draw Images or GIFs in your terminal

Drawitor Draw Images/GIFs in your terminal. Install pip install drawitor CLI Tool drawitor cat_dancing.gif Library The library is written in a simple

Eliaz Bobadilla 7 Dec 15, 2022
A functional and efficient python implementation of the 3D version of Maxwell's equations

py-maxwell-fdfd Solving Maxwell's equations via A python implementation of the 3D curl-curl E-field equations. This code contains additional work to e

Nathan Zhao 12 Dec 11, 2022
API to help generating QR-code for ZATCA's e-invoice known as Fatoora with any programming language

You can try it @ api-fatoora api-fatoora API to help generating QR-code for ZATCA's e-invoice known as Fatoora with any programming language Disclaime

نافع الهلالي 12 Oct 05, 2022
This projects aim is to simulate flowers(Gerbera Daisy) phyllotaxis.

phyllotaxis This projects aim is to simulate flowers(Gerbera Daisy) phyllotaxis. Take a look at the arrangement of this flower's seeds, this project's

amirsalar 3 Dec 10, 2021
MikuMikuRig是一款集生成控制器,自动导入动画,自动布料为一体的blender插件

Miku_Miku_Rig MikuMikuRig是一款集生成控制器,自动导入动画,自动布料为一体的blender插件。 MikumiKurig is a Blender plugin that can generates rig, automatically imports animations

小威廉伯爵 342 Dec 29, 2022
A collection of python scripts which help you programatically create PNGs or GIFs

A collection of python scripts which help you programatically create PNGs or GIFs and their Metadata in bulk with custom rarity rates, upload them to OpenSea & list them for sale.

Tom 30 Dec 24, 2022
Tool to create a Phunk image with a custom background

Create Phunk image Tool to create a Phunk image with a custom background Installation Clone the repo git clone https://github.com/albanow/etherscan_sa

Albano Pena Torres 6 Mar 31, 2022
This repository will help you get label for images in Stanford Cars Dataset.

STANFORD CARS DATASET stanford-cars "The Cars dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and

Nguyễn Trường Lâu 3 Sep 20, 2022
Generative Art Synthesizer - a python program that generates python programs that generates generative art

GAS - Generative Art Synthesizer Generative Art Synthesizer - a python program that generates python programs that generates generative art. Examples

Alexey Borsky 43 Dec 03, 2022
🎨 Generate and change color-schemes on the fly.

Generate and change color-schemes on the fly. Pywal is a tool that generates a color palette from the dominant colors in an image. It then applies the

dylan 6.9k Jan 03, 2023
clesperanto is a graphical user interface for GPU-accelerated image processing.

clesperanto is a graphical user interface for a multi-platform multi-language framework for GPU-accelerated image processing. It is based on napari and the pyclesperanto-prototype.

1 Jan 02, 2022
Blue noise image stippling in Processing (requires GPU)

Blue noise image stippling in Processing (requires GPU)

Mike Wong 141 Oct 09, 2022
python app to turn a photograph into a cartoon

Draw This. Draw This is a polaroid camera that draws cartoons. You point, and shoot - and out pops a cartoon; the camera's best interpretation of what

Dan Macnish 2k Dec 19, 2022
HTML2Image is a lightweight Python package that acts as a wrapper around the headless mode of existing web browsers to generate images from URLs and from HTML+CSS strings or files.

A package acting as a wrapper around the headless mode of existing web browsers to generate images from URLs and from HTML+CSS strings or files.

176 Jan 01, 2023
Rembg is a tool to remove images background.

Rembg is a tool to remove images background.

Daniel Gatis 7.8k Jan 05, 2023