Imagededup - ๐Ÿ˜Ž Finding duplicate images made easy

Overview

Image Deduplicator (imagededup)

Build Status Build Status Docs codecov PyPI Version License

imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection.

This package provides functionality to make use of hashing algorithms that are particularly good at finding exact duplicates as well as convolutional neural networks which are also adept at finding near duplicates. An evaluation framework is also provided to judge the quality of deduplication for a given dataset.

Following details the functionality provided by the package:

Detailed documentation for the package can be found at: https://idealo.github.io/imagededup/

imagededup is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows. It is distributed under the Apache 2.0 license.

๐Ÿ“– Contents

โš™๏ธ Installation

There are two ways to install imagededup:

  • Install imagededup from PyPI (recommended):
pip install imagededup

โš ๏ธ Note: The TensorFlow >=2.1 and TensorFlow 1.15 release now include GPU support by default. Before that CPU and GPU packages are separate. If you have GPUs, you should rather install the TensorFlow version with GPU support especially when you use CNN to find duplicates. It's way faster. See the TensorFlow guide for more details on how to install it for older versions of TensorFlow.

  • Install imagededup from the GitHub source:
=0.29" python setup.py install">
git clone https://github.com/idealo/imagededup.git
cd imagededup
pip install "cython>=0.29"
python setup.py install

๐Ÿš€ Quick Start

In order to find duplicates in an image directory using perceptual hashing, following workflow can be used:

  • Import perceptual hashing method
from imagededup.methods import PHash
phasher = PHash()
  • Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')
  • Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)
  • Plot duplicates obtained for a given file (eg: 'ukbench00120.jpg') using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

The output looks as below:

The complete code for the workflow is:

from imagededup.methods import PHash
phasher = PHash()

# Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')

# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)

# plot duplicates obtained for a given file using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

For more examples, refer this part of the repository.

For more detailed usage of the package functionality, refer: https://idealo.github.io/imagededup/

โณ Benchmarks

Detailed benchmarks on speed and classification metrics for different methods have been provided in the documentation. Generally speaking, following conclusions can be made:

  • CNN works best for near duplicates and datasets containing transformations.
  • All deduplication methods fare well on datasets containing exact duplicates, but Difference hashing is the fastest.

๐Ÿค Contribute

We welcome all kinds of contributions. See the Contribution guide for more details.

๐Ÿ“ Citation

Please cite Imagededup in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{idealods2019imagededup,
  title={Imagededup},
  author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/idealo/imagededup}},
}

๐Ÿ— Maintainers

ยฉ Copyright

See LICENSE for details.

Comments
  • Optional parallelization of cosine similarity computation (Issue #95)

    Optional parallelization of cosine similarity computation (Issue #95)

    The PR adds parallel_cosine_similarity to find_duplicates which is set to True by default so that it won't break any existing code. I have a really huge dataset and not enough RAM to use multiprocessing so introducing the ability to disable parallel computation of cosine similarity was the only way to use the package.

    enhancement 
    opened by EduardKononov 8
  • Cannot install imagededup==0.0.1, .. imagededup==0.1.0 because these package versions have conflicting dependencies.

    Cannot install imagededup==0.0.1, .. imagededup==0.1.0 because these package versions have conflicting dependencies.

    Not sure if this is a recent issue, or related to my using an M1 Mac. Nevertheless, the tail of a very long traceback is below:

    ERROR: Cannot install imagededup==0.0.1, imagededup==0.0.2, imagededup==0.0.3, imagededup==0.0.4 and imagededup==0.1.0 because these package versions have conflicting dependencies.
    
    The conflict is caused by:
        imagededup 0.1.0 depends on tensorflow==2.0.0
        imagededup 0.0.4 depends on tensorflow==2.0.0
        imagededup 0.0.3 depends on tensorflow==1.13.1
        imagededup 0.0.2 depends on tensorflow==1.13.1
        imagededup 0.0.1 depends on tensorflow==1.13.1
    
    To fix this you could try to:
    1. loosen the range of package versions you've specified
    2. remove package versions to allow pip attempt to solve the dependency conflict
    
    opened by robmarkcole 7
  • Duplicates are not found when comparing two files

    Duplicates are not found when comparing two files

    The code below always returns empty duplicates when two files are compared - regardless PNG files are same or not. (Python 3.8, OSX Catalina)

    import sys
    import os
    
    from imagededup.methods import PHash
    
    if __name__ == '__main__':
        hasher = PHash()
        image_map = {}
    
        for i in range(1,3):
            if not os.path.exists(sys.argv[i]):
                sys.exit(sys.argv[i] + " not found")
            image_map[sys.argv[i]] = hasher.encode_image(image_file=sys.argv[i])
        
        
        duplicates = hasher.find_duplicates(
            encoding_map=image_map,
            max_distance_threshold=0,
            scores=True)
    
        print(duplicates)
    
    bug duplicate 
    opened by mmertama 7
  • Relax the dependency specifiers

    Relax the dependency specifiers

    I can see why you may think you need to lock these down to exact versions, but in general for a library like this it's better to keep them more liberal for several reasons:

    1. Locking gives a false sense of security - it doesn't actually lock the whole tree, so a release of any of the dependencies of these packages could cause a break. You need to use a tool like Pipenv, poetry or pip-tools to lock the entire dependency tree.

    2. This is a library, you need to ensure that it fits in nicely with larger applications that may well have similar dependencies. By locking them to exact versions you are going to cause dependency issues as their specifiers may conflict (i.e Pillow >6.0.0)

    3. It's conceptually wrong - you're saying that this library only works with Pillow ==6.0.0. Not 6.0.1, which could be a bugfix release that fixes a bunch of issues.

    Testing dependencies, like pytest and others, should also not be locked. They are not critical to the application as a whole, and it's fairly obvious if they break. Which they won't.

    opened by orf 6
  • Installtion error

    Installtion error

    I forked the repository and I worked on the "dev" branch and while running the command:

    python setup.py install
    

    I got this error message:

    Processing dependencies for imagededup==0.1.0
    Searching for tensorflow~=2.0.0
    Reading https://pypi.org/simple/tensorflow/
    No local packages or working download links found for tensorflow~=2.0.0
    
    opened by ShaharNaveh 5
  • Handle multi picture objects (MPO)

    Handle multi picture objects (MPO)

    I have a LOT of images (roughly 1/3 of my entire personal library) that register in PIL as MPO. Your code barfs on all of them, but changing image_utils.py line 15 to IMG_FORMATS = ['JPEG', 'PNG', 'BMP', 'MPO'] fixes this.

    I didn't want to submit a PR in case you already know this and it causes a headache somewhere else. If not, please change this.

    enhancement 
    opened by drrelyea 5
  • Unable to execute this project

    Unable to execute this project

    I have several errors during the execution of this project, the main error is that after launched "pip install imagededup" command, this error comes out: image

    I already looked at all other issues but no one can solve this problem

    opened by uly94 4
  • Fix/installation

    Fix/installation

    Running python setup.py install on the dev branch fails.

    Tensorflow and numpy aren't playing well with each other. Leave tf>1.0 in setup.py with no mention of numpy (i.e., relying on tf to get numpy) leads to an error since tf 2.4.1 gets installed along with numpy 1.20.1 (due to pip resolver algo), but tf 2.4.1 needs numpy=~1.9.2. Explicitly mentioning numpy <1.20.0, makes the installation work. Additionally, newer versions of numpy (1.20.1), scipy (1.6 onwards) and matplotlib do not support python 3.6 anymore. The changes proposed in this PR will also work from Python 3.7 onwards.

    opened by tanujjain 4
  • cannot identify image file 'filename.png' 2021-01-17 20:12:33,709: WARNING Invalid image file filename.png:

    cannot identify image file 'filename.png' 2021-01-17 20:12:33,709: WARNING Invalid image file filename.png:

    Getting this error, cannot identify image file 'filename.png' 2021-01-17 20:12:33,709: WARNING Invalid image file filename.png: in v0.2.4 for .png file

    opened by awsaf49 4
  • Image format of encode_image method

    Image format of encode_image method

    Hi there, Thanks for your repo. To use encode_image(image_array=img_array) method, must the input numpy image array format be as BGR or RGB (because for example OpenCV default format is BGR, but Pillow is RGB)? Best

    opened by ahkarami 4
  • Fix tests

    Fix tests

    Currently we have failing Linux tests because we rely on the order of how images are loaded which varies across OSs. We don't care in which order images are loaded so we shouldn't test for it, that's why I removed the lines from tests/test_hashing.py

    We also have a failing test for macOS Python 3.6 on Azure pipelines which was not reproducible on my MacBook but was fixed by initialising a new CNN object for the failing test.

    opened by clennan 4
  • Supporting image compression?

    Supporting image compression?

    Hey!

    I believe most duplicates are created through compression. For example, if I upload an image to a different service, I re-download it. It's usually compressed, and its metadata may have changed. I have many duplicates from various platforms like Google, Facebook, etc.

    I haven't seen anything in the documentation about how this repository handles compression. Will it be able to recognize duplicate images with various levels of compression?

    Thanks!

    opened by PetrochukM 3
  •  Introducing new optional multiprocessing parameters

    Introducing new optional multiprocessing parameters

    WHAT/WHY

    Introduces new optional multiprocessing parameters to several methods:

    • num_enc_workers - Change number of processes to generate encodings (Addresses #156)
    • num_sim_workers/num_dist_workers - Change number of processes to compute similarity/distances (Addresses #95, #113)

    HOW

    APIs impacted

    For both CNN as well as Hashing functions, the following user-facing api calls get new parameters-

    • encode_images
    • find_duplicates
    • find_duplicates_to_remove

    Choice of default values

    • num_enc_workers: For CNN methods, this is set to 0 by default (0=disabled). Furthermore, parallelization of CNN encoding generation is only supported on linux platform. This is because pytorch requires the call to Dataloader to be wrapped within an if ___name__ == '__main__' construct to enable multiprocessing on non-linux platforms (due to the difference between fork() vs spawn() call used for multiprocessing on different systems). Such a construct does not fit well with the current code structure. For Hashing methods, this parameter is set to the cpu count of the system. The default values preserve backward compatibility.
    • num_dist_workers/num_sim_workers: Set to cpu count of the system. The default values preserve backward compatibility.
    opened by tanujjain 0
  • Formatted and Updated README.md - Added Streamlit based WebApp ๐Ÿ‘จโ€๐Ÿ’ปโœ…

    Formatted and Updated README.md - Added Streamlit based WebApp ๐Ÿ‘จโ€๐Ÿ’ปโœ…

    Hello @tanujjain / @clennan / @datitran, and https://github.com/idealo ,

    Kudos to you for bringing up imagededup. I worked on developing a simple streamlit based webapp on the same and I think it will be fruitful to have it as a part of README here as the motivation behind developing this came from your work ๐Ÿ˜„! The entire webapp codebase has also been added as a separate folder while the main README.md of the project has been updated with the same.

    You can find the entire webapp source-code in the stream_app directory of the repo.

    Happy opensourcing!

    Cheers, Prateek

    opened by prateekralhan 7
  • On demand duplicate check during runtime with a 'growing' BKTree

    On demand duplicate check during runtime with a 'growing' BKTree

    What I would like to achieve I about the following:

    EXISTING_HASHES: set = set()
    def is_duplicate(img_bytes: bytes):
        if get_hash(img_bytes) in EXISTING_HASHES:
            return True
        return False
    
    def main():
        image_bytes = get_new_image()
        if is_duplicate(image_bytes)
            return
    
        with open(file) as f:
           f.write(image_bytes)
            
    
    opened by sla-te 0
  • Error in the case of too many images

    Error in the case of too many images

    Hello,

    I got a problem when I tried to find duplicating image in a dataset of 40000 images. image

    image

    I have already tried this solution but it didn't work https://github.com/idealo/imagededup/issues/95

    Do you know how to fix it ? Thank you in advance.

    Best regard

    opened by hoangkhoiLE 0
Releases(v0.3.0)
  • v0.3.0(Oct 15, 2022)

    Installation fix

    • Make package installable by removing tensorflow as a dependency and replacing it with pytorch #173
    • Drop support for python 3.6 and python 3.7 #173

    โœจ New features and improvements

    • Use MobileNetv3 for generating CNN encodings #173
    • Introduce a 'recursive' option to generate encodings for images organized in a nested directory structure #104

    Breaking changes

    • Size of CNN encodings is 576 instead of 1024 #173
    • Since CNN encodings are generated using a different network, the robustness might be different; user might need to change similarity threshold settings #173
    • Hashes (all types) may be different from previous versions for a given image #173
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Nov 23, 2020)

    ๐Ÿ”ด Bug fixes

    • Fix broken cython brute force in Python 3.8 #117
    • Close figure after plotting to avoid figure overwrite #111
    • Allow encode_image method of cnn to accept 2d arrays #110
    • Relax dependencies and update packages #116, #107, #102, #119
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Dec 11, 2019)

    โœจ New features and improvements

    • Switched to creating list comprehensions to create lists on demand instead of slower explicit for loops that rely on calling the append function in every iteration. #76
    • Used sets for membership tests
    • Used broadcasting instead of explicit for loops
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Nov 3, 2019)

  • v0.2.0(Oct 30, 2019)

    โœจ New features and improvements

    • Implemented Cython implementation for brute force. This is now used as default search_method on Linux and MacOS X. For Windows, we still use bktree as default as we are not sure that popcnt is supported #56
    • Expand supported image formats. Now it also supports: 'MPO', 'PPM', 'TIFF', 'GIF', 'SVG', 'PGM', 'PBM' #35

    ๐Ÿ”ด Bug fixes

    • Relaxing the package dependencies #36
    • Removal of print statements #39
    • Fix type error when saving scores #55 & #61

    ๐Ÿ‘ฅ Contributors

    Thanks to @jonatron, @orf, @DannyFeliz, @ImportTaste, @fridzema, @DannyFeliz, @iozevo, @MomIsBestFriend, @YadunandanH for the pull requests and contributions.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Oct 8, 2019)

    This is the first release of imagededup.

    We added: ๐Ÿงฎ Several hashing algorithms (PHash, DHash, WHash, AHash) and convolutional neural networks ๐Ÿ”Ž An evaluation framework to judge the quality of deduplication ๐Ÿ–ผ Easy plotting functionality of duplicates โš™๏ธ Simple API

    Source code(tar.gz)
    Source code(zip)
Owner
idealo
idealo's technology org page, Germany's largest price comparison service. Visit us at https://idealo.github.io/.
idealo
This is the official implementation for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents" in NeurIPS 2021.

Observe then Incentivize Experiments This is the code used for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents",

Cong Shen Research Group 0 Mar 08, 2022
Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch

Enformer - Pytorch (wip) Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch. The original tensorflow

Phil Wang 235 Dec 27, 2022
๐Ÿ”ฅ3D-RecGAN in Tensorflow (ICCV Workshops 2017)

3D Object Reconstruction from a Single Depth View with Adversarial Learning Bo Yang, Hongkai Wen, Sen Wang, Ronald Clark, Andrew Markham, Niki Trigoni

Bo Yang 125 Nov 26, 2022
Source code for the paper "SEPP: Similarity Estimation of Predicted Probabilities for Defending and Detecting Adversarial Text" PACLIC 2021

Adversarial text generator Refer to "adversarial_text_generator"[https://github.com/quocnsh/SEPP_generator] project for generating adversarial texts A

0 Oct 05, 2021
InsightFace: 2D and 3D Face Analysis Project on MXNet and PyTorch

InsightFace: 2D and 3D Face Analysis Project on MXNet and PyTorch

Deep Insight 13.2k Jan 06, 2023
PyTorch implementation for 3D human pose estimation

Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach This repository is the PyTorch implementation for the network presented in:

Xingyi Zhou 579 Dec 22, 2022
A unified framework to jointly model images, text, and human attention traces.

connect-caption-and-trace This repository contains the reference code for our paper Connecting What to Say With Where to Look by Modeling Human Attent

Meta Research 73 Oct 24, 2022
SuMa++: Efficient LiDAR-based Semantic SLAM (Chen et al IROS 2019)

SuMa++: Efficient LiDAR-based Semantic SLAM This repository contains the implementation of SuMa++, which generates semantic maps only using three-dime

Photogrammetry & Robotics Bonn 701 Dec 30, 2022
HiFT: Hierarchical Feature Transformer for Aerial Tracking (ICCV2021)

HiFT: Hierarchical Feature Transformer for Aerial Tracking Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li Our paper is Accepted by ICCV 2

Intelligent Vision for Robotics in Complex Environment 55 Nov 23, 2022
Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.

Optimum Transformers Accelerated NLP pipelines for fast inference ๐Ÿš€ on CPU and GPU. Built with ๐Ÿค— Transformers, Optimum and ONNX runtime. Installatio

Aleksey Korshuk 115 Dec 16, 2022
Learning Versatile Neural Architectures by Propagating Network Codes

Learning Versatile Neural Architectures by Propagating Network Codes Mingyu Ding, Yuqi Huo, Haoyu Lu, Linjie Yang, Zhe Wang, Zhiwu Lu, Jingdong Wang,

Mingyu Ding 36 Dec 06, 2022
A annotation of yolov5-5.0

ไปฃ็ ็‰ˆๆœฌ๏ผš0714 commit #4000 $ git clone https://github.com/ultralytics/yolov5 $ cd yolov5 $ git checkout 720aaa65c8873c0d87df09e3c1c14f3581d4ea61 ่ฟ™ไธชไปฃ็ ๅชๆ˜ฏๆณจ้‡Š็‰ˆ

Laughing 229 Dec 17, 2022
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 08, 2022
Self-Supervised Deep Blind Video Super-Resolution

Self-Blind-VSR Paper | Discussion Self-Supervised Deep Blind Video Super-Resolution By Haoran Bai and Jinshan Pan Abstract Existing deep learning-base

Haoran Bai 35 Dec 09, 2022
COCO Style Dataset Generator GUI

A simple GUI-based COCO-style JSON Polygon masks' annotation tool to facilitate quick and efficient crowd-sourced generation of annotation masks and bounding boxes. Optionally, one could choose to us

Hans Krupakar 142 Dec 09, 2022
The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

Habitat-Matterport 3D Dataset (HM3D) The Habitat-Matterport 3D Research Dataset is the largest-ever dataset of 3D indoor spaces. It consists of 1,000

Meta Research 62 Dec 27, 2022
Face Mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices

Face-Mesh Face Mesh is a face geometry solution that estimates 468 3D face landmarks in real-time even on mobile devices. It employs machine learning

Farnam Javadi 9 Dec 21, 2022
Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer.

DocEnTR Description Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer. This model is implemented on to

Mohamed Ali Souibgui 74 Jan 07, 2023
Compare GAN code.

Compare GAN This repository offers TensorFlow implementations for many components related to Generative Adversarial Networks: losses (such non-saturat

Google 1.8k Jan 05, 2023
Certified Patch Robustness via Smoothed Vision Transformers

Certified Patch Robustness via Smoothed Vision Transformers This repository contains the code for replicating the results of our paper: Certified Patc

Madry Lab 35 Dec 14, 2022