Python script for finding duplicate images within a folder.

Overview

Duplicate Image Finder (DIF)


Tired of going through all images in a folder and comparing them manually to check if they are duplicates? The Duplicate Image Finder (DIF) for Python automates this task for you!


Description

The DIF searches for images in a specified target folder, compares the images it found and checks whether these are duplicates. The DIF then outputs the image files classified as duplicates and the filenames of the images having the lowest resolution, so you know which of the duplicate images are safe to be deleted. You can then either delete them manually, or let the DIF delete them for you.

Basic Usage

Use the following function to make DIF search for duplicates in the specified folder:

from difPy import compare_images

1

#1

  1. Test
  • test
Comments
  • run the CLI, how?

    run the CLI, how?

    Hello,

    call me stupid but I try to run the cli version of this code, I can run it from a basic script: from difPy import dif search = dif("C:/Path/to/Folder/")

    and this works. but if I run it as python dif.py -A "C:/Path/to/Folder_A/"

    I get a no such file or directory

    And yes, not very familiar with python (yet)

    Kind Regards,

    Gerrit Kuilder

    question 
    opened by GerritKuilder 4
  • Search results' keys are just names, but sometimes in sub-folders

    Search results' keys are just names, but sometimes in sub-folders

    Hi there! I have a folder like this:

    folder/
    | - IMG_202201.jpg
    | - IMG_202202.jpg
    | - subfolder/
    |  | - IMG_202203.jpg
    

    and i use it as first arg

    i noticed that difPy.dif() search results give me just the file name... without the subfolder anyhow noted :neutral_face:

    this broke my script with FileNotFoundError: [Errno 2] No such file or directory

    bug 
    opened by TheLastGimbus 4
  • PNGs with transparency are mistakenly counted as duplicate and not rendered properly in GUI compare

    PNGs with transparency are mistakenly counted as duplicate and not rendered properly in GUI compare

    Great tool! I learned a lot reading the article you wrote about this as well.

    I tested it on some of my files, but found that I had some PNGs that were just line-art (black line-art on transparent background) were flagged as duplicate when they were completely different, even on high sensitivity. In fact, the listed MSE is 0.00

    They also did not render properly during the image comparison when running -d False, with both image previews looking like black squares. Note: This does not apply to line-art of a different color on transparent background, only black.

    I am not familiar with how the PNG file format encodes black vs transparent, but I believe that the issue stems from that.

    Screen Shot 2022-07-22 at 1 57 07 AM

    question 
    opened by SPRCoreDump 4
  • ValueError.

    ValueError.

    Hi there,

    I'm trying to run this code on folder with more than 80k images:

    Traceback (most recent call last):
      File ".\difpy.py", line 3, in <module>
        dif.compare_images("PATH TO FOLDER")
      File "C:\Users\user\.conda\envs\gan\lib\site-packages\difPy\dif.py", line 35, in compare_images
        imgs_matrix = dif.create_imgs_matrix(directory, px_size)
      File "C:\Users\user\.conda\envs\gan\lib\site-packages\difPy\dif.py", line 121, in create_imgs_matrix
        imgs_matrix = np.concatenate((imgs_matrix, img))
      File "<__array_function__ internals>", line 6, in concatenate
    ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 3 dimension(s) and the array at index 1 has 2 dimension(s)
    

    what am i doing wrong?

    Thanks in advance

    bug 
    opened by rqtqp 4
  • Same duplicate in different keys

    Same duplicate in different keys

    We have found that when you use dif within a folder of folders, there may be some unexpected behaviour. In our case, we have a pair of duplicates in one folder, and a third duplicate in another one. This makes it so result will output:

    image

    So an element that was detected as duplicate is being used later as a key. We do not know if this is bug or a feature, but it may be inconsistent with the behavior of not repeating duplicates in later keys. Still, for our use we can just use a set() as a workaround to ignore "duplicates of duplicates".

    Nice work on the tool, it has helped us a lot with a nasty database. Thank you, have a nice day!

    5m0RZ

    bug 
    opened by Fenho 3
  • Erroneous results on particular image set

    Erroneous results on particular image set

    I've been testing various image sets trying to isolate a bug and I got weird results on this one. There are no duplicates or similar images in this set. Similarity was to high. For example, the first result detected 32 duplicates with many of the files being listed more than once.

    difPy output.zip

    The image set can be downloaded here since it's to big to post. https://drive.google.com/file/d/1pbl7SttHF-mB35V1Q5ehj6A5wCb68o3B/view?usp=sharing

    bug 
    opened by MarcG2 3
  • Match Single Image with Read-Only Directory

    Match Single Image with Read-Only Directory

    Dear Developer,

    Am a noob but still love programming (have just started) so excuse me if anything below is "obvious" or "incorrectly stated".

    I got the gist that this will match all files in the given directory for similarity.

    First Point: Is it possible to match an image (file path to pass as parameter) against a directory path (folder path to pass as parameter)? Which Means that instead of Matching all Images against all images, we could match just one image against all images of a folder.

    Second Point: Is the function writing something in the Search folder (like tensor Data or anything)? Am asking to understand if this can work in read-only directory or not. (I tried reading the code but could not figure it out)

    Third point: If we have to run / call it multiple times on a large folder then would it be taking long time analyzing all files each time or is it possible to provide / pass a path to file / folder where it can save the analysis to save the time?

    Example: (No text in below lines is crossed so please do not ignore if any text is coming crossed. I could not figure out why is it applying this formatting")

    Input_file_path = "~/Downloads/image.jpg" # Any valid Image File Target_Folder_path = "~/A_Readonly_Folder_of_Images" # A Read-only folder with say 56,000 (big number ?) files to search from. Working_File_or_Folder_path: "~/A_File_or_Folder_with_Read_Write_Access" # A Write access enabled file / folder to save analysis data to / from. E.g. If the passed parameter file / folder does not exist then create one and save analysis data. If the passed parameter file / folder does exist then read it and use it instead of analyzing the Target Folder again #calling dif.compare_image(Input_file_path,Target_folder_path,Working_Folder_path)

    Please excuse me if am crossing any limits here. I just became curious about this wonderful concept but I know nothing about github and how it works.

    Best Regards Ashish

    question 
    opened by ashish128 3
  • [CHANGE REQUEST] replacing 'output directory' with 'move_path'

    [CHANGE REQUEST] replacing 'output directory' with 'move_path'

    Hello. first of all I would like to thank you for creating and maintaining this project. It has certainly helped me finding a bunch of duplicate images through my enormous gallery.

    I discovered this project 3/4 months ago. I needed a way for difPy.py to move my duplicate images to certain directories, but it was not possible. I edited the source code - which was really easy, having little to no Python experience prior to this.

    As I recently wanted to make a pull request, I noticed that this repository had been updated, which meant that I had to update my version as well. Along with the updates, I noticed a new output_directory flag, which was only useful if using this program through the command line. I made my changes and would like to introduce my implementation.

    Instead of the (now present) output_directory flag, I added move, silent_move and move_path as parameters to the __init__ function. Here are the details:

    • Their default values are (of course) false
    • move and silent_move would be further passed to the _validate_parameters() function
    • After processing directory_A and directory_B, if move was set to true, the move_path would be validated - checked if it was equal to directory_A and/or directory_B, and it would be further passed to the _process_directory() function
    • An appropriate prompt for the silent_move parameter
    • In the _validate_parameters() function, move and delete can not be both true, as well as move and silent_move accepting only boolean values
    • A _move_imgs() funcion, similar to _delete_imgs(), with appropriate behavior
    • -m, --move, -M, --silent_move, -mp and --move-path CLI flags

    The currently implemented output_directory flag only works for the CLI, but not for python scripts, as it is not passed over to the __init__ funcion. As a result, I have removed the output_directory flag and replaced it with my move implementation. This version takes both the command line and scripts in mind.

    I would be happy to submit a pull request with my changes, If this idea sounds good to you, so you can take a better look at how these changes would be implemented.

    Looking forward to collaborating and contributing to this project as much as I can.

    new feature out of scope 
    opened by bojanmilevski 2
  • Near duplicate Image detection

    Near duplicate Image detection

    Hello, first of all thanks for creating this package It is really good package for detecting Duplicate images. I have tried this package I have found that it is able to detect images which are 100% similarity but I have found that it was not able to detect the images when similarity is not 100% even if similarity is 99.99% or less not able to detect image. I have tried to play with the pixel values and similarity but than also it was not able to detect. So, is there ways to detect such image which having similarity score less than 100% by using difpy package.

    I have attached few images which it was not able to detect. Note:- The percentage values which I have refereed many times found from matchTemplate method the images which are attached having similarity is 99%.

    TOI_Delhi_12-07-2022_4_1 TOI_Delhi_12-07-2022_4_2 TOI_Delhi_12-07-2022_4_3 TOI_Delhi_12-07-2022_4_7 TOI_Delhi_12-07-2022_4_8

    question 
    opened by dhruvbhatnagar9548 2
  • search in Sub directories

    search in Sub directories

    Hi Elise!

    Thank you for existing!

    My Onedrive duplicated my library about 4years ago, that and countless backups from WhatsApp and messager, A 550GB mess, yeah you get the point.

    I'm really new to coding and git so figure ill postcode instead, it's not clean but I'm pressed on time studying applied data science and working as a product manager.

    I have a few more ideas, but the code below was necessary for me right now :)

    Code finds photos in all subdirectories (folder in a folder) in the given file paths. Code I have added is commented: #added by Kristofer from #added by Kristofer to

    `import skimage.color import matplotlib.pyplot as plt import numpy as np import cv2 import os import imghdr import time import collections #added kristofer from pathlib import Path

    class dif:

    def __init__(self, directory_A, directory_B = None, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False, silent_del=False):
        """
        directory_A (str)......folder path to search for duplicate/similar images
        directory_B (str)....second folder path to search for duplicate/similar images
        similarity (str)....."normal" = searches for duplicates, recommended setting, MSE < 200
                             "high" = serached for exact duplicates, extremly sensitive to details, MSE < 0.1
                             "low" = searches for similar images, MSE < 1000
        px_size (int)........recommended not to change default value
                             resize images to px_size height x width (in pixels) before being compared
                             the higher the pixel size, the more computational ressources and time required 
        sort_output (bool)...False = adds the duplicate images to output dictionary in the order they were found
                             True = sorts the duplicate images in the output dictionars alphabetically 
        show_output (bool)...False = omits the output and doesn't show found images
                             True = shows duplicate/similar images found in output            
        delete (bool)........! please use with care, as this cannot be undone
                             lower resolution duplicate images that were found are automatically deleted
        silent_del (bool)....! please use with care, as this cannot be undone
                             True = skips the asking for user confirmation when deleting lower resolution duplicate images
                             will only work if "delete" AND "silent_del" are both == True
        
        OUTPUT (set).........a dictionary with the filename of the duplicate images 
                             and a set of lower resultion images of all duplicates
        """
        start_time = time.time()
    
       
        if directory_B != None:
            # process both directories
            dif._process_directory(directory_A)
            dif._process_directory(directory_B)
        else:
            # process one directory
            dif._process_directory(directory_A)
            directory_B = directory_A
    
        all_directories_A = [directory_A]
        all_directories_B = [directory_B]
    
        #added by Kristofer from
        for path in Path(directory_A).iterdir():
            if path.is_dir():
                all_directories_A.append(path)
    
        for path in Path(directory_B).iterdir():
            if path.is_dir():
                all_directories_B.append(path)
        
        dif._validate_parameters(sort_output, show_output, similarity, px_size, delete, silent_del)
    
        for dif_A in all_directories_A:
            for dif_B in all_directories_B:
    
                directory_A = str(dif_A)
                directory_B = str(dif_B)
        #added by Kristofer to                    
                       
                if directory_B == directory_A:
                    result, lower_quality = dif._search_one_dir(directory_A, 
                                                                    similarity, px_size, sort_output, show_output, delete)
                else:
                    result, lower_quality = dif._search_two_dirs(directory_A, directory_B, 
                                                                    similarity, px_size, sort_output, show_output, delete)
                    if len(lower_quality) != len(set(lower_quality)):
                        print("DifPy found that there are duplicates within directory A.")
                        
                if sort_output == True:
                    result = collections.OrderedDict(sorted(result.items()))
                
                time_elapsed = np.round(time.time() - start_time, 4)
                
                self.result = result
                self.lower_quality = lower_quality
                self.time_elapsed = time_elapsed
                
                if len(result) == 1:
                    images = "image"
                else:
                    images = "images"
                print("Found", len(result), images, "with one or more duplicate/similar images in", time_elapsed, "seconds.")
                
                if len(result) != 0:
                    if delete:
                        if not silent_del:
                            usr = input("Are you sure you want to delete all lower resolution duplicate images? \nThis cannot be undone. (y/n)")
                            if str(usr) == "y":
                                dif._delete_imgs(set(lower_quality))
                            else:
                                print("Image deletion canceled.")
                        else:
                            dif._delete_imgs(set(lower_quality))
    
                    
            
    def _search_one_dir(directory_A, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False):
        
        img_matrices_A, filenames_A = dif._create_imgs_matrix(directory_A, px_size)
        result = {}
        lower_quality = []   
        
        ref = dif._map_similarity(similarity)
        
        # find duplicates/similar images within one folder
        for count_A, imageMatrix_A in enumerate(img_matrices_A):
            for count_B, imageMatrix_B in enumerate(img_matrices_A):
                if count_B != 0 and count_B > count_A and count_A != len(img_matrices_A):      
                    rotations = 0
                    while rotations <= 3:
                        if rotations != 0:
                            imageMatrix_B = dif._rotate_img(imageMatrix_B)
    
                        err = dif._mse(imageMatrix_A, imageMatrix_B)
                        if err < ref:
                            if show_output:
                                dif._show_img_figs(imageMatrix_A, imageMatrix_B, err)
                                dif._show_file_info(str("..." + directory_A[-35:]) + "/" + filenames_A[count_A], 
                                                   str("..." + directory_A[-35:]) + "/" + filenames_A[count_B])
                            if filenames_A[count_A] in result.keys():
                                result[filenames_A[count_A]]["duplicates"] = result[filenames_A[count_A]]["duplicates"] + [directory_A + "/" + filenames_A[count_B]]
                            else:
                                result[filenames_A[count_A]] = {"location" : directory_A + "/" + filenames_A[count_A],
                                                                    "duplicates" : [directory_A + "/" + filenames_A[count_B]]
                                                                   }
                            high, low = dif._check_img_quality(directory_A, directory_A, filenames_A[count_A], filenames_A[count_B])
                            lower_quality.append(low)                         
                            break
                        else:
                            rotations += 1    
        if sort_output == True:
            result = collections.OrderedDict(sorted(result.items()))
        return result, lower_quality            
    
    def _search_two_dirs(directory_A, directory_B = None, similarity="normal", px_size=50, sort_output=False, show_output=False, delete=False):
    
        img_matrices_A, filenames_A = dif._create_imgs_matrix(directory_A, px_size)
        img_matrices_B, filenames_B = dif._create_imgs_matrix(directory_B, px_size)
        
        result = {}
        lower_quality = []   
        
        ref = dif._map_similarity(similarity)
            
        # find duplicates/similar images between two folders
        for count_A, imageMatrix_A in enumerate(img_matrices_A):
            for count_B, imageMatrix_B in enumerate(img_matrices_B):
                rotations = 0
                #print(count_A, count_B)
                while rotations <= 3:
    
                    if rotations != 0:
                        imageMatrix_B = dif._rotate_img(imageMatrix_B)
                        
                    err = dif._mse(imageMatrix_A, imageMatrix_B)
                    #print(err)
                    if err < ref:
                        if show_output:
                            dif._show_img_figs(imageMatrix_A, imageMatrix_B, err)
                            dif._show_file_info(str("..." + directory_A[-35:]) + "/" + filenames_A[count_A], 
                                               str("..." + directory_B[-35:]) + "/" + filenames_B[count_B])
                        
                        if filenames_A[count_A] in result.keys():
                            result[filenames_A[count_A]]["duplicates"] = result[filenames_A[count_A]]["duplicates"] + [directory_B + "/" + filenames_B[count_B]]
                        else:
                            result[filenames_A[count_A]] = {"location" : directory_A + "/" + filenames_A[count_A],
                                                                "duplicates" : [directory_B + "/" + filenames_B[count_B]]
                                                               }
                        high, low = dif._check_img_quality(directory_A, directory_B, filenames_A[count_A], filenames_B[count_B])
                        lower_quality.append(low)                         
                        break
                    else:
                        rotations += 1    
                
        if sort_output == True:
            result = collections.OrderedDict(sorted(result.items()))
        return result, lower_quality
    
    def _process_directory(directory):
        # check if directories are valid
        directory += os.sep
        if not os.path.isdir(directory):
            raise FileNotFoundError(f"Directory: " + directory + " does not exist")
        return directory
    
    def _validate_parameters(sort_output, show_output, similarity, px_size, delete, silent_del):
        # validate the parameters of the function
        if sort_output != True and sort_output != False:
            raise ValueError('Invalid value for "sort_output" parameter.')
        if show_output != True and show_output != False:
            raise ValueError('Invalid value for "show_output" parameter.')
        if similarity not in ["low", "normal", "high"]:
            raise ValueError('Invalid value for "similarity" parameter.')
        if px_size < 10 or px_size > 5000:
            raise ValueError('Invalid value for "px_size" parameter.')
        if delete != True and delete != False:
            raise ValueError('Invalid value for "delete" parameter.')   
        if silent_del != True and silent_del != False:
            raise ValueError('Invalid value for "silent_del" parameter.')   
    
    def _create_imgs_matrix(directory, px_size):
        directory = dif._process_directory(directory)
        img_filenames = []
        # create list of all files in directory     
        folder_files = [filename for filename in os.listdir(directory)]
    
        # create images matrix   
        imgs_matrix = []
        for filename in folder_files:
            path = os.path.join(directory, filename)
            # check if the file is not a folder
            if not os.path.isdir(path):
                try:
                    img = cv2.imdecode(np.fromfile(path, dtype=np.uint8), cv2.IMREAD_UNCHANGED)
                    if type(img) == np.ndarray:
                        img = img[..., 0:3]
                        img = cv2.resize(img, dsize=(px_size, px_size), interpolation=cv2.INTER_CUBIC)
                        
                        if len(img.shape) == 2:
                            img = skimage.color.gray2rgb(img)
                        imgs_matrix.append(img)
                        img_filenames.append(filename)
                except:
                    pass
        return imgs_matrix, img_filenames
    
    def _map_similarity(similarity):
        if similarity == "low":
            ref = 1000
        # search for exact duplicate images, extremly sensitive, MSE < 0.1
        elif similarity == "high":
            ref = 0.1
        # normal, search for duplicates, recommended, MSE < 200
        else:
            ref = 200
        return ref
    
    # Function that calulates the mean squared error (mse) between two image matrices
    def _mse(imageA, imageB):
        err = np.sum((imageA.astype("float") - imageB.astype("float")) ** 2)
        err /= float(imageA.shape[0] * imageA.shape[1])
        return err
    
    # Function that plots two compared image files and their mse
    def _show_img_figs(imageA, imageB, err):
        fig = plt.figure()
        plt.suptitle("MSE: %.2f" % (err))
        # plot first image
        ax = fig.add_subplot(1, 2, 1)
        plt.imshow(imageA, cmap=plt.cm.gray)
        plt.axis("off")
        # plot second image
        ax = fig.add_subplot(1, 2, 2)
        plt.imshow(imageB, cmap=plt.cm.gray)
        plt.axis("off")
        # show the images
        plt.show()
        
    # Function for printing filename info of plotted image files
    def _show_file_info(imageA, imageB):
        print("""Duplicate files:\n{} and \n{}
        
        """.format(imageA, imageB))
        
    # Function for rotating an image matrix by a 90 degree angle
    def _rotate_img(image):
        image = np.rot90(image, k=1, axes=(0, 1))
        return image
    
    # Function for checking the quality of compared images, appends the lower quality image to the list
    def _check_img_quality(directoryA, directoryB, imageA, imageB):
        dirA = dif._process_directory(directoryA)
        dirB = dif._process_directory(directoryB)
        size_imgA = os.stat(dirA + imageA).st_size
        size_imgB = os.stat(dirB + imageB).st_size
        if size_imgA >= size_imgB:
            return directoryA + "/" + imageA, directoryB + "/" + imageB
        else:
            return directoryB + "/" + imageB, directoryA + "/" + imageA
        
    # Function for deleting the lower quality images that were found after the search    
    def _delete_imgs(lower_quality_set):
        deleted = 0
        for file in lower_quality_set:
            print("\nDeletion in progress...", end = "\r")
            try:
                os.remove(file)
                print("Deleted file:", file, end = "\r")
                deleted += 1
            except:
                print("Could not delete file:", file, end = "\r")
        print("\n***\nDeleted", deleted, "images.")
    

    `

    new feature 
    opened by DeyoSwed 2
  • Local variable 'imgs_matrix' referenced before assignment

    Local variable 'imgs_matrix' referenced before assignment

    Hello,

    I get this error while trying to run this simple line from your package (the import works). Some help would be very welcome.

    UnboundLocalError: local variable 'imgs_matrix' referenced before assignment

    image

    bug 
    opened by Tesax123 2
  • Refactoring - Optional Merge

    Refactoring - Optional Merge

    Hi Elise :wave:

    first of all, cool idea! I recently needed to compare a large chunks of images and your approach for comparing them worked pretty well :+1:

    That being said, in the current implementation it is rather slow. Comparing larger chunks of images (15000+) takes a while. Moreover, you use a lot of different dependencies where some of them are quite large (e.g. opencv). This makes it difficult to install the tool in specific environments like within a Docker container.

    Since I probably need to compare images in future again, I thought of improving these issues. This pull request provides the results. Before talking about the changes, let me apologize for the huge pull request. I actually do not like larger pull requests for my own repos and prevent from doing them to other persons as well. However, the dependency changes and especially the multiprocessing required a larger restructuring of your tool. Therefore, I totally understand if you do not want to merge the changes. In this case, I'm fine with maintaining a fork of your repository that provides an alternative implementation. Just decide as you like :)

    Here is a brief summary of the changes I made:

    1. Make a clearer cut between CLI and library. The CLI script is now contained in /bin/difpy, while the code in /difPy/difPy.py only contains the library implementation.
    2. Reduce dependencies. The whole technique you describe can be implemented using numpy and Pillow. This makes it possible to create a Docker container running difPy that has only 161MB. Before, with opencv, we were around 1.2GB.
    3. Add multiprocessing. Work can now be distributed between different cores, which should speed up the operation quite a bit for larger image sets.
    4. Add a fast compare option. When image A is similar to image B, one probably does not want to compare B to other images, but is fine with only comparing A with others from here. Sure, this may misses some edge case duplicates, but in most situations it should be fine and provides a huge speedup for the operation.
    5. Change the command line layout. Feels now more intuitive (at least to me :D)
    6. Change the output format. The output format is still JSON based, but does not include much statistic information now. The regular end user is probably not that interested on when a comparison took place, but more on the actual comparison result. The new reduced output format should be easier to read / parse.
    7. Add a Dockerfile for building a container running difpy.

    As I said, many changes. Just think about whether you want to merge or whether we keep these changes in a separate fork. I'm fine with both approaches :wink:

    Best Tobias

    new feature 
    opened by qtc-de 1
  • Multi-processing

    Multi-processing

    I am currently working on making this project multithreaded, as I have many folders with tens of thousands of images(perhaps 100k+), and am wanting a slightly faster option.

    Opening this as a means of communication. If you have a discord account/email that would work better, as I will likely see that before a github issue comment.
    My discord account is thecodingchicken#4835 if you would prefer to reach out there.

    new feature 
    opened by thecodingchicken 3
  • Multi-threading

    Multi-threading

    Hi! I have nice AMD cpu with 8 cores. And when I'm searching thorough 2 big folders, it takes a lof of time because only one of them is being use

    Dividing the work into multiple threads seems as obvious task in this library - would be awesome if you implemented it! (or suggested how it could be done for someone to pull request)

    new feature 
    opened by TheLastGimbus 1
  • feature request: chunking of source folder

    feature request: chunking of source folder

    Thank you for your library! Just giving a heads up that I edited one of your previous versions by adding an additional parameter that allows the src folder to be split into n chunks for processing. Scenario: I have image folders that contain over 50000 images in sequential time over.

    For me, it is most likely that an image file is going to be a duplicate with other image files added around a similar time frame. Comparing against the entire 50000+ for each image took an enormous amount of time. So, I made it so that I could split the folder into chunks of 5000 (for example) and evaluate in sections. It also allowed me to restart from a position if I had to stop evaluation for some reason. There's a little more that I added to make it more robust (for example, for n+1 chunk would also include some amount of files from the previous chunk so that there would be some degree of overlap). Anyway, this worked out well for me and if you are still adding to this library then I found it to be very useful.

    The route I took is not going to be as robust as going through EVERY image each time but in my personal tests, the performance was close enough and the time savings were significant! Cheers,

    new feature 
    opened by ALCarter2 1
Releases(v2.4.5)
  • v2.4.5(Jan 1, 2023)

    Major updates and bug fixes:

    • Fixed issue #42 where duplicate files in subfolders would be added twice to the search.result output dictionary
    • @stberg-os implemented the feature to disable recursive search: search within subfolders can now be turned off
    • Various other minor code updates

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.4...v2.4.5

    Source code(tar.gz)
    Source code(zip)
  • v2.4.4(Aug 25, 2022)

    Major code improvements & fixes

    • Fixed issue #37 where black and white images would not be correctly decoded.
    • Fixed issue where command line parameter -s / -similarity would not accept integers as input
    • Various other fixes in the code

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.3...v2.4.4

    Source code(tar.gz)
    Source code(zip)
  • v2.4.3(Aug 24, 2022)

    Please update to a higher version as a major issue was found in v2.4.3.

    Major bug fix

    • Fixed issue #37 which caused difPy's output to be inaccurate.

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.2...v2.4.3

    Source code(tar.gz)
    Source code(zip)
  • v2.4.2(Aug 21, 2022)

    Please update to a higher version as a major issue was found in v2.4.2.

    Bug fixes & minor code improvements

    • Fixed issue #33 where files with same filename and different folder would be put under the same key in the output results dictionary
    • Removed sort_output parameter as it became obsolete with the above fix
    • Support for setting the MSE threshold for comparison directly from the similarity parameter
    • Implemented handling for issue #32 where CTRL-C would not abort the difPy process when running in a terminal
    • Various other code improvements

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4.1...v2.4.2

    Source code(tar.gz)
    Source code(zip)
  • v2.4.1(Jul 10, 2022)

    Minor code updates and bug fixes

    • Changed show progress parameter to default True: the progress bar of difPy will be shown by default
    • Added -Z / -output_directory parameter to CLI interface: allows to set the output folder of the result files
    • More detailed progress tracking: progress bar is shown when difPy is preparing the files in the target folder(s), and when difPy is comparing the images
    • Fixed an issue where search in subfolders was imprecise
    • @ethanmann fixed issue #25
    • Minor other code adjustments and bug fixes

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.4...v2.4.1

    Source code(tar.gz)
    Source code(zip)
  • v2.4(Jun 30, 2022)

    Major new features and code improvements:

    • Enhancement #12 and #18: added support for search within subfolders
    • Enhancement #11: added support for usage through CLI interface
    • Improved path handling of files to be os-independent
    • Various minor code updates

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.3...v2.4

    Source code(tar.gz)
    Source code(zip)
  • v2.3(Jun 29, 2022)

    New features and code improvements:

    • Enhancement https://github.com/elisemercury/Duplicate-Image-Finder/pull/19: added support for a progress bar to track the process of difPy
    • Enhancement https://github.com/elisemercury/Duplicate-Image-Finder/pull/20: added support for generation of statistics on the difPy process
    • Fixed bug #17 which caused a FileNotFoundError when files where moved/deleted while difPy is running
    • Various updates & improvements to the code

    Full Changelog: https://github.com/elisemercury/Duplicate-Image-Finder/compare/v2.2...v2.3

    Source code(tar.gz)
    Source code(zip)
  • v2.2(Mar 6, 2022)

  • v2.0(Dec 26, 2021)

    Major code updates and various new features added:

    • Runtime of difPy v2.0 is 6x faster than its previous versions
    • Support for search within two different folders
    • Support for sorting of output by filename alphabetically
    • Optimization and implementation of error handling
    • Various other code improvements
    Source code(tar.gz)
    Source code(zip)
  • v1.2(Nov 10, 2021)

  • v1.0.0(Oct 30, 2021)

    Various updates to the code.

    New features:

    • Automatically delete the lower resolution duplicate files that were found
    • Addition of a new similarity-level at which images are compared: now 3 levels can be chosen ("low", "normal" and "high")

    Upload as package to PyPI.org

    Source code(tar.gz)
    Source code(zip)
  • v0.0(Oct 30, 2021)

Owner
Technical Solutions Specialist @ Cisco Systems
rclip - AI-Powered Command-Line Photo Search Tool

rclip is a command-line photo search tool based on the awesome OpenAI's CLIP neural network.

Yurij Mikhalevich 394 Dec 12, 2022
This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

Karn Deb 49 Oct 30, 2022
A library for fast parse & import of Windows Prefetch into Elasticsearch.

prefetch2es Fast import of Windows Prefetch(.pf) into Elasticsearch. prefetch2es uses C library libscca. Usage When using from the commandline interfa

S.Nakano 5 Nov 24, 2022
Deep Image Search - AI-Based Image Search Engine

Deep Image Search is an AI-based image search engine that includes deep transfer learning features Extraction and tree-based vectorized search technique.

144 Jan 05, 2023
solrpy is a Python client for Solr

solrpy solrpy is a Python client for Solr, an enterprise search server built on top of Lucene. solrpy allows you to add documents to a Solr instance,

Jiho Persy Lee 37 Jul 22, 2021
Home for Elasticsearch examples available to everyone. It's a great way to get started.

Introduction This is a collection of examples to help you get familiar with the Elastic Stack. Each example folder includes a README with detailed ins

elastic 2.5k Jan 03, 2023
ForFinder is a search tool for folder and files

ForFinder is a search tool for folder and files. You can use that when you Source Code Analysis at your project's local files or other projects that you are download. Enter a root path and keyword to

Çağrı Aliş 7 Oct 25, 2022
Inverted index creation and query search mechanism on Wikipedia pages.

WikiPedia Search Engine Step 1 : Installing Requirements Install "stemming" module for python using pip. Step 2 : Parsing the Data To parse the data,

Piyush Atri 1 Nov 27, 2021
An image inline search telegram bot.

Image-Search-Bot An image inline search telegram bot. Note: Use Telegram picture bot. That is better. Not recommending to deploy this bot. Made with P

Fayas Noushad 24 Oct 21, 2022
This is a Telegram Bot written in Python for searching data on Google Drive.

This is a Telegram Bot written in Python for searching data on Google Drive. Supports multiple Shared Drives (TDs). Manual Guide for deploying the bot

Levi 158 Dec 27, 2022
Wagtail CLIP allows you to search your Wagtail images using natural language queries.

Wagtail CLIP allows you to search your Wagtail images using natural language queries.

Matt Segal 10 Dec 21, 2022
A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

S.Nakano 3 Apr 01, 2022
GitScanner is a script to make it easy to search for Exposed Git through an advanced Google search.

GitScanner Legal disclaimer Usage of GitScanner for attacking targets without prior mutual consent is illegal. It is the end user's responsibility to

Kaio Gomes 3 Oct 28, 2022
A simple search engine that allow searching for chess games

A simple search engine that allow searching for chess games based on queries about opening names & opening moves. Built with Python 3.10 and python-chess.

Tyler Hoang 1 Jun 17, 2022
ElasticSearch ODM (Object Document Mapper) for Python - pip install esengine

esengine - The Elasticsearch Object Document Mapper esengine is an ODM (Object Document Mapper) it maps Python classes in to Elasticsearch index/doc_t

SEEK International AI 109 Nov 22, 2022
A play store search application programming interface ( API )

Play-Store-API A play store search application programming interface ( API ) Made with Python3

Fayas Noushad 8 Oct 21, 2022
PwnWiki Telegram database searching bot

pwtgbot PwnWiki Telegram database searching bot. Screenshots How it looks like in the terminal when running How it looks like in Telegram Run Directly

K4YT3X 3 Jan 25, 2022
Yuno is context based search engine for anime.

Yuno yuno.mp4 Table of Contents Introduction Power Of Yuno Try Yuno How Yuno was created? References Introduction Yuno is a context based search engin

IAmParadox 354 Dec 19, 2022
esguard provides a Python decorator that waits for processing while monitoring the load of Elasticsearch.

esguard esguard provides a Python decorator that waits for processing while monitoring the load of Elasticsearch. Quick Start You need to launch elast

po3rin 5 Dec 08, 2021
Google Drive file searcher

Google Drive file searcher

Hafitz Setya 25 Dec 09, 2022