Optimizing Image Deduplication with pHash Techniques

Implementing pHash in Python: Step-by-Step Tutorial

Perceptual hashing (pHash) creates compact fingerprints that represent an image’s visual content, allowing detection of similar or near-duplicate images even after edits like resizing, compression, or minor color changes. This tutorial walks through a clear, practical implementation of pHash in Python, from theory to working code, including comparison and tuning tips.

Prerequisites

  • Python 3.8+
  • Libraries: Pillow, numpy, scipy, imagehash (optional helper)
    • Install: pip install pillow numpy scipy imagehash

How pHash works (brief)

  1. Convert image to grayscale and resize to a fixed small size (commonly 32×32).
  2. Compute the 2D discrete cosine transform (DCT) of the image.
  3. Keep the low-frequency DCT coefficients (top-left 8×8 block is common).
  4. Compute the median (or mean) of those coefficients (excluding the DC term optionally).
  5. Build the hash: for each coefficient in the selected block, set bit = 1 if > median, else 0. The result is a compact binary hash (commonly 64 bits).

Step-by-step implementation

1) Minimal implementation using Pillow + numpy + scipy
python
from PIL import Imageimport numpy as npfrom scipy.fftpack import dct def phash(image_path, size=32, hash_size=8): # 1. Load image, convert to grayscale, resize img = Image.open(image_path).convert(“L”).resize((size, size), Image.ANTIALIAS) pixels = np.asarray(img, dtype=np.float32) # 2. Apply 2D DCT (first along rows, then columns) dct_rows = dct(pixels, axis=0, norm=‘ortho’) dct_result = dct(dct_rows, axis=1, norm=‘ortho’) # 3. Extract top-left low-frequency block dct_low_freq = dct_result[:hash_size, :hash_size] # 4. Use median excluding the DC term at [0,0] dct_flat = dct_low_freq.flatten() median = np.median(dct_flat[1:]) # exclude DC # 5. Build hash: 1 if coefficient > median diff = dct_flat > median # convert to hex string for convenience bit_string = “.join([‘1’ if x else ‘0’ for x in diff]) hex_hash = ‘{:0{}x}’.format(int(bit_string, 2), hash_size*hash_size//4) return hex_hash

Usage:

python
print(phash(“image1.jpg”))print(phash(“image2.jpg”))
2) Compare hashes with Hamming distance
python
def hamming_distance(hex_hash1, hex_hash2): # convert hex to int, XOR, count set bits x = int(hex_hash1, 16) ^ int(hex_hash2, 16) return bin(x).count(“1”)

Exampleh1 = phash(“image1.jpg”)h2 = phash(“image2.jpg”)print(“Hamming:”, hamming_distance(h1, h2))

Interpretation: distances 0–5 generally mean near-identical; higher values indicate greater differences. Thresholds depend on use case.

3) Using the imagehash library (quick)
python
import imagehashfrom PIL import Image
 

hash1 = imagehash.phash(Image.open(“image1.jpg”))hash2 = imagehash.phash(Image.open(“image2.jpg”))print(hash1) # prints hash like 74a1f2…print(hash1 - hash2) # returns Hamming distance

Performance & tuning tips

  • hash_size: 8 → 64-bit; increase to 16 for more sensitivity (256-bit).
  • size: larger resize (e.g., 64) preserves more detail before DCT but increases compute.
  • Median vs mean: median is more robust to outliers.
  • Excluding DC prevents brightness shifts from dominating hash.
  • For large datasets,

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *