Implementing pHash in Python: Step-by-Step Tutorial
Perceptual hashing (pHash) creates compact fingerprints that represent an image’s visual content, allowing detection of similar or near-duplicate images even after edits like resizing, compression, or minor color changes. This tutorial walks through a clear, practical implementation of pHash in Python, from theory to working code, including comparison and tuning tips.
Prerequisites
- Python 3.8+
- Libraries: Pillow, numpy, scipy, imagehash (optional helper)
- Install:
pip install pillow numpy scipy imagehash
- Install:
How pHash works (brief)
- Convert image to grayscale and resize to a fixed small size (commonly 32×32).
- Compute the 2D discrete cosine transform (DCT) of the image.
- Keep the low-frequency DCT coefficients (top-left 8×8 block is common).
- Compute the median (or mean) of those coefficients (excluding the DC term optionally).
- Build the hash: for each coefficient in the selected block, set bit = 1 if > median, else 0. The result is a compact binary hash (commonly 64 bits).
Step-by-step implementation
1) Minimal implementation using Pillow + numpy + scipy
python
from PIL import Imageimport numpy as npfrom scipy.fftpack import dct def phash(image_path, size=32, hash_size=8): # 1. Load image, convert to grayscale, resize img = Image.open(image_path).convert(“L”).resize((size, size), Image.ANTIALIAS) pixels = np.asarray(img, dtype=np.float32) # 2. Apply 2D DCT (first along rows, then columns) dct_rows = dct(pixels, axis=0, norm=‘ortho’) dct_result = dct(dct_rows, axis=1, norm=‘ortho’) # 3. Extract top-left low-frequency block dct_low_freq = dct_result[:hash_size, :hash_size] # 4. Use median excluding the DC term at [0,0] dct_flat = dct_low_freq.flatten() median = np.median(dct_flat[1:]) # exclude DC # 5. Build hash: 1 if coefficient > median diff = dct_flat > median # convert to hex string for convenience bit_string = “.join([‘1’ if x else ‘0’ for x in diff]) hex_hash = ‘{:0{}x}’.format(int(bit_string, 2), hash_size*hash_size//4) return hex_hash
Usage:
python
print(phash(“image1.jpg”))print(phash(“image2.jpg”))
2) Compare hashes with Hamming distance
python
def hamming_distance(hex_hash1, hex_hash2): # convert hex to int, XOR, count set bits x = int(hex_hash1, 16) ^ int(hex_hash2, 16) return bin(x).count(“1”)
Exampleh1 = phash(“image1.jpg”)h2 = phash(“image2.jpg”)print(“Hamming:”, hamming_distance(h1, h2))
Interpretation: distances 0–5 generally mean near-identical; higher values indicate greater differences. Thresholds depend on use case.
3) Using the imagehash library (quick)
python
import imagehashfrom PIL import Image
hash1 = imagehash.phash(Image.open(“image1.jpg”))hash2 = imagehash.phash(Image.open(“image2.jpg”))print(hash1) # prints hash like 74a1f2…print(hash1 - hash2) # returns Hamming distance
Performance & tuning tips
- hash_size: 8 → 64-bit; increase to 16 for more sensitivity (256-bit).
- size: larger resize (e.g., 64) preserves more detail before DCT but increases compute.
- Median vs mean: median is more robust to outliers.
- Excluding DC prevents brightness shifts from dominating hash.
- For large datasets,
Leave a Reply