Shq64 is like a super-efficient fingerprinting system for your data. It creates a tiny, fixed-size code that tells you how similar two pieces of information are, without revealing the original data or causing accidental matches in your search results.

Shq64: SimHash Variant for Similarity Preservation

Overview

Shq64 (SimHash QuadB64) is a compact, similarity-preserving hash encoding that combines the power of locality-sensitive hashing with position-safe encoding. It produces a fixed 16-character output that maintains similarity relationships between inputs while preventing substring pollution.

Key Characteristics

  • Fixed Size: Always 16 characters (64 bits)
  • Similarity-Preserving: Similar inputs → similar hashes
  • Position-Safe: No false substring matches
  • Ultra-Fast: Optimized for high-throughput scenarios
  • One-Way: Not reversible (hash function)

How SimHash Works

The Algorithm

SimHash is a locality-sensitive hashing technique that preserves cosine similarity:

  1. Feature Extraction: Break input into features (shingles)
  2. Hash Features: Generate hash for each feature
  3. Weighted Sum: Accumulate weighted bit vectors
  4. Binarization: Convert to final hash
def simhash(data: bytes, num_bits: int = 64) -> int:
    # Initialize bit vector
    bit_vector = [0] * num_bits
    
    # Extract features (e.g., 4-byte shingles)
    for i in range(len(data) - 3):
        feature = data[i:i+4]
        
        # Hash the feature
        feature_hash = hash(feature)
        
        # Update bit vector
        for j in range(num_bits):
            if feature_hash & (1 << j):
                bit_vector[j] += 1
            else:
                bit_vector[j] -= 1
    
    # Binarize
    result = 0
    for j in range(num_bits):
        if bit_vector[j] > 0:
            result |= (1 << j)
    
    return result

Position-Safe Encoding

After generating the 64-bit SimHash, Shq64 applies position-safe encoding:

def encode_shq64(simhash_value: int) -> str:
    # Convert 64-bit hash to bytes
    hash_bytes = simhash_value.to_bytes(8, 'big')
    
    # Apply Eq64-style encoding with position safety
    encoded = encode_eq64(hash_bytes)
    
    # Result: 16 characters with dots
    # Example: "QRsT.UvWx.YZab.cdef"
    return encoded[:19]  # 16 chars + 3 dots

Usage Examples

Basic Usage

from uubed import encode_shq64

# Text similarity
text1 = "The quick brown fox jumps over the lazy dog"
text2 = "The quick brown fox jumps over the lazy cat"
text3 = "Python is a great programming language"

hash1 = encode_shq64(text1.encode())
hash2 = encode_shq64(text2.encode())
hash3 = encode_shq64(text3.encode())

print(f"Text 1: {hash1}")  # QRsT.UvWx.YZab.cdef
print(f"Text 2: {hash2}")  # QRsT.UvWx.YZab.cdeg (similar!)
print(f"Text 3: {hash3}")  # mnOp.qRsT.uvWx.yzAB (different)

Embedding Similarity

import numpy as np
from uubed import encode_shq64, hamming_distance

# Similar embeddings
embedding1 = np.random.rand(768).astype(np.float32)
embedding2 = embedding1 + np.random.normal(0, 0.01, 768)  # Small perturbation

hash1 = encode_shq64(embedding1.tobytes())
hash2 = encode_shq64(embedding2.tobytes())

# Compare similarity
distance = hamming_distance(hash1, hash2)
print(f"Hamming distance: {distance}")  # Small value (0-3)

Deduplication

from uubed import encode_shq64

# Document deduplication
documents = [
    "Machine learning is transforming industries",
    "Machine learning is transforming industries.",  # Near duplicate
    "Deep learning is a subset of machine learning",
    "Machine learning is transforming industries!",  # Near duplicate
]

# Generate hashes
hashes = {}
for i, doc in enumerate(documents):
    hash_code = encode_shq64(doc.encode())
    if hash_code in hashes:
        print(f"Document {i} is likely duplicate of {hashes[hash_code]}")
    else:
        hashes[hash_code] = i

Advanced Features

Custom Feature Extraction

from uubed import Shq64Encoder

# Custom shingle size
encoder = Shq64Encoder(shingle_size=8)  # 8-byte shingles
hash_code = encoder.encode(data)

# Custom feature weights
encoder = Shq64Encoder(
    feature_extractor=lambda data: extract_weighted_features(data)
)

Batch Processing

from uubed import encode_shq64_batch

# Process multiple embeddings efficiently
embeddings = [model.encode(text) for text in documents]

# Parallel processing
hashes = encode_shq64_batch(
    [emb.tobytes() for emb in embeddings],
    num_workers=4
)

Similarity Metrics

from uubed import shq64_similarity

# Compare two hashes
hash1 = "QRsT.UvWx.YZab.cdef"
hash2 = "QRsT.UvWx.YZab.cdeg"

similarity = shq64_similarity(hash1, hash2)
print(f"Similarity: {similarity:.2%}")  # 93.75% (15/16 bits match)

Integration Patterns

With Redis

import redis
from uubed import encode_shq64

r = redis.Redis()

# Store embeddings with similarity hashes
def store_embedding(doc_id: str, embedding: np.ndarray, metadata: dict):
    hash_code = encode_shq64(embedding.tobytes())
    
    # Store in Redis with hash as key prefix
    r.hset(f"emb:{hash_code}:{doc_id}", mapping={
        "data": embedding.tobytes(),
        "metadata": json.dumps(metadata)
    })
    
    # Add to similarity index
    r.sadd(f"similar:{hash_code[:8]}", doc_id)

# Find similar documents
def find_similar(embedding: np.ndarray, threshold: int = 2):
    hash_code = encode_shq64(embedding.tobytes())
    
    similar_docs = []
    # Check variations within hamming distance
    for variant in generate_variants(hash_code, threshold):
        docs = r.smembers(f"similar:{variant[:8]}")
        similar_docs.extend(docs)
    
    return similar_docs

With PostgreSQL

import psycopg2
from uubed import encode_shq64

# Create table with Shq64 column
cursor.execute("""
    CREATE TABLE documents (
        id SERIAL PRIMARY KEY,
        content TEXT,
        embedding BYTEA,
        shq64_hash CHAR(19) NOT NULL,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    
    CREATE INDEX idx_shq64 ON documents(shq64_hash);
    CREATE INDEX idx_shq64_prefix ON documents(LEFT(shq64_hash, 8));
""")

# Insert with hash
def insert_document(content: str, embedding: np.ndarray):
    hash_code = encode_shq64(embedding.tobytes())
    
    cursor.execute("""
        INSERT INTO documents (content, embedding, shq64_hash)
        VALUES (%s, %s, %s)
    """, (content, embedding.tobytes(), hash_code))

# Find near-duplicates
def find_near_duplicates(embedding: np.ndarray):
    hash_code = encode_shq64(embedding.tobytes())
    prefix = hash_code[:8]  # First 2 groups
    
    cursor.execute("""
        SELECT id, content, shq64_hash
        FROM documents
        WHERE LEFT(shq64_hash, 8) = %s
        AND shq64_hash != %s
    """, (prefix, hash_code))
    
    return cursor.fetchall()

With Faiss

import faiss
from uubed import encode_shq64

class ShqIndex:
    def __init__(self, dim: int):
        self.index = faiss.IndexFlatL2(dim)
        self.hash_to_ids = {}
        self.id_to_hash = {}
        
    def add(self, embeddings: np.ndarray, ids: List[int]):
        self.index.add(embeddings)
        
        for i, emb in enumerate(embeddings):
            hash_code = encode_shq64(emb.tobytes())
            self.hash_to_ids.setdefault(hash_code, []).append(ids[i])
            self.id_to_hash[ids[i]] = hash_code
    
    def search_with_dedup(self, query: np.ndarray, k: int = 10):
        # Initial search
        distances, indices = self.index.search(query, k * 3)
        
        # Deduplicate by Shq64 hash
        seen_hashes = set()
        results = []
        
        for idx in indices[0]:
            if idx == -1:
                continue
            hash_code = self.id_to_hash.get(idx)
            if hash_code not in seen_hashes:
                seen_hashes.add(hash_code)
                results.append(idx)
                if len(results) >= k:
                    break
        
        return results

Performance Characteristics

Speed Benchmarks

Operation Throughput Latency (1KB)
Shq64 encoding 117 MB/s 8.5 μs
Hamming distance 2.1M ops/s 0.48 μs
Batch (1000 items) 156 MB/s 6.4 ms

Similarity Preservation

Shq64 maintains excellent similarity preservation:

Cosine Similarity Avg Hamming Distance Correlation
95-100% 0-3 bits 0.94
85-95% 3-8 bits 0.91
70-85% 8-16 bits 0.87
<70% 16-32 bits 0.82

Best Practices

Do’s

  1. Use for deduplication: Excellent for finding near-duplicates
  2. Combine with vector search: Pre-filter candidates
  3. Index prefixes: For efficient similarity queries
  4. Batch processing: Better throughput for large datasets
  5. Cache hashes: They’re small and reusable

Don’ts

  1. Don’t expect reversibility: It’s a one-way hash
  2. Don’t use for exact matching only: Wastes similarity features
  3. Don’t ignore false positives: Similar hash ≠ identical content
  4. Don’t modify hash strings: Breaks position encoding

Tuning Parameters

Shingle Size

# Smaller shingles = more sensitive to small changes
encoder_sensitive = Shq64Encoder(shingle_size=2)

# Larger shingles = more robust to noise
encoder_robust = Shq64Encoder(shingle_size=16)

Bit Allocation

# Custom bit allocation for different feature types
encoder = Shq64Encoder(
    bit_allocations={
        'content': 48,    # 48 bits for content
        'structure': 16   # 16 bits for structure
    }
)

Use Cases

1. Document Deduplication

Perfect for finding duplicate or near-duplicate documents:

  • News article deduplication
  • Research paper similarity
  • Code clone detection
  • Email threading

2. Embedding Clustering

Pre-cluster embeddings for faster search:

  • Group similar embeddings
  • Reduce search space
  • Accelerate k-NN queries

3. Content Fingerprinting

Create compact fingerprints:

  • Video frame similarity
  • Audio matching
  • Image deduplication

4. Anomaly Detection

Identify outliers:

  • Detect unusual embeddings
  • Find corrupted data
  • Security monitoring

Limitations

  1. Not cryptographic: Don’t use for security
  2. Fixed size: Always 64 bits, regardless of input
  3. Approximate: Can have false positives
  4. Not order-preserving: Can’t do range queries

Summary

Shq64 provides a powerful combination of similarity preservation and position safety in just 16 characters. It’s ideal for:

  • High-volume deduplication: Process millions of documents
  • Similarity search: Pre-filter before expensive operations
  • Compact storage: 16 characters vs full embeddings
  • Fast comparison: Hamming distance is extremely efficient

Use Shq64 when you need fast, approximate similarity matching with protection against substring pollution. It’s particularly effective as a pre-filter for more expensive similarity computations or when storage space is at a premium.


Copyright © 2024 UUBED Project. Distributed under the MIT License.