This guide is your fast track to using QuadB64. It shows you how to install it, encode and decode data, and why it’s way better than old-school Base64 for modern search. Get ready to make your data smarter and your searches cleaner!

Quick Start Guide

Imagine you’ve just bought a new, super-fast, anti-clutter vacuum cleaner. This guide is the quick-start manual that gets you cleaning up your data messes in minutes, showing you how to plug it in, turn it on, and see the magic happen.

Imagine you’re a secret agent, and your mission is to transmit sensitive data without leaving any traceable fingerprints. This guide is your briefing, teaching you the basics of QuadB64, your new top-secret encoding technique that ensures your data is safe from accidental exposure and always finds its true target.

Get up and running with QuadB64 in minutes! This guide covers installation, basic usage, and common integration patterns.

Installation

The simplest way to install uubed:

pip install uubed

For maximum performance with native extensions:

pip install uubed[native]

For development or latest features:

git clone https://github.com/twardoch/uubed.git
cd uubed
pip install -e ".[dev]"

See the Installation Guide for detailed instructions and troubleshooting.

Basic Usage

Simple Text Encoding

from uubed import encode_eq64, decode_eq64

# Encode any binary data
data = b"Hello, QuadB64 World!"
encoded = encode_eq64(data)
print(f"Encoded: {encoded}")
# Output: SGVs.bG8s.IFFV.YWRC.NjQg.V29y.bGQh

# Decode back to original
decoded = decode_eq64(encoded)
assert decoded == data
print(f"Decoded: {decoded}")
# Output: b'Hello, QuadB64 World!'

Working with Embeddings

import numpy as np
from uubed import encode, decode

# Create a sample embedding (e.g., from an ML model)
embedding = np.random.rand(768).astype(np.float32)

# Convert to bytes
embedding_bytes = embedding.tobytes()

# Full precision encoding with Eq64
full_code = encode(embedding_bytes, method="eq64")
print(f"Full encoding length: {len(full_code)} chars")

# Compact similarity hash with Shq64
compact_code = encode(embedding_bytes, method="shq64")
print(f"Compact hash: {compact_code}")  # 16 characters

# Decode back (only works for eq64)
decoded_bytes = decode(full_code)
decoded_embedding = np.frombuffer(decoded_bytes, dtype=np.float32)
assert np.allclose(embedding, decoded_embedding)

Why QuadB64?

The Problem with Traditional Base64

When search engines index Base64-encoded data, they treat it as regular text:

# Two completely different embeddings
embedding1 = model.encode("cats are cute")
embedding2 = model.encode("quantum physics")

# Traditional Base64 encoding
import base64
b64_1 = base64.b64encode(embedding1.tobytes()).decode()
b64_2 = base64.b64encode(embedding2.tobytes()).decode()

# Substring pollution: random matches!
# "YWJj" might appear in both encodings by chance
# Search engines will falsely match these unrelated documents

The QuadB64 Solution

QuadB64 uses position-dependent encoding to prevent false matches:

# QuadB64 encoding
from uubed import encode_eq64
q64_1 = encode_eq64(embedding1.tobytes())
q64_2 = encode_eq64(embedding2.tobytes())

# Position-safe: "YWJj" at position 0 ≠ "YWJj" at position 4
# No false substring matches between unrelated documents!

Key benefits:

  • No substring pollution: Position-dependent alphabets
  • Search accuracy: Only genuine matches are found
  • Easy integration: Drop-in replacement for Base64
  • High performance: Minimal overhead vs Base64

Encoding Methods

uubed provides multiple encoding schemes optimized for different use cases:

Eq64 - Full Embeddings

Perfect for when you need lossless encoding:

from uubed import encode_eq64, decode_eq64

data = b"Your binary data here"
encoded = encode_eq64(data)  # Position-safe, dots every 4 chars
decoded = decode_eq64(encoded)  # Get original data back
  • Size: ~1.33x original (same as Base64)
  • Use cases: Full embeddings, binary files, any lossless encoding
  • Features: Complete reversibility, position safety

Shq64 - SimHash Variant

Compact similarity-preserving hashes:

from uubed import encode_shq64

# 768-dimensional embedding
embedding = model.encode("sample text")
hash_code = encode_shq64(embedding.tobytes())
print(hash_code)  # 16-character hash like "QRsT.UvWx.YZab.cdef"
  • Size: Always 16 characters (64-bit hash)
  • Use cases: Deduplication, similarity search, clustering
  • Features: Preserves cosine similarity, extremely compact

T8q64 - Top-k Indices

Sparse representation capturing most important features:

from uubed import encode_t8q64

# Encode top-8 most significant indices
sparse_code = encode_t8q64(embedding.tobytes(), k=8)
  • Size: 16 characters (8 indices + magnitudes)
  • Use cases: Sparse embeddings, feature selection
  • Features: Captures most informative dimensions

Zoq64 - Z-order Curve

Spatial locality-preserving encoding:

from uubed import encode_zoq64

# 2D or higher dimensional data
spatial_code = encode_zoq64(coordinates)
  • Size: Variable (based on precision needs)
  • Use cases: Geospatial data, multi-dimensional indexing
  • Features: Nearby points have similar prefixes

Performance

QuadB64 is designed for production workloads:

Operation Pure Python With Native Extensions Speedup
Eq64 encoding 5.5 MB/s 230+ MB/s 40-105x
Shq64 hashing 12 MB/s 117 MB/s 9.7x
T8q64 sparse 8 MB/s 156 MB/s 19.5x
Zoq64 spatial 0.3 MB/s 480 MB/s 1600x

Check if native extensions are available:

from uubed import has_native_extensions

if has_native_extensions():
    print("🚀 Native acceleration enabled!")
else:
    print("Using pure Python implementation")

Common Patterns

Batch Processing

from uubed import encode_batch

# Process multiple embeddings efficiently
embeddings = [model.encode(text) for text in documents]
encoded_batch = encode_batch(embeddings, method="shq64")

# Parallel processing for large datasets
from concurrent.futures import ProcessPoolExecutor

def process_chunk(chunk):
    return [encode_eq64(emb.tobytes()) for emb in chunk]

with ProcessPoolExecutor() as executor:
    chunks = [embeddings[i:i+100] for i in range(0, len(embeddings), 100)]
    results = list(executor.map(process_chunk, chunks))

Configuration Options

from uubed import Config

# Custom configuration
config = Config(
    default_variant="eq64",
    use_native=True,
    chunk_size=8192,
    num_threads=4
)

# Apply configuration
encoded = encode(data, config=config)

Real-World Integration

Vector Databases (Pinecone, Weaviate, Qdrant)

from uubed import encode_shq64
import pinecone

# Initialize your vector database
index = pinecone.Index("my-index")

# Store embeddings with QuadB64 codes
for doc_id, text in documents.items():
    embedding = model.encode(text)
    q64_code = encode_shq64(embedding.tobytes())
    
    index.upsert(
        vectors=[(doc_id, embedding.tolist())],
        metadata={doc_id: {"text": text, "q64_code": q64_code}}
    )

# Similarity search without substring pollution
query_embedding = model.encode(query_text)
query_code = encode_shq64(query_embedding.tobytes())

# Find exact code matches (no false positives!)
results = index.query(
    vector=query_embedding.tolist(),
    filter={"q64_code": {"$eq": query_code}},
    top_k=10
)

Elasticsearch / OpenSearch

from elasticsearch import Elasticsearch
from uubed import encode_eq64

es = Elasticsearch()

# Index documents with position-safe encoding
doc = {
    "title": "Introduction to QuadB64",
    "content": "QuadB64 solves substring pollution...",
    "embedding": embedding.tolist(),
    "embedding_q64": encode_eq64(embedding.tobytes())
}

es.index(index="docs", id="doc1", body=doc)

# Search with exact matching on encoded field
query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"content": "QuadB64"}},
                {"term": {"embedding_q64.keyword": target_code}}
            ]
        }
    }
}

results = es.search(index="docs", body=query)

LangChain Integration

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from uubed import encode_shq64

# Custom embeddings wrapper
class QuadB64Embeddings(OpenAIEmbeddings):
    def embed_documents(self, texts):
        embeddings = super().embed_documents(texts)
        # Add QuadB64 codes to metadata
        return [(emb, {"q64": encode_shq64(np.array(emb).tobytes())}) 
                for emb in embeddings]

# Use with vector store
embeddings = QuadB64Embeddings()
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings
)

Next Steps

Need Help?


Copyright © 2024 UUBED Project. Distributed under the MIT License.