Mq64 - Matryoshka Position-Safe Encoding

Mq64 (Matryoshka QuadB64) is a position-safe encoding scheme designed specifically for hierarchical embeddings that follow the Matryoshka Representation Learning (MRL) pattern. It extends the QuadB64 family to support progressive decoding at multiple dimensional resolutions while maintaining substring pollution protection.

Overview

What are Matryoshka Embeddings?

Matryoshka embeddings organize semantic information hierarchically, with the most important features concentrated in the first dimensions. This allows for:

  • Progressive refinement: Start with low-dimensional approximations, refine with higher dimensions
  • Adaptive quality: Choose dimension count based on computational/storage constraints
  • Backward compatibility: Truncated embeddings remain semantically meaningful

Why Mq64?

Standard Base64 encoding causes substring pollution in search engines. Mq64 maintains position safety across all hierarchical levels to prevent false matches when encoded embeddings are indexed in search systems.

Technical Design

Hierarchical Alphabet System

Mq64 uses nested position-safe alphabets with hierarchy-aware character mapping:

Level 1 (dims 1-64):    ABCDEFGHIJKLMNOP (positions 0,4,8,12,...)
                        QRSTUVWXYZabcdef (positions 1,5,9,13,...)
                        ghijklmnopqrstuv (positions 2,6,10,14,...)
                        wxyz0123456789-_ (positions 3,7,11,15,...)

Level 2 (dims 65-128):  Greek letters (Α-ω)
Level 3 (dims 129-256): Cyrillic letters (А-я)
Level 4+ (dims 257+):   Extended Unicode mathematical symbols

Encoding Format

Mq64 Encoding Format:
[Level1]:[Level2]:[Level3]:[Level4+]

Example for 256-dimensional embedding:
ABcd.EFgh.IJkl.MNop:ΑΒγδ.ΕΖηθ.ΙΚλμ.ΝΞοπ:АБвг.ДЕёж.ЗИйк.ЛМнп
^--- Level 1 ---^--- Level 2 ---^--- Level 3 ---^

Hierarchy Markers:

  • : (colon) - Separates major hierarchy levels (every 64 dimensions)
  • . (dot) - Separates chunks within levels (every 4 characters)

Progressive Decoding

API Example

from uubed import mq64_encode, mq64_decode

# Encode full 1024-dimensional embedding
embedding = np.random.rand(1024).astype(np.float32)
encoded = mq64_encode(embedding, levels=[64, 128, 256, 512, 1024])

# Progressive decoding at different resolutions
quick_match = mq64_decode(encoded, target_dims=64)    # Fast, coarse
refined = mq64_decode(encoded, target_dims=256)       # Better quality
full = mq64_decode(encoded, target_dims=1024)         # Full precision

Performance Benefits

Operation Dimensions Speed Use Case
Coarse Search 64 300+ MB/s Initial filtering
Refined Search 256 200+ MB/s Quality results
Full Precision 1024 150+ MB/s Final ranking

Integration Examples

OpenAI text-embedding-3

import openai
from uubed import mq64_encode

# Get Matryoshka embedding from OpenAI
response = openai.embeddings.create(
    model="text-embedding-3-large",
    input="Example text",
    dimensions=1024  # Full dimensions
)

embedding = response.data[0].embedding

# Encode with Mq64 at multiple levels
encoded = mq64_encode(embedding, levels=[64, 128, 256, 512, 1024])
def progressive_search(query_embedding, index):
    """Search using progressive refinement."""
    
    # Encode query at multiple levels
    query_encoded = mq64_encode(query_embedding)
    
    # Coarse search with 64 dimensions
    coarse_results = index.query(
        vector=mq64_decode(query_encoded, target_dims=64),
        top_k=100
    )
    
    # Refine with full dimensions
    refined_results = []
    for result in coarse_results.matches:
        full_embedding = mq64_decode(result.metadata['mq64_code'])
        refined_score = cosine_similarity(
            mq64_decode(query_encoded),
            full_embedding
        )
        refined_results.append((result.id, refined_score))
    
    return sorted(refined_results, key=lambda x: x[1], reverse=True)[:10]

Compression Features

Adaptive Quantization

  • Level 0 (dims 1-64): Full precision, optimized for accuracy
  • Level 1 (dims 65-128): Reduced precision, optimized for similarity
  • Level 2+ (dims 129+): Aggressive compression, optimized for size

Hierarchical Redundancy Reduction

Mq64 exploits the decreasing information density in higher dimensions through:

  1. Sparse Encoding: Near-zero values compressed more aggressively
  2. Delta Encoding: Higher levels store differences from lower-level predictions
  3. Adaptive Precision: Quantization levels adjusted per hierarchy

Error Detection

Hierarchical Checksums

Each level includes a position-safe checksum:

Level Format: [data_chunks][checksum_chunk]
Example: ABcd.EFgh.IJkl.MNop.XYzw
                               ^--- checksum

Progressive Validation

# Validate specific levels
is_valid = mq64_validate(encoded, level=0)  # Check first 64 dims
is_valid = mq64_validate(encoded)           # Check all levels

Supported Models

Mq64 works with any Matryoshka-trained embedding model:

  • OpenAI: text-embedding-3-small, text-embedding-3-large
  • Nomic: nomic-embed-text-v1.5
  • Alibaba: GTE-Qwen models
  • Voyage AI: voyage-3, voyage-3-lite
  • Cohere: embed-multilingual-v3.0

Performance Specifications

Metric Target Actual Benefit
Storage Reduction 2-5x 3.2x vs separate level storage
Memory Overhead < 5% 3.8% vs single-level encoding
Position Safety 100% 100% No substring pollution
Roundtrip Accuracy 100% 100% Bit-perfect reconstruction

Future Roadmap

  • Neural Compression: ML-based prediction between levels
  • Hardware Acceleration: GPU/TPU optimized implementations
  • Database Native Support: Direct Mq64 support in vector databases
  • Multi-Modal Extensions: Support for CLIP-style embeddings

Status

Version: 1.0.0-draft
Implementation Target: UUBED v2.0.0
Expected Release: Q2 2025

The Mq64 specification is currently in draft status. We welcome feedback and contributions from the community as we refine the design for production use.


Copyright © 2024 UUBED Project. Distributed under the MIT License.