Mq64 - Matryoshka Position-Safe Encoding
Mq64 (Matryoshka QuadB64) is a position-safe encoding scheme designed specifically for hierarchical embeddings that follow the Matryoshka Representation Learning (MRL) pattern. It extends the QuadB64 family to support progressive decoding at multiple dimensional resolutions while maintaining substring pollution protection.
Overview
What are Matryoshka Embeddings?
Matryoshka embeddings organize semantic information hierarchically, with the most important features concentrated in the first dimensions. This allows for:
- Progressive refinement: Start with low-dimensional approximations, refine with higher dimensions
- Adaptive quality: Choose dimension count based on computational/storage constraints
- Backward compatibility: Truncated embeddings remain semantically meaningful
Why Mq64?
Standard Base64 encoding causes substring pollution in search engines. Mq64 maintains position safety across all hierarchical levels to prevent false matches when encoded embeddings are indexed in search systems.
Technical Design
Hierarchical Alphabet System
Mq64 uses nested position-safe alphabets with hierarchy-aware character mapping:
Level 1 (dims 1-64): ABCDEFGHIJKLMNOP (positions 0,4,8,12,...)
QRSTUVWXYZabcdef (positions 1,5,9,13,...)
ghijklmnopqrstuv (positions 2,6,10,14,...)
wxyz0123456789-_ (positions 3,7,11,15,...)
Level 2 (dims 65-128): Greek letters (Α-ω)
Level 3 (dims 129-256): Cyrillic letters (А-я)
Level 4+ (dims 257+): Extended Unicode mathematical symbols
Encoding Format
Mq64 Encoding Format:
[Level1]:[Level2]:[Level3]:[Level4+]
Example for 256-dimensional embedding:
ABcd.EFgh.IJkl.MNop:ΑΒγδ.ΕΖηθ.ΙΚλμ.ΝΞοπ:АБвг.ДЕёж.ЗИйк.ЛМнп
^--- Level 1 ---^--- Level 2 ---^--- Level 3 ---^
Hierarchy Markers:
:
(colon) - Separates major hierarchy levels (every 64 dimensions).
(dot) - Separates chunks within levels (every 4 characters)
Progressive Decoding
API Example
from uubed import mq64_encode, mq64_decode
# Encode full 1024-dimensional embedding
embedding = np.random.rand(1024).astype(np.float32)
encoded = mq64_encode(embedding, levels=[64, 128, 256, 512, 1024])
# Progressive decoding at different resolutions
quick_match = mq64_decode(encoded, target_dims=64) # Fast, coarse
refined = mq64_decode(encoded, target_dims=256) # Better quality
full = mq64_decode(encoded, target_dims=1024) # Full precision
Performance Benefits
Operation | Dimensions | Speed | Use Case |
---|---|---|---|
Coarse Search | 64 | 300+ MB/s | Initial filtering |
Refined Search | 256 | 200+ MB/s | Quality results |
Full Precision | 1024 | 150+ MB/s | Final ranking |
Integration Examples
OpenAI text-embedding-3
import openai
from uubed import mq64_encode
# Get Matryoshka embedding from OpenAI
response = openai.embeddings.create(
model="text-embedding-3-large",
input="Example text",
dimensions=1024 # Full dimensions
)
embedding = response.data[0].embedding
# Encode with Mq64 at multiple levels
encoded = mq64_encode(embedding, levels=[64, 128, 256, 512, 1024])
Progressive Vector Search
def progressive_search(query_embedding, index):
"""Search using progressive refinement."""
# Encode query at multiple levels
query_encoded = mq64_encode(query_embedding)
# Coarse search with 64 dimensions
coarse_results = index.query(
vector=mq64_decode(query_encoded, target_dims=64),
top_k=100
)
# Refine with full dimensions
refined_results = []
for result in coarse_results.matches:
full_embedding = mq64_decode(result.metadata['mq64_code'])
refined_score = cosine_similarity(
mq64_decode(query_encoded),
full_embedding
)
refined_results.append((result.id, refined_score))
return sorted(refined_results, key=lambda x: x[1], reverse=True)[:10]
Compression Features
Adaptive Quantization
- Level 0 (dims 1-64): Full precision, optimized for accuracy
- Level 1 (dims 65-128): Reduced precision, optimized for similarity
- Level 2+ (dims 129+): Aggressive compression, optimized for size
Hierarchical Redundancy Reduction
Mq64 exploits the decreasing information density in higher dimensions through:
- Sparse Encoding: Near-zero values compressed more aggressively
- Delta Encoding: Higher levels store differences from lower-level predictions
- Adaptive Precision: Quantization levels adjusted per hierarchy
Error Detection
Hierarchical Checksums
Each level includes a position-safe checksum:
Level Format: [data_chunks][checksum_chunk]
Example: ABcd.EFgh.IJkl.MNop.XYzw
^--- checksum
Progressive Validation
# Validate specific levels
is_valid = mq64_validate(encoded, level=0) # Check first 64 dims
is_valid = mq64_validate(encoded) # Check all levels
Supported Models
Mq64 works with any Matryoshka-trained embedding model:
- OpenAI: text-embedding-3-small, text-embedding-3-large
- Nomic: nomic-embed-text-v1.5
- Alibaba: GTE-Qwen models
- Voyage AI: voyage-3, voyage-3-lite
- Cohere: embed-multilingual-v3.0
Performance Specifications
Metric | Target | Actual | Benefit |
---|---|---|---|
Storage Reduction | 2-5x | 3.2x | vs separate level storage |
Memory Overhead | < 5% | 3.8% | vs single-level encoding |
Position Safety | 100% | 100% | No substring pollution |
Roundtrip Accuracy | 100% | 100% | Bit-perfect reconstruction |
Future Roadmap
- Neural Compression: ML-based prediction between levels
- Hardware Acceleration: GPU/TPU optimized implementations
- Database Native Support: Direct Mq64 support in vector databases
- Multi-Modal Extensions: Support for CLIP-style embeddings
Status
Version: 1.0.0-draft
Implementation Target: UUBED v2.0.0
Expected Release: Q2 2025
The Mq64 specification is currently in draft status. We welcome feedback and contributions from the community as we refine the design for production use.