Vector Functions
Functions for vector similarity search, distance calculations, and embedding operations.
Vector Similarity Functions
SDBQL provides vector functions for computing similarity between embeddings, measuring distances, and normalizing vectors for semantic search applications.
Similarity Functions
VECTOR_SIMILARITY(vec1, vec2)
Calculates cosine similarity between two vectors. Returns a value between -1 and 1, where 1 means identical direction, 0 means orthogonal, and -1 means opposite direction.
-- Identical vectors have similarity 1.0
RETURN VECTOR_SIMILARITY([1, 0, 0], [1, 0, 0])
-- Result: 1.0
-- Orthogonal vectors have similarity 0
RETURN VECTOR_SIMILARITY([1, 0, 0], [0, 1, 0])
-- Result: 0.0
-- Similar vectors have high similarity
RETURN VECTOR_SIMILARITY([0.9, 0.1, 0], [1, 0, 0])
-- Result: ~0.994
Distance Functions
VECTOR_DISTANCE(vec1, vec2, metric)
Calculates the distance between two vectors using the specified metric. Supported metrics: "cosine", "euclidean", "dot".
-- Euclidean distance (3-4-5 triangle)
RETURN VECTOR_DISTANCE([0, 0], [3, 4], "euclidean")
-- Result: 5.0
-- Cosine distance (1 - cosine similarity)
RETURN VECTOR_DISTANCE([1, 0], [0, 1], "cosine")
-- Result: 1.0
-- Dot product (for pre-normalized vectors)
RETURN VECTOR_DISTANCE([0.6, 0.8], [1, 0], "dot")
-- Result: 0.6
Utility Functions
VECTOR_NORMALIZE(vec)
Normalizes a vector to unit length (magnitude = 1). Useful for preparing vectors for dot product similarity.
-- Normalize a 3-4-5 vector
RETURN VECTOR_NORMALIZE([3, 4, 0])
-- Result: [0.6, 0.8, 0.0]
-- The normalized vector has magnitude 1
LET norm = VECTOR_NORMALIZE([3, 4, 0])
RETURN SQRT(norm[0]*norm[0] + norm[1]*norm[1] + norm[2]*norm[2])
-- Result: 1.0
Index Statistics
VECTOR_INDEX_STATS(collection, index_name)
Returns statistics about a vector index including dimension, vector count, quantization status, and memory usage.
-- Get statistics for a vector index
LET stats = VECTOR_INDEX_STATS("articles", "embedding_idx")
RETURN stats
-- Result: {
-- "name": "embedding_idx",
-- "field": "embedding",
-- "dimension": 768,
-- "vectors": 15420,
-- "metric": "cosine",
-- "quantization": "scalar",
-- "memory_bytes": 11842560,
-- "compression_ratio": 4.0,
-- "m": 16,
-- "ef_construction": 200
-- }
-- Check if an index is quantized
LET stats = VECTOR_INDEX_STATS("articles", "embedding_idx")
RETURN stats.quantization != "none"
-- Result: true
-- Calculate memory savings
LET stats = VECTOR_INDEX_STATS("articles", "embedding_idx")
LET full_memory = stats.vectors * stats.dimension * 4
LET savings = full_memory - stats.memory_bytes
RETURN {
vectors: stats.vectors,
compression: stats.compression_ratio,
savings_mb: ROUND(savings / 1048576, 2)
}
Choosing a Metric
Cosine is ideal for text embeddings (OpenAI, sentence-transformers) where direction matters more than magnitude. Euclidean works well for image embeddings or when magnitude is meaningful. Dot product is fastest but requires pre-normalized vectors.
Practical Examples
-- Find similar articles to a query
FOR doc IN articles
LET sim = VECTOR_SIMILARITY(doc.embedding, @query_embedding)
FILTER sim > 0.8
SORT sim DESC
LIMIT 10
RETURN { title: doc.title, score: sim }
-- Find products similar to a given product
LET product = (FOR p IN products FILTER p._key == @product_id RETURN p)[0]
FOR other IN products
FILTER other._key != @product_id
LET sim = VECTOR_SIMILARITY(other.embedding, product.embedding)
SORT sim DESC
LIMIT 5
RETURN { name: other.name, similarity: ROUND(sim, 3) }
-- Combine vector search with filters
FOR doc IN documents
FILTER doc.category == "technology"
FILTER doc.published >= "2024-01-01"
LET sim = VECTOR_SIMILARITY(doc.embedding, @query_vec)
FILTER sim > 0.75
SORT sim DESC
LIMIT 20
RETURN {
title: doc.title,
category: doc.category,
score: ROUND(sim, 4)
}
-- Find nearest neighbors using Euclidean distance
FOR item IN items
LET dist = VECTOR_DISTANCE(item.features, @target_features, "euclidean")
SORT dist ASC
LIMIT 5
RETURN { id: item._key, distance: dist }
-- Normalize vectors before storage
LET raw_embedding = @embedding_from_model
LET normalized = VECTOR_NORMALIZE(raw_embedding)
INSERT { content: @content, embedding: normalized } INTO documents
-- Semantic search with score threshold
FOR chunk IN document_chunks
LET relevance = VECTOR_SIMILARITY(chunk.embedding, @question_embedding)
FILTER relevance >= 0.7
SORT relevance DESC
LIMIT 5
RETURN {
text: chunk.text,
source: chunk.source_doc,
relevance: ROUND(relevance, 3)
}
-- Multi-vector query (average of multiple embeddings)
LET query_vecs = @user_interest_embeddings
FOR product IN products
LET scores = (
FOR qv IN query_vecs
RETURN VECTOR_SIMILARITY(product.embedding, qv)
)
LET avg_score = SUM(scores) / LENGTH(scores)
SORT avg_score DESC
LIMIT 10
RETURN { name: product.name, match_score: avg_score }
Tips for Vector Queries
Performance
- Create vector indexes on embedding fields
- Apply non-vector filters first to reduce candidates
- Use LIMIT to cap result size
- Pre-normalize vectors for dot product
- Large indexes (>10K) use HNSW for ~100x faster search
Best Practices
- Store embeddings as float arrays
- Match dimension to your model (1536, 768, etc.)
- Use cosine for most text embedding models
- Test similarity thresholds for your use case
Quick Reference
| Function | Returns | Description |
|---|---|---|
| VECTOR_SIMILARITY(a, b) | Number (-1 to 1) | Cosine similarity between vectors |
| VECTOR_DISTANCE(a, b, metric) | Number (>= 0) | Distance using specified metric |
| VECTOR_NORMALIZE(v) | Array | Unit-length normalized vector |
| VECTOR_INDEX_STATS(coll, idx) | Object | Index stats: vectors, memory, quantization |