Vector Functions

Functions for vector similarity search, distance calculations, and embedding operations.

Vector Similarity Functions

SDBQL provides vector functions for computing similarity between embeddings, measuring distances, and normalizing vectors for semantic search applications.

Similarity Functions

VECTOR_SIMILARITY(vec1, vec2)

Calculates cosine similarity between two vectors. Returns a value between -1 and 1, where 1 means identical direction, 0 means orthogonal, and -1 means opposite direction.

-- Identical vectors have similarity 1.0
RETURN VECTOR_SIMILARITY([1, 0, 0], [1, 0, 0])
-- Result: 1.0

-- Orthogonal vectors have similarity 0
RETURN VECTOR_SIMILARITY([1, 0, 0], [0, 1, 0])
-- Result: 0.0

-- Similar vectors have high similarity
RETURN VECTOR_SIMILARITY([0.9, 0.1, 0], [1, 0, 0])
-- Result: ~0.994

Distance Functions

VECTOR_DISTANCE(vec1, vec2, metric)

Calculates the distance between two vectors using the specified metric. Supported metrics: "cosine", "euclidean", "dot".

-- Euclidean distance (3-4-5 triangle)
RETURN VECTOR_DISTANCE([0, 0], [3, 4], "euclidean")
-- Result: 5.0

-- Cosine distance (1 - cosine similarity)
RETURN VECTOR_DISTANCE([1, 0], [0, 1], "cosine")
-- Result: 1.0

-- Dot product (for pre-normalized vectors)
RETURN VECTOR_DISTANCE([0.6, 0.8], [1, 0], "dot")
-- Result: 0.6

Utility Functions

VECTOR_NORMALIZE(vec)

Normalizes a vector to unit length (magnitude = 1). Useful for preparing vectors for dot product similarity.

-- Normalize a 3-4-5 vector
RETURN VECTOR_NORMALIZE([3, 4, 0])
-- Result: [0.6, 0.8, 0.0]

-- The normalized vector has magnitude 1
LET norm = VECTOR_NORMALIZE([3, 4, 0])
RETURN SQRT(norm[0]*norm[0] + norm[1]*norm[1] + norm[2]*norm[2])
-- Result: 1.0

Index Statistics

VECTOR_INDEX_STATS(collection, index_name)

Returns statistics about a vector index including dimension, vector count, quantization status, and memory usage.

-- Get statistics for a vector index
LET stats = VECTOR_INDEX_STATS("articles", "embedding_idx")
RETURN stats
-- Result: {
--   "name": "embedding_idx",
--   "field": "embedding",
--   "dimension": 768,
--   "vectors": 15420,
--   "metric": "cosine",
--   "quantization": "scalar",
--   "memory_bytes": 11842560,
--   "compression_ratio": 4.0,
--   "m": 16,
--   "ef_construction": 200
-- }

-- Check if an index is quantized
LET stats = VECTOR_INDEX_STATS("articles", "embedding_idx")
RETURN stats.quantization != "none"
-- Result: true

-- Calculate memory savings
LET stats = VECTOR_INDEX_STATS("articles", "embedding_idx")
LET full_memory = stats.vectors * stats.dimension * 4
LET savings = full_memory - stats.memory_bytes
RETURN {
  vectors: stats.vectors,
  compression: stats.compression_ratio,
  savings_mb: ROUND(savings / 1048576, 2)
}

Choosing a Metric

Cosine is ideal for text embeddings (OpenAI, sentence-transformers) where direction matters more than magnitude. Euclidean works well for image embeddings or when magnitude is meaningful. Dot product is fastest but requires pre-normalized vectors.

Practical Examples

-- Find similar articles to a query
FOR doc IN articles
  LET sim = VECTOR_SIMILARITY(doc.embedding, @query_embedding)
  FILTER sim > 0.8
  SORT sim DESC
  LIMIT 10
  RETURN { title: doc.title, score: sim }

-- Find products similar to a given product
LET product = (FOR p IN products FILTER p._key == @product_id RETURN p)[0]
FOR other IN products
  FILTER other._key != @product_id
  LET sim = VECTOR_SIMILARITY(other.embedding, product.embedding)
  SORT sim DESC
  LIMIT 5
  RETURN { name: other.name, similarity: ROUND(sim, 3) }

-- Combine vector search with filters
FOR doc IN documents
  FILTER doc.category == "technology"
  FILTER doc.published >= "2024-01-01"
  LET sim = VECTOR_SIMILARITY(doc.embedding, @query_vec)
  FILTER sim > 0.75
  SORT sim DESC
  LIMIT 20
  RETURN {
    title: doc.title,
    category: doc.category,
    score: ROUND(sim, 4)
  }

-- Find nearest neighbors using Euclidean distance
FOR item IN items
  LET dist = VECTOR_DISTANCE(item.features, @target_features, "euclidean")
  SORT dist ASC
  LIMIT 5
  RETURN { id: item._key, distance: dist }

-- Normalize vectors before storage
LET raw_embedding = @embedding_from_model
LET normalized = VECTOR_NORMALIZE(raw_embedding)
INSERT { content: @content, embedding: normalized } INTO documents

-- Semantic search with score threshold
FOR chunk IN document_chunks
  LET relevance = VECTOR_SIMILARITY(chunk.embedding, @question_embedding)
  FILTER relevance >= 0.7
  SORT relevance DESC
  LIMIT 5
  RETURN {
    text: chunk.text,
    source: chunk.source_doc,
    relevance: ROUND(relevance, 3)
  }

-- Multi-vector query (average of multiple embeddings)
LET query_vecs = @user_interest_embeddings
FOR product IN products
  LET scores = (
    FOR qv IN query_vecs
      RETURN VECTOR_SIMILARITY(product.embedding, qv)
  )
  LET avg_score = SUM(scores) / LENGTH(scores)
  SORT avg_score DESC
  LIMIT 10
  RETURN { name: product.name, match_score: avg_score }

Tips for Vector Queries

Performance
  • Create vector indexes on embedding fields
  • Apply non-vector filters first to reduce candidates
  • Use LIMIT to cap result size
  • Pre-normalize vectors for dot product
  • Large indexes (>10K) use HNSW for ~100x faster search
Best Practices
  • Store embeddings as float arrays
  • Match dimension to your model (1536, 768, etc.)
  • Use cosine for most text embedding models
  • Test similarity thresholds for your use case

Quick Reference

Function Returns Description
VECTOR_SIMILARITY(a, b) Number (-1 to 1) Cosine similarity between vectors
VECTOR_DISTANCE(a, b, metric) Number (>= 0) Distance using specified metric
VECTOR_NORMALIZE(v) Array Unit-length normalized vector
VECTOR_INDEX_STATS(coll, idx) Object Index stats: vectors, memory, quantization