Incremental construction with SearchGraph

by: Eric S. Téllez

using SimilaritySearch

For incremental construction we need a database backend that supports incremental insertions. Currently, there are two backends for this: DynamicMatrixDatabase and VectorDatabase:

dim = 8
db = DynamicMatrixDatabase(Float32, dim) # or VectorDatabase(Vector{Float32})
dist = L1Distance()

it can use any distance function described in SimilaritySearch and Distances.jl, and in fact any SemiMetric as described in the later package. The index construction is made as follows:

G = SearchGraph(; dist, db)
ctx = SearchGraphContext()

instead of index! we can use push_item! and append_items! functions

for _ in 1:10^4
    push_item!(G, ctx, rand(Float32, dim))  # push_item! inserts one item at a time
end

we can also use append_items! if we have a batch of items

append_items!(G, ctx, MatrixDatabase(rand(Float32, dim, 10^4))) # append_items! inserts many items at once

Note that we used a MatrixDatabase to wrap the matrix to be inserted since it will be copied into the index. Now we have a populated index.

@assert length(G) == 20_000

this will display a lot of information in the console, since as construction advances the hyperparameters of the index are adjusted.

Once the index is created, the index can solve nearest neighbor queries


1Q = MatrixDatabase(rand(dim, 30))
2k = 5
3I, D = searchbatch(G, ctx, Q, k)
display((typeof(I), typeof(D)))
display((size(I), size(D)))
1
Creates the query
2
The number of nearest neighbors to retrieve
3
Solve queries, returns neighbor identifiers and distances.
(Matrix{Int32}, Matrix{Float32})
((5, 30), (5, 30))

Environment and dependencies

Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  JULIA_PROJECT = .
  JULIA_NUM_THREADS = auto
  JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [a93c6f00] DataFrames v1.7.0
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
  [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.6
  [eb30cadb] MLDatasets v0.7.18
  [06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
  [91a5bcdd] Plots v1.40.11
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
  [053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
  [f3b207a7] StatsPlots v0.15.7
  [7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`