using SimilaritySearch
Incremental construction with SearchGraph
by: Eric S. Téllez
For incremental construction we need a database backend that supports incremental insertions. Currently, there are two backends for this: DynamicMatrixDatabase
and VectorDatabase
:
DynamicMatrixDatabase
: produces matrix like databases (all objects having the same number of components).VectorDatabase
: A generic conainer of objects, objects can be of any kind.
= 8
dim = DynamicMatrixDatabase(Float32, dim) # or VectorDatabase(Vector{Float32})
db = L1Distance() dist
it can use any distance function described in SimilaritySearch
and Distances.jl
, and in fact any SemiMetric
as described in the later package. The index construction is made as follows:
= SearchGraph(; dist, db)
G = SearchGraphContext() ctx
instead of index!
we can use push_item!
and append_items!
functions
for _ in 1:10^4
push_item!(G, ctx, rand(Float32, dim)) # push_item! inserts one item at a time
end
we can also use append_items!
if we have a batch of items
append_items!(G, ctx, MatrixDatabase(rand(Float32, dim, 10^4))) # append_items! inserts many items at once
Note that we used a MatrixDatabase
to wrap the matrix to be inserted since it will be copied into the index. Now we have a populated index.
@assert length(G) == 20_000
this will display a lot of information in the console, since as construction advances the hyperparameters of the index are adjusted.
Once the index is created, the index can solve nearest neighbor queries
1= MatrixDatabase(rand(dim, 30))
Q 2= 5
k 3= searchbatch(G, ctx, Q, k)
I, D display((typeof(I), typeof(D)))
display((size(I), size(D)))
- 1
- Creates the query
- 2
- The number of nearest neighbors to retrieve
- 3
- Solve queries, returns neighbor identifiers and distances.
(Matrix{Int32}, Matrix{Float32})
((5, 30), (5, 30))
Environment and dependencies
Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
JULIA_PROJECT = .
JULIA_NUM_THREADS = auto
JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
[aaaa29a8] Clustering v0.15.8
[944b1d66] CodecZlib v0.7.8
[a93c6f00] DataFrames v1.7.0
[c5bfea45] Embeddings v0.4.6
[f67ccb44] HDF5 v0.17.2
[b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
[682c06a0] JSON v0.21.4
[23fbe1c1] Latexify v0.16.6
[eb30cadb] MLDatasets v0.7.18
[06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
[91a5bcdd] Plots v1.40.11
[27ebfcd6] Primes v0.5.7
[ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
[053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
[f3b207a7] StatsPlots v0.15.7
[7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`