Incremental construction

Usage demonstrations of `SimilaritySearch` with synthetic and real world data

Incremental construction with SearchGraph

By Eric S. Téllez

using SimilaritySearch

For incremental construction we need a database backend that supports incremental insertions. Currently, there are two backends for this: DynamicMatrixDatabase and VectorDatabase

  • DynamicMatrixDatabase: produces matrix like databases (all objects having the same number of components)

  • VectorDatabase: A generic conainer of objects, objects can be of any kind

const dim = 8

db = DynamicMatrixDatabase(Float32, dim) # or VectorDatabase(Vector{Float32})
dist = L1Distance()
SimilaritySearch.L1Distance()

it can use any distance function described in SimilaritySearch and Distances.jl, and in fact any SemiMetric as described in the later package. The index construction is made as follows

G = SearchGraph(; dist, db, verbose=false)
SimilaritySearch.SearchGraph{SimilaritySearch.L1Distance, SimilaritySearch.DynamicMatrixDatabase{Float32, 8}, SimilaritySearch.AdjacencyLists.AdjacencyList{UInt32}, SimilaritySearch.BeamSearch}
  dist: SimilaritySearch.L1Distance SimilaritySearch.L1Distance()
  db: SimilaritySearch.DynamicMatrixDatabase{Float32, 8}
  adj: SimilaritySearch.AdjacencyLists.AdjacencyList{UInt32}
  hints: Array{Int32}((0,)) Int32[]
  search_algo: SimilaritySearch.BeamSearch
  len: Base.RefValue{Int64}
  verbose: Bool false

instead of index! we can use push_item! and append_items! functions

for _ in 1:10^4
    push_item!(G, rand(Float32, dim))  # push_item! inserts one item at a time
end

append_items!(G, MatrixDatabase(rand(Float32, dim, 10^4))) # append_items! inserts many items at once
SimilaritySearch.SearchGraph{SimilaritySearch.L1Distance, SimilaritySearch.DynamicMatrixDatabase{Float32, 8}, SimilaritySearch.AdjacencyLists.AdjacencyList{UInt32}, SimilaritySearch.BeamSearch}
  dist: SimilaritySearch.L1Distance SimilaritySearch.L1Distance()
  db: SimilaritySearch.DynamicMatrixDatabase{Float32, 8}
  adj: SimilaritySearch.AdjacencyLists.AdjacencyList{UInt32}
  hints: Array{Int32}((100,)) Int32[190, 354, 392, 429, 442, 455, 502, 621, 709, 748  …  3317, 3354, 3357, 3361, 3368, 3381, 3404, 3414, 3415, 3462]
  search_algo: SimilaritySearch.BeamSearch
  len: Base.RefValue{Int64}
  verbose: Bool false

Note that we used a MatrixDatabase to wrap the matrix to be inserted since it will be copied into the index.

Now we have a populated index

@assert length(G) == 20_000

this will display a lot of information in the console, since as construction advances the hyperparameters of the index are adjusted. The default optimization takes into account both quality and speed, and tries to adjust to take the best of both. See ParetoRecall in docs.

Once the index is created, the index can solve nearest neighbor queries

Q = MatrixDatabase(rand(dim, 30))
k = 5
I, D = searchbatch(G, Q, k)
D
5×30 Matrix{Float32}:
 0.502752  0.52576   0.726595  0.401183  0.464119  0.444866  0.756742  0.456873  0.40771   0.367571  0.370388  0.507813  0.744219  0.876406  0.836829  0.528599  0.545427  0.661519  0.5463    0.399571  0.745745  0.409689  0.729336  0.539553  0.582203  0.724071  0.609153  0.335656  0.638144  0.625584
 0.553149  0.684503  0.75752   0.433573  0.591741  0.479094  0.853407  0.499817  0.532571  0.609003  0.475572  0.515589  0.785361  0.906292  0.847161  0.536406  0.766627  0.664324  0.589698  0.461357  0.766175  0.601051  0.785035  0.669979  0.608382  0.724944  0.661835  0.479266  0.784063  0.731361
 0.660131  0.695987  0.785675  0.49282   0.649916  0.635248  0.948492  0.767378  0.544482  0.644764  0.503525  0.542498  0.805892  0.922792  0.852931  0.546636  0.789724  0.672152  0.714921  0.542695  0.78998   0.635067  0.78868   0.698604  0.627333  0.738631  0.740231  0.633751  0.806445  0.812788
 0.68536   0.70219   0.823099  0.724192  0.685036  0.652608  0.949991  0.781874  0.604717  0.699274  0.608819  0.567185  0.821345  0.930386  0.866786  0.636144  0.815309  0.683346  0.842521  0.626694  0.830637  0.694602  0.829939  0.724968  0.636138  0.878802  0.741749  0.636044  0.867437  0.924791
 0.694604  0.707156  0.889462  0.738963  0.704689  0.719938  0.953765  0.782926  0.725796  0.733226  0.649799  0.601782  0.82788   0.956573  0.893695  0.706392  0.852041  0.691399  0.853528  0.65921   0.840649  0.709495  0.843366  0.729257  0.669603  0.893223  0.784865  0.685117  0.899927  0.943416

Dependencies

Status `~/Research/SimilaritySearchDemos/site-src/Project.toml`
⌃ [91a5bcdd] Plots v1.15.2
  [053f045d] SimilaritySearch v0.10.8
Info Packages marked with ⌃ have new versions available and may be upgradable.
CC BY-SA 4.0 Eric S. Tellez . Last modified: March 14, 2022. Website built with Franklin.jl and the Julia programming language.