using SimilaritySearch, Markdown
Using the SimilaritySearch
package
by: Eric S. Téllez
This is a small tutorial showing a minimum example for working with SimilaritySearch
it accepts several options that are let to defaults. While this should be enough for many purposes, you are invited to see the rest of the tutorials to take advantage of other features.
MatrixDatabase is a required wrapper that tells SimilaritySearch
how to access underlying objects since it can support different kinds of objects. In this setup, each column is an object and will be accessed through views using the MatrixDatabase. Since the backend doesn’t support appends or pushes, the index can be seen as an static index.
function synthetic_benchmark(n, m, dim)
= MatrixDatabase(randn(Float32, dim, n))
db = MatrixDatabase(randn(Float32, dim, m))
queries = SqL2Distance()
dist
(; db, queries, dist)end
it can use any distance function described in SimilaritySearch
and Distances.jl
, and in fact any SemiMetric
as described in the later package. The index construction is made as follows
= synthetic_benchmark(3000, 50, 2)
B = SearchGraph(; B.dist, B.db)
G = SearchGraphContext()
ctx index!(G, ctx)
this will display a lot of information in the console, since as construction advances the hyperparameters of the index are adjusted. The default optimization try to get a recall of 0.9
which is a typical tradeoff between quality and speed. Once the index is created, the index can solve nearest neighbor queries
= 16
k 1= searchbatch(G, ctx, B.queries, k) I, D
- 1
-
The
searchbatch
functions takes a set of queries and solve them using the given index.I
is a matrix of identifiers indb
andD
their corresponding distances.
Visualizing what we just did
using Plots
scatter(B.db.matrix[1, :], B.db.matrix[2, :], size=(600, 600), color=:cyan, ma=0.3, a=0.3, ms=1, msw=0, label="")
for c in eachcol(I)
= B.db.matrix[:, c]
R @views scatter!(R[1, :], R[2, :], m=:diamond, ma=0.3, a=0.3, color=:auto, ms=2, msw=0, label="")
end
@views scatter!(B.queries.matrix[1, :], B.queries.matrix[2, :], color=:black, m=:star, ma=0.5, a=0.5, ms=4, msw=0, label="")
plot!()
Cyan points identify the dataset while starts are query points. The nearest neighbor points are colored automatically and can repeat, but they come quite close to query points, in dense areas they are even hidding them.
Environment and dependencies
Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
JULIA_PROJECT = .
JULIA_NUM_THREADS = auto
JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
[aaaa29a8] Clustering v0.15.8
[944b1d66] CodecZlib v0.7.8
[a93c6f00] DataFrames v1.7.0
[c5bfea45] Embeddings v0.4.6
[f67ccb44] HDF5 v0.17.2
[b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
[682c06a0] JSON v0.21.4
[23fbe1c1] Latexify v0.16.6
[eb30cadb] MLDatasets v0.7.18
[06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
[91a5bcdd] Plots v1.40.11
[27ebfcd6] Primes v0.5.7
[ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
[053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
[f3b207a7] StatsPlots v0.15.7
[7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`