- 1
- The dimension to use in the synthetic data
- 2
- The synthetic database
- 3
- The synthetic queries
- 4
- The distance function; we will use the squared L2, which preserves the order of L2 but is faster to compute.
- 5
- The number of neighbors to retrieve
Automatic Hyperparameter Optimization
Automatic hyperparameter optimization
by: Eric S. Téllez
This example optimizes different kinds of optimizations that allow different tradeoffs
Computing ground truth
We will generate a ground truth with an exhaustive method.
gold_knns = searchbatch(ExhaustiveSearch(; db, dist), GenericContext(), queries, k)Different hyperparameter optimization strategies
The way of specifying the hyperparameter optimization strategy and objective is with a SearchGraphContext object, as follows:
G1 = SearchGraph(; dist, db)
C1 = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.99)))
buildtime1 = @elapsed index!(G1, C1)The previous construction optimizes the construction to have a very high recall, which can be very costly but also produces a high quality index.
G2 = SearchGraph(; dist, db)
C2 = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.9)))
buildtime2 = @elapsed index!(G2, C2)search, searchbatch, index!, append_items!, and push_item! accept context arguments.
Performances
searching times
time1 = @elapsed knns1 = searchbatch(G1, C1, queries, k)
time2 = @elapsed knns2 = searchbatch(G2, C2, queries, k)
recall1 = macrorecall(gold_knns, knns1)
recall2 = macrorecall(gold_knns, knns2)the recall is an score value between 0 to 1 where values close to 1 indicate better qualities.
build time:
buildtime1: 7.405784697
buildtime2: 2.458756746
search time:
time1: 0.028004235
time2: 0.005745139
recall values:
recall1: 0.9525833333333302
recall2: 0.648249999999999
here we can see smaller recalls than expected, and this is an effect of the difference between indexed elements (that are those objects used to perform the hyperparameter optimization). In any case, we 1can appreciate the differences among them, showing that high quality constructions may produce faster indexes; this is a consequence of the quality of the underlying structure. Contrary to this example, in higher dimensions or large datasets, we will obtain much higher construction times for high quality constructions.
Optimizing an already created SearchGraph for achieving a desired quality
The hyperparameter optimization is performed in exponential stages while the SearchGraph is created; and therefore, the current hyperparameters could need an update. To optimize an already created SearchGraph we use optimize instead of index
Context objects are special for construction since they encapsulate several hyperparameters; for searching it contains also caches but it can be shared among indexes; however, if the indexes have different sizes or you expect very different queries, it is better to maintain different context.
optimize_index!(G1, C1, MinRecall(0.9))
optimize_index!(G2, C1, MinRecall(0.9))after optimizing the index its quality and speed are changed
time1 = @elapsed knns1 = searchbatch(G1, C1, queries, k)
time2 = @elapsed knns2 = searchbatch(G2, C1, queries, k)
recall1 = macrorecall(gold_knns, knns1)
recall2 = macrorecall(gold_knns, knns2)These results on the following performances:
build time:
buildtime1: 7.405784697
buildtime2: 2.458756746
search time:
time1: 0.014133835
time2: 0.006368638
recall values:
recall1: 0.6095000000000004
recall2: 0.6896666666666663
Please note that faster searches are expected for indexes created for higher qualities; but the construction must be paid. Note that recall values are lower than expected, as we explained, due to differences in the distributions (more precisely between points already seen and not seen points).
Giving more realistic queries for optimization
The default optimization parameters use objects already indexed to tune the hyperparameters, which is too optimistic in real applications, since already indexed objects are particularly easy for this use. We can get a better optimization using external data:
optqueries = MatrixDatabase(rand(Float32, dim, 64))
optimize_index!(G1, C1, MinRecall(0.9); queries=optqueries)
optimize_index!(G2, C1, MinRecall(0.9); queries=optqueries)after optimizing the index its quality and speed are changed
time1 = @elapsed knns1 = searchbatch(G1, C1, queries, k)
time2 = @elapsed knns2 = searchbatch(G2, C1, queries, k)
recall1 = macrorecall(gold_knns, knns1)
recall2 = macrorecall(gold_knns, knns2)These results on the following performances:
build time:
buildtime1: 7.405784697
buildtime2: 2.458756746
search time:
time1: 0.020129612
time2: 0.0125475
recall values:
recall1: 0.9097499999999974
recall2: 0.9020833333333316
These scores are much closer to those we are looking for.
Be careful on doing optimize_index!(..., queries=queries) since this can yield to overfitting on your query set.
Environment and dependencies
Julia Version 1.10.11 Commit a2b11907d7b (2026-03-09 14:59 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: macOS (x86_64-apple-darwin24.0.0) CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, skylake) Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores) Environment: JULIA_NUM_THREADS = auto JULIA_PROJECT = @. JULIA_LOAD_PATH = @:@stdlib Status `~/Research/SimilaritySearchDemos/Project.toml` [aaaa29a8] Clustering v0.15.8 [944b1d66] CodecZlib v0.7.8 [5ae59095] Colors v0.13.1 [a93c6f00] DataFrames v1.8.1 [c5bfea45] Embeddings v0.4.6 [f67ccb44] HDF5 v0.17.2 [916415d5] Images v0.26.2 [b20bd276] InvertedFiles v0.9.2 ⌅ [682c06a0] JSON v0.21.4 [23fbe1c1] Latexify v0.16.10 [eb30cadb] MLDatasets v0.7.21 [06eb3307] ManifoldLearning v0.9.0 ⌃ [ca7969ec] PlotlyLight v0.11.0 [27ebfcd6] Primes v0.5.7 [ca7ab67e] SimSearchManifoldLearning v0.4.0 [053f045d] SimilaritySearch v0.14.3 ⌅ [2913bbd2] StatsBase v0.33.21 [7f6f6c8a] TextSearch v0.20.0 Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`