Automatic Hyperparameter Optimization

Automatic hyperparameter optimization

by: Eric S. Téllez

This example optimizes different kinds of optimizations that allow different tradeoffs

using SimilaritySearch, Markdown

1dim = 16
2db = MatrixDatabase(rand(Float32, dim, 10^5))
3queries = MatrixDatabase(rand(Float32, dim, 10^3))
4dist = SqL2Distance()
5k = 12

1: The dimension to use in the synthetic data
2: The synthetic database
3: The synthetic queries
4: The distance function; we will use the squared L2, which preserves the order of L2 but is faster to compute.
5: The number of neighbors to retrieve

Computing ground truth

We will generate a ground truth with an exhaustive method.

goldI, goldD = searchbatch(ExhaustiveSearch(; db, dist), GenericContext(), queries, k)

Different hyperparameter optimization strategies

The way of specifying the hyperparameter optimization strategy and objective is with a SearchGraphContext object, as follows:

G1 = SearchGraph(; dist, db)
C1 = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.99)))
buildtime1 = @elapsed index!(G1, C1)

The previous construction optimizes the construction to have a very high recall, which can be very costly but also produces a high quality index.

G2 = SearchGraph(; dist, db)
C2 = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.9)))
buildtime2 = @elapsed index!(G2, C2)

search, searchbatch, index!, append_items!, and push_item! accept context arguments.

Performances

searching times

time1 = @elapsed I1, D1 = searchbatch(G1, C1, queries, k)
time2 = @elapsed I2, D2 = searchbatch(G2, C2, queries, k)
recall1 = macrorecall(goldI, I1)
recall2 = macrorecall(goldI, I2)

the recall is an score value between 0 to 1 where values close to 1 indicate better qualities.

build time:

buildtime1: 5.648114106
buildtime2: 0.523501032

search time:

time1: 0.002314346
time2: 0.001370323

recall values:

recall1: 0.96783333333333
recall2: 0.8649999999999989

here we can see smaller recalls than expected, and this is an effect of the difference between indexed elements (that are those objects used to perform the hyperparameter optimization). In any case, we 1can appreciate the differences among them, showing that high quality constructions may produce faster indexes; this is a consequence of the quality of the underlying structure. Contrary to this example, in higher dimensions or large datasets, we will obtain much higher construction times for high quality constructions.

Optimizing an already created `SearchGraph` for achieving a desired quality

The hyperparameter optimization is performed in exponential stages while the SearchGraph is created; and therefore, the current hyperparameters could need an update. To optimize an already created SearchGraph we use optimize instead of index

Context objects are special for construction since they encapsulate several hyperparameters; for searching it contains also caches but it can be shared among indexes; however, if the indexes have different sizes or you expect very different queries, it is better to maintain different context.

optimize_index!(G1, C1, MinRecall(0.9))
optimize_index!(G2, C1, MinRecall(0.9))

after optimizing the index its quality and speed are changed

time1 = @elapsed I1, D1 = searchbatch(G1, C1, queries, k)
time2 = @elapsed I2, D2 = searchbatch(G2, C1, queries, k)

recall1 = macrorecall(goldI, I1)
recall2 = macrorecall(goldI, I2)

These results on the following performances:

build time:

buildtime1: 5.648114106
buildtime2: 0.523501032

search time:

time1: 0.001014245
time2: 0.000905641

recall values:

recall1: 0.6443333333333329
recall2: 0.7148333333333333

Please note that faster searches are expected for indexes created for higher qualities; but the construction must be paid. Note that recall values are lower than expected, as we explained, due to differences in the distributions (more precisely between points already seen and not seen points).

Giving more realistic queries for optimization

The default optimization parameters use objects already indexed to tune the hyperparameters, which is too optimistic in real applications, since already indexed objects are particularly easy for this use. We can get a better optimization using external data:

optqueries = MatrixDatabase(rand(Float32, dim, 64))

optimize_index!(G1, C1, MinRecall(0.9); queries=optqueries)
optimize_index!(G2, C1, MinRecall(0.9); queries=optqueries)

after optimizing the index its quality and speed are changed

time1 = @elapsed I1, D1 = searchbatch(G1, C1, queries, k)
time2 = @elapsed I2, D2 = searchbatch(G2, C1, queries, k)

recall1 = macrorecall(goldI, I1)
recall2 = macrorecall(goldI, I2)

These results on the following performances:

build time:

buildtime1: 5.648114106
buildtime2: 0.523501032

search time:

time1: 0.013517147
time2: 0.010678581

recall values:

recall1: 0.9045833333333309
recall2: 0.9097499999999987

These scores are much closer to those we are looking for.

Be careful on doing optimize_index!(..., queries=queries) since this can yield to overfitting on your query set.

Environment and dependencies

Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  JULIA_PROJECT = .
  JULIA_NUM_THREADS = auto
  JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [a93c6f00] DataFrames v1.7.0
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
  [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.6
  [eb30cadb] MLDatasets v0.7.18
  [06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
  [91a5bcdd] Plots v1.40.11
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
  [053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
  [f3b207a7] StatsPlots v0.15.7
  [7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`