Automatic Hyperparameter Optimization

Usage demonstrations of `SimilaritySearch` with synthetic and real world data

Automatic hyperparameter optimization

By Eric S. Téllez

using SimilaritySearch

const dim = 16
16

This example optimizes different kinds of optimizations that allow different tradeoffs

X = rand(Float32, dim, 10^5)
Q = rand(Float32, dim, 10^3)
k = 10

We will use the squared L2, which preserves the order of L2 but is faster to compute.

dist = SqL2Distance()
db = MatrixDatabase(X)
queries = MatrixDatabase(Q)

Computing ground truth

verbose = false
goldI, goldD = searchbatch(ExhaustiveSearch(; db, dist), queries, k)  # `ExhaustiveSearch` solves with brute force

Different hyperparameter optimization strategies

the way of specifying the hyperparameter optimization strategy and objective is with a SearchGraphCallbacks object, as follows:

G1 = SearchGraph(; dist, db=MatrixDatabase(X), verbose)
callbacks1 = SearchGraphCallbacks(ParetoRecall())  # ParetoRecall is the default, and it will be used unless you change it
@elapsed index!(G1; callbacks=callbacks1)
0.388286811

Using ParetoRadius: which should be faster since it doesn't needs a costly computation as the recall score but can be easily fool by distances distribution

G2 = SearchGraph(; dist, db=MatrixDatabase(X), verbose)
callbacks2 = SearchGraphCallbacks(ParetoRadius())  # it uses distances instead of recall, it will be faster but lower quality
@elapsed index!(G2; callbacks=callbacks2)
0.950074613

Using MinRecall: It ensures a minimum quality in a small validation set

G3 = SearchGraph(; dist, db=MatrixDatabase(X), verbose)
callbacks3 = SearchGraphCallbacks(MinRecall(0.95))
@elapsed index!(G3; callbacks=callbacks3)
0.660934935

index!, append_items!, and push_item! accept callbacks.

Performances

searching times

@time I1, D1 = searchbatch(G1, queries, k)
@time I2, D2 = searchbatch(G2, queries, k)
@time I3, D3 = searchbatch(G3, queries, k)
  0.001746 seconds (2.01 k allocations: 109.859 KiB)
  0.001166 seconds (2.01 k allocations: 109.859 KiB)
  0.002012 seconds (2.01 k allocations: 109.859 KiB)

the recall (0-1, one is the best)

@show macrorecall(goldI, I1)
@show macrorecall(goldI, I2)
@show macrorecall(goldI, I3)
macrorecall(goldI, I1) = 0.7939999999999962
macrorecall(goldI, I2) = 0.3682999999999996
macrorecall(goldI, I3) = 0.8744999999999933

here we can see smaller recalls than expected, but we also can appreciate the differences among them.

Optimizing SearchGraph without inserting

The hyperparameter optimization is performed in exponential stages while the SearchGraph is created, and therefore, the current hyperparameters could need an update. To optimize an already created SearchGraph we use optimize instead of index

optimize!(G1, ParetoRecall())
optimize!(G3, ParetoRadius())
optimize!(G3, MinRecall(0.95))

@time I1, D1 = searchbatch(G1, queries, k)
@time I2, D2 = searchbatch(G2, queries, k)
@time I3, D3 = searchbatch(G3, queries, k)
  0.001568 seconds (2.01 k allocations: 109.734 KiB)
  0.001135 seconds (2.01 k allocations: 109.734 KiB)
  0.002298 seconds (2.01 k allocations: 109.734 KiB)

the recall (00 to 11, where one is the best)

@show macrorecall(goldI, I1)
@show macrorecall(goldI, I2)
@show macrorecall(goldI, I3)
macrorecall(goldI, I1) = 0.7164999999999981
macrorecall(goldI, I2) = 0.3682999999999996
macrorecall(goldI, I3) = 0.9187999999999927

The optimization is made using objects of the dataset as queries to compute the objective function. If it is possible to get external queries to optimize (not in the database and not in the query set, but also following the expected distrbiution), they can be provided as follows:

equeries = MatrixDatabase(rand(dim, 64))
optimize!(G1, ParetoRecall(), queries=equeries)
optimize!(G2, ParetoRadius(), queries=equeries)
optimize!(G3, MinRecall(0.95), queries=equeries)

@time I1, D1 = searchbatch(G1, queries, k)
@time I2, D2 = searchbatch(G2, queries, k)
@time I3, D3 = searchbatch(G3, queries, k)

@show macrorecall(goldI, I1)
@show macrorecall(goldI, I2)
@show macrorecall(goldI, I3)
  0.001835 seconds (2.01 k allocations: 109.734 KiB)
  0.001232 seconds (2.01 k allocations: 109.734 KiB)
  0.002550 seconds (2.01 k allocations: 109.734 KiB)
macrorecall(goldI, I1) = 0.7906999999999968
macrorecall(goldI, I2) = 0.297
macrorecall(goldI, I3) = 0.9418999999999941

Finally, we create our index with any hyperparameter optimization strategy, and optimize for quality, as follows:

optimize!(G1, MinRecall(0.95), queries=equeries)
optimize!(G2, MinRecall(0.95), queries=equeries)
optimize!(G3, MinRecall(0.95), queries=equeries)

@time I1, D1 = searchbatch(G1, queries, k)
@time I2, D2 = searchbatch(G2, queries, k)
@time I3, D3 = searchbatch(G3, queries, k)

@show macrorecall(goldI, I1)
@show macrorecall(goldI, I2)
@show macrorecall(goldI, I3)
  0.002958 seconds (2.01 k allocations: 109.734 KiB)
  0.009363 seconds (2.28 k allocations: 1.187 MiB)
  0.002534 seconds (2.01 k allocations: 109.734 KiB)
macrorecall(goldI, I1) = 0.9315999999999932
macrorecall(goldI, I2) = 0.9301999999999929
macrorecall(goldI, I3) = 0.9371999999999938

Please note that faster searches are expected for indexes created for higher qualities.

Dependencies

Status `~/Research/SimilaritySearchDemos/site-src/Project.toml`
  [053f045d] SimilaritySearch v0.10.8
CC BY-SA 4.0 Eric S. Tellez . Last modified: March 14, 2022. Website built with Franklin.jl and the Julia programming language.