Incremental construction

Usage demonstrations of `SimilaritySearch` with synthetic and real world data

All-KNN

By Eric S. Téllez

using SimilaritySearch

Computing the kk nearest neighbors of a dataset (all vs all) is a useful task to take knowledge of a given dataset. This is a core task for some clustering algorithms and non-linear dimensional reduction algorithms.

Given a metric database (X,dist)(X, dist) and a relatively small kk value, the goal is to compute {knn(x)xX}\{ knn(x) \mid x \in X \} taking into account that each xiXx_i \in X, and therefore, xix_i should be removed from the ii-th knnknn result set.

Solving allknn fast and accuratelly is the goal of this example.

const dim = 16
k = 15
verbose = false
db = MatrixDatabase(randn(Float32, dim, 10^5))
dist = SqL2Distance()
G = SearchGraph(; dist, db, verbose)
itime = @elapsed (index!(G); optimize!(G, MinRecall(0.9)))

Now we can solve all kknn

allknntime = @elapsed knns, dists = allknn(G, k)

Differences between allknn(G, k) and searchbatch(G, X, k)

We can solve similarly with searchbatch but self-references should be removed later, and more important, allknn use special pivoting/boosting strategies that yields to faster searches.

searchtime = @elapsed sknns, sdists = searchbatch(G, db, k)

About the cost of construction + allknn instead of a brute force computation.

allknn for ExhaustiveSearch doesn't perform any optimization but removes self references.

etime = @elapsed eknns, edists = allknn(ExhaustiveSearch(; db, dist), k)

Comparing solution times

indexing, allknn, and searchbatch times

itime, allknntime, searchtime
(5.040281815, 1.043150425, 0.48428813)

full cost allknn

itime + allknntime
6.0834322400000005

full cost searchbatch

itime + searchtime
5.5245699450000005

exhaustive allknn

etime
6.325815174

Quality

macro recall of allknn

macrorecall(eknns, knns)
0.9139300000006781

macro recall of searchbatch

macrorecall(eknns, sknns)
0.914159333334015

Final notes:

Exhaustive search will fetch the exact solution but it has a higher cost and this could be more notorious as dataset's size increases.

Also note that even SimilaritySearch has multithreading operations, it runs in a single thread due to the html generation pipeline.

Dependencies

Status `~/Research/SimilaritySearchDemos/site-src/Project.toml`
  [053f045d] SimilaritySearch v0.10.8
CC BY-SA 4.0 Eric S. Tellez . Last modified: March 15, 2022. Website built with Franklin.jl and the Julia programming language.