Incremental construction
Usage demonstrations of `SimilaritySearch` with synthetic and real world data
All-KNN
By Eric S. Téllez
using SimilaritySearch
Computing the nearest neighbors of a dataset (all vs all) is a useful task to take knowledge of a given dataset. This is a core task for some clustering algorithms and non-linear dimensional reduction algorithms.
Given a metric database and a relatively small value, the goal is to compute taking into account that each , and therefore, should be removed from the -th result set.
Solving allknn
fast and accuratelly is the goal of this example.
const dim = 16
k = 15
verbose = false
db = MatrixDatabase(randn(Float32, dim, 10^5))
dist = SqL2Distance()
G = SearchGraph(; dist, db, verbose)
itime = @elapsed (index!(G); optimize!(G, MinRecall(0.9)))
Now we can solve all nn
allknntime = @elapsed knns, dists = allknn(G, k)
Differences between allknn(G, k)
and searchbatch(G, X, k)
We can solve similarly with searchbatch
but self-references should be removed later, and more important, allknn
use special pivoting/boosting strategies that yields to faster searches.
searchtime = @elapsed sknns, sdists = searchbatch(G, db, k)
About the cost of construction + allknn
instead of a brute force computation.
allknn
for ExhaustiveSearch
doesn't perform any optimization but removes self references.
etime = @elapsed eknns, edists = allknn(ExhaustiveSearch(; db, dist), k)
Comparing solution times
indexing, allknn, and searchbatch times
itime, allknntime, searchtime
(5.040281815, 1.043150425, 0.48428813)
full cost allknn
itime + allknntime
6.0834322400000005
full cost searchbatch
itime + searchtime
5.5245699450000005
exhaustive allknn
etime
6.325815174
Quality
macro recall of allknn
macrorecall(eknns, knns)
0.9139300000006781
macro recall of searchbatch
macrorecall(eknns, sknns)
0.914159333334015
Final notes:
Exhaustive search will fetch the exact solution but it has a higher cost and this could be more notorious as dataset's size increases.
Also note that even SimilaritySearch
has multithreading operations, it runs in a single thread due to the html generation pipeline.
Dependencies
Status `~/Research/SimilaritySearchDemos/site-src/Project.toml`
[053f045d] SimilaritySearch v0.10.8