It is impossible to obtain efficient search algorithms under high-dimensional data. However, to cope with high dimensional datasets, we can relax the definitions of $knn$ as $knn^*$, just allowing us to not retrieve the precise result set.
Here we introduce the recall quality measure: $$\frac{|knn(q, S) \cap knn^*(q, S)|}{k} \leq 1.0$$
Hereafter, we will use $knn$ for both exact and approximate searches and only distinguish between them whenever needed.
A family $H$ of functions $h : R^d \rightarrow \mathbb{N}$ is called $(P_1, P_2, r, cr)$-sensitive, if for any $p, q$:
Let $P = p_1, p_2, \cdots, p_m $ be a set of points called references.
Use $P$ to project the metric space into a rank space; that is, each object is represented by the order that sees each $p_i$, under the $d$ perspective.
For example,
$$u_1 = 3, 2, 1, 4, 5, 7, 6 $$ $$u_2 = 1, 3, 2, 4, 5, 7, 6 $$ $$u_3 = 4, 5, 3, 1, 2, 6, 7 $$ |
Considering only the closest references creates neighborhood approaches:
NATIX and SimilaritySearch.jl
Tellez, E. S., Ruiz, G., Chavez, E., & Graff, M. (2021). A scalable solution to the nearest neighbor search problem through local-search methods on neighbor graphs. Pattern Analysis and Applications, 24(2), 763-777.
For HNSW, PQ, and LSH, we used the implementation of FAISS, written in C++; in particular, we use the CPU implementation. We also use FALCONN-LSH for benchmarking DEEPIMAGE.
We use our own implementation for BS, IHC, and KNR, written in the Julia language, and it is also open-source. We also show the performance Seq, our brute-force search, and Flat, the FAISS’s exhaustive search. implementation.