API · SimilaritySearch.jl

Indexes

ExhaustiveSearch(dist::SemiMetric, db::AbstractVector)

Solves queries evaluating dist for the query and all elements in the dataset

source

ParallelExhaustiveSearch(; dist=SqL2Distance(), db=VectorDatabase{Float32}())

Solves queries evaluating dist in parallel for the query and all elements in the dataset. Note that this should not be used in conjunction with searchbatch(...; parallel=true) since they will compete for resources.

source

SimilaritySearch.SearchGraph — Type

struct SearchGraph <: AbstractSearchIndex

SearchGraph index. It stores a set of points that can be compared through a distance function dist. The performance is determined by the search algorithm search_algo and the neighborhood policy. It supports callbacks to adjust parameters as insertions are made.

hints: Initial points for exploration (empty hints imply using random points)

Note: Parallel insertions should be made through append! or index! function with parallel_block > 1

source

Searching

SimilaritySearch.search — Function

search(seq::ExhaustiveSearch, context::AbstractContext, q, res::KnnResult)

Solves the query evaluating all items in the given query.

source

search(ex::ParallelExhaustiveSearch, context::GenericContext, q, res::KnnResult)

Solves the query evaluating all items in the given query.

Arguments

ex: the search structure
q: the query to solve
res: the result set
context: running context

source

search(bs::BeamSearch, index::SearchGraph, context, q, res, hints; bsize=bs.bsize, Δ=bs.Δ, maxvisits=bs.maxvisits)

Tries to reach the set of nearest neighbors specified in res for q.

bs: the parameters of BeamSearch
index: the local search index
q: the query
res: The result object, it stores the results and also specifies the kind of query
hints: Starting points for searching, randomly selected when it is an empty collection
context: A SearchGraphContext object with preallocated objects

Optional arguments (defaults to values in bs)

bsize: Beam size
Δ: exploration expansion factor
maxvisits: Maximum number of nodes to visit (distance evaluations)

source

search(index::SearchGraph, context::SearchGraphContext, q, res; hints=index.hints

Solves the specified query res for the query object q.

source

SimilaritySearch.searchbatch — Function

searchbatch(index, ctx, Q, k::Integer) -> indices, distances
searchbatch(index, Q, k::Integer) -> indices, distances

Searches a batch of queries in the given index (searches for k neighbors).

Arguments

index: The search structure
Q: The set of queries
k: The number of neighbors to retrieve
context: caches, hyperparameters, and meta data (defaults to getcontext(index))

Note: The i-th column in indices and distances correspond to the i-th query in Q Note: The final indices at each column can be 0 if the search process was unable to retrieve k neighbors.

source

searchbatch(index, ctx, Q, I::AbstractMatrix{Int32}, D::AbstractMatrix{Float32}) -> indices, distances

Searches a batch of queries in the given index and I and D as output (searches for k=size(I, 1))

Arguments

index: The search structure
Q: The set of queries
k: The number of neighbors to retrieve
ctx: environment for running searches (hyperparameters and caches)

source

searchbatch(index, context, Q, KNN::AbstractVector{KnnResult}) -> indices, distances

Searches a batch of queries in the given index using an array of KnnResult's; each KnnResult object can specify different k values.

Arguments

context: contain caches to reduce memory allocations and hyperparameters for search customization

source

Note: KnnResult based functions are significantly faster in general on pre-allocated objects that similar functions accepting matrices of identifiers and distances. Matrix based outputs are based on KnnResult methods that copy their results on the matrices. Preallocation is also costly, so if you have relatively small datasets, you are not intended to repeat the search process many times, or you are unsure, it is safe to use matrix-based functions.

Computing all knns

The operation of computing all knns in the index is computed as follows:

SimilaritySearch.allknn — Function

allknn(g::AbstractSearchIndex, context, k::Integer; minbatch=0, pools=getpools(g)) -> knns, dists
allknn(g, context, knns, dists; minbatch=0, pools=getpools(g)) -> knns, dists

Computes all the k nearest neighbors (all vs all) using the index g. It removes self references.

Parameters:

g: the index
context: the index's context (caches, hyperparameters, logger, etc)
Query specification and result:
- k: the number of neighbors to retrieve
- knns: an uninitialized integer matrix of (k, n) size for storing the k nearest neighbors of the n elements
- dists: an uninitialized floating point matrix of (k, n) size for storing the k nearest distances of the n elements
context: caches, hyperparameters and meta specifications, e.g., see SearchGraphContext

Returns:

knns a (k, n) matrix of identifiers; the i-th column corresponds to the i-th object in the dataset. Zeros can happen to the end of each column meaning that the retrieval was less than the desired k
dists a (k, n) matrix of distances; the i-th column corresponds to the i-th object in the dataset. Zero values in knns should be ignored in dists

Note:

This function was introduced in v0.8 series, and removes self references automatically. In v0.9 the self reference is kept since removing from the algorithm introduces a considerable overhead.

source

Computing closest pair

The operation of finding the closest pair of elements in the indexed dataset.

SimilaritySearch.closestpair — Function

closestpair(idx::AbstractSearchIndex, ctx::AbstractContext; minbatch=0)

Finds the closest pair among all elements in idx. If the index idx is approximate then pair of points could be also an approximation.

Arguments:

idx: the search structure that indexes the set of points
ctx: the search context (caches, hyperparameters, etc)

Keyword Arguments:

minbatch: controls how multithreading is used for evaluating configurations, see getminbatch

source

Remove near duplicates

Finds and removes near duplicate items in a metric dataset

SimilaritySearch.neardup — Function

neardup(idx::AbstractSearchIndex, ctx::AbstractContext, X::AbstractDatabase, ϵ::Real; k::Int=8, blocksize::Int=get_parallel_block(), minbatch=0, filterblocks=true, verbose=true)
neardup(dist::SemiMetric, X::AbstractDatabase, ϵ::Real; kwargs...)

Find nearest duplicates in database X using the empty index idx. The algorithm iteratively try to index elements in X, and items being near than ϵ to some element in idx will be ignored.

The function returns a named tuple (idx, map, nn, dist) where:

idx: it is the index of the non duplicated elements
ctx: the index's context
map: a mapping from |idx|-1 to its positions in X
nn: an array where each element in $x \in X$ points to its covering element (previously indexed element u such that $d(u, x_i) \leq ϵ$)
dist: an array of distance values to each covering element (correspond to each element in nn)

Arguments

idx: An empty index (i.e., a SearchGraph)
X: The input dataset
ϵ: Real value to cut, if negative, then ϵ will be computed using the quantile value at 'abs(ϵ)' in a small sample of nearest neighbor distances; the quantile method should be used only for applications that need some vague approximations to ϵ

Keyword arguments

k: The number of nearest neighbors to retrieve (some algorithms benefit from retrieving larger k values)
blocksize: the number of items processed at the time
minbatch: argument to control @batch macro (see Polyester package multithreading)
filterblocks: if true then it filters neardups inside blocks (see blocksize parameter), otherwise, it supposes that blocks are free of neardups (e.g., randomized order).
verbose: controls the verbosity of the function

Notes

The index idx must support incremental construction
If you need to customize object insertions, you must wrap the index idx and implement your custom methods; it requires valid implementations of the following functions:
- searchbatch(idx::AbstractSearchIndex, ctx, queries::AbstractDatabase, knns::Matrix, dists::Matrix)
- distance(idx::AbstractSearchIndex)
- length(idx::AbstractSearchIndex)
- append_items!(idx::AbstractSearchIndex, ctx, items::AbstractDatabase)
You can access the set of elements being 'ϵ-non duplicates (the $ϵ-net$) using database(idx) or where nn[i] == i

source

Indexing elements

SimilaritySearch.push_item! — Function

push_item!(res::KnnResult, p::IdWeight)
push_item!(res::KnnResult, id::Integer, dist::Real)

Appends an item into the result set

source

push_item!(
    index::SearchGraph,
    context,
    item;
    push_item=true
)

Appends an object into the index.

Arguments:

index: The search graph index where the insertion is going to happen
item: The object to be inserted, it should be in the same space than other objects in the index and understood by the distance metric.
context: The context environment of the graph, see SearchGraphContext.
push_db: if push_db=false is an internal option, used by append! and index! (it avoids to insert item into the database since it is already inserted but not indexed)

source

push_item!(
    index::SearchGraph,
    context,
    item,
    push_item
)

Appends an object into the index. Internal function

Arguments:

index: The search graph index where the insertion is going to happen.
item: The object to be inserted, it should be in the same space than other objects in the index and understood by the distance metric.
context: The context environment of the graph, see SearchGraphContext.
push_db: if false is an internal option, used by append! and index! (it avoids to insert item into the database since it is already inserted but not indexed).
Note: setting callbacks as nothing ignores the execution of any callback

source

SimilaritySearch.append_items! — Function

append_items!(
    index::SearchGraph,
    context::SearchGraphContext,
    db
)

Appends all items in db to the index. It can be made in parallel or sequentially.

Arguments:

index: the search graph index
db: the collection of objects to insert, an AbstractDatabase is the canonical input, but supports any iterable objects
context: The context environment of the graph, see SearchGraphContext.

source

SimilaritySearch.index! — Function

index!(index::SearchGraph, context::SearchGraphContext)

Indexes the already initialized database (e.g., given in the constructor method). It can be made in parallel or sequentially. The arguments are the same than append_items! function but using the internal index.db as input.

Arguments:

index: The graph index
context: The context environment of the graph, see SearchGraphContext.

source

SimilaritySearch.rebuild — Function

rebuild(g::SearchGraph; context=SearchGraphContext())

Rebuilds the SearchGraph index but seeing the whole dataset for the incremental construction, i.e., it can connect the i-th vertex to its knn in the 1..n possible vertices instead of its knn among 1..(i-1) as in the original algorithm.

Arguments

g: The search index to be rebuild.
context: The context to run the procedure, it can differ from the original one.
minbatch: controls how the multithreading is made, see getminbatch

source

Distance functions

The distance functions are defined to work under the evaluate(::metric, u, v) function (borrowed from Distances.jl package).

Minkowski vector distance functions

SimilaritySearch.L1Distance — Type

L1Distance()

The manhattan distance or $L_1$ is defined as

\[L_1(u, v) = \sum_i{|u_i - v_i|}\]

source

SimilaritySearch.L2Distance — Type

L2Distance()

The euclidean distance or $L_2$ is defined as

\[L_2(u, v) = \sqrt{\sum_i{(u_i - v_i)^2}}\]

source

SimilaritySearch.SqL2Distance — Type

SqL2Distance()

The squared euclidean distance is defined as

\[L_2(u, v) = \sum_i{(u_i - v_i)^2}\]

It avoids the computation of the square root and should be used whenever you are able do it.

source

LInftyDistance()

The Chebyshev or $L_{\infty}$ distance is defined as

\[L_{\infty}(u, v) = \max_i{\left| u_i - v_i \right|}\]

source

SimilaritySearch.LpDistance — Type

LpDistance(p)
LpDistance(p, pinv)

The general Minkowski distance $L_p$ distance is defined as

\[L_p(u, v) = \left|\sum_i{(u_i - v_i)^p}\right|^{1/p}\]

Where $p_{inv} = 1/p$. Note that you can specify unrelated p and pinv if you need an specific behaviour.

source

The package implements some of these functions using the @turbo macro from LoopVectorization package, only for v1.10 and previous versions.

Cosine and angle distance functions for vectors

CosineDistance()

The cosine is defined as:

\[\cos(u, v) = \frac{\sum_i u_i v_i}{\sqrt{\sum_i u_i^2} \sqrt{\sum_i v_i^2}}\]

The cosine distance is defined as $1 - \cos(u,v)$

source

NormalizedCosineDistance()

Similar to CosineDistance but suppose that input vectors are already normalized, and therefore, reduced to simply one minus the dot product.

\[1 - \sum_i {u_i v_i}\]

source

SimilaritySearch.AngleDistance — Type

AngleDistance()

The angle distance is defined as:

\[∠(u, v)= \arccos(\cos(u, v))\]

source

NormalizedAngleDistance()

Similar to AngleDistance but suppose that input vectors are already normalized

\[\arccos \sum_i {u_i v_i}\]

source

Set distance functions

Set bbject are represented as ordered arrays

JaccardDistance()

The Jaccard distance is defined as

\[J(u, v) = \frac{|u \cap v|}{|u \cup v|}\]

source

SimilaritySearch.DiceDistance — Type

DiceDistance()

The Dice distance is defined as

\[D(u, v) = \frac{2 |u \cap v|}{|u| + |v|}\]

source

IntersectionDissimilarity()

The intersection dissimilarity uses the size of the intersection as a mesuare of similarity as follows:

\[I(u, v) = 1 - \frac{|u \cap v|}{\max \{|u|, |v|\}}\]

source

CosineDistanceSet()

The cosine distance for very sparse binary vectors represented as sorted lists of positive integers where ones occur.

source

String alignment distances

The following uses strings/arrays as input, i.e., objects follow the array interface. A broader set of distances for strings can be found in the StringDistances.jl package.

CommonPrefixDissimilarity()

Uses the common prefix as a measure of dissimilarity between two strings

source

GenericLevenshteinDistance(;icost, dcost, rcost)

The levenshtein distance measures the minimum number of edit operations to convert one string into another. The costs insertion icost, deletion cost dcost, and replace cost rcost. Not thread safe, use a copy of for each thread.

source

StringHammingDistance()

The hamming distance counts the differences between two equally sized strings

source

LevenshteinDistance()

Instantiates a GenericLevenshteinDistance object to perform traditional levenshtein distance

source

SimilaritySearch.LcsDistance — Function

LcsDistance()

Instantiates a GenericLevenshteinDistance object to perform LCS distance

source

Distances for Cloud of points

HausdorffDistance(dist::SemiMetric)

Hausdorff distance is defined as the maximum of the minimum between two clouds of points.

\[Hausdorff(U, V) = \max{\max_{u \in U} nndist(u, V), \max{v \in V} nndist(v, U) }\]

where $nndist(u, V)$ computes the distance of $u$ to its nearest neighbor in $V$ using the dist metric.

source

MinHausdorffDistance(dist::SemiMetric)

Similar to HausdorffDistance but using minimum instead of maximum.

source

Functions that customize parameters

Several algorithms support arguments that modify the performance, for instance, some of them should be computed or prepared with external functions or structs

SimilaritySearch.getminbatch — Function

getminbatch(minbatch, n)

Used by functions that use parallelism based on Polyester.jl minibatches specify how many queries (or something else) are solved per thread whenever the thread is used (in minibatches).

Arguments

minbatch
- Integers $1 ≤ minbatch ≤ n$ are valid values (where n is the number of objects to process, i.e., queries)
- Defaults to 0 which computes a default number based on the number of available cores and n.
- Set minbatch=-1 to avoid parallelism.

source

SimilaritySearch.getknnresult — Function

getknnresult(k::Integer, ctx::AbstractContext) -> KnnResult

Generic function to obtain a shared result set for the same thread and avoid memory allocations. This function should be specialized for indexes and caches that use shared results or threads in some special way.

source

Missing docstring.

Missing docstring for getpools. Check Documenter's build log for details.

SimilaritySearch.Neighborhood — Type

Neighborhood(; logbase=2, minsize=2, reduce=SatNeighborhood())

Determines the size of the neighborhood, $k$ is adjusted as a callback, and it is intended to affect previously inserted vertices. The neighborhood is designed to consider two components $k=in+out$, i.e. incoming and outgoing edges for each vertex.

The $out$ size is computed as $minsize + \log(logbase, n)$ where $n$ is the current number of indexed elements; this is computed searching

for $out$ elements in the current index.

The $in$ size is unbounded.
reduce is intended to postprocess neighbors (after search process, i.e., once out edges are computed); do not change $k$ but always must return a copy of the reduced result set.

Note: Set $logbase=Inf$ to obtain a fixed number of $in$ nodes; and set $minsize=0$ to obtain a pure logarithmic growing neighborhood.

source

SimilaritySearch.Callback — Type

abstract type Callback end

Abstract type to trigger callbacks after some number of insertions. SearchGraph stores the callbacks in callbacks (a dictionary that associates symbols and callback objects); A SearchGraph object controls when callbacks are fired using callback_logbase and callback_starting

source

Missing docstring.

Missing docstring for SearchGraphCallbacks. Check Documenter's build log for details.

BeamSearchSpace(; bsize, Δ, bsize_scale, Δ_scale)

Define search space for beam search autotuning

source

Database API

abstract type AbstractDatabase end

Base type to represent databases. A database is a collection of objects that can be accessed like a similar interface to AbstractVector. It is separated to allow SimilaritySearch methods to know what is a database and what is an object (since most object representations will look as vectors and matrices).

The basic implementations are:

MatrixDatabase: A wrapper for object-vectors stored in a Matrix, columns are the objects. It is static.
DynamicMatrixDatabase: A dynamic representation for vectors that allows adding new vectors.
VectorDatabase: A wrapper for vector-like structures. It can contain any kind of objects.
SubDatabase: A sample of a given database

In particular, the storage details are not used by VectorDatabase and MatrixDatabase. For instance, it is possible to use matrices like Matrix, SMatrix or StrideArrays; or even use generated objects with VectorDatabase (supporting a vector-like interface).

If the storage backend support it, it is possible to use vector operations, for example:

get the i-th element obj = db[i], elements in the database are identified by position
get the elements list in a list of indices lst as db[lst] (also using view)
set a value at the i-th element db[i] = obj
random sampling rand(db), rand(db, 3)
iterate and collect objects in the database
get the number of elements in the database length(db)
add new objects to the end of the database (not all internal containers will support it)
- push_item!(db, u) adds a single element u
- append_items!(db, lst) adds a list of objects to the end of the database

source

struct MatrixDatabase{M<:AbstractDatabase} <: AbstractDatabase

MatrixDatabase(matrix::AbstractMatrix)

Wraps a matrix-like object matrix into a MatrixDatabase. Please see AbstractDatabase for general usage.

source

struct VectorDatabase{V} <: AbstractDatabase

A wrapper for vector-like databases

source

struct DynamicMatrixDatabase{DType,Dim} <: AbstractDatabase

A dynamic matrix with elements of type DType and dimension Dim

source

struct StrideMatrixDatabase{M<:StrideArray} <: AbstractDatabase

StrideMatrixDatabase(matrix::StrideArray)

Wraps a matrix object into a StrideArray and wrap it as a database. Please see AbstractDatabase for general usage.

source

SimilaritySearch.find_neighborhood — Function

find_neighborhood(index::SearchGraph{T}, context, item; hints=index.hints)

Searches for item neighborhood in the index, i.e., if item were in the index whose items should be its neighbors (intenal function). res is always reused since reduce creates a new KnnResult from it (a copy if reduce in its simpler terms)

Arguments

index: The search index.
item: The item to be inserted.
context: context, neighborhood, and cache objects to be used
hints: Search hints

source

SimilaritySearch.SatPruning — Type

SatPruning(k; kind=DistalSatNeighborhood())
SatPruning(k)

Selects SatNeighborhood or DistalSatNeighborhood for each vertex. Defaults to DistalSatNeighborhood.

k: the threshold size to apply the Sat reduction, i.e., neighbors larger than k will be pruned.

source

SimilaritySearch.RandomPruning — Type

RandomPruning(k)

Selects k random edges for each vertex

source

KeepNearestPruning(k)

Kept k nearest neighbor edges for each vertex

source

NeighborhoodPruning

Abstract data type for neighborhood pruning strategies

source

SimilaritySearch.maxlength — Function

maxlength(res::KnnResult)

The maximum allowed cardinality (the k of knn)

source

get_parallel_block()

Used by SearchGraph insertion functions to solve find_neighborhood in blocks. Small blocks are better to ensure quality; faster constructions will be achieved if parallel_block is a multiply of Threads.nthreads()

source

SimilarityFromDistance(dist)

Evaluates as $1/(1 + d)$ for a distance evaluation $d$ of dist. This is not a distance function and is part of the hacks to get a similarity for searching farthest elements on indexes that can handle this hack (e.g., ExhaustiveSearch, ParallelExhaustiveSearch, SearchGraph).

source

SimilaritySearch.execute_callbacks — Function

execute_callbacks(index, context, n=length(index), m=n+1)

Process all registered callbacks

source