Indexes

SimilaritySearch.ParallelExhaustiveSearchType
ParallelExhaustiveSearch(; dist=SqL2Distance(), db=VectorDatabase{Float32}())

Solves queries evaluating dist in parallel for the query and all elements in the dataset. Note that this should not be used in conjunction with searchbatch(...; parallel=true) since they will compete for resources.

source
SimilaritySearch.SearchGraphType
struct SearchGraph <: AbstractSearchIndex

SearchGraph index. It stores a set of points that can be compared through a distance function dist. The performance is determined by the search algorithm search_algo and the neighborhood policy. It supports callbacks to adjust parameters as insertions are made.

  • hints: Initial points for exploration (empty hints imply using random points)

Note: Parallel insertions should be made through append! or index! function with parallel_block > 1

source

Searching

SimilaritySearch.searchFunction
search(seq::ExhaustiveSearch, context::AbstractContext, q, res::KnnResult)

Solves the query evaluating all items in the given query.

source
search(ex::ParallelExhaustiveSearch, context::GenericContext, q, res::KnnResult)

Solves the query evaluating all items in the given query.

Arguments

  • ex: the search structure
  • q: the query to solve
  • res: the result set
  • context: running context
source
search(bs::BeamSearch, index::SearchGraph, context, q, res, hints; bsize=bs.bsize, Δ=bs.Δ, maxvisits=bs.maxvisits)

Tries to reach the set of nearest neighbors specified in res for q.

  • bs: the parameters of BeamSearch
  • index: the local search index
  • q: the query
  • res: The result object, it stores the results and also specifies the kind of query
  • hints: Starting points for searching, randomly selected when it is an empty collection
  • context: A SearchGraphContext object with preallocated objects

Optional arguments (defaults to values in bs)

  • bsize: Beam size
  • Δ: exploration expansion factor
  • maxvisits: Maximum number of nodes to visit (distance evaluations)
source
search(index::SearchGraph, context::SearchGraphContext, q, res; hints=index.hints

Solves the specified query res for the query object q.

source
SimilaritySearch.searchbatchFunction
searchbatch(index, ctx, Q, k::Integer) -> indices, distances
searchbatch(index, Q, k::Integer) -> indices, distances

Searches a batch of queries in the given index (searches for k neighbors).

Arguments

  • index: The search structure
  • Q: The set of queries
  • k: The number of neighbors to retrieve
  • context: caches, hyperparameters, and meta data (defaults to getcontext(index))

Note: The i-th column in indices and distances correspond to the i-th query in Q Note: The final indices at each column can be 0 if the search process was unable to retrieve k neighbors.

source
searchbatch(index, ctx, Q, I::AbstractMatrix{Int32}, D::AbstractMatrix{Float32}) -> indices, distances

Searches a batch of queries in the given index and I and D as output (searches for k=size(I, 1))

Arguments

  • index: The search structure
  • Q: The set of queries
  • k: The number of neighbors to retrieve
  • ctx: environment for running searches (hyperparameters and caches)
source
searchbatch(index, context, Q, KNN::AbstractVector{KnnResult}) -> indices, distances

Searches a batch of queries in the given index using an array of KnnResult's; each KnnResult object can specify different k values.

Arguments

  • context: contain caches to reduce memory allocations and hyperparameters for search customization
source

Note: KnnResult based functions are significantly faster in general on pre-allocated objects that similar functions accepting matrices of identifiers and distances. Matrix based outputs are based on KnnResult methods that copy their results on the matrices. Preallocation is also costly, so if you have relatively small datasets, you are not intended to repeat the search process many times, or you are unsure, it is safe to use matrix-based functions.

Computing all knns

The operation of computing all knns in the index is computed as follows:

SimilaritySearch.allknnFunction
allknn(g::AbstractSearchIndex, context, k::Integer; minbatch=0, pools=getpools(g)) -> knns, dists
allknn(g, context, knns, dists; minbatch=0, pools=getpools(g)) -> knns, dists

Computes all the k nearest neighbors (all vs all) using the index g. It removes self references.

Parameters:

  • g: the index

  • context: the index's context (caches, hyperparameters, logger, etc)

  • Query specification and result:

    • k: the number of neighbors to retrieve
    • knns: an uninitialized integer matrix of (k, n) size for storing the k nearest neighbors of the n elements
    • dists: an uninitialized floating point matrix of (k, n) size for storing the k nearest distances of the n elements
  • context: caches, hyperparameters and meta specifications, e.g., see SearchGraphContext

Returns:

  • knns a (k, n) matrix of identifiers; the i-th column corresponds to the i-th object in the dataset. Zeros can happen to the end of each column meaning that the retrieval was less than the desired k
  • dists a (k, n) matrix of distances; the i-th column corresponds to the i-th object in the dataset. Zero values in knns should be ignored in dists

Note:

This function was introduced in v0.8 series, and removes self references automatically. In v0.9 the self reference is kept since removing from the algorithm introduces a considerable overhead.

source

Computing closest pair

The operation of finding the closest pair of elements in the indexed dataset.

SimilaritySearch.closestpairFunction
closestpair(idx::AbstractSearchIndex, ctx::AbstractContext; minbatch=0)

Finds the closest pair among all elements in idx. If the index idx is approximate then pair of points could be also an approximation.

Arguments:

  • idx: the search structure that indexes the set of points
  • ctx: the search context (caches, hyperparameters, etc)

Keyword Arguments:

  • minbatch: controls how multithreading is used for evaluating configurations, see getminbatch
source

Remove near duplicates

Finds and removes near duplicate items in a metric dataset

SimilaritySearch.neardupFunction
neardup(idx::AbstractSearchIndex, ctx::AbstractContext, X::AbstractDatabase, ϵ::Real; k::Int=8, blocksize::Int=get_parallel_block(), minbatch=0, filterblocks=true, verbose=true)
neardup(dist::SemiMetric, X::AbstractDatabase, ϵ::Real; kwargs...)

Find nearest duplicates in database X using the empty index idx. The algorithm iteratively try to index elements in X, and items being near than ϵ to some element in idx will be ignored.

The function returns a named tuple (idx, map, nn, dist) where:

  • idx: it is the index of the non duplicated elements
  • ctx: the index's context
  • map: a mapping from |idx|-1 to its positions in X
  • nn: an array where each element in $x \in X$ points to its covering element (previously indexed element u such that $d(u, x_i) \leq ϵ$)
  • dist: an array of distance values to each covering element (correspond to each element in nn)

Arguments

  • idx: An empty index (i.e., a SearchGraph)
  • X: The input dataset
  • ϵ: Real value to cut, if negative, then ϵ will be computed using the quantile value at 'abs(ϵ)' in a small sample of nearest neighbor distances; the quantile method should be used only for applications that need some vague approximations to ϵ

Keyword arguments

  • k: The number of nearest neighbors to retrieve (some algorithms benefit from retrieving larger k values)
  • blocksize: the number of items processed at the time
  • minbatch: argument to control @batch macro (see Polyester package multithreading)
  • filterblocks: if true then it filters neardups inside blocks (see blocksize parameter), otherwise, it supposes that blocks are free of neardups (e.g., randomized order).
  • verbose: controls the verbosity of the function

Notes

  • The index idx must support incremental construction
  • If you need to customize object insertions, you must wrap the index idx and implement your custom methods; it requires valid implementations of the following functions:
    • searchbatch(idx::AbstractSearchIndex, ctx, queries::AbstractDatabase, knns::Matrix, dists::Matrix)
    • distance(idx::AbstractSearchIndex)
    • length(idx::AbstractSearchIndex)
    • append_items!(idx::AbstractSearchIndex, ctx, items::AbstractDatabase)
  • You can access the set of elements being 'ϵ-non duplicates (the $ϵ-net$) using database(idx) or where nn[i] == i
source

Indexing elements

SimilaritySearch.push_item!Function
push_item!(res::KnnResult, p::IdWeight)
push_item!(res::KnnResult, id::Integer, dist::Real)

Appends an item into the result set

source
push_item!(
    index::SearchGraph,
    context,
    item;
    push_item=true
)

Appends an object into the index.

Arguments:

  • index: The search graph index where the insertion is going to happen
  • item: The object to be inserted, it should be in the same space than other objects in the index and understood by the distance metric.
  • context: The context environment of the graph, see SearchGraphContext.
  • push_db: if push_db=false is an internal option, used by append! and index! (it avoids to insert item into the database since it is already inserted but not indexed)
source
push_item!(
    index::SearchGraph,
    context,
    item,
    push_item
)

Appends an object into the index. Internal function

Arguments:

  • index: The search graph index where the insertion is going to happen.

  • item: The object to be inserted, it should be in the same space than other objects in the index and understood by the distance metric.

  • context: The context environment of the graph, see SearchGraphContext.

  • push_db: if false is an internal option, used by append! and index! (it avoids to insert item into the database since it is already inserted but not indexed).

  • Note: setting callbacks as nothing ignores the execution of any callback

source
SimilaritySearch.append_items!Function
append_items!(
    index::SearchGraph,
    context::SearchGraphContext,
    db
)

Appends all items in db to the index. It can be made in parallel or sequentially.

Arguments:

  • index: the search graph index
  • db: the collection of objects to insert, an AbstractDatabase is the canonical input, but supports any iterable objects
  • context: The context environment of the graph, see SearchGraphContext.
source
SimilaritySearch.index!Function
index!(index::SearchGraph, context::SearchGraphContext)

Indexes the already initialized database (e.g., given in the constructor method). It can be made in parallel or sequentially. The arguments are the same than append_items! function but using the internal index.db as input.

Arguments:

  • index: The graph index
  • context: The context environment of the graph, see SearchGraphContext.
source
SimilaritySearch.rebuildFunction
rebuild(g::SearchGraph; context=SearchGraphContext())

Rebuilds the SearchGraph index but seeing the whole dataset for the incremental construction, i.e., it can connect the i-th vertex to its knn in the 1..n possible vertices instead of its knn among 1..(i-1) as in the original algorithm.

Arguments

  • g: The search index to be rebuild.
  • context: The context to run the procedure, it can differ from the original one.
  • minbatch: controls how the multithreading is made, see getminbatch
source

Distance functions

The distance functions are defined to work under the evaluate(::metric, u, v) function (borrowed from Distances.jl package).

Minkowski vector distance functions

SimilaritySearch.SqL2DistanceType
SqL2Distance()

The squared euclidean distance is defined as

\[L_2(u, v) = \sum_i{(u_i - v_i)^2}\]

It avoids the computation of the square root and should be used whenever you are able do it.

source
SimilaritySearch.LpDistanceType
LpDistance(p)
LpDistance(p, pinv)

The general Minkowski distance $L_p$ distance is defined as

\[L_p(u, v) = \left|\sum_i{(u_i - v_i)^p}\right|^{1/p}\]

Where $p_{inv} = 1/p$. Note that you can specify unrelated p and pinv if you need an specific behaviour.

source

The package implements some of these functions using the @turbo macro from LoopVectorization package, only for v1.10 and previous versions.

Cosine and angle distance functions for vectors

SimilaritySearch.CosineDistanceType

CosineDistance()

The cosine is defined as:

\[\cos(u, v) = \frac{\sum_i u_i v_i}{\sqrt{\sum_i u_i^2} \sqrt{\sum_i v_i^2}}\]

The cosine distance is defined as $1 - \cos(u,v)$

source

Set distance functions

Set bbject are represented as ordered arrays

SimilaritySearch.IntersectionDissimilarityType
IntersectionDissimilarity()

The intersection dissimilarity uses the size of the intersection as a mesuare of similarity as follows:

\[I(u, v) = 1 - \frac{|u \cap v|}{\max \{|u|, |v|\}}\]

source

String alignment distances

The following uses strings/arrays as input, i.e., objects follow the array interface. A broader set of distances for strings can be found in the StringDistances.jl package.

SimilaritySearch.GenericLevenshteinDistanceType
GenericLevenshteinDistance(;icost, dcost, rcost)

The levenshtein distance measures the minimum number of edit operations to convert one string into another. The costs insertion icost, deletion cost dcost, and replace cost rcost. Not thread safe, use a copy of for each thread.

source

Distances for Cloud of points

SimilaritySearch.HausdorffDistanceType
HausdorffDistance(dist::SemiMetric)

Hausdorff distance is defined as the maximum of the minimum between two clouds of points.

\[Hausdorff(U, V) = \max{\max_{u \in U} nndist(u, V), \max{v \in V} nndist(v, U) }\]

where $nndist(u, V)$ computes the distance of $u$ to its nearest neighbor in $V$ using the dist metric.

source

Functions that customize parameters

Several algorithms support arguments that modify the performance, for instance, some of them should be computed or prepared with external functions or structs

SimilaritySearch.getminbatchFunction
getminbatch(minbatch, n)

Used by functions that use parallelism based on Polyester.jl minibatches specify how many queries (or something else) are solved per thread whenever the thread is used (in minibatches).

Arguments

  • minbatch
    • Integers $1 ≤ minbatch ≤ n$ are valid values (where n is the number of objects to process, i.e., queries)
    • Defaults to 0 which computes a default number based on the number of available cores and n.
    • Set minbatch=-1 to avoid parallelism.
source
SimilaritySearch.getknnresultFunction
getknnresult(k::Integer, ctx::AbstractContext) -> KnnResult

Generic function to obtain a shared result set for the same thread and avoid memory allocations. This function should be specialized for indexes and caches that use shared results or threads in some special way.

source
Missing docstring.

Missing docstring for getpools. Check Documenter's build log for details.

SimilaritySearch.NeighborhoodType
Neighborhood(; logbase=2, minsize=2, reduce=SatNeighborhood())

Determines the size of the neighborhood, $k$ is adjusted as a callback, and it is intended to affect previously inserted vertices. The neighborhood is designed to consider two components $k=in+out$, i.e. incoming and outgoing edges for each vertex.

  • The $out$ size is computed as $minsize + \log(logbase, n)$ where $n$ is the current number of indexed elements; this is computed searching

for $out$ elements in the current index.

  • The $in$ size is unbounded.
  • reduce is intended to postprocess neighbors (after search process, i.e., once out edges are computed); do not change $k$ but always must return a copy of the reduced result set.

Note: Set $logbase=Inf$ to obtain a fixed number of $in$ nodes; and set $minsize=0$ to obtain a pure logarithmic growing neighborhood.

source
SimilaritySearch.CallbackType
abstract type Callback end

Abstract type to trigger callbacks after some number of insertions. SearchGraph stores the callbacks in callbacks (a dictionary that associates symbols and callback objects); A SearchGraph object controls when callbacks are fired using callback_logbase and callback_starting

source
Missing docstring.

Missing docstring for SearchGraphCallbacks. Check Documenter's build log for details.

Database API

SimilaritySearch.AbstractDatabaseType
abstract type AbstractDatabase end

Base type to represent databases. A database is a collection of objects that can be accessed like a similar interface to AbstractVector. It is separated to allow SimilaritySearch methods to know what is a database and what is an object (since most object representations will look as vectors and matrices).

The basic implementations are:

  • MatrixDatabase: A wrapper for object-vectors stored in a Matrix, columns are the objects. It is static.
  • DynamicMatrixDatabase: A dynamic representation for vectors that allows adding new vectors.
  • VectorDatabase: A wrapper for vector-like structures. It can contain any kind of objects.
  • SubDatabase: A sample of a given database

In particular, the storage details are not used by VectorDatabase and MatrixDatabase. For instance, it is possible to use matrices like Matrix, SMatrix or StrideArrays; or even use generated objects with VectorDatabase (supporting a vector-like interface).

If the storage backend support it, it is possible to use vector operations, for example:

  • get the i-th element obj = db[i], elements in the database are identified by position
  • get the elements list in a list of indices lst as db[lst] (also using view)
  • set a value at the i-th element db[i] = obj
  • random sampling rand(db), rand(db, 3)
  • iterate and collect objects in the database
  • get the number of elements in the database length(db)
  • add new objects to the end of the database (not all internal containers will support it)
    • push_item!(db, u) adds a single element u
    • append_items!(db, lst) adds a list of objects to the end of the database
source
SimilaritySearch.find_neighborhoodFunction
find_neighborhood(index::SearchGraph{T}, context, item; hints=index.hints)

Searches for item neighborhood in the index, i.e., if item were in the index whose items should be its neighbors (intenal function). res is always reused since reduce creates a new KnnResult from it (a copy if reduce in its simpler terms)

Arguments

  • index: The search index.
  • item: The item to be inserted.
  • context: context, neighborhood, and cache objects to be used
  • hints: Search hints
source
SimilaritySearch.SatPruningType
SatPruning(k; kind=DistalSatNeighborhood())
SatPruning(k)

Selects SatNeighborhood or DistalSatNeighborhood for each vertex. Defaults to DistalSatNeighborhood.

  • k: the threshold size to apply the Sat reduction, i.e., neighbors larger than k will be pruned.
source
SimilaritySearch.get_parallel_blockFunction
get_parallel_block()

Used by SearchGraph insertion functions to solve find_neighborhood in blocks. Small blocks are better to ensure quality; faster constructions will be achieved if parallel_block is a multiply of Threads.nthreads()

source
SimilaritySearch.SimilarityFromDistanceType
SimilarityFromDistance(dist)

Evaluates as $1/(1 + d)$ for a distance evaluation $d$ of dist. This is not a distance function and is part of the hacks to get a similarity for searching farthest elements on indexes that can handle this hack (e.g., ExhaustiveSearch, ParallelExhaustiveSearch, SearchGraph).

source