Indexes
SimilaritySearch.ExhaustiveSearch
— TypeExhaustiveSearch(dist::SemiMetric, db::AbstractVector)
Solves queries evaluating dist
for the query and all elements in the dataset
SimilaritySearch.ParallelExhaustiveSearch
— TypeParallelExhaustiveSearch(; dist=SqL2Distance(), db=VectorDatabase{Float32}())
Solves queries evaluating dist
in parallel for the query and all elements in the dataset. Note that this should not be used in conjunction with searchbatch(...; parallel=true)
since they will compete for resources.
SimilaritySearch.SearchGraph
— Typestruct SearchGraph <: AbstractSearchIndex
SearchGraph index. It stores a set of points that can be compared through a distance function dist
. The performance is determined by the search algorithm algo
and the neighborhood policy. It supports callbacks to adjust parameters as insertions are made.
hints
: Initial points for exploration (empty hints imply using random points)
Note: Parallel insertions should be made through append!
or index!
function with parallel_block > 1
Searching
SimilaritySearch.search
— Functionsearch(seq::ExhaustiveSearch, context::AbstractContext, q, res::KnnResult)
Solves the query evaluating all items in the given query.
search(ex::ParallelExhaustiveSearch, context::GenericContext, q, res::KnnResult)
Solves the query evaluating all items in the given query.
Arguments
ex
: the search structureq
: the query to solveres
: the result setcontext
: running context
search(bs::BeamSearch, index::SearchGraph, context, q, res, hints; bsize=bs.bsize, Δ=bs.Δ, maxvisits=bs.maxvisits)
Tries to reach the set of nearest neighbors specified in res
for q
.
bs
: the parameters ofBeamSearch
index
: the local search indexq
: the queryres
: The result object, it stores the results and also specifies the kind of queryhints
: Starting points for searching, randomly selected when it is an empty collectioncontext
: A SearchGraphContext object with preallocated objects
Optional arguments (defaults to values in bs
)
bsize
: Beam sizeΔ
: exploration expansion factormaxvisits
: Maximum number of nodes to visit (distance evaluations)
search(index::SearchGraph, context::SearchGraphContext, q, res; hints=index.hints
Solves the specified query res
for the query object q
.
SimilaritySearch.searchbatch
— Functionsearchbatch(index, ctx, Q, k::Integer) -> indices, distances
searchbatch(index, Q, k::Integer) -> indices, distances
Searches a batch of queries in the given index (searches for k neighbors).
Arguments
index
: The search structureQ
: The set of queriesk
: The number of neighbors to retrievecontext
: caches, hyperparameters, and meta data (defaults togetcontext(index)
)
Note: The i-th column in indices and distances correspond to the i-th query in Q
Note: The final indices at each column can be 0
if the search process was unable to retrieve k
neighbors.
Note: KnnResult
based functions are significantly faster in general on pre-allocated objects that similar functions accepting matrices of identifiers and distances. Matrix based outputs are based on KnnResult
methods that copy their results on the matrices. Preallocation is also costly, so if you have relatively small datasets, you are not intended to repeat the search process many times, or you are unsure, it is safe to use matrix-based functions.
Computing all knns
The operation of computing all knns in the index is computed as follows:
SimilaritySearch.allknn
— Functionallknn(g::AbstractSearchIndex, context, k::Integer; minbatch=0, pools=getpools(g)) -> knns, dists
allknn(g, context, knns, dists; minbatch=0, pools=getpools(g)) -> knns, dists
Computes all the k nearest neighbors (all vs all) using the index g
. It removes self references.
Parameters:
g
: the indexcontext
: the index's context (caches, hyperparameters, logger, etc)Query specification and result:
k
: the number of neighbors to retrieveknns
: an uninitialized integer matrix of (k, n) size for storing thek
nearest neighbors of then
elementsdists
: an uninitialized floating point matrix of (k, n) size for storing thek
nearest distances of then
elements
context
: caches, hyperparameters and meta specifications, e.g., seeSearchGraphContext
Returns:
knns
a (k, n) matrix of identifiers; the i-th column corresponds to the i-th object in the dataset. Zeros can happen to the end of each column meaning that the retrieval was less than the desiredk
dists
a (k, n) matrix of distances; the i-th column corresponds to the i-th object in the dataset. Zero values inknns
should be ignored indists
Note:
This function was introduced in v0.8
series, and removes self references automatically. In v0.9
the self reference is kept since removing from the algorithm introduces a considerable overhead.
Computing closest pair
The operation of finding the closest pair of elements in the indexed dataset.
SimilaritySearch.closestpair
— Functionclosestpair(idx::AbstractSearchIndex, ctx::AbstractContext; minbatch=0)
Finds the closest pair among all elements in idx
. If the index idx
is approximate then pair of points could be also an approximation.
Arguments:
idx
: the search structure that indexes the set of pointsctx
: the search context (caches, hyperparameters, etc)
Keyword Arguments:
minbatch
: controls how multithreading is used for evaluating configurations, seegetminbatch
Remove near duplicates
Finds and removes near duplicate items in a metric dataset
SimilaritySearch.neardup
— Functionneardup(idx::AbstractSearchIndex, ctx::AbstractContext, X::AbstractDatabase, ϵ::Real; k::Int=8, blocksize::Int=get_parallel_block(), minbatch=0, filterblocks=true, verbose=true)
neardup(dist::SemiMetric, X::AbstractDatabase, ϵ::Real; kwargs...)
Find nearest duplicates in database X
using the empty index idx
. The algorithm iteratively try to index elements in X
, and items being near than ϵ
to some element in idx
will be ignored.
The function returns a named tuple (idx, map, nn, dist)
where:
idx
: it is the index of the non duplicated elementsctx
: the index's contextmap
: a mapping from|idx|-1
to its positions inX
nn
: an array where each element in $x \in X$ points to its covering element (previously indexed elementu
such that $d(u, x_i) \leq ϵ$)dist
: an array of distance values to each covering element (correspond to each element innn
)
Arguments
idx
: An empty index (i.e., aSearchGraph
)X
: The input datasetϵ
: Real value to cut, if negative, then ϵ will be computed using the quantile value at 'abs(ϵ)' in a small sample of nearest neighbor distances; the quantile method should be used only for applications that need some vague approximations toϵ
Keyword arguments
k
: The number of nearest neighbors to retrieve (some algorithms benefit from retrieving largerk
values)blocksize
: the number of items processed at the timeminbatch
: argument to control@batch
macro (seePolyester
package multithreading)filterblocks
: if true then it filters neardups inside blocks (seeblocksize
parameter), otherwise, it supposes that blocks are free of neardups (e.g., randomized order).verbose
: controls the verbosity of the function
Notes
- The index
idx
must support incremental construction - If you need to customize object insertions, you must wrap the index
idx
and implement your custom methods; it requires valid implementations of the following functions:searchbatch(idx::AbstractSearchIndex, ctx, queries::AbstractDatabase, knns::Matrix, dists::Matrix)
distance(idx::AbstractSearchIndex)
length(idx::AbstractSearchIndex)
append_items!(idx::AbstractSearchIndex, ctx, items::AbstractDatabase)
- You can access the set of elements being 'ϵ-non duplicates (the $ϵ-net$) using
database(idx)
or wherenn[i] == i
Indexing elements
SimilaritySearch.push_item!
— Functionpush_item!(res::KnnResult, p::IdWeight)
push_item!(res::KnnResult, id::Integer, dist::Real)
Appends an item into the result set
push_item!(
index::SearchGraph,
context,
item;
push_item=true
)
Appends an object into the index.
Arguments:
index
: The search graph index where the insertion is going to happenitem
: The object to be inserted, it should be in the same space than other objects in the index and understood by the distance metric.context
: The context environment of the graph, seeSearchGraphContext
.push_db
: ifpush_db=false
is an internal option, used byappend!
andindex!
(it avoids to insertitem
into the database since it is already inserted but not indexed)
push_item!(
index::SearchGraph,
context,
item,
push_item
)
Appends an object into the index. Internal function
Arguments:
index
: The search graph index where the insertion is going to happen.item
: The object to be inserted, it should be in the same space than other objects in the index and understood by the distance metric.context
: The context environment of the graph, seeSearchGraphContext
.push_db
: iffalse
is an internal option, used byappend!
andindex!
(it avoids to insertitem
into the database since it is already inserted but not indexed).Note: setting
callbacks
asnothing
ignores the execution of any callback
SimilaritySearch.append_items!
— Functionappend_items!(
index::SearchGraph,
context::SearchGraphContext,
db
)
Appends all items in db to the index. It can be made in parallel or sequentially.
Arguments:
index
: the search graph indexdb
: the collection of objects to insert, anAbstractDatabase
is the canonical input, but supports any iterable objectscontext
: The context environment of the graph, seeSearchGraphContext
.
SimilaritySearch.index!
— Functionindex!(index::SearchGraph, context::SearchGraphContext)
Indexes the already initialized database (e.g., given in the constructor method). It can be made in parallel or sequentially. The arguments are the same than append_items!
function but using the internal index.db
as input.
Arguments:
index
: The graph indexcontext
: The context environment of the graph, seeSearchGraphContext
.
SimilaritySearch.rebuild
— Functionrebuild(g::SearchGraph; context=SearchGraphContext())
Rebuilds the SearchGraph
index but seeing the whole dataset for the incremental construction, i.e., it can connect the i-th vertex to its knn in the 1..n possible vertices instead of its knn among 1..(i-1) as in the original algorithm.
Arguments
g
: The search index to be rebuild.context
: The context to run the procedure, it can differ from the original one.minbatch
: controls how the multithreading is made, seegetminbatch
Distance functions
The distance functions are defined to work under the evaluate(::metric, u, v)
function (borrowed from Distances.jl package).
Minkowski vector distance functions
SimilaritySearch.L1Distance
— TypeL1Distance()
The manhattan distance or $L_1$ is defined as
\[L_1(u, v) = \sum_i{|u_i - v_i|}\]
SimilaritySearch.L2Distance
— TypeL2Distance()
The euclidean distance or $L_2$ is defined as
\[L_2(u, v) = \sqrt{\sum_i{(u_i - v_i)^2}}\]
SimilaritySearch.SqL2Distance
— TypeSqL2Distance()
The squared euclidean distance is defined as
\[L_2(u, v) = \sum_i{(u_i - v_i)^2}\]
It avoids the computation of the square root and should be used whenever you are able do it.
SimilaritySearch.LInftyDistance
— TypeLInftyDistance()
The Chebyshev or $L_{\infty}$ distance is defined as
\[L_{\infty}(u, v) = \max_i{\left| u_i - v_i \right|}\]
SimilaritySearch.LpDistance
— TypeLpDistance(p)
LpDistance(p, pinv)
The general Minkowski distance $L_p$ distance is defined as
\[L_p(u, v) = \left|\sum_i{(u_i - v_i)^p}\right|^{1/p}\]
Where $p_{inv} = 1/p$. Note that you can specify unrelated p
and pinv
if you need an specific behaviour.
The package implements some of these functions using the @turbo
macro from LoopVectorization
package, only for v1.10 and previous versions.
Cosine and angle distance functions for vectors
SimilaritySearch.CosineDistance
— TypeCosineDistance()
The cosine is defined as:
\[\cos(u, v) = \frac{\sum_i u_i v_i}{\sqrt{\sum_i u_i^2} \sqrt{\sum_i v_i^2}}\]
The cosine distance is defined as $1 - \cos(u,v)$
SimilaritySearch.NormalizedCosineDistance
— TypeNormalizedCosineDistance()
Similar to CosineDistance
but suppose that input vectors are already normalized, and therefore, reduced to simply one minus the dot product.
\[1 - \sum_i {u_i v_i}\]
SimilaritySearch.AngleDistance
— TypeAngleDistance()
The angle distance is defined as:
\[∠(u, v)= \arccos(\cos(u, v))\]
SimilaritySearch.NormalizedAngleDistance
— TypeNormalizedAngleDistance()
Similar to AngleDistance
but suppose that input vectors are already normalized
\[\arccos \sum_i {u_i v_i}\]
Set distance functions
Set bbject are represented as ordered arrays
SimilaritySearch.JaccardDistance
— TypeJaccardDistance()
The Jaccard distance is defined as
\[J(u, v) = \frac{|u \cap v|}{|u \cup v|}\]
SimilaritySearch.DiceDistance
— TypeDiceDistance()
The Dice distance is defined as
\[D(u, v) = \frac{2 |u \cap v|}{|u| + |v|}\]
SimilaritySearch.IntersectionDissimilarity
— TypeIntersectionDissimilarity()
The intersection dissimilarity uses the size of the intersection as a mesuare of similarity as follows:
\[I(u, v) = 1 - \frac{|u \cap v|}{\max \{|u|, |v|\}}\]
SimilaritySearch.CosineDistanceSet
— TypeCosineDistanceSet()
The cosine distance for very sparse binary vectors represented as sorted lists of positive integers where ones occur.
String alignment distances
The following uses strings/arrays as input, i.e., objects follow the array interface. A broader set of distances for strings can be found in the StringDistances.jl package.
SimilaritySearch.CommonPrefixDissimilarity
— TypeCommonPrefixDissimilarity()
Uses the common prefix as a measure of dissimilarity between two strings
SimilaritySearch.GenericLevenshteinDistance
— TypeGenericLevenshteinDistance(;icost, dcost, rcost)
The levenshtein distance measures the minimum number of edit operations to convert one string into another. The costs insertion icost
, deletion cost dcost
, and replace cost rcost
. Not thread safe, use a copy of for each thread.
SimilaritySearch.StringHammingDistance
— TypeStringHammingDistance()
The hamming distance counts the differences between two equally sized strings
SimilaritySearch.LevenshteinDistance
— FunctionLevenshteinDistance()
Instantiates a GenericLevenshteinDistance object to perform traditional levenshtein distance
SimilaritySearch.LcsDistance
— FunctionLcsDistance()
Instantiates a GenericLevenshteinDistance object to perform LCS distance
Distances for Cloud of points
SimilaritySearch.HausdorffDistance
— TypeHausdorffDistance(dist::SemiMetric)
Hausdorff distance is defined as the maximum of the minimum between two clouds of points.
\[Hausdorff(U, V) = \max{\max_{u \in U} nndist(u, V), \max{v \in V} nndist(v, U) }\]
where $nndist(u, V)$ computes the distance of $u$ to its nearest neighbor in $V$ using the dist
metric.
SimilaritySearch.MinHausdorffDistance
— TypeMinHausdorffDistance(dist::SemiMetric)
Similar to HausdorffDistance
but using minimum instead of maximum.
Functions that customize parameters
Several algorithms support arguments that modify the performance, for instance, some of them should be computed or prepared with external functions or structs
SimilaritySearch.getminbatch
— Functiongetminbatch(minbatch, n)
Used by functions that use parallelism based on Polyester.jl
minibatches specify how many queries (or something else) are solved per thread whenever the thread is used (in minibatches).
Arguments
minbatch
- Integers $1 ≤ minbatch ≤ n$ are valid values (where n is the number of objects to process, i.e., queries)
- Defaults to 0 which computes a default number based on the number of available cores and
n
. - Set
minbatch=-1
to avoid parallelism.
SimilaritySearch.getknnresult
— Functiongetknnresult(k::Integer, ctx::AbstractContext) -> KnnResult
Generic function to obtain a shared result set for the same thread and avoid memory allocations. This function should be specialized for indexes and caches that use shared results or threads in some special way.
Missing docstring for getpools
. Check Documenter's build log for details.
SimilaritySearch.Neighborhood
— TypeNeighborhood(; logbase=2, minsize=2, filter=SatNeighborhood())
Determines the size of the neighborhood, $k$ is adjusted as a callback, and it is intended to affect previously inserted vertices. The neighborhood is designed to consider two components $k=in+out$, i.e. incoming and outgoing edges for each vertex.
- The $out$ size is computed as $minsize + \log(logbase, n)$ where $n$ is the current number of indexed elements; this is computed searching
for $out$ elements in the current index.
- The $in$ size is unbounded.
- filter is intended to postprocess neighbors (after search process, i.e., once out edges are computed); do not change $k$ but always must return a copy of the filterd result set.
Note: Set $logbase=Inf$ to obtain a fixed number of $in$ nodes; and set $minsize=0$ to obtain a pure logarithmic growing neighborhood.
SimilaritySearch.Callback
— Typeabstract type Callback end
Abstract type to trigger callbacks after some number of insertions. SearchGraph stores the callbacks in callbacks
(a dictionary that associates symbols and callback objects); A SearchGraph object controls when callbacks are fired using callback_logbase
and callback_starting
Missing docstring for SearchGraphCallbacks
. Check Documenter's build log for details.
SimilaritySearch.BeamSearchSpace
— TypeBeamSearchSpace(; bsize, Δ, bsize_scale, Δ_scale)
Define search space for beam search autotuning
Database API
SimilaritySearch.AbstractDatabase
— Typeabstract type AbstractDatabase end
Base type to represent databases. A database is a collection of objects that can be accessed like a similar interface to AbstractVector
. It is separated to allow SimilaritySearch
methods to know what is a database and what is an object (since most object representations will look as vectors and matrices).
The basic implementations are:
MatrixDatabase
: A wrapper for object-vectors stored in aMatrix
, columns are the objects. It is static.DynamicMatrixDatabase
: A dynamic representation for vectors that allows adding new vectors.VectorDatabase
: A wrapper for vector-like structures. It can contain any kind of objects.SubDatabase
: A sample of a given database
In particular, the storage details are not used by VectorDatabase
and MatrixDatabase
. For instance, it is possible to use matrices like Matrix
, SMatrix
or StrideArrays
; or even use generated objects with VectorDatabase
(supporting a vector-like interface).
If the storage backend support it, it is possible to use vector operations, for example:
- get the
i
-th elementobj = db[i]
, elements in the database are identified by position - get the elements list in a list of indices
lst
asdb[lst]
(also usingview
) - set a value at the
i
-th elementdb[i] = obj
- random sampling
rand(db)
,rand(db, 3)
- iterate and collect objects in the database
- get the number of elements in the database
length(db)
- add new objects to the end of the database (not all internal containers will support it)
push_item!(db, u)
adds a single elementu
append_items!(db, lst)
adds a list of objects to the end of the database
SimilaritySearch.MatrixDatabase
— Typestruct MatrixDatabase{M<:AbstractDatabase} <: AbstractDatabase
MatrixDatabase(matrix::AbstractMatrix)
Wraps a matrix-like object matrix
into a MatrixDatabase
. Please see AbstractDatabase
for general usage.
SimilaritySearch.VectorDatabase
— Typestruct VectorDatabase{V} <: AbstractDatabase
A wrapper for vector-like databases
SimilaritySearch.DynamicMatrixDatabase
— Typestruct DynamicMatrixDatabase{DType,Dim} <: AbstractDatabase
A dynamic matrix with elements of type DType
and dimension Dim
SimilaritySearch.StrideMatrixDatabase
— Typestruct StrideMatrixDatabase{M<:StrideArray} <: AbstractDatabase
StrideMatrixDatabase(matrix::StrideArray)
Wraps a matrix object into a StrideArray
and wrap it as a database. Please see AbstractDatabase
for general usage.
SimilaritySearch.find_neighborhood
— Functionfind_neighborhood(copy_, index::SearchGraph{T}, context, item; hints=index.hints)
Searches for item
neighborhood in the index, i.e., if item
were in the index whose items should be its neighbors (intenal function). The copy_
function forces to control how the returned KnnResult object is handled because it uses a cache result set from the given context.
Arguments
copy_
: A copying function, it controls what is retrieved by the function.index
: The search index.item
: The item to be inserted.context
: context, neighborhood, and cache objects to be usedhints
: Search hints
Missing docstring for SatPruning
. Check Documenter's build log for details.
Missing docstring for RandomPruning
. Check Documenter's build log for details.
Missing docstring for KeepNearestPruning
. Check Documenter's build log for details.
Missing docstring for NeighborhoodPruning
. Check Documenter's build log for details.
SimilaritySearch.maxlength
— Functionmaxlength(res::KnnResult)
The maximum allowed cardinality (the k of knn)
SimilaritySearch.get_parallel_block
— Functionget_parallel_block()
Used by SearchGraph insertion functions to solve find_neighborhood
in blocks. Small blocks are better to ensure quality; faster constructions will be achieved if parallel_block
is a multiply of Threads.nthreads()
SimilaritySearch.SimilarityFromDistance
— TypeSimilarityFromDistance(dist)
Evaluates as $1/(1 + d)$ for a distance evaluation $d$ of dist
. This is not a distance function and is part of the hacks to get a similarity for searching farthest elements on indexes that can handle this hack (e.g., ExhaustiveSearch
, ParallelExhaustiveSearch
, SearchGraph
).
SimilaritySearch.execute_callbacks
— Functionexecute_callbacks(index, context, n=length(index), m=n+1)
Process all registered callbacks