Indexes
SimilaritySearch.ExhaustiveSearch — Type
ExhaustiveSearch(dist::PreMetric, db::AbstractVector)Solves queries evaluating dist for the query and all elements in the dataset
SimilaritySearch.ParallelExhaustiveSearch — Type
ParallelExhaustiveSearch(; dist=SqL2Distance(), db=VectorDatabase{Float32}())Solves queries evaluating dist in parallel for the query and all elements in the dataset. Note that this should not be used in conjunction with searchbatch(...; parallel=true) since they will compete for resources.
SimilaritySearch.SearchGraph — Type
struct SearchGraph <: AbstractSearchIndexSearchGraph index. It stores a set of points that can be compared through a distance function dist. The performance is determined by the search algorithm algo and the neighborhood policy. It supports callbacks to adjust parameters as insertions are made.
hints: Initial points for exploration (empty hints imply using random points)
Note: Parallel insertions should be made through append! or index! function with parallel_block > 1
Searching
SimilaritySearch.search — Function
search(seq::ExhaustiveSearch, ctx::AbstractContext, q, res)Solves the query evaluating all items in the given query.
search(pex::ParallelExhaustiveSearch, ctx::GenericContext, q, res)Solves the query evaluating all items in the given query.
Arguments
pex: the search structureq: the query to solveres: the result setctx: running ctx
search(bs::BeamSearch, index::SearchGraph, ctx, q, res, hints, vstate)Tries to reach the set of nearest neighbors specified in res for q.
bs: the parameters ofBeamSearchindex: the local search indexctx: A SearchGraphContext object with preallocated objectsq: the queryres: The result object, it stores the results and also specifies the kind of queryhints: Starting points for searching, randomly selected when it is an empty collectionvstate: data structure to mark visited vertices
search(index::SearchGraph, ctx::SearchGraphContext, q, resSolves the specified query res for the query object q.
SimilaritySearch.searchbatch — Function
searchbatch(index, ctx, Q, k::Integer) -> indices, distances
searchbatch(index, Q, k::Integer) -> indices, distancesSearches a batch of queries in the given index (searches for k neighbors).
Arguments
index: The search structureQ: The set of queriesk: The number of neighbors to retrievecontext: caches, hyperparameters, and meta data (defaults togetcontext(index))sorted=true: ensures that the results are sorted by distance.
Note: The i-th column in indices and distances correspond to the i-th query in Q Note: The final indices at each column can be 0 if the search process was unable to retrieve k neighbors.
SimilaritySearch.searchbatch! — Function
searchbatch!(index, ctx, Q, knns; sorted) -> knnsSearches a batch of queries in the given index and use knns as output (searches for k=size(I, 1))
Arguments
index: The search structurectx: Context of the search algorithm, environment for running searches (hyperparameters and caches)Q: The set of queriesknns: Output, a matrix of IdDist elements (initialized withzeros); an array of KnnAbstract elements, use this form to retrieve search costs.
Keyword arguments
sorted: indicates whether the output should be sorted or not.
Computing all knns
The operation of computing all knns in the index is computed as follows:
SimilaritySearch.allknn — Function
allknn(index, ctx, k::Integer; minbatch=0, sort=true, show_progress=true) -> knnsComputes all the k nearest neighbors (all vs all) using the given index. User must remove self references
Parameters:
index: the indexctx: the index's ctx (caches, hyperparameters, logger, etc)Query specification and result:
k: the number of neighbors to retrieveknns: an uninitialized IdDist matrix of (k, n) size for storing theknearest neighbors and sitances of thenelements
sort: ensures that result set is presented in ascending order by distanceshow_progress: enables or disables progress bar
Returns:
knnsa (k, n) matrix ofIdDistelements, i.e.,zeros(IdDist, k, n); the i-th column corresponds to the i-th object in the dataset. Zeros can happen to the end of each column meaning that the retrieval was less than the desiredk
Computing closest pair
The operation of finding the closest pair of elements in the indexed dataset.
SimilaritySearch.closestpair — Function
closestpair(idx::AbstractSearchIndex, ctx::AbstractContext)Finds the closest pair among all elements in idx. If the index idx is approximate then pair of points could be also an approximation.
Arguments:
idx: the search structure that indexes the set of pointsctx: the search context (caches, hyperparameters, etc)
Keyword Arguments:
min_k: instead of looking fork=1some approximate methods can take advantage of a largerk
Remove near duplicates
Finds and removes near duplicate items in a metric dataset
SimilaritySearch.neardup — Function
neardup(idx::AbstractSearchIndex, ctx::AbstractContext, X::AbstractDatabase, ϵ::Real; k::Int=8, blocksize::Int=get_parallel_block(), filterblocks=true, verbose=true)
neardup(dist::PreMetric, X::AbstractDatabase, ϵ::Real; kwargs...)Find nearest duplicates in database X using the empty index idx. The algorithm iteratively try to index elements in X, and items being near than ϵ to some element in idx will be ignored.
The function returns a named tuple (idx, map, nn, dist) where:
idx: it is the index of the non duplicated elementsctx: the index's contextmap: a mapping from|idx|-1to its positions inXnn: an array where each element in $x \in X$ points to its covering element (previously indexed elementusuch that $d(u, x_i) \leq ϵ$)dist: an array of distance values to each covering element (correspond to each element innn)
Arguments
idx: An empty index (i.e., aSearchGraph)X: The input datasetϵ: Real value to cut, if negative, then ϵ will be computed using the quantile value at 'abs(ϵ)' in a small sample of nearest neighbor distances; the quantile method should be used only for applications that need some vague approximations toϵ
Keyword arguments
k: The number of nearest neighbors to retrieve (some algorithms benefit from retrieving largerkvalues)blocksize: the number of items processed at the timefilterblocks: if true then it filters neardups inside blocks (seeblocksizeparameter), otherwise, it supposes that blocks are free of neardups (e.g., randomized order).verbose: controls the verbosity of the function
Notes
- The index
idxmust support incremental construction - If you need to customize object insertions, you must wrap the index
idxand implement your custom methods; it requires valid implementations of the following functions:searchbatch(idx::AbstractSearchIndex, ctx, queries::AbstractDatabase, knns::Matrix, dists::Matrix)distance(idx::AbstractSearchIndex)length(idx::AbstractSearchIndex)append_items!(idx::AbstractSearchIndex, ctx, items::AbstractDatabase)
- You can access the set of elements being 'ϵ-non duplicates (the $ϵ-net$) using
database(idx)or wherenn[i] == i
Indexing elements
SimilaritySearch.push_item! — Function
push_item!(res::KnnHeap, p::IdDist)Appends an item into the result set
push_item!(res::KnnSorted, p::IdDist)Appends an item into the result set
push_item!(
index::SearchGraph,
ctx,
item,
qcache
push_item
)Appends an object into the index. Internal function
Arguments:
index: The search graph index where the insertion is going to happen.item: The object to be inserted, it should be in the same space than other objects in the index and understood by the distance metric.ctx: The context environment of the graph, seeSearchGraphContext.tmp: knnqueue to be used by the neighborhood computationneighbors: knnqueue to be used by the neighborhood computationpush_db: iffalseis an internal option, used byappend!andindex!(it avoids to insertiteminto the database since it is already inserted but not indexed).Note: setting
callbacksasnothingignores the execution of any callback
SimilaritySearch.append_items! — Function
append_items!(
index::SearchGraph,
ctx::SearchGraphContext,
db
)Appends all items in db to the index. It can be made in parallel or sequentially.
Arguments:
index: the search graph indexdb: the collection of objects to insert, anAbstractDatabaseis the canonical input, but supports any iterable objectsctx: The context environment of the graph, seeSearchGraphContext.
SimilaritySearch.index! — Function
index!(index::SearchGraph, ctx::SearchGraphContext)Indexes the already initialized database (e.g., given in the constructor method). It can be made in parallel or sequentially. The arguments are the same than append_items! function but using the internal index.db as input.
Arguments:
index: The graph indexctx: The context environment of the graph, seeSearchGraphContext.
SimilaritySearch.rebuild — Function
rebuild(g::SearchGraph; context=SearchGraphContext())Rebuilds the SearchGraph index but seeing the whole dataset for the incremental construction, i.e., it can connect the i-th vertex to its knn in the 1..n possible vertices instead of its knn among 1..(i-1) as in the original algorithm.
Arguments
g: The search index to be rebuild.context: The context to run the procedure, it can differ from the original one.
Distance functions
The distance functions are defined to work under the evaluate(::metric, u, v) function (borrowed from Distances.jl package).
Minkowski vector distance functions
Cosine and angle distance functions for vectors
Missing docstring for NormalizedCosineDistance. Check Documenter's build log for details.
Missing docstring for NormalizedAngleDistance. Check Documenter's build log for details.
Set distance functions
Set bbject are represented as ordered arrays
Missing docstring for IntersectionDissimilarity. Check Documenter's build log for details.
Missing docstring for CosineDistanceSet. Check Documenter's build log for details.
String alignment distances
The following uses strings/arrays as input, i.e., objects follow the array interface. A broader set of distances for strings can be found in the StringDistances.jl package.
Missing docstring for CommonPrefixDissimilarity. Check Documenter's build log for details.
Missing docstring for GenericLevenshteinDistance. Check Documenter's build log for details.
Missing docstring for StringHammingDistance. Check Documenter's build log for details.
Missing docstring for LevenshteinDistance. Check Documenter's build log for details.
Distances for Cloud of points
Missing docstring for HausdorffDistance. Check Documenter's build log for details.
Missing docstring for MinHausdorffDistance. Check Documenter's build log for details.
Functions that customize parameters
Several algorithms support arguments that modify the performance, for instance, some of them should be computed or prepared with external functions or structs
SimilaritySearch.getminbatch — Function
getminbatch(n::Int, nt::Int)Used by functions that use parallelism on small batches; each block is processed by a single thread
Arguments
ctx: The search contextn: the number of elements to processnt: number of threads to use
SimilaritySearch.Neighborhood — Type
Neighborhood(; logbase=2, minsize=2, neardup=typemin(Float32), filter=SatNeighborhood())Determines the size of the neighborhood; it is adjusted as a callback exponentially. More detailed, the insertion algorithm searches for $log_\text{logbase}(N) + minsize)$ in the index where $N$ is the size of the index/dataset, then these neighbors are filtered with filter. The algorithms use neardup to discard proximal items to be part of a neighborhood.
Parameters
logbase=2: logarithmic base to determine the number of neighbors to retrieveminsize=2: minimum number of elements to retrieveneardup=typemin(Float32): distance to identify an element as duplicate (neardups could be ignored from neighborhoods)filter=SatNeighborhood(): strategy to reduce the number of neighbors
Note: Set $logbase=Inf$ to obtain a fixed number of $in$ nodes; and set $minsize=0$ to obtain a pure logarithmic growing neighborhood.
SimilaritySearch.Callback — Type
abstract type Callback endAbstract type to trigger callbacks after some number of insertions. SearchGraph stores the callbacks in callbacks (a dictionary that associates symbols and callback objects); A SearchGraph object controls when callbacks are fired using callback_logbase and callback_starting
Missing docstring for SearchGraphCallbacks. Check Documenter's build log for details.
SimilaritySearch.BeamSearchSpace — Type
BeamSearchSpace(; bsize, Δ, bsize_scale, Δ_scale)Define search space for beam search autotuning
Database API
SimilaritySearch.AbstractDatabase — Type
abstract type AbstractDatabase endBase type to represent databases. A database is a collection of objects that can be accessed like a similar interface to AbstractVector. It is separated to allow SimilaritySearch methods to know what is a database and what is an object (since most object representations will look as vectors and matrices).
The basic implementations are:
MatrixDatabase: A wrapper for object-vectors stored in aMatrix, columns are the objects. It is static.DynamicMatrixDatabase: A dynamic representation for vectors that allows adding new vectors.VectorDatabase: A wrapper for vector-like structures. It can contain any kind of objects.SubDatabase: A sample of a given database
In particular, the storage details are not used by VectorDatabase and MatrixDatabase. For instance, it is possible to use matrices like Matrix, SMatrix or StrideArrays; or even use generated objects with VectorDatabase (supporting a vector-like interface).
If the storage backend support it, it is possible to use vector operations, for example:
- get the
i-th elementobj = db[i], elements in the database are identified by position - get the elements list in a list of indices
lstasdb[lst](also usingview) - set a value at the
i-th elementdb[i] = obj - random sampling
rand(db),rand(db, 3) - iterate and collect objects in the database
- get the number of elements in the database
length(db) - add new objects to the end of the database (not all internal containers will support it)
push_item!(db, u)adds a single elementuappend_items!(db, lst)adds a list of objects to the end of the database
SimilaritySearch.MatrixDatabase — Type
struct MatrixDatabase{M<:AbstractDatabase} <: AbstractDatabase
MatrixDatabase(matrix::AbstractMatrix)Wraps a matrix-like object matrix into a MatrixDatabase. Please see AbstractDatabase for general usage.
SimilaritySearch.VectorDatabase — Type
struct VectorDatabase{V} <: AbstractDatabaseA wrapper for vector-like databases
Missing docstring for DynamicMatrixDatabase. Check Documenter's build log for details.
SimilaritySearch.StrideMatrixDatabase — Type
struct StrideMatrixDatabase{M<:StrideArray} <: AbstractDatabase
StrideMatrixDatabase(matrix::StrideArray)Wraps a matrix object into a StrideArray and wrap it as a database. Please see AbstractDatabase for general usage.
SimilaritySearch.find_neighborhood! — Function
find_neighborhood!(index::SearchGraph{T}, ctx, item, ksearch, blockrange; hints=index.hints)Searches for item neighborhood in the index, i.e., if item were in the index whose items should be its neighbors (intenal function).
Arguments
index: The search index.item: The item to be inserted.blockrange: Extra block range for parallel insertions, defaults to an empty rangectx: context, neighborhood, and cache objects to be usedhints: Search hints
Missing docstring for KeepNearestPruning. Check Documenter's build log for details.
Missing docstring for NeighborhoodPruning. Check Documenter's build log for details.
SimilaritySearch.maxlength — Function
maxlength(res::KnnHeap)The maximum allowed cardinality (the k of knn)
maxlength(res::KnnSorted)The maximum allowed cardinality (the k of knnSorted)
SimilaritySearch.get_parallel_block — Function
get_parallel_block()Used by SearchGraph insertion functions to solve find_neighborhood! in blocks. Small blocks are better to ensure quality; faster constructions will be achieved if parallel_block is a multiply of Threads.nthreads()