Indexes

SimilaritySearch.ParallelExhaustiveSearchType
ParallelExhaustiveSearch(; dist=SqL2Distance(), db=VectorDatabase{Float32}())

Solves queries evaluating dist in parallel for the query and all elements in the dataset. Note that this should not be used in conjunction with searchbatch(...; parallel=true) since they will compete for resources.

source
SimilaritySearch.SearchGraphType
struct SearchGraph <: AbstractSearchIndex

SearchGraph index. It stores a set of points that can be compared through a distance function dist. The performance is determined by the search algorithm algo and the neighborhood policy. It supports callbacks to adjust parameters as insertions are made.

  • hints: Initial points for exploration (empty hints imply using random points)

Note: Parallel insertions should be made through append! or index! function with parallel_block > 1

source

Searching

SimilaritySearch.searchFunction
search(seq::ExhaustiveSearch, ctx::AbstractContext, q, res)

Solves the query evaluating all items in the given query.

source
search(pex::ParallelExhaustiveSearch, ctx::GenericContext, q, res)

Solves the query evaluating all items in the given query.

Arguments

  • pex: the search structure
  • q: the query to solve
  • res: the result set
  • ctx: running ctx
source
search(bs::BeamSearch, index::SearchGraph, ctx, q, res, hints, vstate)

Tries to reach the set of nearest neighbors specified in res for q.

  • bs: the parameters of BeamSearch
  • index: the local search index
  • ctx: A SearchGraphContext object with preallocated objects
  • q: the query
  • res: The result object, it stores the results and also specifies the kind of query
  • hints: Starting points for searching, randomly selected when it is an empty collection
  • vstate: data structure to mark visited vertices
source
search(index::SearchGraph, ctx::SearchGraphContext, q, res

Solves the specified query res for the query object q.

source
SimilaritySearch.searchbatchFunction
searchbatch(index, ctx, Q, k::Integer) -> indices, distances
searchbatch(index, Q, k::Integer) -> indices, distances

Searches a batch of queries in the given index (searches for k neighbors).

Arguments

  • index: The search structure
  • Q: The set of queries
  • k: The number of neighbors to retrieve
  • context: caches, hyperparameters, and meta data (defaults to getcontext(index))
  • sorted=true: ensures that the results are sorted by distance.

Note: The i-th column in indices and distances correspond to the i-th query in Q Note: The final indices at each column can be 0 if the search process was unable to retrieve k neighbors.

source
SimilaritySearch.searchbatch!Function
searchbatch!(index, ctx, Q, knns; sorted) -> knns

Searches a batch of queries in the given index and use knns as output (searches for k=size(I, 1))

Arguments

  • index: The search structure
  • ctx: Context of the search algorithm, environment for running searches (hyperparameters and caches)
  • Q: The set of queries
  • knns: Output, a matrix of IdDist elements (initialized with zeros); an array of KnnAbstract elements, use this form to retrieve search costs.

Keyword arguments

  • sorted: indicates whether the output should be sorted or not.
source

Computing all knns

The operation of computing all knns in the index is computed as follows:

SimilaritySearch.allknnFunction
allknn(index, ctx, k::Integer; minbatch=0, sort=true, show_progress=true) -> knns

Computes all the k nearest neighbors (all vs all) using the given index. User must remove self references

Parameters:

  • index: the index

  • ctx: the index's ctx (caches, hyperparameters, logger, etc)

  • Query specification and result:

    • k: the number of neighbors to retrieve
    • knns: an uninitialized IdDist matrix of (k, n) size for storing the k nearest neighbors and sitances of the n elements
  • sort: ensures that result set is presented in ascending order by distance

  • show_progress: enables or disables progress bar

Returns:

  • knns a (k, n) matrix of IdDist elements, i.e., zeros(IdDist, k, n); the i-th column corresponds to the i-th object in the dataset. Zeros can happen to the end of each column meaning that the retrieval was less than the desired k
source

Computing closest pair

The operation of finding the closest pair of elements in the indexed dataset.

SimilaritySearch.closestpairFunction
closestpair(idx::AbstractSearchIndex, ctx::AbstractContext)

Finds the closest pair among all elements in idx. If the index idx is approximate then pair of points could be also an approximation.

Arguments:

  • idx: the search structure that indexes the set of points
  • ctx: the search context (caches, hyperparameters, etc)

Keyword Arguments:

  • min_k: instead of looking for k=1 some approximate methods can take advantage of a larger k
source

Remove near duplicates

Finds and removes near duplicate items in a metric dataset

SimilaritySearch.neardupFunction
neardup(idx::AbstractSearchIndex, ctx::AbstractContext, X::AbstractDatabase, ϵ::Real; k::Int=8, blocksize::Int=get_parallel_block(), filterblocks=true, verbose=true)
neardup(dist::PreMetric, X::AbstractDatabase, ϵ::Real; kwargs...)

Find nearest duplicates in database X using the empty index idx. The algorithm iteratively try to index elements in X, and items being near than ϵ to some element in idx will be ignored.

The function returns a named tuple (idx, map, nn, dist) where:

  • idx: it is the index of the non duplicated elements
  • ctx: the index's context
  • map: a mapping from |idx|-1 to its positions in X
  • nn: an array where each element in $x \in X$ points to its covering element (previously indexed element u such that $d(u, x_i) \leq ϵ$)
  • dist: an array of distance values to each covering element (correspond to each element in nn)

Arguments

  • idx: An empty index (i.e., a SearchGraph)
  • X: The input dataset
  • ϵ: Real value to cut, if negative, then ϵ will be computed using the quantile value at 'abs(ϵ)' in a small sample of nearest neighbor distances; the quantile method should be used only for applications that need some vague approximations to ϵ

Keyword arguments

  • k: The number of nearest neighbors to retrieve (some algorithms benefit from retrieving larger k values)
  • blocksize: the number of items processed at the time
  • filterblocks: if true then it filters neardups inside blocks (see blocksize parameter), otherwise, it supposes that blocks are free of neardups (e.g., randomized order).
  • verbose: controls the verbosity of the function

Notes

  • The index idx must support incremental construction
  • If you need to customize object insertions, you must wrap the index idx and implement your custom methods; it requires valid implementations of the following functions:
    • searchbatch(idx::AbstractSearchIndex, ctx, queries::AbstractDatabase, knns::Matrix, dists::Matrix)
    • distance(idx::AbstractSearchIndex)
    • length(idx::AbstractSearchIndex)
    • append_items!(idx::AbstractSearchIndex, ctx, items::AbstractDatabase)
  • You can access the set of elements being 'ϵ-non duplicates (the $ϵ-net$) using database(idx) or where nn[i] == i
source

Indexing elements

SimilaritySearch.push_item!Function
push_item!(res::KnnHeap, p::IdDist)

Appends an item into the result set

source
push_item!(res::KnnSorted, p::IdDist)

Appends an item into the result set

source
push_item!(
    index::SearchGraph,
    ctx,
    item,
    qcache
    push_item
)

Appends an object into the index. Internal function

Arguments:

  • index: The search graph index where the insertion is going to happen.

  • item: The object to be inserted, it should be in the same space than other objects in the index and understood by the distance metric.

  • ctx: The context environment of the graph, see SearchGraphContext.

  • tmp: knnqueue to be used by the neighborhood computation

  • neighbors: knnqueue to be used by the neighborhood computation

  • push_db: if false is an internal option, used by append! and index! (it avoids to insert item into the database since it is already inserted but not indexed).

  • Note: setting callbacks as nothing ignores the execution of any callback

source
SimilaritySearch.append_items!Function
append_items!(
    index::SearchGraph,
    ctx::SearchGraphContext,
    db
)

Appends all items in db to the index. It can be made in parallel or sequentially.

Arguments:

  • index: the search graph index
  • db: the collection of objects to insert, an AbstractDatabase is the canonical input, but supports any iterable objects
  • ctx: The context environment of the graph, see SearchGraphContext.
source
SimilaritySearch.index!Function
index!(index::SearchGraph, ctx::SearchGraphContext)

Indexes the already initialized database (e.g., given in the constructor method). It can be made in parallel or sequentially. The arguments are the same than append_items! function but using the internal index.db as input.

Arguments:

source
SimilaritySearch.rebuildFunction
rebuild(g::SearchGraph; context=SearchGraphContext())

Rebuilds the SearchGraph index but seeing the whole dataset for the incremental construction, i.e., it can connect the i-th vertex to its knn in the 1..n possible vertices instead of its knn among 1..(i-1) as in the original algorithm.

Arguments

  • g: The search index to be rebuild.
  • context: The context to run the procedure, it can differ from the original one.
source

Distance functions

The distance functions are defined to work under the evaluate(::metric, u, v) function (borrowed from Distances.jl package).

Minkowski vector distance functions

Missing docstring.

Missing docstring for L1Distance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for L2Distance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for SqL2Distance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for LInftyDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for LpDistance. Check Documenter's build log for details.

Cosine and angle distance functions for vectors

Missing docstring.

Missing docstring for CosineDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for NormalizedCosineDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for AngleDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for NormalizedAngleDistance. Check Documenter's build log for details.

Set distance functions

Set bbject are represented as ordered arrays

Missing docstring.

Missing docstring for JaccardDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for DiceDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for IntersectionDissimilarity. Check Documenter's build log for details.

Missing docstring.

Missing docstring for CosineDistanceSet. Check Documenter's build log for details.

String alignment distances

The following uses strings/arrays as input, i.e., objects follow the array interface. A broader set of distances for strings can be found in the StringDistances.jl package.

Missing docstring.

Missing docstring for CommonPrefixDissimilarity. Check Documenter's build log for details.

Missing docstring.

Missing docstring for GenericLevenshteinDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for StringHammingDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for LevenshteinDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for LcsDistance. Check Documenter's build log for details.

Distances for Cloud of points

Missing docstring.

Missing docstring for HausdorffDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for MinHausdorffDistance. Check Documenter's build log for details.

Functions that customize parameters

Several algorithms support arguments that modify the performance, for instance, some of them should be computed or prepared with external functions or structs

SimilaritySearch.getminbatchFunction
getminbatch(n::Int, nt::Int)

Used by functions that use parallelism on small batches; each block is processed by a single thread

Arguments

  • ctx: The search context
  • n: the number of elements to process
  • nt: number of threads to use
source
Missing docstring.

Missing docstring for getknnresult. Check Documenter's build log for details.

Missing docstring.

Missing docstring for getpools. Check Documenter's build log for details.

SimilaritySearch.NeighborhoodType
Neighborhood(; logbase=2, minsize=2, neardup=typemin(Float32), filter=SatNeighborhood())

Determines the size of the neighborhood; it is adjusted as a callback exponentially. More detailed, the insertion algorithm searches for $log_\text{logbase}(N) + minsize)$ in the index where $N$ is the size of the index/dataset, then these neighbors are filtered with filter. The algorithms use neardup to discard proximal items to be part of a neighborhood.

Parameters

  • logbase=2: logarithmic base to determine the number of neighbors to retrieve
  • minsize=2: minimum number of elements to retrieve
  • neardup=typemin(Float32): distance to identify an element as duplicate (neardups could be ignored from neighborhoods)
  • filter=SatNeighborhood(): strategy to reduce the number of neighbors

Note: Set $logbase=Inf$ to obtain a fixed number of $in$ nodes; and set $minsize=0$ to obtain a pure logarithmic growing neighborhood.

source
SimilaritySearch.CallbackType
abstract type Callback end

Abstract type to trigger callbacks after some number of insertions. SearchGraph stores the callbacks in callbacks (a dictionary that associates symbols and callback objects); A SearchGraph object controls when callbacks are fired using callback_logbase and callback_starting

source
Missing docstring.

Missing docstring for SearchGraphCallbacks. Check Documenter's build log for details.

Database API

SimilaritySearch.AbstractDatabaseType
abstract type AbstractDatabase end

Base type to represent databases. A database is a collection of objects that can be accessed like a similar interface to AbstractVector. It is separated to allow SimilaritySearch methods to know what is a database and what is an object (since most object representations will look as vectors and matrices).

The basic implementations are:

  • MatrixDatabase: A wrapper for object-vectors stored in a Matrix, columns are the objects. It is static.
  • DynamicMatrixDatabase: A dynamic representation for vectors that allows adding new vectors.
  • VectorDatabase: A wrapper for vector-like structures. It can contain any kind of objects.
  • SubDatabase: A sample of a given database

In particular, the storage details are not used by VectorDatabase and MatrixDatabase. For instance, it is possible to use matrices like Matrix, SMatrix or StrideArrays; or even use generated objects with VectorDatabase (supporting a vector-like interface).

If the storage backend support it, it is possible to use vector operations, for example:

  • get the i-th element obj = db[i], elements in the database are identified by position
  • get the elements list in a list of indices lst as db[lst] (also using view)
  • set a value at the i-th element db[i] = obj
  • random sampling rand(db), rand(db, 3)
  • iterate and collect objects in the database
  • get the number of elements in the database length(db)
  • add new objects to the end of the database (not all internal containers will support it)
    • push_item!(db, u) adds a single element u
    • append_items!(db, lst) adds a list of objects to the end of the database
source
Missing docstring.

Missing docstring for DynamicMatrixDatabase. Check Documenter's build log for details.

SimilaritySearch.find_neighborhood!Function
find_neighborhood!(index::SearchGraph{T}, ctx, item, ksearch, blockrange; hints=index.hints)

Searches for item neighborhood in the index, i.e., if item were in the index whose items should be its neighbors (intenal function).

Arguments

  • index: The search index.
  • item: The item to be inserted.
  • blockrange: Extra block range for parallel insertions, defaults to an empty range
  • ctx: context, neighborhood, and cache objects to be used
  • hints: Search hints
source
Missing docstring.

Missing docstring for SatPruning. Check Documenter's build log for details.

Missing docstring.

Missing docstring for RandomPruning. Check Documenter's build log for details.

Missing docstring.

Missing docstring for KeepNearestPruning. Check Documenter's build log for details.

Missing docstring.

Missing docstring for NeighborhoodPruning. Check Documenter's build log for details.

SimilaritySearch.maxlengthFunction
maxlength(res::KnnHeap)

The maximum allowed cardinality (the k of knn)

source
maxlength(res::KnnSorted)

The maximum allowed cardinality (the k of knnSorted)

source
SimilaritySearch.get_parallel_blockFunction
get_parallel_block()

Used by SearchGraph insertion functions to solve find_neighborhood! in blocks. Small blocks are better to ensure quality; faster constructions will be achieved if parallel_block is a multiply of Threads.nthreads()

source
Missing docstring.

Missing docstring for SimilarityFromDistance. Check Documenter's build log for details.

Missing docstring.

Missing docstring for execute_callbacks. Check Documenter's build log for details.