NeighborhoodApproximationIndex

NeighborhoodApproximationIndex.DistanceOnTopKOrderingType
struct DistanceOnTopKOrdering <: KnrOrderingStrategy
    top::Int
end

Used as ordering parameter of KnrIndex specifies that only the top elements are evaluated against the distance metric. Useful for very costly distance functions. If you are in doubt please use DistanceOrdering instead.

source
NeighborhoodApproximationIndex.DistanceOrderingType
struct DistanceOrdering <: KnrOrderingStrategy end

Used as ordering parameter of KnrIndex specifies that each object u found by the inverted index will be evaluated against a given query. This is the default ordering strategy.

source
NeighborhoodApproximationIndex.InternalDistanceOrderingType
struct InternalDistanceOrdering <: KnrOrderingStrategy end

Used as ordering parameter of KnrIndex specifies that the internal distance of the underlying inverted index will be used to ordering the k nearest neighbors. Useful when the internal distance is quite representative of the real distance metric or whenever speed is the major objective.

source
NeighborhoodApproximationIndex.KnrIndexType
struct KnrIndex <: AbstractSearchContext

The K nearest references inverted index

Parameters

  • dist: the distance function of the index
  • db: the database of indexed objects
  • centers: a search index for a set of references
  • invfile: an inverted file data structure
  • kbuild: the number of references to be computed and stored by each indexed object
  • ordering: specifies how the index performs final k nn selection
  • opt: the parameters to be optimized by optimize!
source
NeighborhoodApproximationIndex.KnrIndexMethod
KnrIndex(
    dist::SemiMetric,
    db::AbstractDatabase;
    invfiletype=BinaryInvertedFile,
    invfiledist=JaccardDistance(),
    initial=:dnet,
    maxiters=0,
    refs=references(dist, db; initial),
    centers=nothing,
    kbuild=3,
    ksearch=1,
    centersrecall::AbstractFloat=length(db) > 10^3 ? 0.95 : 1.0,
    ordering=DistanceOrdering(),
    pools=nothing,
    parallel_block=get_parallel_block(length(db)),
    verbose=false
)

A convenient function to create a KnrIndex, it uses several default arguments. After the construction, use optimize! to adjust the index to some performance.

Arguments

  • dist: Distance object (a SemiMetric object, see Distances.jl)
  • db: The database of objects to be indexed.

Keyword arguments

  • invfiletype: the type of the underlying inverted file (BinaryInvertedFile or WeightedInvertedFile)
  • invfiledist: the distance of the underlying inverted file (see InvertedFiles.jl package)
  • centers: The index used for centers/references, if centers === nothing then a sample of db will be used.
  • initial: indicates how references are selected, only used if refs will be computed; see references for more detail.
  • maxiters: how many iterations of the Lloyd algorithm are applied to initial references, only used if refs will be computed; see references for more detail.
  • refs: the set of reference, only used if centers === nothing
  • centersrecall: used when centers === nothing; if centersrecall == 1 then it will create an exact index on refs or an approximate if 0 < centersrecall < 1
  • kbuild: the number of references to compute and store on construction
  • ksearch: the number of references to compute on searching
  • ordering: the ordering strategy
  • pools: an object with preallocated caches specific for KnrIndex, if pools=nothing it will use default caches.
  • parallel_block Parallel construction works on batches, this is the size of these blocks
  • verbose true if you want to see messages
source
Base.append!Method
append!(idx::KnrIndex, db; <kwargs>)

Appends all items in the database db into the index

Arguments

  • idx: the index structure
  • db: the objects to be appended

Keyword arguments

  • parallel_block: the number of elements to be inserted in parallel
  • pools: unused argument
  • verbose: controls the verbosity of the procedure
source
Base.push!Method
push!(idx::KnrIndex, obj; pools=getpools(idx), encpools=getpools(idx.centers))

Inserts obj into the indexed

source
NeighborhoodApproximationIndex.referencesMethod
references(dist::SemiMetric, db; <kwargs>) -> SubDatabase

Computes a set of references, a sample of db, using the KCenters specification, it is a wrapper to kcenters function in the KCenters.jl package. The set of references are computed taking into account the specified r but also the distance function dist.

Arguments

  • dist: a distance function
  • db: the database to be sampled

Keyword arguments

  • k::Int: the number of centers to compute, defaults to sqrt(|db|)
  • sample::Real: indicates the sampling size before computing the set of k references, defaults to log(|db|) k; sample=0 means for no sampling.
  • maxiters::Int: number of iterationso of the Lloyd algorithm that should be applied on the initial computation of centers, that is, maxiters > 0 applies maxiters iterations of the algorithm.
  • tol::Float64: change tolerance to stop the Lloyd algorithm (error changes smaller than tol among iterations will stop the algorithm)
  • initial: initial centers or a strategy to compute initial centers, symbols :rand, :fft, and :dnet.

There are several interactions between initial values and parameter interactions (described in KCenters object), for instance, the maxiters > 0 apply the Lloyd's algorithm to the initial computation of references.

  • if initial=:rand:
    • maxiters = 0 will retrieve a simple random sampling
    • maxiters > 0' achieve kmeans-centroids,maxiters` should be set appropiately for the the dataset
  • if initial=:dnet:
    • maxiters = 0 computes a pure density-net
    • maxiters > 0 will compute a kmeans centroids but with an initialization based on the dnet
  • if initial=:fft:
    • maxiters = 0 computes k centers with the farthest first traversal algorithm
    • maxiters > 0 will use the FFT based kcenters as initial points for the Lloyd algorithm

Note 1: maxiters > 0 needs to compute centroids and these centroids should be defined for the specific data model, and also be of use in the specific metric distance and error function.

Note 2: The error function is defined as the mean of distances of all objects in db to its associated nearest centers in each iteration.

Note 3: The computation of references on large databases can be prohibitive, in these cases please consider to work on a sample of db

source
SimilaritySearch.index!Method
index!(idx::KnrIndex; parallel_block=get_parallel_block(length(idx.db)), pools=nothing, verbose=true)

Indexes all non indexed items in the database

Arguments

  • idx: the index structure

Keyword arguments

  • parallel_block: the number of elements to be inserted in parallel
  • pools: unused parameter
  • verbose: controls verbosity of the procedure
source
SimilaritySearch.searchMethod
search(idx::KnrIndex, q, res::KnnResult; ksearch=idx.opt.ksearch, ordering=idx.ordering, pools=getpools(D))

Searches nearest neighbors of q inside the index under the distance function dist.

source