NeighborhoodApproximationIndex
NeighborhoodApproximationIndex.DistanceOnTopKOrdering
NeighborhoodApproximationIndex.DistanceOrdering
NeighborhoodApproximationIndex.InternalDistanceOrdering
NeighborhoodApproximationIndex.KnrCaches
NeighborhoodApproximationIndex.KnrIndex
NeighborhoodApproximationIndex.KnrIndex
Base.append!
Base.push!
NeighborhoodApproximationIndex.get_parallel_block
NeighborhoodApproximationIndex.references
SimilaritySearch.index!
SimilaritySearch.search
NeighborhoodApproximationIndex.DistanceOnTopKOrdering
— Typestruct DistanceOnTopKOrdering <: KnrOrderingStrategy
top::Int
end
Used as ordering
parameter of KnrIndex
specifies that only the top
elements are evaluated against the distance metric. Useful for very costly distance functions. If you are in doubt please use DistanceOrdering
instead.
NeighborhoodApproximationIndex.DistanceOrdering
— Typestruct DistanceOrdering <: KnrOrderingStrategy end
Used as ordering
parameter of KnrIndex
specifies that each object u
found by the inverted index will be evaluated against a given query. This is the default ordering strategy.
NeighborhoodApproximationIndex.InternalDistanceOrdering
— Typestruct InternalDistanceOrdering <: KnrOrderingStrategy end
Used as ordering
parameter of KnrIndex
specifies that the internal distance of the underlying inverted index will be used to ordering the k
nearest neighbors. Useful when the internal distance is quite representative of the real distance metric or whenever speed is the major objective.
NeighborhoodApproximationIndex.KnrCaches
— Typestruct KnrCaches
enc
end
Caches used for KnrIndex
(one per thread)
Properties
enc
:KnnResult
for encoding purposes
NeighborhoodApproximationIndex.KnrIndex
— Typestruct KnrIndex <: AbstractSearchContext
The K nearest references inverted index
Parameters
dist
: the distance function of the indexdb
: the database of indexed objectscenters
: a search index for a set of referencesinvfile
: an inverted file data structurekbuild
: the number of references to be computed and stored by each indexed objectordering
: specifies how the index performs finalk
nn selectionopt
: the parameters to be optimized byoptimize!
NeighborhoodApproximationIndex.KnrIndex
— MethodKnrIndex(
dist::SemiMetric,
db::AbstractDatabase;
invfiletype=BinaryInvertedFile,
invfiledist=JaccardDistance(),
initial=:dnet,
maxiters=0,
refs=references(dist, db; initial),
centers=nothing,
kbuild=3,
ksearch=1,
centersrecall::AbstractFloat=length(db) > 10^3 ? 0.95 : 1.0,
ordering=DistanceOrdering(),
pools=nothing,
parallel_block=get_parallel_block(length(db)),
verbose=false
)
A convenient function to create a KnrIndex
, it uses several default arguments. After the construction, use optimize!
to adjust the index to some performance.
Arguments
dist
: Distance object (aSemiMetric
object, seeDistances.jl
)db
: The database of objects to be indexed.
Keyword arguments
invfiletype
: the type of the underlying inverted file (BinaryInvertedFile
orWeightedInvertedFile
)invfiledist
: the distance of the underlying inverted file (seeInvertedFiles.jl
package)centers
: The index used for centers/references, ifcenters === nothing
then a sample ofdb
will be used.initial
: indicates how references are selected, only used ifrefs
will be computed; seereferences
for more detail.maxiters
: how many iterations of the Lloyd algorithm are applied to initial references, only used ifrefs
will be computed; seereferences
for more detail.refs
: the set of reference, only used ifcenters === nothing
centersrecall
: used whencenters === nothing
; ifcentersrecall == 1
then it will create an exact index onrefs
or an approximate if0 < centersrecall < 1
kbuild
: the number of references to compute and store on constructionksearch
: the number of references to compute on searchingordering
: the ordering strategypools
: an object with preallocated caches specific forKnrIndex
, ifpools=nothing
it will use default caches.parallel_block
Parallel construction works on batches, this is the size of these blocksverbose
true if you want to see messages
Base.append!
— Methodappend!(idx::KnrIndex, db; <kwargs>)
Appends all items in the database db
into the index
Arguments
idx
: the index structuredb
: the objects to be appended
Keyword arguments
parallel_block
: the number of elements to be inserted in parallelpools
: unused argumentverbose
: controls the verbosity of the procedure
Base.push!
— Methodpush!(idx::KnrIndex, obj; pools=getpools(idx), encpools=getpools(idx.centers))
Inserts obj
into the indexed
NeighborhoodApproximationIndex.get_parallel_block
— Methodget_parallel_block(n)
An heuristic to compute the parallel_block
with respect with the number of elements to insert
NeighborhoodApproximationIndex.references
— Methodreferences(dist::SemiMetric, db; <kwargs>) -> SubDatabase
Computes a set of references, a sample of db
, using the KCenters
specification, it is a wrapper to kcenters
function in the KCenters.jl
package. The set of references are computed taking into account the specified r
but also the distance function dist
.
Arguments
dist
: a distance functiondb
: the database to be sampled
Keyword arguments
k::Int
: the number of centers to compute, defaults tosqrt(|db|)
sample::Real
: indicates the sampling size before computing the set ofk
references, defaults tolog(|db|) k
;sample=0
means for no sampling.maxiters::Int
: number of iterationso of the Lloyd algorithm that should be applied on the initial computation of centers, that is,maxiters > 0
appliesmaxiters
iterations of the algorithm.tol::Float64
: change tolerance to stop the Lloyd algorithm (error changes smaller thantol
among iterations will stop the algorithm)initial
: initial centers or a strategy to compute initial centers, symbols:rand
,:fft
, and:dnet
.
There are several interactions between initial values and parameter interactions (described in KCenters
object), for instance, the maxiters > 0
apply the Lloyd's algorithm to the initial computation of references.
- if
initial=:rand
:maxiters = 0
will retrieve a simple random samplingmaxiters > 0' achieve kmeans-centroids,
maxiters` should be set appropiately for the the dataset
- if
initial=:dnet
:maxiters = 0
computes a pure density-netmaxiters > 0
will compute a kmeans centroids but with an initialization based on the dnet
- if
initial=:fft
:maxiters = 0
computesk
centers with the farthest first traversal algorithmmaxiters > 0
will use the FFT based kcenters as initial points for the Lloyd algorithm
Note 1: maxiters > 0
needs to compute centroids and these centroids should be defined for the specific data model, and also be of use in the specific metric distance and error function.
Note 2: The error function is defined as the mean of distances of all objects in db
to its associated nearest centers in each iteration.
Note 3: The computation of references on large databases can be prohibitive, in these cases please consider to work on a sample of db
SimilaritySearch.index!
— Methodindex!(idx::KnrIndex; parallel_block=get_parallel_block(length(idx.db)), pools=nothing, verbose=true)
Indexes all non indexed items in the database
Arguments
idx
: the index structure
Keyword arguments
parallel_block
: the number of elements to be inserted in parallelpools
: unused parameterverbose
: controls verbosity of the procedure
SimilaritySearch.search
— Methodsearch(idx::KnrIndex, q, res::KnnResult; ksearch=idx.opt.ksearch, ordering=idx.ordering, pools=getpools(D))
Searches nearest neighbors of q
inside the index
under the distance function dist
.