UMAP
SimSearchManifoldLearning.UMAP
— Typestruct UMAP
The UMAP model struct
Properties
graph
: The fuzzy simplicial set that represents the all knn graphembedding
: The embedding projectionk
the number of neighbors used to create the modela
andb
: parameters to ensure an well distributed and smooth projection (frommin_dist
andspread
arguments infit
)index
: the search index, it can be nothing if the model is handled directly with precomputedknns
anddists
matrices
StatsAPI.fit
— Functionfit(Type{UMAP}, knns, dists; <kwargs>) -> UMAP object
Create a model representing the embedding of data (X, dist)
into maxoutdim
-dimensional space. Note that knns
and dists
jointly specify the all k
nearest neighbors of $(X, dist)$, these results must not include self-references. See the allknn
method in SimilaritySearch
.
Arguments
knns
: A $(k, n)$ matrix of integers (identifiers).dists
: A $(k, n)$ matrix of floating points (distances).
It uses all available threads for the projection.
Keyword Arguments
maxoutdim::Integer=2
: The number of components in the embeddingn_epochs::Integer = 300
: the number of training epochs for embedding optimizationlearning_rate::Real = 1
: the initial learning rate during optimizationlearning_rate_decay::Real = 0.9
: how muchlearning_rate
is updated on each epoch(learning_rate *= learning_rate_decay)
(a minimum value is also considered as 1e-6)layout::AbstractLayout = SpectralLayout()
: how to initialize the output embeddingmin_dist::Real = 0.1
: the minimum spacing of points in the output embeddingspread::Real = 1
: the effective scale of embedded points. Determines how clustered embedded points are in combination withmin_dist
.set_operation_ratio::Real = 1
: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.local_connectivity::Integer = 1
: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.repulsion_strength::Real = 1
: the weighting of negative samples during the optimization process.neg_sample_rate::Integer = 5
: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.tol::Real = 1e-4
: tolerance to early stopping while optimizing embeddings.minbatch=0
: controls how parallel computation is made, zero to useSimilaritySearch
defaults and -1 to avoid parallel computation; passed to@batch
macro ofPolyester
package.a = nothing
: this controls the embedding. By default, this is determined automatically bymin_dist
andspread
.b = nothing
: this controls the embedding. By default, this is determined automatically bymin_dist
andspread
.
fit(::Type{<:UMAP}, index_or_data;
k=15,
dist::SemiMetric=L2Distance,
minbatch=0,
kwargs...)
Wrapper for fit
that computes n_nearests
nearest neighbors on index_or_data
and passes these and kwargs
to regular fit
.
Arguments
index_or_data
: an already constructed index (seeSimilaritySearch
), a matrix, or an abstact database (SimilaritySearch
)
Keyword arguments
k=15
: number of neighbors to computedist=L2Distance()
: A distance function (seeDistances.jl
)searchctx
: search context (hyperparameters, caches, etc)
fit(UMAP::UMAP, maxoutdim; <kwargs>)
Reuses a previously computed model with a different number of components
Keyword arguments
n_epochs=50
: number of epochs to runlearning_rate::Real = 1f0
: initial learning ratelearning_rate_decay::Real = 0.9f0
: how learning rate is adjusted per epochlearning_rate *= learning_rate_decay
repulsion_strength::Float32 = 1f0
: repulsion force (for negative sampling)neg_sample_rate::Integer = 5
: how many negative examples per object are used.tol::Real = 1e-4
: tolerance to early stopping while optimizing embeddings.minbatch=0
: controls how parallel computation is made. SeeSimilaritySearch.getminbatch
and@batch
(Polyester
package).
StatsAPI.predict
— Functionpredict(model::UMAP)
Returns the internal embedding (the entire dataset projection)
predict(model::UMAP, Q::AbstractDatabase; k::Integer=15, kwargs...)
predict(model::UMAP, knns, dists; <kwargs>) -> embedding
Use the given model to embed new points $Q$ into an existing embedding produced by $(X, dist)$. The second function represent Q
using its k
nearest neighbors in X
under some distance function (knns
and dists
) See searchbatch
in SimilaritySearch
to compute both (also for AbstractDatabase
objects).
Arguments
model
: The fitted modelknns
: matrix of identifiers (integers) of size $(k, |Q|)$dists
: matrix of distances (floating point values) of size $(k, |Q|)$
Note: the number of neighbors k
(embedded into knn matrices) control the embedding. Larger values capture more global structure in the data, while small values capture more local structure.
Keyword Arguments
searchctx = getcontext(model.index)
: the search context for the knn index (caches, hyperparameters, loggers, etc)n_epochs::Integer = 30
: the number of training epochs for embedding optimizationlearning_rate::Real = 1
: the initial learning rate during optimizationlearning_rate_decay::Real = 0.8
: A decay factor for thelearning_rate
param (on each epoch)set_operation_ratio::Real = 1
: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.local_connectivity::Integer = 1
: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.repulsion_strength::Real = 1
: the weighting of negative samples during the optimization process.neg_sample_rate::Integer = 5
: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.tol::Real = 1e-4
: tolerance to early stopping while optimizing embeddings.minbatch=0
: controls how parallel computation is made. SeeSimilaritySearch.getminbatch
and@batch
(Polyester
package).
Missing docstring for optimize_embeddings!
. Check Documenter's build log for details.
Layouts
SimSearchManifoldLearning.RandomLayout
— TypeRandomLayout <: AbstractLayout
Initializes the embedding using a random set of points. It may converge slowly
SimSearchManifoldLearning.SpectralLayout
— TypeSpectralLayout <: AbstractLayout
Initializes the embedding using the spectral layout method. It could be costly in very large datasets.
SimSearchManifoldLearning.PrecomputedLayout
— TypePrecomputedLayout <: AbstractLayout
Initializes the embedding using a previously computed layout, i.e., (maxoutdim, n_points) matrix.
SimSearchManifoldLearning.KnnGraphLayout
— TypeKnnGraphLayout <: AbstractLayout
A lattice like + clouds of points initialization that uses the computed all-knn graph. This layout initialization is a simple proof of concept, so please use it under this assumption.
Precomputed Knn matrices
If you don't want to use SimilaritySearch
for solving k
nearest neighbors, you can also pass precomputed knns and distances matrices.
SimSearchManifoldLearning.PrecomputedKnns
— Typestruct PrecomputedKnns <: AbstractSearchIndex
knns
dists
end
An index-like wrapper for precomputed all-knns (as knns and dists matrices (k, n))
SimSearchManifoldLearning.PrecomputedAffinityMatrix
— Typestruct PrecomputedAffinityMatrix <: AbstractSearchIndex
dists # precomputed distances for all pairs (squared matrix)
end
An index-like wrapper for precomputed affinity matrix.
Distance functions
The distance functions are defined to work under the evaluate(::SemiMetric, u, v)
function (borrowed from Distances.jl package).