UMAP
SimSearchManifoldLearning.UMAP — Typestruct UMAPThe UMAP model struct
Properties
graph: The fuzzy simplicial set that represents the all knn graphembedding: The embedding projectionkthe number of neighbors used to create the modelaandb: parameters to ensure an well distributed and smooth projection (frommin_distandspreadarguments infit)index: the search index, it can be nothing if the model is handled directly with precomputedknnsanddistsmatrices
StatsAPI.fit — Functionfit(Type{UMAP}, knns, dists; <kwargs>) -> UMAP objectCreate a model representing the embedding of data (X, dist) into maxoutdim-dimensional space. Note that knns and dists jointly specify the all k nearest neighbors of $(X, dist)$, these results must not include self-references. See the allknn method in SimilaritySearch.
Arguments
knns: A $(k, n)$ matrix of integers (identifiers).dists: A $(k, n)$ matrix of floating points (distances).
It uses all available threads for the projection.
Keyword Arguments
maxoutdim::Integer=2: The number of components in the embeddingn_epochs::Integer = 300: the number of training epochs for embedding optimizationlearning_rate::Real = 1: the initial learning rate during optimizationlearning_rate_decay::Real = 0.9: how muchlearning_rateis updated on each epoch(learning_rate *= learning_rate_decay)(a minimum value is also considered as 1e-6)layout::AbstractLayout = SpectralLayout(): how to initialize the output embeddingmin_dist::Real = 0.1: the minimum spacing of points in the output embeddingspread::Real = 1: the effective scale of embedded points. Determines how clustered embedded points are in combination withmin_dist.set_operation_ratio::Real = 1: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.local_connectivity::Integer = 1: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.repulsion_strength::Real = 1: the weighting of negative samples during the optimization process.neg_sample_rate::Integer = 5: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.minbatch=0: controls how parallel computation is made, zero to useSimilaritySearchdefaults and -1 to avoid parallel computation; passed to@batchmacro ofPolyesterpackage.a = nothing: this controls the embedding. By default, this is determined automatically bymin_distandspread.b = nothing: this controls the embedding. By default, this is determined automatically bymin_distandspread.
fit(::Type{<:UMAP}, index_or_data;
k=15,
dist::SemiMetric=L2Distance,
minbatch=0,
kwargs...)Wrapper for fit that computes n_nearests nearest neighbors on index_or_data and passes these and kwargs to regular fit.
Arguments
index_or_data: an already constructed index (seeSimilaritySearch), a matrix, or an abstact database (SimilaritySearch)
Keyword arguments
k=15: number of neighbors to computedist=L2Distance(): A distance function (seeDistances.jl)searchctx: search context (hyperparameters, caches, etc)
fit(UMAP::UMAP, maxoutdim; <kwargs>)Reuses a previously computed model with a different number of components
Keyword arguments
n_epochs=50: number of epochs to runlearning_rate::Real = 1f0: initial learning ratelearning_rate_decay::Real = 0.9f0: how learning rate is adjusted per epochlearning_rate *= learning_rate_decayrepulsion_strength::Float32 = 1f0: repulsion force (for negative sampling)neg_sample_rate::Integer = 5: how many negative examples per object are used.tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.minbatch=0: controls how parallel computation is made. SeeSimilaritySearch.getminbatchand@batch(Polyesterpackage).
StatsAPI.predict — Functionpredict(model::UMAP)Returns the internal embedding (the entire dataset projection)
predict(model::UMAP, Q::AbstractDatabase; k::Integer=15, kwargs...)
predict(model::UMAP, knns, dists; <kwargs>) -> embeddingUse the given model to embed new points $Q$ into an existing embedding produced by $(X, dist)$. The second function represent Q using its k nearest neighbors in X under some distance function (knns and dists) See searchbatch in SimilaritySearch to compute both (also for AbstractDatabase objects).
Arguments
model: The fitted modelknns: matrix of identifiers (integers) of size $(k, |Q|)$dists: matrix of distances (floating point values) of size $(k, |Q|)$
Note: the number of neighbors k (embedded into knn matrices) control the embedding. Larger values capture more global structure in the data, while small values capture more local structure.
Keyword Arguments
searchctx = getcontext(model.index): the search context for the knn index (caches, hyperparameters, loggers, etc)n_epochs::Integer = 30: the number of training epochs for embedding optimizationlearning_rate::Real = 1: the initial learning rate during optimizationlearning_rate_decay::Real = 0.8: A decay factor for thelearning_rateparam (on each epoch)set_operation_ratio::Real = 1: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.local_connectivity::Integer = 1: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.repulsion_strength::Real = 1: the weighting of negative samples during the optimization process.neg_sample_rate::Integer = 5: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.minbatch=0: controls how parallel computation is made. SeeSimilaritySearch.getminbatchand@batch(Polyesterpackage).
Missing docstring for optimize_embeddings!. Check Documenter's build log for details.
Layouts
SimSearchManifoldLearning.RandomLayout — TypeRandomLayout <: AbstractLayoutInitializes the embedding using a random set of points. It may converge slowly
SimSearchManifoldLearning.SpectralLayout — TypeSpectralLayout <: AbstractLayoutInitializes the embedding using the spectral layout method. It could be costly in very large datasets.
SimSearchManifoldLearning.PrecomputedLayout — TypePrecomputedLayout <: AbstractLayoutInitializes the embedding using a previously computed layout, i.e., (maxoutdim, n_points) matrix.
SimSearchManifoldLearning.KnnGraphLayout — TypeKnnGraphLayout <: AbstractLayoutA lattice like + clouds of points initialization that uses the computed all-knn graph. This layout initialization is a simple proof of concept, so please use it under this assumption.
Precomputed Knn matrices
If you don't want to use SimilaritySearch for solving k nearest neighbors, you can also pass precomputed knns and distances matrices.
SimSearchManifoldLearning.PrecomputedKnns — Typestruct PrecomputedKnns <: AbstractSearchIndex
knns
dists
endAn index-like wrapper for precomputed all-knns (as knns and dists matrices (k, n))
SimSearchManifoldLearning.PrecomputedAffinityMatrix — Typestruct PrecomputedAffinityMatrix <: AbstractSearchIndex
dists # precomputed distances for all pairs (squared matrix)
endAn index-like wrapper for precomputed affinity matrix.
Distance functions
The distance functions are defined to work under the evaluate(::SemiMetric, u, v) function (borrowed from Distances.jl package).