UMAP

SimSearchManifoldLearning.UMAPType
struct UMAP

The UMAP model struct

Properties

  • graph: The fuzzy simplicial set that represents the all knn graph
  • embedding: The embedding projection
  • k the number of neighbors used to create the model
  • a and b: parameters to ensure an well distributed and smooth projection (from min_dist and spread arguments in fit)
  • index: the search index, it can be nothing if the model is handled directly with precomputed knns and dists matrices
source
StatsAPI.fitFunction
fit(Type{UMAP}, knns, dists; <kwargs>) -> UMAP object

Create a model representing the embedding of data (X, dist) into maxoutdim-dimensional space. Note that knns and dists jointly specify the all k nearest neighbors of $(X, dist)$, these results must not include self-references. See the allknn method in SimilaritySearch.

Arguments

  • knns: A $(k, n)$ matrix of integers (identifiers).
  • dists: A $(k, n)$ matrix of floating points (distances).

It uses all available threads for the projection.

Keyword Arguments

  • maxoutdim::Integer=2: The number of components in the embedding
  • n_epochs::Integer = 300: the number of training epochs for embedding optimization
  • learning_rate::Real = 1: the initial learning rate during optimization
  • learning_rate_decay::Real = 0.9: how much learning_rate is updated on each epoch (learning_rate *= learning_rate_decay) (a minimum value is also considered as 1e-6)
  • layout::AbstractLayout = SpectralLayout(): how to initialize the output embedding
  • min_dist::Real = 0.1: the minimum spacing of points in the output embedding
  • spread::Real = 1: the effective scale of embedded points. Determines how clustered embedded points are in combination with min_dist.
  • set_operation_ratio::Real = 1: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.
  • local_connectivity::Integer = 1: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.
  • repulsion_strength::Real = 1: the weighting of negative samples during the optimization process.
  • neg_sample_rate::Integer = 5: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.
  • tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.
  • minbatch=0: controls how parallel computation is made, zero to use SimilaritySearch defaults and -1 to avoid parallel computation; passed to @batch macro of Polyester package.
  • a = nothing: this controls the embedding. By default, this is determined automatically by min_dist and spread.
  • b = nothing: this controls the embedding. By default, this is determined automatically by min_dist and spread.
source
fit(::Type{<:UMAP}, index_or_data;
    k=15,
    dist::SemiMetric=L2Distance,
    minbatch=0,
    kwargs...)

Wrapper for fit that computes n_nearests nearest neighbors on index_or_data and passes these and kwargs to regular fit.

Arguments

  • index_or_data: an already constructed index (see SimilaritySearch), a matrix, or an abstact database (SimilaritySearch)

Keyword arguments

  • k=15: number of neighbors to compute
  • dist=L2Distance(): A distance function (see Distances.jl)
  • searchctx: search context (hyperparameters, caches, etc)
source
fit(UMAP::UMAP, maxoutdim; <kwargs>)

Reuses a previously computed model with a different number of components

Keyword arguments

  • n_epochs=50: number of epochs to run
  • learning_rate::Real = 1f0: initial learning rate
  • learning_rate_decay::Real = 0.9f0: how learning rate is adjusted per epoch learning_rate *= learning_rate_decay
  • repulsion_strength::Float32 = 1f0: repulsion force (for negative sampling)
  • neg_sample_rate::Integer = 5: how many negative examples per object are used.
  • tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.
  • minbatch=0: controls how parallel computation is made. See SimilaritySearch.getminbatch and @batch (Polyester package).
source
StatsAPI.predictFunction
predict(model::UMAP)

Returns the internal embedding (the entire dataset projection)

source
predict(model::UMAP, Q::AbstractDatabase; k::Integer=15, kwargs...)
predict(model::UMAP, knns, dists; <kwargs>) -> embedding

Use the given model to embed new points $Q$ into an existing embedding produced by $(X, dist)$. The second function represent Q using its k nearest neighbors in X under some distance function (knns and dists) See searchbatch in SimilaritySearch to compute both (also for AbstractDatabase objects).

Arguments

  • model: The fitted model
  • knns: matrix of identifiers (integers) of size $(k, |Q|)$
  • dists: matrix of distances (floating point values) of size $(k, |Q|)$

Note: the number of neighbors k (embedded into knn matrices) control the embedding. Larger values capture more global structure in the data, while small values capture more local structure.

Keyword Arguments

  • searchctx = getcontext(model.index): the search context for the knn index (caches, hyperparameters, loggers, etc)
  • n_epochs::Integer = 30: the number of training epochs for embedding optimization
  • learning_rate::Real = 1: the initial learning rate during optimization
  • learning_rate_decay::Real = 0.8: A decay factor for the learning_rate param (on each epoch)
  • set_operation_ratio::Real = 1: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.
  • local_connectivity::Integer = 1: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.
  • repulsion_strength::Real = 1: the weighting of negative samples during the optimization process.
  • neg_sample_rate::Integer = 5: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.
  • tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.
  • minbatch=0: controls how parallel computation is made. See SimilaritySearch.getminbatch and @batch (Polyester package).
source
Missing docstring.

Missing docstring for optimize_embeddings!. Check Documenter's build log for details.

Layouts

SimSearchManifoldLearning.KnnGraphLayoutType
KnnGraphLayout <: AbstractLayout

A lattice like + clouds of points initialization that uses the computed all-knn graph. This layout initialization is a simple proof of concept, so please use it under this assumption.

source

Precomputed Knn matrices

If you don't want to use SimilaritySearch for solving k nearest neighbors, you can also pass precomputed knns and distances matrices.

Distance functions

The distance functions are defined to work under the evaluate(::SemiMetric, u, v) function (borrowed from Distances.jl package).