UMAP

SimSearchManifoldLearning.UMAP — Type

struct UMAP

The UMAP model struct

Properties

graph: The fuzzy simplicial set that represents the all knn graph
embedding: The embedding projection
k the number of neighbors used to create the model
a and b: parameters to ensure an well distributed and smooth projection (from min_dist and spread arguments in fit)
index: the search index, it can be nothing if the model is handled directly with precomputed knns and dists matrices

source

StatsAPI.fit — Function

fit(Type{UMAP}, knns, dists; <kwargs>) -> UMAP object

Create a model representing the embedding of data (X, dist) into maxoutdim-dimensional space. Note that knns and dists jointly specify the all k nearest neighbors of $(X, dist)$, these results must not include self-references. See the allknn method in SimilaritySearch.

Arguments

knns: A $(k, n)$ matrix of integers (identifiers).
dists: A $(k, n)$ matrix of floating points (distances).

It uses all available threads for the projection.

Keyword Arguments

maxoutdim::Integer=2: The number of components in the embedding
n_epochs::Integer = 300: the number of training epochs for embedding optimization
learning_rate::Real = 1: the initial learning rate during optimization
learning_rate_decay::Real = 0.9: how much learning_rate is updated on each epoch (learning_rate *= learning_rate_decay) (a minimum value is also considered as 1e-6)
layout::AbstractLayout = SpectralLayout(): how to initialize the output embedding
min_dist::Real = 0.1: the minimum spacing of points in the output embedding
spread::Real = 1: the effective scale of embedded points. Determines how clustered embedded points are in combination with min_dist.
set_operation_ratio::Real = 1: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.
local_connectivity::Integer = 1: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.
repulsion_strength::Real = 1: the weighting of negative samples during the optimization process.
neg_sample_rate::Integer = 5: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.
tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.
minbatch=0: controls how parallel computation is made, zero to use SimilaritySearch defaults and -1 to avoid parallel computation; passed to @batch macro of Polyester package.
a = nothing: this controls the embedding. By default, this is determined automatically by min_dist and spread.
b = nothing: this controls the embedding. By default, this is determined automatically by min_dist and spread.

source

fit(::Type{<:UMAP}, index_or_data;
    k=15,
    dist::SemiMetric=L2Distance,
    minbatch=0,
    kwargs...)

Wrapper for fit that computes n_nearests nearest neighbors on index_or_data and passes these and kwargs to regular fit.

Arguments

index_or_data: an already constructed index (see SimilaritySearch), a matrix, or an abstact database (SimilaritySearch)

Keyword arguments

k=15: number of neighbors to compute
dist=L2Distance(): A distance function (see Distances.jl)
searchctx: search context (hyperparameters, caches, etc)

source

fit(UMAP::UMAP, maxoutdim; <kwargs>)

Reuses a previously computed model with a different number of components

Keyword arguments

n_epochs=50: number of epochs to run
learning_rate::Real = 1f0: initial learning rate
learning_rate_decay::Real = 0.9f0: how learning rate is adjusted per epoch learning_rate *= learning_rate_decay
repulsion_strength::Float32 = 1f0: repulsion force (for negative sampling)
neg_sample_rate::Integer = 5: how many negative examples per object are used.
tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.
minbatch=0: controls how parallel computation is made. See SimilaritySearch.getminbatch and @batch (Polyester package).

source

StatsAPI.predict — Function

predict(model::UMAP)

Returns the internal embedding (the entire dataset projection)

source

predict(model::UMAP, Q::AbstractDatabase; k::Integer=15, kwargs...)
predict(model::UMAP, knns, dists; <kwargs>) -> embedding

Use the given model to embed new points $Q$ into an existing embedding produced by $(X, dist)$. The second function represent Q using its k nearest neighbors in X under some distance function (knns and dists) See searchbatch in SimilaritySearch to compute both (also for AbstractDatabase objects).

Arguments

model: The fitted model
knns: matrix of identifiers (integers) of size $(k, |Q|)$
dists: matrix of distances (floating point values) of size $(k, |Q|)$

Note: the number of neighbors k (embedded into knn matrices) control the embedding. Larger values capture more global structure in the data, while small values capture more local structure.

Keyword Arguments

searchctx = getcontext(model.index): the search context for the knn index (caches, hyperparameters, loggers, etc)
n_epochs::Integer = 30: the number of training epochs for embedding optimization
learning_rate::Real = 1: the initial learning rate during optimization
learning_rate_decay::Real = 0.8: A decay factor for the learning_rate param (on each epoch)
set_operation_ratio::Real = 1: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.
local_connectivity::Integer = 1: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.
repulsion_strength::Real = 1: the weighting of negative samples during the optimization process.
neg_sample_rate::Integer = 5: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.
tol::Real = 1e-4: tolerance to early stopping while optimizing embeddings.
minbatch=0: controls how parallel computation is made. See SimilaritySearch.getminbatch and @batch (Polyester package).

source

Missing docstring.

Missing docstring for optimize_embeddings!. Check Documenter's build log for details.

Layouts

SimSearchManifoldLearning.RandomLayout — Type

RandomLayout <: AbstractLayout

Initializes the embedding using a random set of points. It may converge slowly

source

SimSearchManifoldLearning.SpectralLayout — Type

SpectralLayout <: AbstractLayout

Initializes the embedding using the spectral layout method. It could be costly in very large datasets.

source

SimSearchManifoldLearning.PrecomputedLayout — Type

PrecomputedLayout <: AbstractLayout

Initializes the embedding using a previously computed layout, i.e., (maxoutdim, n_points) matrix.

source

SimSearchManifoldLearning.KnnGraphLayout — Type

KnnGraphLayout <: AbstractLayout

A lattice like + clouds of points initialization that uses the computed all-knn graph. This layout initialization is a simple proof of concept, so please use it under this assumption.

source

Precomputed Knn matrices

If you don't want to use SimilaritySearch for solving k nearest neighbors, you can also pass precomputed knns and distances matrices.

SimSearchManifoldLearning.PrecomputedKnns — Type

struct PrecomputedKnns <: AbstractSearchIndex
    knns
    dists
end

An index-like wrapper for precomputed all-knns (as knns and dists matrices (k, n))

source

SimSearchManifoldLearning.PrecomputedAffinityMatrix — Type

struct PrecomputedAffinityMatrix <: AbstractSearchIndex
    dists # precomputed distances for all pairs (squared matrix)
end

An index-like wrapper for precomputed affinity matrix.

source

Distance functions

The distance functions are defined to work under the evaluate(::SemiMetric, u, v) function (borrowed from Distances.jl package).