Visualizing MNIST database

by: Eric S. Téllez

This example creates a visualization of the MNIST images (hand written digits) using MLDatasets.jl to retrieve it.

Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto before running julia.

using SimilaritySearch, SimSearchManifoldLearning, PlotlyLight, Colors, StatsBase, LinearAlgebra, Markdown, MLDatasets, Random
db, y, dist = let data = MNIST(split=:train)
    T, y = data.features, data.targets
    n = size(T, 3)
    MatrixDatabase(Float32.(reshape(T, (28*28, n)))), y, Dist.CastF32.SqL2()
end

Now we can create the index

1index = SearchGraph(; dist, db)
ctx = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.99)))
2index!(index, ctx)
3optimize_index!(index, ctx, MinRecall(0.95))
1
Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build MinRecall(0.99); high quality constructions yield to faster queries due to the underlying graph structure.
2
Actual indexing procedure using the given search context.
3
Optimizing the index to trade quality and speed.

Searching

Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.

function search_and_render(index, ctx, q, res)
    res = reuse!(res)
    @time search(index, ctx, q, res)
    qinverted = 1 .- reshape(q, (28, 28))' # distinguishability
    h = hcat(qinverted, [reshape(index[id_], (28, 28))' for id_ in IdView(res)]...)
    
    Gray.(h)
end

res = knnqueue(ctx, 12)
for _ in 1:7
    for qid in rand(1:length(index))
        display(search_and_render(index, ctx, index[qid], res))
    end
end
  0.000108 seconds (3 allocations: 64 bytes)
  0.000104 seconds (3 allocations: 64 bytes)
  0.000124 seconds (3 allocations: 64 bytes)
  0.000097 seconds (3 allocations: 64 bytes)
  0.000107 seconds (3 allocations: 64 bytes)
  0.000103 seconds (3 allocations: 64 bytes)
  0.000079 seconds (3 allocations: 64 bytes)

UMAP Visualization

Computing the UMAP projections

e2 = let min_dist=0.5f0,
             k=7,
             n_epochs=75,
             neg_sample_rate=3,
             tol=1e-3,
             layout=SpectralLayout()

    @time "Compute 2D UMAP model" U2 = fit(UMAP, index; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
    @time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
    e2
end

Now visualizing


hovertext = ["$t" for t in y]

data = [Config(;
    x = view(e2, 1, :),
    y = view(e2, 2, :),
    mode = "markers",
    marker = (
        color = y, 
        colorscale = "Viridis",
        size = 4, 
        line = (width = 0,)
    ),
    hovertext,
    type = "scattergl" 
)]

layout = Config(
    width = 600,
    height = 600,
    xaxis = (visible = false, showgrid = false, zeroline = false),
    yaxis = (visible = false, showgrid = false, zeroline = false),
    hovermode = "closest",
    plot_bgcolor = "white"
)

Plot(data, layout)

Final notes

This example shows how to index and visualize the MNIST dataset using UMAP low dimensional projections. Low dimensional projections are made with SimSearchManifoldLearning, note that SimilaritySearch is also used for computing the all \(k\) nearest neighbors needed by the UMAP model. Note that this notebook should be ran with several threads to reduce time costs.

Environment and dependencies

Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin24.0.0)
  CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto
  JULIA_PROJECT = @.
  JULIA_LOAD_PATH = @:@stdlib
Status `~/Research/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [5ae59095] Colors v0.13.1
  [a93c6f00] DataFrames v1.8.1
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [916415d5] Images v0.26.2
  [b20bd276] InvertedFiles v0.9.2
 [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.10
  [eb30cadb] MLDatasets v0.7.21
  [06eb3307] ManifoldLearning v0.9.0
 [ca7969ec] PlotlyLight v0.11.0
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.4.0
  [053f045d] SimilaritySearch v0.14.3
 [2913bbd2] StatsBase v0.33.21
  [7f6f6c8a] TextSearch v0.20.0
Info Packages marked with  and  have new versions available. Those with  may be upgradable, but those with  are restricted by compatibility constraints from upgrading. To see why use `status --outdated`