using SimilaritySearch, SimSearchManifoldLearning, Plots, StatsBase, LinearAlgebra, Markdown, MLDatasets, Random
Visualizing MNIST database
by: Eric S. Téllez
This example creates a visualization of the MNIST images (hand written digits) using MLDatasets.jl
to retrieve it.
Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto
before running julia
.
= let data = MNIST(split=:train)
db, y, dist = data.features, data.targets
T, y = size(T, 3)
n MatrixDatabase(Float32.(reshape(T, (28*28, n)))), y, SqL2_asf32()
end
Now we can create the index
1= SearchGraph(; dist, db)
index = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.99)))
ctx 2index!(index, ctx)
3optimize_index!(index, ctx, MinRecall(0.95))
- 1
-
Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build
MinRecall(0.99)
; high quality constructions yield to faster queries due to the underlying graph structure. - 2
- Actual indexing procedure using the given search context.
- 3
- Optimizing the index to trade quality and speed.
Searching
Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.
function search_and_render(index, ctx, q, res)
= reuse!(res)
res @time search(index, ctx, q, res)
= 1 .- reshape(q, (28, 28))' # distinguishability
qinverted = hcat(qinverted, [reshape(index[id_], (28, 28))' for id_ in IdView(res)]...)
h
Gray.(h)
end
= KnnResult(12)
res in 1:7
for _ in rand(1:length(index))
for qid display(search_and_render(index, ctx, index[qid], res))
end
end
0.000199 seconds (2 allocations: 32 bytes)
0.000138 seconds (2 allocations: 32 bytes)
0.000151 seconds (2 allocations: 32 bytes)
0.000100 seconds (2 allocations: 32 bytes)
0.000136 seconds (2 allocations: 32 bytes)
0.000130 seconds (2 allocations: 32 bytes)
0.000161 seconds (2 allocations: 32 bytes)
UMAP Visualization
function normcolors(V)
= extrema(V)
min_, max_ .= (V .- min_) ./ (max_ - min_)
V .= clamp.(V, 0, 1)
V end
normcolors(@view e3[1, :])
normcolors(@view e3[2, :])
normcolors(@view e3[3, :])
let C = [RGB(c[1], c[2], c[3]) for c in eachcol(e3)],
= view(e2, 1, :),
X = view(e2, 2, :)
Y scatter(X, Y, color=C, fmt=:png, alpha=0.2, size=(600, 600), ma=0.3, ms=2, msw=0, label="", yticks=nothing, xticks=nothing, xaxis=false, yaxis=false)
for i in 1:100
= rand(1:length(y))
j annotate!(X[j], Y[j], text(y[j], :black, :right, 8, "noto"))
end
end
plot!()
= let min_dist=0.5f0,
e2, e3 =7,
k=75,
n_epochs=3,
neg_sample_rate=1e-3,
tol=SpectralLayout()
layout
@time "Compute 2D UMAP model" U2 = fit(UMAP, index; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
@time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
@time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
@time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
e2, e3end
Final notes
This example shows how to index and visualize the MNIST dataset using UMAP low dimensional projections. Low dimensional projections are made with SimSearchManifoldLearning
, note that SimilaritySearch
is also used for computing the all \(k\) nearest neighbors needed by the UMAP model. Note that this notebook should be ran with several threads to reduce time costs.
Environment and dependencies
Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
JULIA_PROJECT = .
JULIA_NUM_THREADS = auto
JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
[aaaa29a8] Clustering v0.15.8
[944b1d66] CodecZlib v0.7.8
[a93c6f00] DataFrames v1.7.0
[c5bfea45] Embeddings v0.4.6
[f67ccb44] HDF5 v0.17.2
[b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
[682c06a0] JSON v0.21.4
[23fbe1c1] Latexify v0.16.6
[eb30cadb] MLDatasets v0.7.18
[06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
[91a5bcdd] Plots v1.40.11
[27ebfcd6] Primes v0.5.7
[ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
[053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
[f3b207a7] StatsPlots v0.15.7
[7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`