using SimilaritySearch, SimSearchManifoldLearning, Primes, Plots, StatsBase, LinearAlgebra, Markdown, Random
Prime numbers
by: Eric S. Téllez
This demonstration is about prime numbers and its similarity based on its factors. This is a well-known demonstration of SimSearchManifoldLearning.jl
. This notebook does not requires to download any dataset.
Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto
before running julia
.
Now, we can define our dataset. The set of factors are found using the Primes
package. Note that we use VectorDatabase
to represent the dataset.
= 100_000
n = Vector{Vector{Int32}}(undef, n+1)
F
function encode_factors(num)
sort!(Int32[convert(Int32, f) for f in factor(Set, num)])
end
1] = Int32[1]
F[
for i in 2:n+1
= encode_factors(i)
s = s
F[i] end
= VectorDatabase(F)
db #dist = JaccardDistance() # Other distances from SimilaritySearch
#dist = IntersectionDissimilarity()
#dist = CosineDistanceSet()
= DiceDistance() dist
We use Int32 ordered arrays to store prime factors to represent each integer. In the following cell define the cosine distance equivalent for this representation. While other representations may perform faster, this is quite straighforward and demonstrates the use of user’s defined distance functions.
Index construction
Note that the primes factors are pretty large for some large \(n\) and this imply challengues for metric indexes (since it is related with the intrinsic dimension of the dataset). We used a kernel that starts 64 threads, it solves \(100000\) in a few seconds but it can take pretty large time using single threads and larger \(n\) values. The construction of the index is used by the visualization algorithm (UMAP) to construct an all-knn graph, which can be a quite costly procedure.
= SearchGraph(; db, dist)
G 1= SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.95)))
ctx 2index!(G, ctx)
3optimize_index!(G, ctx, MinRecall(0.9))
- 1
-
Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build
MinRecall(0.95)
; high quality constructions yield to faster queries due to the underlying graph structure. - 2
- Actual indexing procedure using the given search context.
- 3
- Optimizing the index to trade quality and speed.
Searching
Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.
function search_and_render(index, ctx, qnum, qenc, res, k)
= reuse!(res, k)
res @time search(index, ctx, qenc, res)
= [
L """## result list for num: $(qnum), factors: $qenc""",
"""| nn | num | factors | dist |""",
"""|----|------|--------|------|"""
]for (j, p) in enumerate(res)
push!(L, """| $j | $(p.id) | $(index[p.id]) | $(round(p.weight, digits=3)) |""")
end
parse(join(L, "\n"))
Markdown.end
display(md"## Search examples (random)")
= KnnResult(12)
res for qnum in [42, 51, 1000, 1492, 3192]
= encode_factors(qnum)
qenc search_and_render(G, ctx, qnum, qenc, res, maxlength(res)) |> display
end
0.000450 seconds (3 allocations: 2.969 KiB)
0.000511 seconds (4 allocations: 3.766 KiB)
0.000764 seconds (4 allocations: 3.766 KiB)
0.000183 seconds (3 allocations: 848 bytes)
0.000342 seconds (3 allocations: 2.969 KiB)
Search examples (random)
result list for num: 42, factors: Int32[2, 3, 7]
nn | num | factors | dist |
1 | 42 | Int32[2, 3, 7] | 0.0 |
2 | 84 | Int32[2, 3, 7] | 0.0 |
3 | 126 | Int32[2, 3, 7] | 0.0 |
4 | 168 | Int32[2, 3, 7] | 0.0 |
5 | 252 | Int32[2, 3, 7] | 0.0 |
6 | 294 | Int32[2, 3, 7] | 0.0 |
7 | 336 | Int32[2, 3, 7] | 0.0 |
8 | 378 | Int32[2, 3, 7] | 0.0 |
9 | 504 | Int32[2, 3, 7] | 0.0 |
10 | 672 | Int32[2, 3, 7] | 0.0 |
11 | 756 | Int32[2, 3, 7] | 0.0 |
12 | 588 | Int32[2, 3, 7] | 0.0 |
result list for num: 51, factors: Int32[3, 17]
nn | num | factors | dist |
1 | 51 | Int32[3, 17] | 0.0 |
2 | 153 | Int32[3, 17] | 0.0 |
3 | 459 | Int32[3, 17] | 0.0 |
4 | 867 | Int32[3, 17] | 0.0 |
5 | 1377 | Int32[3, 17] | 0.0 |
6 | 2601 | Int32[3, 17] | 0.0 |
7 | 4131 | Int32[3, 17] | 0.0 |
8 | 7803 | Int32[3, 17] | 0.0 |
9 | 12393 | Int32[3, 17] | 0.0 |
10 | 14739 | Int32[3, 17] | 0.0 |
11 | 23409 | Int32[3, 17] | 0.0 |
12 | 37179 | Int32[3, 17] | 0.0 |
result list for num: 1000, factors: Int32[2, 5]
nn | num | factors | dist |
1 | 10 | Int32[2, 5] | 0.0 |
2 | 20 | Int32[2, 5] | 0.0 |
3 | 40 | Int32[2, 5] | 0.0 |
4 | 50 | Int32[2, 5] | 0.0 |
5 | 80 | Int32[2, 5] | 0.0 |
6 | 100 | Int32[2, 5] | 0.0 |
7 | 160 | Int32[2, 5] | 0.0 |
8 | 200 | Int32[2, 5] | 0.0 |
9 | 250 | Int32[2, 5] | 0.0 |
10 | 320 | Int32[2, 5] | 0.0 |
11 | 400 | Int32[2, 5] | 0.0 |
12 | 500 | Int32[2, 5] | 0.0 |
result list for num: 1492, factors: Int32[2, 373]
nn | num | factors | dist |
1 | 746 | Int32[2, 373] | 0.0 |
2 | 1492 | Int32[2, 373] | 0.0 |
3 | 2984 | Int32[2, 373] | 0.0 |
4 | 5968 | Int32[2, 373] | 0.0 |
5 | 11936 | Int32[2, 373] | 0.0 |
6 | 23872 | Int32[2, 373] | 0.0 |
7 | 47744 | Int32[2, 373] | 0.0 |
8 | 95488 | Int32[2, 373] | 0.0 |
9 | 61918 | Int32[2, 83, 373] | 0.2 |
10 | 2238 | Int32[2, 3, 373] | 0.2 |
11 | 8206 | Int32[2, 11, 373] | 0.2 |
12 | 9698 | Int32[2, 13, 373] | 0.2 |
result list for num: 3192, factors: Int32[2, 3, 7, 19]
nn | num | factors | dist |
1 | 798 | Int32[2, 3, 7, 19] | 0.0 |
2 | 1596 | Int32[2, 3, 7, 19] | 0.0 |
3 | 2394 | Int32[2, 3, 7, 19] | 0.0 |
4 | 3192 | Int32[2, 3, 7, 19] | 0.0 |
5 | 4788 | Int32[2, 3, 7, 19] | 0.0 |
6 | 5586 | Int32[2, 3, 7, 19] | 0.0 |
7 | 6384 | Int32[2, 3, 7, 19] | 0.0 |
8 | 7182 | Int32[2, 3, 7, 19] | 0.0 |
9 | 9576 | Int32[2, 3, 7, 19] | 0.0 |
10 | 11172 | Int32[2, 3, 7, 19] | 0.0 |
11 | 12768 | Int32[2, 3, 7, 19] | 0.0 |
12 | 14364 | Int32[2, 3, 7, 19] | 0.0 |
Visualizing with UMAP projection
We select to initialize the embedding randomly, this could yield to low quality embeddings, but it is much faster than other techniques like spectral layout. Note that we pass the Search graph G
. We also use a second call to compute a 3D embedding for computing a kind of colour embedding, here we pass U2
to avoid recomputing several of the involved structures.
function normcolors(V)
= extrema(V)
min_, max_ .= (V .- min_) ./ (max_ - min_)
V .= clamp.(V, 0, 1)
V end
normcolors(@view e3[1, :])
normcolors(@view e3[2, :])
normcolors(@view e3[3, :])
let C = [RGB(c[1], c[2], c[3]) for c in eachcol(e3)],
= view(e2, 1, :),
X = view(e2, 2, :)
Y scatter(X, Y, color=C, fmt=:png, alpha=0.2, size=(600, 600), ma=0.3, ms=2, msw=0, label="", yticks=nothing, xticks=nothing, xaxis=false, yaxis=false)
end
plot!()
= let min_dist=0.5f0,
e2, e3 =20,
k=75,
n_epochs=3,
neg_sample_rate=1e-3,
tol=SpectralLayout()
layout
@time "Compute 2D UMAP model" U2 = fit(UMAP, G; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
@time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
@time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
@time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
e2, e3end
Environment and dependencies
Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
JULIA_PROJECT = .
JULIA_NUM_THREADS = auto
JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
[aaaa29a8] Clustering v0.15.8
[944b1d66] CodecZlib v0.7.8
[a93c6f00] DataFrames v1.7.0
[c5bfea45] Embeddings v0.4.6
[f67ccb44] HDF5 v0.17.2
[b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
[682c06a0] JSON v0.21.4
[23fbe1c1] Latexify v0.16.6
[eb30cadb] MLDatasets v0.7.18
[06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
[91a5bcdd] Plots v1.40.11
[27ebfcd6] Primes v0.5.7
[ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
[053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
[f3b207a7] StatsPlots v0.15.7
[7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`