Prime numbers

by: Eric S. Téllez

This demonstration is about prime numbers and its similarity based on its factors. This is a well-known demonstration of SimSearchManifoldLearning.jl. This notebook does not requires to download any dataset.

Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto before running julia.

using SimilaritySearch, SimSearchManifoldLearning, Primes, Plots, StatsBase, LinearAlgebra, Markdown, Random

Now, we can define our dataset. The set of factors are found using the Primes package. Note that we use VectorDatabase to represent the dataset.

n = 100_000
F = Vector{Vector{Int32}}(undef, n+1)

function encode_factors(num)
    sort!(Int32[convert(Int32, f) for f in factor(Set, num)])
end

F[1] = Int32[1]

for i in 2:n+1
    s = encode_factors(i)
    F[i] = s
end

db = VectorDatabase(F)
#dist = JaccardDistance()  # Other distances from SimilaritySearch
#dist = IntersectionDissimilarity()
#dist = CosineDistanceSet()
dist = DiceDistance()

We use Int32 ordered arrays to store prime factors to represent each integer. In the following cell define the cosine distance equivalent for this representation. While other representations may perform faster, this is quite straighforward and demonstrates the use of user’s defined distance functions.

Index construction

Note that the primes factors are pretty large for some large \(n\) and this imply challengues for metric indexes (since it is related with the intrinsic dimension of the dataset). We used a kernel that starts 64 threads, it solves \(100000\) in a few seconds but it can take pretty large time using single threads and larger \(n\) values. The construction of the index is used by the visualization algorithm (UMAP) to construct an all-knn graph, which can be a quite costly procedure.

G = SearchGraph(; db, dist)
1ctx = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.95)))
2index!(G, ctx)
3optimize_index!(G, ctx, MinRecall(0.9))
1
Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build MinRecall(0.95); high quality constructions yield to faster queries due to the underlying graph structure.
2
Actual indexing procedure using the given search context.
3
Optimizing the index to trade quality and speed.

Searching

Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.

function search_and_render(index, ctx, qnum, qenc, res, k)
    res = reuse!(res, k)
    @time search(index, ctx, qenc, res)

    L = [
        """## result list for num: $(qnum), factors: $qenc""",
        """| nn | num | factors | dist |""",
        """|----|------|--------|------|"""
    ]
    for (j, p) in enumerate(res)
        push!(L, """| $j | $(p.id) | $(index[p.id]) | $(round(p.weight, digits=3)) |""")     
    end

    Markdown.parse(join(L, "\n"))
end

display(md"## Search examples (random)")

res = KnnResult(12)
for qnum in [42, 51, 1000, 1492, 3192]
    qenc = encode_factors(qnum)
    search_and_render(G, ctx, qnum, qenc, res, maxlength(res)) |> display
end
  0.000450 seconds (3 allocations: 2.969 KiB)
  0.000511 seconds (4 allocations: 3.766 KiB)
  0.000764 seconds (4 allocations: 3.766 KiB)
  0.000183 seconds (3 allocations: 848 bytes)
  0.000342 seconds (3 allocations: 2.969 KiB)

Search examples (random)

result list for num: 42, factors: Int32[2, 3, 7]

nn num factors dist
1 42 Int32[2, 3, 7] 0.0
2 84 Int32[2, 3, 7] 0.0
3 126 Int32[2, 3, 7] 0.0
4 168 Int32[2, 3, 7] 0.0
5 252 Int32[2, 3, 7] 0.0
6 294 Int32[2, 3, 7] 0.0
7 336 Int32[2, 3, 7] 0.0
8 378 Int32[2, 3, 7] 0.0
9 504 Int32[2, 3, 7] 0.0
10 672 Int32[2, 3, 7] 0.0
11 756 Int32[2, 3, 7] 0.0
12 588 Int32[2, 3, 7] 0.0

result list for num: 51, factors: Int32[3, 17]

nn num factors dist
1 51 Int32[3, 17] 0.0
2 153 Int32[3, 17] 0.0
3 459 Int32[3, 17] 0.0
4 867 Int32[3, 17] 0.0
5 1377 Int32[3, 17] 0.0
6 2601 Int32[3, 17] 0.0
7 4131 Int32[3, 17] 0.0
8 7803 Int32[3, 17] 0.0
9 12393 Int32[3, 17] 0.0
10 14739 Int32[3, 17] 0.0
11 23409 Int32[3, 17] 0.0
12 37179 Int32[3, 17] 0.0

result list for num: 1000, factors: Int32[2, 5]

nn num factors dist
1 10 Int32[2, 5] 0.0
2 20 Int32[2, 5] 0.0
3 40 Int32[2, 5] 0.0
4 50 Int32[2, 5] 0.0
5 80 Int32[2, 5] 0.0
6 100 Int32[2, 5] 0.0
7 160 Int32[2, 5] 0.0
8 200 Int32[2, 5] 0.0
9 250 Int32[2, 5] 0.0
10 320 Int32[2, 5] 0.0
11 400 Int32[2, 5] 0.0
12 500 Int32[2, 5] 0.0

result list for num: 1492, factors: Int32[2, 373]

nn num factors dist
1 746 Int32[2, 373] 0.0
2 1492 Int32[2, 373] 0.0
3 2984 Int32[2, 373] 0.0
4 5968 Int32[2, 373] 0.0
5 11936 Int32[2, 373] 0.0
6 23872 Int32[2, 373] 0.0
7 47744 Int32[2, 373] 0.0
8 95488 Int32[2, 373] 0.0
9 61918 Int32[2, 83, 373] 0.2
10 2238 Int32[2, 3, 373] 0.2
11 8206 Int32[2, 11, 373] 0.2
12 9698 Int32[2, 13, 373] 0.2

result list for num: 3192, factors: Int32[2, 3, 7, 19]

nn num factors dist
1 798 Int32[2, 3, 7, 19] 0.0
2 1596 Int32[2, 3, 7, 19] 0.0
3 2394 Int32[2, 3, 7, 19] 0.0
4 3192 Int32[2, 3, 7, 19] 0.0
5 4788 Int32[2, 3, 7, 19] 0.0
6 5586 Int32[2, 3, 7, 19] 0.0
7 6384 Int32[2, 3, 7, 19] 0.0
8 7182 Int32[2, 3, 7, 19] 0.0
9 9576 Int32[2, 3, 7, 19] 0.0
10 11172 Int32[2, 3, 7, 19] 0.0
11 12768 Int32[2, 3, 7, 19] 0.0
12 14364 Int32[2, 3, 7, 19] 0.0

Visualizing with UMAP projection

We select to initialize the embedding randomly, this could yield to low quality embeddings, but it is much faster than other techniques like spectral layout. Note that we pass the Search graph G. We also use a second call to compute a 3D embedding for computing a kind of colour embedding, here we pass U2 to avoid recomputing several of the involved structures.

function normcolors(V)
    min_, max_ = extrema(V)
    V .= (V .- min_) ./ (max_ - min_)
    V .= clamp.(V, 0, 1)
end

normcolors(@view e3[1, :])
normcolors(@view e3[2, :])
normcolors(@view e3[3, :])

let C = [RGB(c[1], c[2], c[3]) for c in eachcol(e3)],
    X = view(e2, 1, :),
    Y = view(e2, 2, :)
    scatter(X, Y, color=C, fmt=:png, alpha=0.2, size=(600, 600), ma=0.3, ms=2, msw=0, label="", yticks=nothing, xticks=nothing, xaxis=false, yaxis=false)
end

plot!()
e2, e3 = let min_dist=0.5f0,
             k=20,
             n_epochs=75,
             neg_sample_rate=3,
             tol=1e-3,
             layout=SpectralLayout()

    @time "Compute 2D UMAP model" U2 = fit(UMAP, G; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
    @time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
    @time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
    @time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
    e2, e3
end    

Environment and dependencies

Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  JULIA_PROJECT = .
  JULIA_NUM_THREADS = auto
  JULIA_LOAD_PATH = @:@stdlib
Status `~/sites/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [a93c6f00] DataFrames v1.7.0
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [b20bd276] InvertedFiles v0.8.0 `~/.julia/dev/InvertedFiles`
  [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.6
  [eb30cadb] MLDatasets v0.7.18
  [06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
  [91a5bcdd] Plots v1.40.11
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.3.0 `~/.julia/dev/SimSearchManifoldLearning`
  [053f045d] SimilaritySearch v0.12.0 `~/.julia/dev/SimilaritySearch`
⌅ [2913bbd2] StatsBase v0.33.21
  [f3b207a7] StatsPlots v0.15.7
  [7f6f6c8a] TextSearch v0.19.0 `~/.julia/dev/TextSearch`
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`