Prime numbers

by: Eric S. Téllez

This demonstration is about prime numbers and its similarity based on its factors. This is a well-known demonstration of SimSearchManifoldLearning.jl. This notebook does not requires to download any dataset.

Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto before running julia.

using SimilaritySearch, SimSearchManifoldLearning, Primes, Plots, StatsBase, LinearAlgebra, Markdown, Random

Now, we can define our dataset. The set of factors are found using the Primes package. Note that we use VectorDatabase to represent the dataset.

n = 100_000
F = Vector{Vector{Int32}}(undef, n+1)

function encode_factors(num)
    sort!(Int32[convert(Int32, f) for f in factor(Set, num)])
end

F[1] = Int32[1]

for i in 2:n+1
    s = encode_factors(i)
    F[i] = s
end

db = VectorDatabase(F)
#dist = JaccardDistance()  # Other distances from SimilaritySearch
#dist = IntersectionDissimilarity()
#dist = CosineDistanceSet()
dist = DiceDistance()

We use Int32 ordered arrays to store prime factors to represent each integer. In the following cell define the cosine distance equivalent for this representation. While other representations may perform faster, this is quite straighforward and demonstrates the use of user’s defined distance functions.

Index construction

Note that the primes factors are pretty large for some large \(n\) and this imply challengues for metric indexes (since it is related with the intrinsic dimension of the dataset). We used a kernel that starts 64 threads, it solves \(100000\) in a few seconds but it can take pretty large time using single threads and larger \(n\) values. The construction of the index is used by the visualization algorithm (UMAP) to construct an all-knn graph, which can be a quite costly procedure.

G = SearchGraph(; db, dist)
1ctx = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.95)))
2index!(G, ctx)
3optimize_index!(G, ctx, MinRecall(0.9))
1
Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build MinRecall(0.95); high quality constructions yield to faster queries due to the underlying graph structure.
2
Actual indexing procedure using the given search context.
3
Optimizing the index to trade quality and speed.

Searching

Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.

function search_and_render(index, ctx, qnum, qenc, res, k)
    res = reuse!(res, k)
    @time search(index, ctx, qenc, res)

    L = [
        """## result list for num: $(qnum), factors: $qenc""",
        """| nn | num | factors | dist |""",
        """|----|------|--------|------|"""
    ]
    for (j, p) in enumerate(viewitems(res))
        push!(L, """| $j | $(p.id) | $(index[p.id]) | $(round(p.weight, digits=3)) |""")     
    end

    Markdown.parse(join(L, "\n"))
end

display(md"## Search examples (random)")

res = knnqueue(ctx, 12)
for qnum in [42, 51, 1000, 1492, 3192]
    qenc = encode_factors(qnum)
    search_and_render(G, ctx, qnum, qenc, res, maxlength(res)) |> display
end
  0.000385 seconds (3 allocations: 64 bytes)
  0.000585 seconds (3 allocations: 64 bytes)
  0.000355 seconds (3 allocations: 64 bytes)
  0.000289 seconds (3 allocations: 64 bytes)
  0.000176 seconds (3 allocations: 64 bytes)

Search examples (random)

result list for num: 42, factors: Int32[2, 3, 7]

nn num factors dist
1 756 Int32[2, 3, 7] 0.0
2 42 Int32[2, 3, 7] 0.0
3 84 Int32[2, 3, 7] 0.0
4 294 Int32[2, 3, 7] 0.0
5 1176 Int32[2, 3, 7] 0.0
6 2058 Int32[2, 3, 7] 0.0
7 2352 Int32[2, 3, 7] 0.0
8 4536 Int32[2, 3, 7] 0.0
9 5292 Int32[2, 3, 7] 0.0
10 7056 Int32[2, 3, 7] 0.0
11 8064 Int32[2, 3, 7] 0.0
12 15876 Int32[2, 3, 7] 0.0

result list for num: 51, factors: Int32[3, 17]

nn num factors dist
1 51 Int32[3, 17] 0.0
2 153 Int32[3, 17] 0.0
3 459 Int32[3, 17] 0.0
4 867 Int32[3, 17] 0.0
5 1377 Int32[3, 17] 0.0
6 2601 Int32[3, 17] 0.0
7 4131 Int32[3, 17] 0.0
8 7803 Int32[3, 17] 0.0
9 12393 Int32[3, 17] 0.0
10 14739 Int32[3, 17] 0.0
11 23409 Int32[3, 17] 0.0
12 44217 Int32[3, 17] 0.0

result list for num: 1000, factors: Int32[2, 5]

nn num factors dist
1 10 Int32[2, 5] 0.0
2 20 Int32[2, 5] 0.0
3 40 Int32[2, 5] 0.0
4 50 Int32[2, 5] 0.0
5 80 Int32[2, 5] 0.0
6 100 Int32[2, 5] 0.0
7 160 Int32[2, 5] 0.0
8 200 Int32[2, 5] 0.0
9 250 Int32[2, 5] 0.0
10 320 Int32[2, 5] 0.0
11 400 Int32[2, 5] 0.0
12 500 Int32[2, 5] 0.0

result list for num: 1492, factors: Int32[2, 373]

nn num factors dist
1 95488 Int32[2, 373] 0.0
2 2238 Int32[2, 3, 373] 0.2
3 4476 Int32[2, 3, 373] 0.2
4 6714 Int32[2, 3, 373] 0.2
5 8952 Int32[2, 3, 373] 0.2
6 13428 Int32[2, 3, 373] 0.2
7 17904 Int32[2, 3, 373] 0.2
8 20142 Int32[2, 3, 373] 0.2
9 26856 Int32[2, 3, 373] 0.2
10 35808 Int32[2, 3, 373] 0.2
11 40284 Int32[2, 3, 373] 0.2
12 53712 Int32[2, 3, 373] 0.2

result list for num: 3192, factors: Int32[2, 3, 7, 19]

nn num factors dist
1 798 Int32[2, 3, 7, 19] 0.0
2 1596 Int32[2, 3, 7, 19] 0.0
3 2394 Int32[2, 3, 7, 19] 0.0
4 3192 Int32[2, 3, 7, 19] 0.0
5 4788 Int32[2, 3, 7, 19] 0.0
6 5586 Int32[2, 3, 7, 19] 0.0
7 6384 Int32[2, 3, 7, 19] 0.0
8 7182 Int32[2, 3, 7, 19] 0.0
9 9576 Int32[2, 3, 7, 19] 0.0
10 11172 Int32[2, 3, 7, 19] 0.0
11 19152 Int32[2, 3, 7, 19] 0.0
12 21546 Int32[2, 3, 7, 19] 0.0

Visualizing with UMAP projection

We select to initialize the embedding randomly, this could yield to low quality embeddings, but it is much faster than other techniques like spectral layout. Note that we pass the Search graph G. We also use a second call to compute a 3D embedding for computing a kind of colour embedding, here we pass U2 to avoid recomputing several of the involved structures.

function normcolors(V)
    min_, max_ = extrema(V)
    V .= (V .- min_) ./ (max_ - min_)
    V .= clamp.(V, 0, 1)
end

normcolors(@view e3[1, :])
normcolors(@view e3[2, :])
normcolors(@view e3[3, :])

let C = [RGB(c[1], c[2], c[3]) for c in eachcol(e3)],
    X = view(e2, 1, :),
    Y = view(e2, 2, :)
    scatter(X, Y, color=C, fmt=:png, alpha=0.2, size=(600, 600), ma=0.3, ms=2, msw=0, label="", yticks=nothing, xticks=nothing, xaxis=false, yaxis=false)
end

plot!()
e2, e3 = let min_dist=0.5f0,
             k=20,
             n_epochs=75,
             neg_sample_rate=3,
             tol=1e-3,
             layout=SpectralLayout()

    @time "Compute 2D UMAP model" U2 = fit(UMAP, G; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
    @time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
    @time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
    @time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
    e2, e3
end    

Environment and dependencies

Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto
  JULIA_PROJECT = .
  JULIA_LOAD_PATH = @:@stdlib
Status `~/Research/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [a93c6f00] DataFrames v1.8.0
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [b20bd276] InvertedFiles v0.8.1
  [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.10
  [eb30cadb] MLDatasets v0.7.18
  [06eb3307] ManifoldLearning v0.9.0
⌃ [ca7969ec] PlotlyLight v0.11.0
  [91a5bcdd] Plots v1.40.20
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.3.1
  [053f045d] SimilaritySearch v0.13.0
⌅ [2913bbd2] StatsBase v0.33.21
  [f3b207a7] StatsPlots v0.15.7
  [7f6f6c8a] TextSearch v0.19.6
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`