Prime numbers

by: Eric S. Téllez

This demonstration is about prime numbers and its similarity based on its factors. This is a well-known demonstration of SimSearchManifoldLearning.jl. This notebook does not requires to download any dataset.

Note: This example needs a lot of computing power; therefore you may want to set the environment variable JULIA_NUM_THREADS=auto before running julia.

using SimilaritySearch, SimSearchManifoldLearning, Primes, PlotlyLight, StatsBase, LinearAlgebra, Markdown, Random

Now, we can define our dataset. The set of factors are found using the Primes package. Note that we use VectorDatabase to represent the dataset.

n = 10^4
F = Vector{Vector{Int32}}(undef, n+1)

function encode_factors(num)
    sort!(Int32[convert(Int32, f) for f in factor(Set, num)])
end

F[1] = Int32[1]

for i in 2:n+1
    s = encode_factors(i)
    F[i] = s
end

db = VectorDatabase(F)
#dist = Dist.Sets.Jaccard()  # Other distances from SimilaritySearch
#dist = Dist.Sets.Intersection()
#dist = Dist.Sets.Cosine()
dist = Dist.Sets.Dice()

We use Int32 ordered arrays to store prime factors to represent each integer. In the following cell define the cosine distance equivalent for this representation. While other representations may perform faster, this is quite straightforward and demonstrates the use of user’s defined distance functions.

Index construction

Note that the primes factors are pretty large for some large \(n\) and this imply challengues for metric indexes (since it is related with the intrinsic dimension of the dataset). We used a kernel that starts 64 threads, it solves \(100000\) in a few seconds but it can take pretty large time using single threads and larger \(n\) values. The construction of the index is used by the visualization algorithm (UMAP) to construct an all-knn graph, which can be a quite costly procedure.

G = SearchGraph(; db, dist)
1ctx = SearchGraphContext(hyperparameters_callback=OptimizeParameters(MinRecall(0.95)))
2index!(G, ctx)
3optimize_index!(G, ctx, MinRecall(0.9))
1
Defines the index and the search context (caches and hyperparameters); particularly, we use a very high quality build MinRecall(0.95); high quality constructions yield to faster queries due to the underlying graph structure.
2
Actual indexing procedure using the given search context.
3
Optimizing the index to trade quality and speed.

Searching

Our index can solve queries over the entire dataset, for instance, solving synonym queries as nearest neighbor queries.

function search_and_render(index, ctx, qnum, qenc, res, k)
    res = reuse!(res, k)
    @time search(index, ctx, qenc, res)

    L = [
        """## result list for num: $(qnum), factors: $qenc""",
        """| nn | num | factors | dist |""",
        """|----|------|--------|------|"""
    ]
    for (j, p) in enumerate(viewitems(res))
        push!(L, """| $j | $(p.id) | $(index[p.id]) | $(round(p.dist, digits=3)) |""")     
    end

    Markdown.parse(join(L, "\n"))
end

display(md"## Search examples (random)")

res = knnqueue(ctx, 12)
for qnum in [42, 51, 1000, 1492, 3192]
    qenc = encode_factors(qnum)
    search_and_render(G, ctx, qnum, qenc, res, maxlength(res)) |> display
end
  0.000051 seconds (3 allocations: 64 bytes)
  0.000132 seconds (3 allocations: 64 bytes)
  0.000079 seconds (3 allocations: 64 bytes)
  0.000111 seconds (3 allocations: 64 bytes)
  0.000029 seconds (3 allocations: 64 bytes)

Search examples (random)

result list for num: 42, factors: Int32[2, 3, 7]

nn num factors dist
1 1176 Int32[2, 3, 7] 0.0
2 42 Int32[2, 3, 7] 0.0
3 84 Int32[2, 3, 7] 0.0
4 126 Int32[2, 3, 7] 0.0
5 168 Int32[2, 3, 7] 0.0
6 252 Int32[2, 3, 7] 0.0
7 294 Int32[2, 3, 7] 0.0
8 336 Int32[2, 3, 7] 0.0
9 378 Int32[2, 3, 7] 0.0
10 504 Int32[2, 3, 7] 0.0
11 588 Int32[2, 3, 7] 0.0
12 672 Int32[2, 3, 7] 0.0

result list for num: 51, factors: Int32[3, 17]

nn num factors dist
1 51 Int32[3, 17] 0.0
2 153 Int32[3, 17] 0.0
3 459 Int32[3, 17] 0.0
4 867 Int32[3, 17] 0.0
5 1377 Int32[3, 17] 0.0
6 2601 Int32[3, 17] 0.0
7 4131 Int32[3, 17] 0.0
8 7803 Int32[3, 17] 0.0
9 255 Int32[3, 5, 17] 0.2
10 2397 Int32[3, 17, 47] 0.2
11 3009 Int32[3, 17, 59] 0.2
12 3111 Int32[3, 17, 61] 0.2

result list for num: 1000, factors: Int32[2, 5]

nn num factors dist
1 1600 Int32[2, 5] 0.0
2 10 Int32[2, 5] 0.0
3 20 Int32[2, 5] 0.0
4 40 Int32[2, 5] 0.0
5 50 Int32[2, 5] 0.0
6 80 Int32[2, 5] 0.0
7 100 Int32[2, 5] 0.0
8 160 Int32[2, 5] 0.0
9 200 Int32[2, 5] 0.0
10 250 Int32[2, 5] 0.0
11 320 Int32[2, 5] 0.0
12 400 Int32[2, 5] 0.0

result list for num: 1492, factors: Int32[2, 373]

nn num factors dist
1 746 Int32[2, 373] 0.0
2 1492 Int32[2, 373] 0.0
3 2984 Int32[2, 373] 0.0
4 5968 Int32[2, 373] 0.0
5 2238 Int32[2, 3, 373] 0.2
6 3730 Int32[2, 5, 373] 0.2
7 8206 Int32[2, 11, 373] 0.2
8 5222 Int32[2, 7, 373] 0.2
9 9698 Int32[2, 13, 373] 0.2
10 4476 Int32[2, 3, 373] 0.2
11 6714 Int32[2, 3, 373] 0.2
12 8952 Int32[2, 3, 373] 0.2

result list for num: 3192, factors: Int32[2, 3, 7, 19]

nn num factors dist
1 798 Int32[2, 3, 7, 19] 0.0
2 1596 Int32[2, 3, 7, 19] 0.0
3 2394 Int32[2, 3, 7, 19] 0.0
4 3192 Int32[2, 3, 7, 19] 0.0
5 4788 Int32[2, 3, 7, 19] 0.0
6 5586 Int32[2, 3, 7, 19] 0.0
7 6384 Int32[2, 3, 7, 19] 0.0
8 7182 Int32[2, 3, 7, 19] 0.0
9 9576 Int32[2, 3, 7, 19] 0.0
10 3990 Int32[2, 3, 5, 7, 19] 0.111
11 8778 Int32[2, 3, 7, 11, 19] 0.111
12 7980 Int32[2, 3, 5, 7, 19] 0.111

Visualizing with UMAP projection

We select to initialize the embedding randomly, this could yield to low quality embeddings, but it is much faster than other techniques like spectral layout. Note that we pass the Search graph G. We also use a second call to compute a 3D embedding for computing a kind of colour embedding, here we pass U2 to avoid recomputing several of the involved structures.

e2, e3 = let min_dist=0.5f0,
             k=20,
             n_epochs=75,
             neg_sample_rate=3,
             tol=1e-3,
             layout=SpectralLayout()

    @time "Compute 2D UMAP model" U2 = fit(UMAP, G; k, neg_sample_rate, layout, n_epochs, tol, min_dist)
    @time "Compute 3D UMAP model" U3 = fit(U2, 3; neg_sample_rate, n_epochs, tol)
    @time "predicting 2D embeddings" e2 = clamp.(predict(U2), -10f0, 10f0)
    @time "predicting 3D embeddings" e3 = clamp.(predict(U3), -10f0, 10f0)
    e2, e3
end    

Now plotting

function normcolors!(V)
    min_, max_ = extrema(V)
    range = max_ - min_
    if range > 0
        V .= (V .- min_) ./ range
    end
    V .= clamp.(V, 0, 1)
end

normcolors!(@view e3[1, :])
normcolors!(@view e3[2, :])
normcolors!(@view e3[3, :])

colors = [
    "rgba($(round(Int, c[1]*255)), $(round(Int, c[2]*255)), $(round(Int, c[3]*255)), 0.3)" 
    for c in eachcol(e3)
]

hovertext = ["p.f. $f" for (i, f) in enumerate(F)]

data = [Config(;
    x = view(e2, 1, :),
    y = view(e2, 2, :),
    mode = "markers",
    marker = (
        color = colors, 
        size = 4, 
        line = (width = 0,)
    ),
    hovertext,
    type = "scattergl" # Recomendado para muchos puntos (embeddings)
)]

layout = Config(
    width = 600,
    height = 600,
    xaxis = (visible = false, showgrid = false, zeroline = false),
    yaxis = (visible = false, showgrid = false, zeroline = false),
    hovermode = "closest",
    plot_bgcolor = "white"
)

Plot(data, layout)

Environment and dependencies

Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin24.0.0)
  CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto
  JULIA_PROJECT = @.
  JULIA_LOAD_PATH = @:@stdlib
Status `~/Research/SimilaritySearchDemos/Project.toml`
  [aaaa29a8] Clustering v0.15.8
  [944b1d66] CodecZlib v0.7.8
  [a93c6f00] DataFrames v1.8.1
  [c5bfea45] Embeddings v0.4.6
  [f67ccb44] HDF5 v0.17.2
  [b20bd276] InvertedFiles v0.9.2
 [682c06a0] JSON v0.21.4
  [23fbe1c1] Latexify v0.16.10
  [eb30cadb] MLDatasets v0.7.21
  [06eb3307] ManifoldLearning v0.9.0
 [ca7969ec] PlotlyLight v0.11.0
  [91a5bcdd] Plots v1.41.6
  [27ebfcd6] Primes v0.5.7
  [ca7ab67e] SimSearchManifoldLearning v0.4.0
  [053f045d] SimilaritySearch v0.14.3
 [2913bbd2] StatsBase v0.33.21
  [f3b207a7] StatsPlots v0.15.8
  [7f6f6c8a] TextSearch v0.20.0
Info Packages marked with  and  have new versions available. Those with  may be upgradable, but those with  are restricted by compatibility constraints from upgrading. To see why use `status --outdated`